Quantcast
Channel: UTF-8 bit representation - Super User
Viewing all articles
Browse latest Browse all 2

Answer by Phil P for UTF-8 bit representation

$
0
0

UTF-8 is self-synchronising. Something examining the bytes can tell if it's at the start of a UTF-8 character, or part-way through one.

Let's say you have two characters in your scheme: 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

If the parser picks up at the second octet, it can't tell that it's not to read the second and third octets as one character. With UTF-8, the parser can tell that it's in the middle of a character and continue ahead to the start of the next one, while emitting some state to mention the corrupted symbol.

For the edit: if the top bit is clear, UTF-8 parsers know that they're looking at a character represented in one octet. If it is set, it's a multi-octet character.

It's all about error recovery and easy classification of octets.


Viewing all articles
Browse latest Browse all 2

Trending Articles