Character Encoding and how it works

Michael Lazebny

15 Nov 2023 — 3 min read

Let's delve into a bit of history. Initially, various computer manufacturers developed their own character sets. However as computers evolved and became increasingly popular, the lack of standardization led to compatibility issues when sharing data between systems from different manufacturers.

In the early days of computing, different computer models and manufacturers used a variety of character encoding systems. IBM, for instance, had its character sets for its mainframe computers, and other manufacturers also developed their proprietary systems.

This lack of standardization made it difficult to reliably exchange data. In the early 60s, ASCII was developed, which mapped characters to binary representation.

It consisted of 7 bits and was able to encode 128 characters: unprintable control codes, numbers, marks, and letters. For example, 01100001(97) is a, and 00110000(48) is 0. ASCII has solved many problems and gained popularity in the United States.

File:ASCII-Table.svg - Simple English Wikipedia, the free encyclopedia

However, another snag was discovered! ASCII supported only Latin letters - it was not designed for other languages. It didn't even have a ¢ cent sign. As computers spread all over the world, a similar problem happened again.

As a result, the Unicode Consortium was formed in the late 1980s to develop a new character encoding system that could encompass characters from all the world's writing systems. Unicode uses a much larger number of bits to represent each character, allowing for over a million unique code points. This capacity makes it possible to include not just every current character from every script in the world, but also a vast array of symbols, emojis, and historical scripts.

Moreover, Unicode features a full compatibility with ASCII making migration even easier. The most widely adopted encoding system for Unicode is UTF-8 (Unicode Transformation Format). UTF-8 is capable of encoding all 1,112,064 valid Unicode code points using one to four one-byte code units.

Encoding Logic:

Characters from the ASCII set (standard English characters and numbers) are represented in a single byte (8 bits), with the most significant bit set to 0.
For characters beyond the ASCII set, UTF-8 uses a sequence of two to four bytes. The number of bytes depends on the range of the Unicode character.
The first byte in a multi-byte sequence uses leading bits to indicate how many bytes are in the sequence (110xxxxx for two bytes, 1110xxxx for three, 11110xxx for four).
Additional bytes in the sequence begin with '10', followed by bits of the actual Unicode character.

Decoding Process:

When a computer reads a byte, it first checks the pattern of the initial bits. If it starts with 0, it's an ASCII character and is processed as such.
If it starts with the pattern 110, 1110, or 11110, the computer understands it's a multi-byte character. It then reads the number of bytes indicated by this pattern, each beginning with 10, to decode the full character.

Some interesting facts:

When you send a POST request with some JSON in the body, your JSON is probably encoded using UTF-8 and then decoded by the server. This is necessary because computers, routers, switches, and everything else only understand binary.
98% of web pages use UTF-8.
All emojis are UTF-8 characters.

Thanks for reading

This article is coming to an end. Hats off to those of you who have read it all and a big thank you. It really means a lot to me.

I hope you found this article interesting and engaging. If so, please consider subscribing to my blog and telegram channel. Perhaps you would like to debate, ask a question, or say some kind words? Feel free to join the discussion below, I will be notified of any updates here, so go ahead!