Encoding Basics

6 minute read

When you type something on keyboard, or load a document, how does the computer figures out how to display it ? This is what character encoding is for. At the beginning, character encoding wasn’t much regarded since PCs didn’t communicate with one another. With the raise of internet, it runs now as mandatory quasi-invisible task.
As natively configurated (often), it runs smoothly on a large extent. However, once upon a time, a mistake can manifest itself within a typical python console with the message: ‘SyntaxError: (unicode error) ‘unicodeescape’ codec can’t decode bytes in position 2-4’. Then, as you can notice, Char encoding is the basis where our code is sitting on. I didn t understand much about this subject before I wrote the article. If you invest about 5 minutes to grab the basics, you could consider it now as a small interpretable bottleneck :)

Thus, many character encodings convert to letters, numbers, accentuation marks, specifics symbols ect.

Quick links:

A text style is a gathering of glyph : a mix of the shapes used to show characters; some fonts are also packed with glyphs which make them personalizables.

The second line correspond to: the letter a (hex byte code point = 61) followed by a space-bar (hex byte code point = 20) and the letter b (hex byte code point = 62)

Quick reminder :

  • 1 byte = 8 bits
  • 1 bit can take either 0 or 1 value. (then, 1 byte maximum value is 11111111
  • 2^8 = 256 -> One byte can store up to 256 patterns.

Many different types

If you want to speak Russian, it makes logic to use a character encoding that underpins Cyrillic alphabet well. If you communicate in Korean, at that point you’ll need something that deals with ‘Hangul’ and ‘Hanja’ alphabets. In other words, which character encoding you use relies upon what your requirements are.

Most popular encoding from computer born :

In the “big picture”, you should consider that the widely used encoding appeared consecutively through time with improvements : more charactacters, different encoding order to facilitate computer speed. That means, any new encoding created by company A will be improved some months/years later by company B which took the previous encoding and add more possibilities.

In chronological order :

  • ASCII – The American Standard Code for Information Interchange is one of the oldest character encoding appeared circa 1950 (telegraphs used it). It is the basis of many encoding and contains the Latin letters without accentuated characters. Its 7-bit encoding takes into account just 128 characters (2^7), which is the reason there are a few variations being used over the world. ASCII serves as the beginning basis of any encoding introduced next. The ASCII table
    As you can see, the first 33 characters and the 127th assigns keys controls.

  • ISO-8859 – The International Organization for Standardization’s most widely used group of character encodings is number 8859. It appeared after ASCII to add more symbols (Beta in german, some scandinavians languages). Each specific encoding designates a number, then follows-up an indice, e.g: ISO-8859-4 is for Latin-4 north of Europe alphabet. - In short, It’s a superset of ASCII: the first 128 values in the encoding are the same as ASCII. It’s 8-bit and allows for 256 characters which brings a lot more patterns.

      - ISO-8859-1 Latin-1 Western European
      - ISO-8859-2 Latin-2 Central European
      - ISO-8859-3 Latin-3 South European
      - ISO-8859-4 Latin-4 North European
      - ISO-8859-5 Cyrillic
      - ISO-8859-6 Arabic
      - ISO-8859-7 Greek …
    

ANSI character set : also known as Windows-1252 became a Microsoft owned character set. It’s a superset of ISO-8859-1 plus 27 characters for control codes.

On its own, Apple’s proprietary text encoder is the MacRoman character set (which contains a similar variety of characters from 128 to 255).

Windows notepad saving tab

Encoding usage along time

The most used encoding

As a developer or content author; UTF-8 character is premium to product content and develop apps. Python 3.7 for e.g is supporting it by default as most web browsers and mail clients. UTF-8 allows for much better compatibility because it contains in the full set. It can store text in any language known to humanity: from Latins languages to Greek, Cyrilic, Mandarin…

The Unicode standard has been a new way of organizing characters : It describes the characters by code points. UTF8 is based on this structure. A code point value is an integer in the range 0 to 0x10FFFF (which counts about 1.1 million values with 130 thousand assigned so far). While the A key shows the code point value 41, the Cyrillic character щ now has a code point value of 1097.

It’s very convenient for people who use symbols in maths or specific characters as squares and checkboxes and punctuation.

The characters take place in the computer as one or more bytes. In UTF-8, every code point from 0-127 needs in a single byte (It takes full ASCII table in order !). Code points 128 and above are stored using 2 to 6 bytes. The Unicode standard contains many tables which list characters and their corresponding code points:

Modern globalized applications often use UTF-8 or UTF-16 to save text files. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need. There is some variants of UTF-8 :

  • UTF-16 little endian
  • UTF-16 big endian
  • UTF-32 little endian
  • UTF-32 big endian

UTF-8, UTF-16, and UTF-32 all contains the full set of unicode code points. There are no characters difference between each.

For basic details, the distinction comes from what’s called “Endians” : they describe the order in which a sequence of bytes get stored in computer memory. Big-endian is an order in which the “big end” (the most significant value in the sequence) is stored first (at the lowest storage address). Little-endian is an order in which the “little end” (least significant value is stored first). Depending on the task, computer operations may be simpler and faster to perform in either storage.

Detecting the encoding of a Text

The standard character encoding for Python 3 is UTF-8, but you can use almost any encoding if you declare it. This is done by including a special comment at the source file beginning :

#!/usr/bin/env python
# -*- coding: latin-1 -*-

In Python, we can use the ASCII decoder to find out if the string is the corresponding encoding:

$ def isASCII(data):
$   try:
$       data.decode('ASCII')
$   except UnicodeDecodeError:
$       return False
$   else:
$       return True

We can use the same algorithm with a different encoding types, you can replace UTF-8 by any codec :

def isUTF8(data):
$   try:
$       data.decode('UTF-8')
$   except UnicodeDecodeError:
$      return False
$   else:
$       return True

To detect it reliably

There is no “free lunch theorem” applied to char encoding detection. In general, the data provider gives you this information (database settings, software documentation, OS configuration, customer specs, supplier call…). But sometimes there is no other way to obtain this information than to ask the person who produced the data. Sometimes even, the encoding declared is false.

If you receive an object and you can not find the encoding, you can force an imperfect decode with decode () by specifying the error parameter. It can take the following values:

  • strict : raises an exception in case of error. This is the default behavior.
  • ignore : any character that causes an error is ignored.
  • replace : any character that causes an error is replaced by a question mark.
$   print 'Santã claus is a biker'.decode('ascii') # Here I made expressely an error with a letter (ã) 
# that can be decoded in UTF-8 but not ASCII (this base 50's character encoding we discussed earlier)

$  Traceback (most recent call last):   File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode 
byte 0xC3 0xA3 in position 1: ordinal not in range(128) # Note that ã is composed of two bytes : 
# `a` with code point `U+0061`, hexadecimal value `0xC3` and  the tilde` ◌̃`  with code point `U+0303` and hex decimal value `0xA3` 
# (Both of these code points belongs to UTF-8 as it was originally coded)
$
$   print 'Santa claus is a biker'.decode('ascii', errors='ignore') 
$   Sant claus is a biker 
$   print 'Santa claus is a biker'.decode('ascii', errors='replace')
$   S��nt�� cl��us is a biker 

Ultimate solution

Mozilla can save us with its library called chardet which must therefore be installed:

pip install chardet

$   chardet.detect(u'Il se monte facilement..'.encode('utf8')) # the u’ indicates to convert the sentence into Unicode code strings 
{'confidence': 0.8063275188616134, 'encoding': 'ISO-8859-2'}
$   chardet.detect(u" Il se monte super facilement".encode('utf8'))
{'confidence': 0.87625, 'encoding': 'utf-8'}

It’s not too bad, but do not wait for miracles. The more text there is, the more accurate it is, and the closer the confidence parameter is to 1.

There are two kinds of strings in Python: unicode strings, which are sequences of unicode code points and bytestrings, which are sequences of bytes with an unspecified encoding. Bytestrings are most common and are what you get when using the regular “abc” string literal syntax. Unicode strings are displayed when using the u”abc” syntax.

Useful Links

Leave a Comment