Classroom Public page

Lab 2.1: UTF-8 Encoding and Decoding

551 words

~75 minutes. Encode "café" in UTF-8 by hand; verify with Python; decode a mystery hex dump character by character.


Goal: understand the UTF-8 encoding algorithm at the byte level by working through it manually, then confirm your result with Python.

Estimated time: 75 minutes

Prerequisites: Week 2 lecture (UTF-8 encoding algorithm, Unicode code points)


Setup

mkdir -p ~/fnd-101/lab-2-1
cd ~/fnd-101/lab-2-1

Have a Python 3 interpreter available: python3


Part A: Encoding "café" by hand

Work through the UTF-8 encoding of each character in "café" by hand.

Step 1: Find the Unicode code point for each character.

Character Unicode code point Decimal
c U+0063 99
a U+0061 97
f U+0066 102
é U+00E9 233

Step 2: Determine the byte length for each character.

UTF-8 byte lengths:

  • Code points 0x00-0x7F (0-127): 1 byte (ASCII range, use the code point directly)
  • Code points 0x80-0x7FF (128-2047): 2 bytes (prefix 110xxxxx 10xxxxxx)
  • Code points 0x800-0xFFFF (2048-65535): 3 bytes
  • Code points above 0xFFFF: 4 bytes

Which byte length does each character in "café" use? Write your answer.

Step 3: Encode each character.

For c, a, f (1-byte characters): the byte is just the ASCII value in hex.

For é (U+00E9 = 233 decimal = 0b11101001 binary):

  • It needs 2 bytes: 110xxxxx 10xxxxxx
  • Take the 11 bits of 0xE9 that fit into the x positions: 0xE9 = 0b11101001; split into 5 + 6 bits: 00011 | 101001
  • First byte: 110 + 00011 = 0b11000011 = 0xC3
  • Second byte: 10 + 101001 = 0b10101001 = 0xA9

Write out the complete byte sequence for "café" in hex.


Part B: Verify with Python

s = "café"
encoded = s.encode('utf-8')
print(encoded)          # shows bytes
print(encoded.hex())    # shows hex string
print(len(encoded))     # how many bytes?
print(len(s))           # how many characters?

Does the hex output match your hand calculation? Record the output in your worksheet.


Part C: Decode a mystery hex sequence

Decode the following UTF-8 byte sequence by hand, character by character:

48 65 6C 6C 6F 2C 20 77 6F 72 6C 64 21 0A

For each byte:

  1. Is it in the ASCII range (0x00-0x7F)? If yes, it is a 1-byte character; look up its ASCII value.
  2. Does it start with 110xxxxx? Then it is the start of a 2-byte character.
  3. Does it start with 10xxxxxx? Then it is a continuation byte (part of a multi-byte character).

Write the decoded text. Then verify:

bytes.fromhex("48656c6c6f2c20776f726c64210a").decode('utf-8')

Part D: A multi-byte sequence

Decode this sequence, which contains at least one multi-byte character:

43 61 66 C3 A9 20 61 75 20 6C 61 69 74

Hint: you already decoded C3 A9 in Part A. Work through the others byte by byte.

Verify with Python:

bytes.fromhex("43616fC3A92061752066C3A9610a").decode('utf-8')

Wait, use the exact sequence given above. Write the Python call with the correct hex string.


Expected output / artifact

lab-2-1-notes.txt containing:

  • Your hand calculation for "café" encoding (Part A steps 1-3)
  • Python output from Part B
  • Your decoded text for Parts C and D
  • Python verification output for Parts C and D
git add lab-2-1-notes.txt
git commit -m "lab-2-1: UTF-8 encoding by hand + Python verification"

Common pitfalls

  • Mixing up the bit layout: the 2-byte format is 110xxxxx 10xxxxxx; the x bits come from the code point, not from the byte values you produce
  • len(s) vs len(bytes): len("café") is 4 (4 characters); len("café".encode('utf-8')) is 5 (5 bytes). This asymmetry is the encoding in action.
  • Python 2 vs Python 3: this lab assumes Python 3; in Python 2, string encoding works differently. If you see an error, check your Python version: python3 --version

Stretch (optional)

Find a language that uses characters requiring 3-byte UTF-8 encoding (Chinese, Japanese, Korean, Arabic, Hebrew all have many such characters). Pick one character, find its code point, and encode it by hand. Verify with Python.


Lab 2.1 v0.1.