~75 minutes. Encode "café" in UTF-8 by hand; verify with Python; decode a mystery hex dump character by character.
Goal: understand the UTF-8 encoding algorithm at the byte level by working through it manually, then confirm your result with Python.
Estimated time: 75 minutes
Prerequisites: Week 2 lecture (UTF-8 encoding algorithm, Unicode code points)
Setup
mkdir -p ~/fnd-101/lab-2-1
cd ~/fnd-101/lab-2-1
Have a Python 3 interpreter available: python3
Part A: Encoding "café" by hand
Work through the UTF-8 encoding of each character in "café" by hand.
Step 1: Find the Unicode code point for each character.
| Character | Unicode code point | Decimal |
|---|---|---|
| c | U+0063 | 99 |
| a | U+0061 | 97 |
| f | U+0066 | 102 |
| é | U+00E9 | 233 |
Step 2: Determine the byte length for each character.
UTF-8 byte lengths:
- Code points 0x00-0x7F (0-127): 1 byte (ASCII range, use the code point directly)
- Code points 0x80-0x7FF (128-2047): 2 bytes (prefix 110xxxxx 10xxxxxx)
- Code points 0x800-0xFFFF (2048-65535): 3 bytes
- Code points above 0xFFFF: 4 bytes
Which byte length does each character in "café" use? Write your answer.
Step 3: Encode each character.
For c, a, f (1-byte characters): the byte is just the ASCII value in hex.
For é (U+00E9 = 233 decimal = 0b11101001 binary):
- It needs 2 bytes: 110xxxxx 10xxxxxx
- Take the 11 bits of 0xE9 that fit into the x positions: 0xE9 = 0b11101001; split into 5 + 6 bits: 00011 | 101001
- First byte: 110 + 00011 = 0b11000011 = 0xC3
- Second byte: 10 + 101001 = 0b10101001 = 0xA9
Write out the complete byte sequence for "café" in hex.
Part B: Verify with Python
s = "café"
encoded = s.encode('utf-8')
print(encoded) # shows bytes
print(encoded.hex()) # shows hex string
print(len(encoded)) # how many bytes?
print(len(s)) # how many characters?
Does the hex output match your hand calculation? Record the output in your worksheet.
Part C: Decode a mystery hex sequence
Decode the following UTF-8 byte sequence by hand, character by character:
48 65 6C 6C 6F 2C 20 77 6F 72 6C 64 21 0A
For each byte:
- Is it in the ASCII range (0x00-0x7F)? If yes, it is a 1-byte character; look up its ASCII value.
- Does it start with 110xxxxx? Then it is the start of a 2-byte character.
- Does it start with 10xxxxxx? Then it is a continuation byte (part of a multi-byte character).
Write the decoded text. Then verify:
bytes.fromhex("48656c6c6f2c20776f726c64210a").decode('utf-8')
Part D: A multi-byte sequence
Decode this sequence, which contains at least one multi-byte character:
43 61 66 C3 A9 20 61 75 20 6C 61 69 74
Hint: you already decoded C3 A9 in Part A. Work through the others byte by byte.
Verify with Python:
bytes.fromhex("43616fC3A92061752066C3A9610a").decode('utf-8')
Wait, use the exact sequence given above. Write the Python call with the correct hex string.
Expected output / artifact
lab-2-1-notes.txt containing:
- Your hand calculation for "café" encoding (Part A steps 1-3)
- Python output from Part B
- Your decoded text for Parts C and D
- Python verification output for Parts C and D
git add lab-2-1-notes.txt
git commit -m "lab-2-1: UTF-8 encoding by hand + Python verification"
Common pitfalls
- Mixing up the bit layout: the 2-byte format is 110xxxxx 10xxxxxx; the x bits come from the code point, not from the byte values you produce
- len(s) vs len(bytes):
len("café")is 4 (4 characters);len("café".encode('utf-8'))is 5 (5 bytes). This asymmetry is the encoding in action. - Python 2 vs Python 3: this lab assumes Python 3; in Python 2, string encoding works differently. If you see an error, check your Python version:
python3 --version
Stretch (optional)
Find a language that uses characters requiring 3-byte UTF-8 encoding (Chinese, Japanese, Korean, Arabic, Hebrew all have many such characters). Pick one character, find its code point, and encode it by hand. Verify with Python.
Lab 2.1 v0.1.