From numbers to text. Computers store text as numbers; this week you learn exactly which numbers, and how UTF-8 turns Unicode code points into the bytes you find in real files.
Theme
A text file is bytes. The letter 'A' is the byte 0x41. The letter 'e' with an accent (é) is the two-byte sequence 0xC3 0xA9 in UTF-8. An emoji like "thumbs up" is four bytes: 0xF0 0x9F 0x91 0x8D.
None of this is magic. There is a published standard (Unicode) and a published encoding algorithm (UTF-8) that specifies, exactly, how any character maps to bytes. This week you learn the mapping well enough that you can decode a short hex dump of a UTF-8 file by hand, character by character.
By the end of week 2 you can look at any ASCII character's hex value and name the character, explain how UTF-8 encodes multi-byte characters, and use xxd to decode a text file without opening it in a text editor.
Reading list (~1 hour)
- Petzold CODE, Ch 20 ("ASCII and a Cast of Characters"): the history of ASCII and why 7-bit character encoding was a practical limit
- Joel Spolsky, "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets" at
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum/: the clearest short explanation of why encodings exist and what code points are - Wikipedia, "UTF-8" (the "Description" section through "Encoding"): the byte-level encoding algorithm
Read Joel's article first; it sets up the conceptual framework Petzold and the Wikipedia entry elaborate on.
Lecture outline (~2 hours)
Section 1: Why encoding exists
- A computer stores bytes; a byte is a number 0-255
- Text is not bytes; text is a human convention
- Encoding is the agreed-on mapping between human conventions (characters) and byte values
- Without agreement on the mapping, a file one program writes another cannot read
- ASCII (1963) was the first widely standardized mapping: 128 characters, 7 bits each
Section 2: ASCII
- The 128 ASCII characters: 0-31 are control characters (newline, tab, bell, etc.); 32-127 are printable (space, letters, digits, punctuation)
- Key values to memorize: 'A' = 65 (0x41), 'a' = 97 (0x61), '0' = 48 (0x30), space = 32 (0x20), newline = 10 (0x0A)
- The letter-case difference: 'A' and 'a' differ by exactly 32 (0x20), or one bit: bit 5
- ASCII only covers English; it has no room for accented characters, Greek, Arabic, Chinese, emoji
Browser tool: Multi-Base Number Slider. Drag the slider through the ASCII range and watch the character column update. Move to 65 and you get 'A'; move to 97 and you get 'a'; the difference is exactly the bit 5 swap the lecture noted. Try 32 (space), 48 ('0'), 10 (LF newline, marked non-printable). The slider is the fastest way to internalize which byte is which character.
Section 3: Unicode and code points
- Unicode is not an encoding; it is a character catalog. Every character humans have ever used gets a "code point": a number like U+0041 for 'A', U+00E9 for 'e with acute' (é), U+1F44D for thumbs-up emoji
- As of 2024, Unicode catalogs over 150,000 characters
- The code point is an abstract number. UTF-8 is the algorithm that turns code points into bytes.
Section 4: UTF-8 encoding
- UTF-8 is a variable-width encoding: ASCII characters take 1 byte (backwards compatible); other code points take 2, 3, or 4 bytes
- The bit prefix of the first byte tells you how many bytes the character uses:
- 0xxxxxxx: 1-byte character (ASCII; code points 0-127)
- 110xxxxx 10xxxxxx: 2-byte character (code points 128-2047)
- 1110xxxx 10xxxxxx 10xxxxxx: 3-byte character (code points 2048-65535)
- 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx: 4-byte character (code points 65536-1114111)
- The continuation bytes all start with 10xxxxxx; this lets a reader re-sync mid-stream
- Example: é (U+00E9 = 233 decimal = 0b11101001)
- Fits in 2-byte form: 110xxxxx 10xxxxxx, where xxx...xxx are the 8 bits of 0xE9
- 0b11101001 split across 5 bits + 6 bits: 00011 101001
- First byte: 110 00011 = 0xC3; second byte: 10 101001 = 0xA9
- Verify:
python3 -c "print('é'.encode('utf-8').hex())"outputsc3a9
Section 5: Magic bytes and file format identification
- Every binary file format has a magic byte sequence at the start: a fixed pattern that identifies the format
- PNG: 89 50 4E 47 0D 0A 1A 0A (the 89 signals "non-ASCII binary"; the next three bytes are "PNG" in ASCII)
- ZIP: 50 4B 03 04 ("PK" in ASCII, for Phil Katz who invented the format)
- PDF: 25 50 44 46 2D ("%%PDF-" in ASCII)
- This is how
fileon Linux identifies file types without looking at the file extension
Labs (~90 minutes)
Lab 2.1: UTF-8 Encoding + Decoding (labs/lab-2-1-utf8-encoding.md)
- Encode "café" in UTF-8 by hand (on paper), then verify with Python
- Decode a short hex dump back to text character by character
- Artifact: Python script + hand-calculation worksheet committed to your Git repo
Lab 1.3 (Tier-1 companion; recommended as warm-up for 2.1): Multi-Base Slider Workout (worksheets/fnd-101/lab-multi-base-slider.md)
- Same worksheet introduced in Week 1; Table C (hex -> ASCII) directly exercises the bit-5 case-swap from this week's Section 2
- Drive the Multi-Base Slider through the printable-ASCII range and confirm the byte-to-character mapping you will need for Lab 2.1's UTF-8 decoding
- If you took the worksheet in Week 1, replay Table C with new hex values (0x21-0x7E) as a fluency drill
Independent practice (~4 hours)
- Write out the full ASCII table (0x20 through 0x7E) from memory on paper; check against a reference chart
- Decode the following hex sequence to UTF-8 text by hand:
48 65 6C 6C 6F 2C 20 77 6F 72 6C 64 21 - Read the UTF-8 Wikipedia article's "History" section; note why Multilingual Plane 0 covers 65,536 characters and why higher planes exist
- Pick a language other than English (Spanish, French, Japanese, Arabic). Find 3 characters in that language that require multi-byte UTF-8 encoding. Look up their Unicode code points and verify the UTF-8 byte encoding by hand.
- Preview for next week: read Petzold CODE Ch 10 ("Logic and Switches"). Come to week 3 having seen the truth table concept once.
Reflection prompts (~30 minutes)
- The lecture said ASCII is backwards compatible with UTF-8: any valid ASCII file is a valid UTF-8 file. Explain why this is true from the encoding rules.
- You decoded "café" from bytes. The letter 'c', 'a', 'f' took 1 byte each. The 'é' took 2 bytes. Why is this asymmetry useful for English text?
- A URL sometimes contains sequences like
%C3%A9where you would expect an accented character. What are those percent signs doing there? - What would break if two programs on the same computer used different character encodings for the same file?
- Magic bytes identify file types more reliably than file extensions. Why?
What comes next
Week 3 moves to Boolean logic: the AND, OR, NOT, XOR, NAND, NOR operations that are the algebraic skeleton of every circuit inside a computer. The hex dumps you decoded this week are data; next week you start learning about the logic that processes data.