Classroom Public page

Week 2: Characters and Encoding

1,154 words

From numbers to text. Computers store text as numbers; this week you learn exactly which numbers, and how UTF-8 turns Unicode code points into the bytes you find in real files.


Theme

A text file is bytes. The letter 'A' is the byte 0x41. The letter 'e' with an accent (é) is the two-byte sequence 0xC3 0xA9 in UTF-8. An emoji like "thumbs up" is four bytes: 0xF0 0x9F 0x91 0x8D.

None of this is magic. There is a published standard (Unicode) and a published encoding algorithm (UTF-8) that specifies, exactly, how any character maps to bytes. This week you learn the mapping well enough that you can decode a short hex dump of a UTF-8 file by hand, character by character.

By the end of week 2 you can look at any ASCII character's hex value and name the character, explain how UTF-8 encodes multi-byte characters, and use xxd to decode a text file without opening it in a text editor.

Reading list (~1 hour)

  1. Petzold CODE, Ch 20 ("ASCII and a Cast of Characters"): the history of ASCII and why 7-bit character encoding was a practical limit
  2. Joel Spolsky, "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets" at https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum/: the clearest short explanation of why encodings exist and what code points are
  3. Wikipedia, "UTF-8" (the "Description" section through "Encoding"): the byte-level encoding algorithm

Read Joel's article first; it sets up the conceptual framework Petzold and the Wikipedia entry elaborate on.

Lecture outline (~2 hours)

Section 1: Why encoding exists

  • A computer stores bytes; a byte is a number 0-255
  • Text is not bytes; text is a human convention
  • Encoding is the agreed-on mapping between human conventions (characters) and byte values
  • Without agreement on the mapping, a file one program writes another cannot read
  • ASCII (1963) was the first widely standardized mapping: 128 characters, 7 bits each

Section 2: ASCII

  • The 128 ASCII characters: 0-31 are control characters (newline, tab, bell, etc.); 32-127 are printable (space, letters, digits, punctuation)
  • Key values to memorize: 'A' = 65 (0x41), 'a' = 97 (0x61), '0' = 48 (0x30), space = 32 (0x20), newline = 10 (0x0A)
  • The letter-case difference: 'A' and 'a' differ by exactly 32 (0x20), or one bit: bit 5
  • ASCII only covers English; it has no room for accented characters, Greek, Arabic, Chinese, emoji

Browser tool: Multi-Base Number Slider. Drag the slider through the ASCII range and watch the character column update. Move to 65 and you get 'A'; move to 97 and you get 'a'; the difference is exactly the bit 5 swap the lecture noted. Try 32 (space), 48 ('0'), 10 (LF newline, marked non-printable). The slider is the fastest way to internalize which byte is which character.

Section 3: Unicode and code points

  • Unicode is not an encoding; it is a character catalog. Every character humans have ever used gets a "code point": a number like U+0041 for 'A', U+00E9 for 'e with acute' (é), U+1F44D for thumbs-up emoji
  • As of 2024, Unicode catalogs over 150,000 characters
  • The code point is an abstract number. UTF-8 is the algorithm that turns code points into bytes.

Section 4: UTF-8 encoding

  • UTF-8 is a variable-width encoding: ASCII characters take 1 byte (backwards compatible); other code points take 2, 3, or 4 bytes
  • The bit prefix of the first byte tells you how many bytes the character uses:
    • 0xxxxxxx: 1-byte character (ASCII; code points 0-127)
    • 110xxxxx 10xxxxxx: 2-byte character (code points 128-2047)
    • 1110xxxx 10xxxxxx 10xxxxxx: 3-byte character (code points 2048-65535)
    • 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx: 4-byte character (code points 65536-1114111)
  • The continuation bytes all start with 10xxxxxx; this lets a reader re-sync mid-stream
  • Example: é (U+00E9 = 233 decimal = 0b11101001)
    • Fits in 2-byte form: 110xxxxx 10xxxxxx, where xxx...xxx are the 8 bits of 0xE9
    • 0b11101001 split across 5 bits + 6 bits: 00011 101001
    • First byte: 110 00011 = 0xC3; second byte: 10 101001 = 0xA9
    • Verify: python3 -c "print('é'.encode('utf-8').hex())" outputs c3a9

Section 5: Magic bytes and file format identification

  • Every binary file format has a magic byte sequence at the start: a fixed pattern that identifies the format
  • PNG: 89 50 4E 47 0D 0A 1A 0A (the 89 signals "non-ASCII binary"; the next three bytes are "PNG" in ASCII)
  • ZIP: 50 4B 03 04 ("PK" in ASCII, for Phil Katz who invented the format)
  • PDF: 25 50 44 46 2D ("%%PDF-" in ASCII)
  • This is how file on Linux identifies file types without looking at the file extension

Labs (~90 minutes)

Lab 2.1: UTF-8 Encoding + Decoding (labs/lab-2-1-utf8-encoding.md)

  • Encode "café" in UTF-8 by hand (on paper), then verify with Python
  • Decode a short hex dump back to text character by character
  • Artifact: Python script + hand-calculation worksheet committed to your Git repo

Lab 1.3 (Tier-1 companion; recommended as warm-up for 2.1): Multi-Base Slider Workout (worksheets/fnd-101/lab-multi-base-slider.md)

  • Same worksheet introduced in Week 1; Table C (hex -> ASCII) directly exercises the bit-5 case-swap from this week's Section 2
  • Drive the Multi-Base Slider through the printable-ASCII range and confirm the byte-to-character mapping you will need for Lab 2.1's UTF-8 decoding
  • If you took the worksheet in Week 1, replay Table C with new hex values (0x21-0x7E) as a fluency drill

Independent practice (~4 hours)

  1. Write out the full ASCII table (0x20 through 0x7E) from memory on paper; check against a reference chart
  2. Decode the following hex sequence to UTF-8 text by hand: 48 65 6C 6C 6F 2C 20 77 6F 72 6C 64 21
  3. Read the UTF-8 Wikipedia article's "History" section; note why Multilingual Plane 0 covers 65,536 characters and why higher planes exist
  4. Pick a language other than English (Spanish, French, Japanese, Arabic). Find 3 characters in that language that require multi-byte UTF-8 encoding. Look up their Unicode code points and verify the UTF-8 byte encoding by hand.
  5. Preview for next week: read Petzold CODE Ch 10 ("Logic and Switches"). Come to week 3 having seen the truth table concept once.

Reflection prompts (~30 minutes)

  1. The lecture said ASCII is backwards compatible with UTF-8: any valid ASCII file is a valid UTF-8 file. Explain why this is true from the encoding rules.
  2. You decoded "café" from bytes. The letter 'c', 'a', 'f' took 1 byte each. The 'é' took 2 bytes. Why is this asymmetry useful for English text?
  3. A URL sometimes contains sequences like %C3%A9 where you would expect an accented character. What are those percent signs doing there?
  4. What would break if two programs on the same computer used different character encodings for the same file?
  5. Magic bytes identify file types more reliably than file extensions. Why?

What comes next

Week 3 moves to Boolean logic: the AND, OR, NOT, XOR, NAND, NOR operations that are the algebraic skeleton of every circuit inside a computer. The hex dumps you decoded this week are data; next week you start learning about the logic that processes data.