Classroom Glossary Public page

Week 5: File I/O

1,604 words

Read and write files. Plain text, CSV, JSON. The lab is a log-file scanner that reads a multi-megabyte system log and extracts every ERROR line.


Theme

A program that only manipulates values it holds in memory is limited to what fits in memory. As soon as you want to read a file (or write one), you cross from "data structures in your head" to "data structures on disk." Python's file I/O is intentionally simple at the surface: open(...), read(), write(), close(). The discipline is in the details: encoding, line endings, when to read line-by-line vs all at once, when to use the with block.

This week also introduces the two structured-data formats every working programmer reads and writes: CSV (one row per line, comma-separated) and JSON (nested key-value with arrays and objects). Both have a stdlib module in Python, and the modules are good enough that you almost never write custom parsers for either format.

The lab is a log-file scanner: given a multi-megabyte log file, find every line containing "ERROR" and report the count plus the first 10. This is a real practitioner workflow; every operations engineer scans logs this way every day; and it is the foundation for week 7's regex work (extracting structured fields from log lines) and week 6's rewrite-with-argparse (the same scanner with a real CLI).

By the end of week 5 you can: open and close files safely with with; read a file line-by-line for memory efficiency; read and write JSON with json.load and json.dump; read and write CSV with csv.DictReader and csv.DictWriter; recognize encoding errors (UnicodeDecodeError) and pick the right encoding for the file at hand.

Reading list (~1 hour)

  1. Matthes, Python Crash Course 2nd ed., Ch 10 ("Files and Exceptions"). Matthes covers with open(...) as f:, reading and writing, and the try/except framing for missing-file errors. FND-102 introduces try/except at the topic level here (full coverage in week 9).
  2. Sweigart, Automate the Boring Stuff with Python 2nd ed., Ch 9 ("Reading and Writing Files") at https://automatetheboringstuff.com/2e/chapter9/ and Ch 16 ("Working with CSV Files and JSON Data") at https://automatetheboringstuff.com/2e/chapter16/. Free online. Sweigart's framing of pathlib.Path as the modern replacement for os.path is the cleanest in print.
  3. Allen B. Downey, Think Python 2nd ed., Ch 14 ("Files") at https://greenteapress.com/thinkpython2/html/thinkpython2015.html. Free online. Downey's section on "Filenames and paths" (14.2) is a useful complement.
  4. Real Python: "Python's pathlib Module: Taming the File System" at https://realpython.com/python-pathlib/. ~20 min read. The modern alternative to os.path; FND-102 uses pathlib from this week onward.

Lecture outline (~1.5 hours, 2 sessions of ~50 min)

Session 1: Opening, reading, writing

Section 1.1: The open() function and the with block

  • The basic read pattern:
    with open('data.txt', 'r', encoding='utf-8') as f:
        content = f.read()
    
  • with is the context manager form. Python guarantees the file is closed when the block exits, even if an exception is raised inside.
  • Without with, you must call f.close() manually:
    f = open('data.txt')
    content = f.read()
    f.close()
    
  • The non-with form is fragile: a crash between open and close leaks the file handle. ALWAYS use with in new code.
  • Mode strings: 'r' (read, default), 'w' (write, TRUNCATES the file), 'a' (append), 'r+' (read+write). Add 'b' for binary: 'rb', 'wb'.
  • The default mode is text ('rt'); the default encoding depends on your platform. ALWAYS pass encoding='utf-8' explicitly for text files; the default can bite you when a Windows machine reads a UTF-8 file.

Section 1.2: Reading patterns

  • f.read() returns the entire file as one string. Fine for small files (< 10 MB). Memory-hungry for large ones.
  • f.read(n) reads at most n characters. Useful for streaming.
  • f.readline() reads one line including the trailing newline.
  • f.readlines() returns a list of all lines. Same memory issue as f.read().
  • The Pythonic pattern for line-by-line:
    with open('big.log') as f:
        for line in f:
            if 'ERROR' in line:
                print(line.rstrip())
    
  • Iterating a file object yields one line at a time WITHOUT loading the whole file into memory. This is how you scan a multi-gigabyte log without crashing.
  • Lines include the trailing \n. .rstrip() removes trailing whitespace (including \n).

Section 1.3: Writing patterns

  • The write pattern:
    with open('output.txt', 'w', encoding='utf-8') as f:
        f.write('hello\n')
        f.write('world\n')
    
  • f.write() does NOT add a newline; you do that yourself.
  • f.writelines(list_of_strings) writes a list of strings. Also does NOT add newlines.
  • For mixed text formatting, use print(..., file=f):
    with open('output.txt', 'w') as f:
        print('hello', file=f)  # print adds the newline for you
    

Section 1.4: pathlib for paths

  • pathlib.Path is the modern, OS-agnostic way to handle file paths.
  • Construct: from pathlib import Path; p = Path('data.txt'). Or: Path.home() / 'fnd-102' / 'lab-5' / 'data.txt'.
  • The / operator joins path components. Cross-platform: works on Windows (handles backslashes) and Unix.
  • Useful methods:
    • p.exists() returns True if the path exists
    • p.is_file(), p.is_dir() distinguish file vs directory
    • p.read_text(encoding='utf-8') reads the whole file (shortcut for with open(p)...)
    • p.write_text('content', encoding='utf-8') writes the whole file
    • p.glob('*.log') yields all paths matching the pattern
    • p.parent, p.name, p.suffix, p.stem for path-component access
  • Older code uses os.path.join, os.path.exists, etc. They still work. Prefer pathlib in new code.

Session 2: CSV, JSON, encodings

Section 2.1: The csv module

  • CSV (comma-separated values) is the universal interchange format for tabular data.
  • Read with csv.DictReader (each row is a dict keyed by the header):
    import csv
    with open('students.csv', newline='', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for row in reader:
            print(row['name'], row['grade'])
    
  • The newline='' argument is mandatory: it tells open not to translate line endings, so the csv module handles \r\n vs \n correctly.
  • Write with csv.DictWriter:
    with open('output.csv', 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=['name', 'grade'])
        writer.writeheader()
        writer.writerow({'name': 'alice', 'grade': 9})
    
  • CSV is deceptively simple. Real-world quirks: quoted fields containing commas, escaped quotes, multi-line fields, different delimiters (semicolon in European CSV). csv.reader handles all of them if you pass the right dialect. The Python docs at https://docs.python.org/3/library/csv.html cover the corner cases.

Section 2.2: The json module

  • JSON (JavaScript Object Notation) is the universal interchange format for nested key-value data.
  • Read a JSON file into Python data:
    import json
    with open('config.json', encoding='utf-8') as f:
        data = json.load(f)
    
  • Write Python data to JSON:
    with open('config.json', 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=2, sort_keys=True)
    
  • The translation: Python dict becomes JSON object; Python list becomes JSON array; Python str becomes JSON string; Python int/float become JSON number; Python True/False/None become JSON true/false/null. Other Python types (tuple, set, datetime) need a custom encoder.
  • JSON keys MUST be strings. A dict with int keys (like {9: ['alice']} from Lab 4) gets stringified on write: {"9": ["alice"]}. On read, it stays a string.
  • String-vs-bytes: json.loads(s) parses a string; json.load(f) parses from a file. The 's' is for "string."

Section 2.3: Encodings

  • A file on disk is bytes. To read it as text, Python needs to know the encoding. UTF-8 is the answer 95% of the time; the remaining 5% is Latin-1, Windows-1252, ASCII-strict, or UTF-16.
  • A UnicodeDecodeError means "the bytes you read are not valid in the encoding I tried." Fix: figure out the actual encoding (try file my.txt on macOS / Linux; try chardet Python package; ask the file's source).
  • For BINARY files (images, PDFs, executables), open in binary mode ('rb') and skip the encoding argument.
  • The errors= argument controls behavior on invalid bytes: errors='strict' (default; raise), errors='replace' (substitute ?), errors='ignore' (drop). Use replace for "I want to scan this log and don't care about a few weird bytes"; use strict for "this file should be UTF-8 and I want to know if it isn't."

Section 2.4: A first taste of try/except

  • The full coverage is week 9; here is enough to handle a missing file:
    try:
        with open('config.json') as f:
            data = json.load(f)
    except FileNotFoundError:
        print('config.json not found; using defaults')
        data = {}
    
  • Catch the SPECIFIC exception class, not bare except: (which catches KeyboardInterrupt and hides bugs).
  • In week 5 use try/except only for the missing-file case in your Lab 5. Week 9 unpacks the full exception model.

Labs (~90 minutes)

Lab 5: Log-File Scanner (labs/lab-5-log-scanner.md)

  • Goal: build a CLI tool that reads a multi-MB log file, filters lines containing "ERROR", and reports the count + first 10 matches
  • Time: ~90 minutes
  • Artifact: lab-5-scanner.py + sample sample.log in ~/fnd-102/lab-5/, committed to Git

Independent practice (~4 hours)

  1. Generate a test log (30 min). Write a Python script that creates a sample sample.log with 10,000 lines. Most lines are "INFO request OK"; ~5% are "ERROR something failed"; ~1% are "WARNING slow request." Use random to interleave. This is the log your Lab 5 will scan.

  2. Streaming vs slurping (30 min). Generate a 100 MB text file (or just use any large file you have):

    with open('big.txt', 'w') as f:
        for i in range(1_000_000):
            f.write(f'line {i}: this is some filler text padded out to be longish.\n')
    

    Time these two approaches:

    • with open('big.txt') as f: lines = f.readlines(); count = sum(1 for l in lines if 'line 999' in l)
    • count = 0; with open('big.txt') as f: then for line in f: if 'line 999' in line: count += 1

    The first slurps the file; the second streams. For 100 MB the streaming version is roughly the same wall-clock but uses 100x less memory. Verify with python3 -m resource or htop.

  3. CSV round-trip (45 min). Take your students.csv from Lab 4. Read it with csv.DictReader, mutate one field (e.g., uppercase all names), write back to students-upper.csv with csv.DictWriter. Verify the output round-trips: read the new CSV with DictReader, confirm the names are uppercase.

  4. JSON round-trip (30 min). Take any JSON file you can find (or write one by hand with json.dump). Read it with json.load; modify one field; write back. Verify the indent / sort_keys options change the output formatting.

  5. pathlib exploration (45 min). In a Python REPL:

    from pathlib import Path
    home = Path.home()
    print(home)
    list(home.glob('*'))           # all files/dirs in home
    list(home.glob('**/*.py')[:5])  # first 5 .py files anywhere in home (recursive)
    

    Then write a script that, given a directory path on the command line (manual sys.argv parsing), walks the directory and prints the total size in bytes of all .py files. Hint: sum(p.stat().st_size for p in Path(dir).rglob('*.py')).

  6. Encoding detective (30 min). Save a small text file with non-ASCII characters (your name with accent marks, or emoji, or kanji) in UTF-8. Read it in Python with encoding='utf-8' (works); then with encoding='ascii' (fails); then with encoding='ascii', errors='replace' (works, with ? substitutions). This is the failure mode every text-processing program eventually hits.

Reflection prompts (~30 minutes)

  1. The with block was introduced as the "right" way to open files. Have you used it in any other context yet? (Hint: any context manager; locks, connections, transactions; works the same way.) Why might the language designers have added syntax just for this pattern?
  2. CSV and JSON both encode structured data. When would you pick CSV? When JSON? Your Lab 4 emitted JSON from CSV input; was that the right direction?
  3. The streaming-vs-slurping practice (item 2 above) shows a 100x memory difference for the same task. Have you ever had a program crash because it loaded too much data? What was the fix?
  4. pathlib.Path and os.path do the same things in slightly different shapes. Why might Python add a second way to do the same thing?
  5. One thing from this week you want to know more about?

Tool journal (week 5)

  • open(), with: opening files safely
  • f.read(), f.readline(), iteration: three ways to consume a file
  • f.write(), print(..., file=f): writing
  • pathlib.Path: modern path handling
  • csv.DictReader, csv.DictWriter: tabular data
  • json.load, json.dump: structured data
  • encoding='utf-8', errors='replace': text encoding control
  • try/except FileNotFoundError (first taste): handle missing files

What comes next

Week 6 introduces modules and the standard library tour. Your week 5 code probably has all the file reading inline in main(); week 6 teaches you to organize multi-function programs into modules, then refactors your Lab 5 to use argparse for the CLI interface (--input, --output, --top N, --verbose) and logging for the debug output. The same scanner becomes a real, configurable tool.