Read and write files. Plain text, CSV, JSON. The lab is a log-file scanner that reads a multi-megabyte system log and extracts every ERROR line.
Theme
A program that only manipulates values it holds in memory is limited to what fits in memory. As soon as you want to read a file (or write one), you cross from "data structures in your head" to "data structures on disk." Python's file I/O is intentionally simple at the surface: open(...), read(), write(), close(). The discipline is in the details: encoding, line endings, when to read line-by-line vs all at once, when to use the with block.
This week also introduces the two structured-data formats every working programmer reads and writes: CSV (one row per line, comma-separated) and JSON (nested key-value with arrays and objects). Both have a stdlib module in Python, and the modules are good enough that you almost never write custom parsers for either format.
The lab is a log-file scanner: given a multi-megabyte log file, find every line containing "ERROR" and report the count plus the first 10. This is a real practitioner workflow; every operations engineer scans logs this way every day; and it is the foundation for week 7's regex work (extracting structured fields from log lines) and week 6's rewrite-with-argparse (the same scanner with a real CLI).
By the end of week 5 you can: open and close files safely with with; read a file line-by-line for memory efficiency; read and write JSON with json.load and json.dump; read and write CSV with csv.DictReader and csv.DictWriter; recognize encoding errors (UnicodeDecodeError) and pick the right encoding for the file at hand.
Reading list (~1 hour)
- Matthes, Python Crash Course 2nd ed., Ch 10 ("Files and Exceptions"). Matthes covers
with open(...) as f:, reading and writing, and thetry/exceptframing for missing-file errors. FND-102 introducestry/exceptat the topic level here (full coverage in week 9). - Sweigart, Automate the Boring Stuff with Python 2nd ed., Ch 9 ("Reading and Writing Files") at
https://automatetheboringstuff.com/2e/chapter9/and Ch 16 ("Working with CSV Files and JSON Data") athttps://automatetheboringstuff.com/2e/chapter16/. Free online. Sweigart's framing ofpathlib.Pathas the modern replacement foros.pathis the cleanest in print. - Allen B. Downey, Think Python 2nd ed., Ch 14 ("Files") at
https://greenteapress.com/thinkpython2/html/thinkpython2015.html. Free online. Downey's section on "Filenames and paths" (14.2) is a useful complement. - Real Python: "Python's pathlib Module: Taming the File System" at
https://realpython.com/python-pathlib/. ~20 min read. The modern alternative toos.path; FND-102 usespathlibfrom this week onward.
Lecture outline (~1.5 hours, 2 sessions of ~50 min)
Session 1: Opening, reading, writing
Section 1.1: The open() function and the with block
- The basic read pattern:
with open('data.txt', 'r', encoding='utf-8') as f: content = f.read()
withis the context manager form. Python guarantees the file is closed when the block exits, even if an exception is raised inside.- Without
with, you must callf.close()manually:f = open('data.txt') content = f.read() f.close()
- The non-
withform is fragile: a crash betweenopenandcloseleaks the file handle. ALWAYS usewithin new code. - Mode strings:
'r'(read, default),'w'(write, TRUNCATES the file),'a'(append),'r+'(read+write). Add'b'for binary:'rb','wb'. - The default mode is text (
'rt'); the default encoding depends on your platform. ALWAYS passencoding='utf-8'explicitly for text files; the default can bite you when a Windows machine reads a UTF-8 file.
Section 1.2: Reading patterns
f.read()returns the entire file as one string. Fine for small files (< 10 MB). Memory-hungry for large ones.f.read(n)reads at most n characters. Useful for streaming.f.readline()reads one line including the trailing newline.f.readlines()returns a list of all lines. Same memory issue asf.read().- The Pythonic pattern for line-by-line:
with open('big.log') as f: for line in f: if 'ERROR' in line: print(line.rstrip())
- Iterating a file object yields one line at a time WITHOUT loading the whole file into memory. This is how you scan a multi-gigabyte log without crashing.
- Lines include the trailing
\n..rstrip()removes trailing whitespace (including\n).
Section 1.3: Writing patterns
- The write pattern:
with open('output.txt', 'w', encoding='utf-8') as f: f.write('hello\n') f.write('world\n')
f.write()does NOT add a newline; you do that yourself.f.writelines(list_of_strings)writes a list of strings. Also does NOT add newlines.- For mixed text formatting, use
print(..., file=f):with open('output.txt', 'w') as f: print('hello', file=f) # print adds the newline for you
Section 1.4: pathlib for paths
pathlib.Pathis the modern, OS-agnostic way to handle file paths.- Construct:
from pathlib import Path; p = Path('data.txt'). Or:Path.home() / 'fnd-102' / 'lab-5' / 'data.txt'. - The
/operator joins path components. Cross-platform: works on Windows (handles backslashes) and Unix. - Useful methods:
p.exists()returns True if the path existsp.is_file(),p.is_dir()distinguish file vs directoryp.read_text(encoding='utf-8')reads the whole file (shortcut forwith open(p)...)p.write_text('content', encoding='utf-8')writes the whole filep.glob('*.log')yields all paths matching the patternp.parent,p.name,p.suffix,p.stemfor path-component access
- Older code uses
os.path.join,os.path.exists, etc. They still work. Preferpathlibin new code.
Session 2: CSV, JSON, encodings
Section 2.1: The csv module
- CSV (comma-separated values) is the universal interchange format for tabular data.
- Read with
csv.DictReader(each row is a dict keyed by the header):import csv with open('students.csv', newline='', encoding='utf-8') as f: reader = csv.DictReader(f) for row in reader: print(row['name'], row['grade'])
- The
newline=''argument is mandatory: it tellsopennot to translate line endings, so thecsvmodule handles\r\nvs\ncorrectly. - Write with
csv.DictWriter:with open('output.csv', 'w', newline='', encoding='utf-8') as f: writer = csv.DictWriter(f, fieldnames=['name', 'grade']) writer.writeheader() writer.writerow({'name': 'alice', 'grade': 9})
- CSV is deceptively simple. Real-world quirks: quoted fields containing commas, escaped quotes, multi-line fields, different delimiters (semicolon in European CSV).
csv.readerhandles all of them if you pass the rightdialect. The Python docs athttps://docs.python.org/3/library/csv.htmlcover the corner cases.
Section 2.2: The json module
- JSON (JavaScript Object Notation) is the universal interchange format for nested key-value data.
- Read a JSON file into Python data:
import json with open('config.json', encoding='utf-8') as f: data = json.load(f)
- Write Python data to JSON:
with open('config.json', 'w', encoding='utf-8') as f: json.dump(data, f, indent=2, sort_keys=True)
- The translation: Python
dictbecomes JSON object; Pythonlistbecomes JSON array; Pythonstrbecomes JSON string; Pythonint/floatbecome JSON number; PythonTrue/False/Nonebecome JSONtrue/false/null. Other Python types (tuple, set, datetime) need a custom encoder. - JSON keys MUST be strings. A dict with int keys (like
{9: ['alice']}from Lab 4) gets stringified on write:{"9": ["alice"]}. On read, it stays a string. - String-vs-bytes:
json.loads(s)parses a string;json.load(f)parses from a file. The 's' is for "string."
Section 2.3: Encodings
- A file on disk is bytes. To read it as text, Python needs to know the encoding. UTF-8 is the answer 95% of the time; the remaining 5% is Latin-1, Windows-1252, ASCII-strict, or UTF-16.
- A
UnicodeDecodeErrormeans "the bytes you read are not valid in the encoding I tried." Fix: figure out the actual encoding (tryfile my.txton macOS / Linux; trychardetPython package; ask the file's source). - For BINARY files (images, PDFs, executables), open in binary mode (
'rb') and skip the encoding argument. - The
errors=argument controls behavior on invalid bytes:errors='strict'(default; raise),errors='replace'(substitute?),errors='ignore'(drop). Usereplacefor "I want to scan this log and don't care about a few weird bytes"; usestrictfor "this file should be UTF-8 and I want to know if it isn't."
Section 2.4: A first taste of try/except
- The full coverage is week 9; here is enough to handle a missing file:
try: with open('config.json') as f: data = json.load(f) except FileNotFoundError: print('config.json not found; using defaults') data = {}
- Catch the SPECIFIC exception class, not bare
except:(which catchesKeyboardInterruptand hides bugs). - In week 5 use
try/exceptonly for the missing-file case in your Lab 5. Week 9 unpacks the full exception model.
Labs (~90 minutes)
Lab 5: Log-File Scanner (labs/lab-5-log-scanner.md)
- Goal: build a CLI tool that reads a multi-MB log file, filters lines containing "ERROR", and reports the count + first 10 matches
- Time: ~90 minutes
- Artifact:
lab-5-scanner.py+ samplesample.login~/fnd-102/lab-5/, committed to Git
Independent practice (~4 hours)
-
Generate a test log (30 min). Write a Python script that creates a sample
sample.logwith 10,000 lines. Most lines are "INFO request OK"; ~5% are "ERROR something failed"; ~1% are "WARNING slow request." Userandomto interleave. This is the log your Lab 5 will scan. -
Streaming vs slurping (30 min). Generate a 100 MB text file (or just use any large file you have):
with open('big.txt', 'w') as f: for i in range(1_000_000): f.write(f'line {i}: this is some filler text padded out to be longish.\n')
Time these two approaches:
with open('big.txt') as f: lines = f.readlines(); count = sum(1 for l in lines if 'line 999' in l)count = 0; with open('big.txt') as f:thenfor line in f: if 'line 999' in line: count += 1
The first slurps the file; the second streams. For 100 MB the streaming version is roughly the same wall-clock but uses 100x less memory. Verify with
python3 -m resourceorhtop. -
CSV round-trip (45 min). Take your
students.csvfrom Lab 4. Read it withcsv.DictReader, mutate one field (e.g., uppercase all names), write back tostudents-upper.csvwithcsv.DictWriter. Verify the output round-trips: read the new CSV with DictReader, confirm the names are uppercase. -
JSON round-trip (30 min). Take any JSON file you can find (or write one by hand with
json.dump). Read it withjson.load; modify one field; write back. Verify the indent / sort_keys options change the output formatting. -
pathlibexploration (45 min). In a Python REPL:from pathlib import Path home = Path.home() print(home) list(home.glob('*')) # all files/dirs in home list(home.glob('**/*.py')[:5]) # first 5 .py files anywhere in home (recursive)
Then write a script that, given a directory path on the command line (manual
sys.argvparsing), walks the directory and prints the total size in bytes of all.pyfiles. Hint:sum(p.stat().st_size for p in Path(dir).rglob('*.py')). -
Encoding detective (30 min). Save a small text file with non-ASCII characters (your name with accent marks, or emoji, or kanji) in UTF-8. Read it in Python with
encoding='utf-8'(works); then withencoding='ascii'(fails); then withencoding='ascii', errors='replace'(works, with?substitutions). This is the failure mode every text-processing program eventually hits.
Reflection prompts (~30 minutes)
- The
withblock was introduced as the "right" way to open files. Have you used it in any other context yet? (Hint: any context manager; locks, connections, transactions; works the same way.) Why might the language designers have added syntax just for this pattern? - CSV and JSON both encode structured data. When would you pick CSV? When JSON? Your Lab 4 emitted JSON from CSV input; was that the right direction?
- The streaming-vs-slurping practice (item 2 above) shows a 100x memory difference for the same task. Have you ever had a program crash because it loaded too much data? What was the fix?
pathlib.Pathandos.pathdo the same things in slightly different shapes. Why might Python add a second way to do the same thing?- One thing from this week you want to know more about?
Tool journal (week 5)
open(),with: opening files safelyf.read(),f.readline(), iteration: three ways to consume a filef.write(),print(..., file=f): writingpathlib.Path: modern path handlingcsv.DictReader,csv.DictWriter: tabular datajson.load,json.dump: structured dataencoding='utf-8',errors='replace': text encoding controltry/except FileNotFoundError(first taste): handle missing files
What comes next
Week 6 introduces modules and the standard library tour. Your week 5 code probably has all the file reading inline in main(); week 6 teaches you to organize multi-function programs into modules, then refactors your Lab 5 to use argparse for the CLI interface (--input, --output, --top N, --verbose) and logging for the debug output. The same scanner becomes a real, configurable tool.