~90 minutes. Read a multi-MB log file line-by-line; filter ERROR lines; report count and first 10 matches. Practice file I/O streaming, string matching, and the with open(...) as f: idiom.
Goal: ship a Python CLI tool that scans a log file for ERROR lines and reports a summary.
Estimated time: 90 minutes
Prerequisites: Week 5 lecture (file I/O, with blocks, pathlib). Labs 1-4 complete.
Setup
mkdir -p ~/fnd-102/lab-5
cd ~/fnd-102/lab-5
You will need a sample log file. Use the script below to generate one (save as gen-log.py and run once):
import random
random.seed(42)
lines = []
for i in range(10000):
r = random.random()
if r < 0.05:
# ERROR line
kind = random.choice(['DatabaseTimeout', 'AuthFailed', 'DiskFull', 'NetworkUnreachable'])
lines.append(f'2026-05-28T12:{(i % 60):02d}:00 ERROR request_id={i:06d} {kind}: see logs')
elif r < 0.10:
lines.append(f'2026-05-28T12:{(i % 60):02d}:00 WARNING request_id={i:06d} slow response')
else:
lines.append(f'2026-05-28T12:{(i % 60):02d}:00 INFO request_id={i:06d} request OK 200')
with open('sample.log', 'w', encoding='utf-8') as f:
for line in lines:
f.write(line + '\n')
print(f'Wrote sample.log with {len(lines)} lines.')
Run it:
python3 gen-log.py
ls -lh sample.log
The file is ~750 KB with 10,000 lines, of which ~500 are ERROR.
Part A: Stream the file (20 min)
Write lab-5-scanner.py. The minimum behavior:
from pathlib import Path
def scan(path):
"""Yield ERROR lines from a log file at `path` (one line at a time)."""
with open(path, encoding='utf-8') as f:
for line in f:
if 'ERROR' in line:
yield line.rstrip()
def main():
log_path = Path('sample.log')
error_lines = list(scan(log_path))
print(f'Found {len(error_lines)} ERROR lines.')
print('First 10:')
for line in error_lines[:10]:
print(f' {line}')
if __name__ == '__main__':
main()
Run it:
python3 lab-5-scanner.py
You should see "Found ~500 ERROR lines" and the first 10 lines printed.
Three things worth noting:
scanis a generator. Theyieldkeyword makes it a generator function: instead of building a list of all matches in memory, it yields one match at a time. The caller (list(scan(...))) collects them into a list. If you wanted to avoid the list at all (true streaming, no memory bound), you could iterate the generator directly.- Line-by-line iteration.
for line in f:reads the file one line at a time. This works for files of any size; you could scan a 10 GB log with this code and it would not run out of memory. line.rstrip()strips trailing whitespace. File lines include the trailing\n. Withoutrstrip, your print output has double-spaced lines.
Part B: Add a count summary (10 min)
The output should also break down the count by error kind. Looking at the sample log, ERROR lines look like:
2026-05-28T12:45:00 ERROR request_id=001234 DatabaseTimeout: see logs
Extract the kind (the word after the request_id field), count occurrences per kind:
from collections import Counter
def main():
log_path = Path('sample.log')
error_lines = list(scan(log_path))
print(f'Found {len(error_lines)} ERROR lines.')
# break down by kind
kinds = []
for line in error_lines:
# split on whitespace; the kind is the field that contains ':' before ' see logs'
parts = line.split()
# parts looks like: ['2026-...', 'ERROR', 'request_id=001234', 'DatabaseTimeout:', 'see', 'logs']
kind = parts[3].rstrip(':')
kinds.append(kind)
counts = Counter(kinds)
print('Breakdown by kind:')
for kind, n in counts.most_common():
print(f' {kind:25s} {n:5d}')
print('First 10:')
for line in error_lines[:10]:
print(f' {line}')
Run again. You should see a breakdown like:
Breakdown by kind:
DatabaseTimeout 130
DiskFull 127
NetworkUnreachable 125
AuthFailed 118
The exact numbers depend on the seed (we set it to 42 for reproducibility).
collections.Counter is a stdlib utility that counts hashable items; most_common() returns the counts sorted by frequency. Cleaner than building the same with a plain dict.
Part C: Read the path from the command line (10 min)
Hard-coding 'sample.log' is fine for one file. Let the user pass the path:
import sys
def main():
if len(sys.argv) != 2:
print('Usage: lab-5-scanner.py LOGFILE', file=sys.stderr)
sys.exit(2)
log_path = Path(sys.argv[1])
if not log_path.exists():
print(f'Error: {log_path} does not exist', file=sys.stderr)
sys.exit(1)
# ... rest as before ...
The manual sys.argv parsing is the pre-argparse version. Week 6 introduces argparse and you will rewrite this with a real CLI parser.
Test:
python3 lab-5-scanner.py sample.log # works
python3 lab-5-scanner.py missing.log # prints error, exits 1
python3 lab-5-scanner.py # prints usage, exits 2
The exit codes follow convention: 0 success, 1 runtime error (missing file), 2 usage error (wrong arguments).
Part D: Verify memory behavior (15 min, optional but recommended)
Create a much bigger log file:
# from inside ~/fnd-102/lab-5
python3 -c "
import random
random.seed(42)
with open('big.log', 'w') as f:
for i in range(1_000_000): # 1 million lines, ~75 MB
r = random.random()
if r < 0.05:
f.write(f'2026-05-28T12:00:00 ERROR id={i:09d} Something failed\n')
else:
f.write(f'2026-05-28T12:00:00 INFO id={i:09d} OK\n')
"
ls -lh big.log
Run your scanner on the big log:
time python3 lab-5-scanner.py big.log
It should finish in under 10 seconds and use less than 100 MB of memory (most of that is the list of matches). On Linux you can confirm with /usr/bin/time -v python3 ... and look at "Maximum resident set size."
The discipline here: line-by-line iteration scales to arbitrary file sizes. Slurping the whole file with f.read() would crash on a 10 GB log; your scanner does not.
Part E: Commit your work (10 min)
cd ~/fnd-102/lab-5
git add gen-log.py lab-5-scanner.py sample.log
# do NOT commit big.log; it's not part of the lab artifact
git commit -m "lab-5: log-file scanner with kind-breakdown and CLI path argument"
For the streaming-verification step:
git commit -m "lab-5: verify memory behavior on 1M-line log file" --allow-empty
(An empty commit documents a verification step that produced no file changes. Some teams discourage empty commits; FND-102 allows them for documentation purposes.)
Expected output / artifact
lab-5-scanner.py should:
- Accept a log file path as a CLI argument
- Stream the file line-by-line (NOT
f.read()) - Count ERROR lines and break down by kind (using
Counteror equivalent) - Print the first 10 matches
- Exit 0 on success, 1 if the file does not exist, 2 if no argument was passed
Files committed: gen-log.py, lab-5-scanner.py, sample.log.
What's the failure mode?
This tool's likely failure modes:
- Non-UTF-8 log file. Real-world logs sometimes contain garbage bytes (binary data leaked into a text log, mixed encodings).
open(path, encoding='utf-8')raisesUnicodeDecodeErroron the first bad byte. Forward-stretch fix:open(path, encoding='utf-8', errors='replace')to substitute?for bad bytes and keep scanning. - Log format does not match the assumption. Your
parts[3]indexing assumes a specific structure. If the log line is shorter (e.g., a malformed line),parts[3]raisesIndexError. Defensive fix (week 9):try/exceptaround the parse, count failures separately. - Path contains a tilde.
Path('~/fnd-102/lab-5/sample.log').exists()is False on most systems because~is not expanded. UsePath('~/fnd-102/...').expanduser()to expand it, or pass the absolute path on the command line. - Out of memory on
list(scan(...)). If the log has millions of ERROR lines, collecting them all into a list could exhaust memory. The streaming generator does not crash; thelist()call does. For very large logs, iterate and tally without collecting.
Common pitfalls
for line in f:afterf.read(). Once you have calledf.read(), the file position is at EOF and iteration yields nothing. Use one or the other, not both.- Forgetting
with.f = open(...)followed by reads, withoutf.close(), leaks the file handle. Always usewith. if 'ERROR' in linevsif line.startswith('ERROR'). The first matchesERRORanywhere; the second matches only at the start. For your log format, "anywhere" works because the level appears mid-line; for a different log format you might needstartswith.- Case sensitivity.
'ERROR' in linedoes not match'error'. If your log mixes cases, normalize:if 'error' in line.lower().
Stretch (optional)
- Add
--top N(manualsys.argvparsing). Show the top N error kinds instead of all of them. - Stream the matches to an output file. Instead of printing, write to
sample.errors.logand report the count to stdout. - Parse the timestamp. Use
datetime.strptime(parts[0], '%Y-%m-%dT%H:%M:%S')and report the timespan of the ERROR lines (first error to last error). - Iterate without
list(). Refactor so the scanner streams matches to print or write without ever building the full list in memory. Useful when the log has millions of errors.
Lab 5 v0.1.