Classroom Glossary Public page

Lab 5: Log-File Scanner

868 words

~90 minutes. Read a multi-MB log file line-by-line; filter ERROR lines; report count and first 10 matches. Practice file I/O streaming, string matching, and the with open(...) as f: idiom.


Goal: ship a Python CLI tool that scans a log file for ERROR lines and reports a summary.

Estimated time: 90 minutes

Prerequisites: Week 5 lecture (file I/O, with blocks, pathlib). Labs 1-4 complete.


Setup

mkdir -p ~/fnd-102/lab-5
cd ~/fnd-102/lab-5

You will need a sample log file. Use the script below to generate one (save as gen-log.py and run once):

import random
random.seed(42)

lines = []
for i in range(10000):
    r = random.random()
    if r < 0.05:
        # ERROR line
        kind = random.choice(['DatabaseTimeout', 'AuthFailed', 'DiskFull', 'NetworkUnreachable'])
        lines.append(f'2026-05-28T12:{(i % 60):02d}:00 ERROR request_id={i:06d} {kind}: see logs')
    elif r < 0.10:
        lines.append(f'2026-05-28T12:{(i % 60):02d}:00 WARNING request_id={i:06d} slow response')
    else:
        lines.append(f'2026-05-28T12:{(i % 60):02d}:00 INFO request_id={i:06d} request OK 200')

with open('sample.log', 'w', encoding='utf-8') as f:
    for line in lines:
        f.write(line + '\n')

print(f'Wrote sample.log with {len(lines)} lines.')

Run it:

python3 gen-log.py
ls -lh sample.log

The file is ~750 KB with 10,000 lines, of which ~500 are ERROR.


Part A: Stream the file (20 min)

Write lab-5-scanner.py. The minimum behavior:

from pathlib import Path

def scan(path):
    """Yield ERROR lines from a log file at `path` (one line at a time)."""
    with open(path, encoding='utf-8') as f:
        for line in f:
            if 'ERROR' in line:
                yield line.rstrip()

def main():
    log_path = Path('sample.log')
    error_lines = list(scan(log_path))
    print(f'Found {len(error_lines)} ERROR lines.')
    print('First 10:')
    for line in error_lines[:10]:
        print(f'  {line}')

if __name__ == '__main__':
    main()

Run it:

python3 lab-5-scanner.py

You should see "Found ~500 ERROR lines" and the first 10 lines printed.

Three things worth noting:

  1. scan is a generator. The yield keyword makes it a generator function: instead of building a list of all matches in memory, it yields one match at a time. The caller (list(scan(...))) collects them into a list. If you wanted to avoid the list at all (true streaming, no memory bound), you could iterate the generator directly.
  2. Line-by-line iteration. for line in f: reads the file one line at a time. This works for files of any size; you could scan a 10 GB log with this code and it would not run out of memory.
  3. line.rstrip() strips trailing whitespace. File lines include the trailing \n. Without rstrip, your print output has double-spaced lines.

Part B: Add a count summary (10 min)

The output should also break down the count by error kind. Looking at the sample log, ERROR lines look like:

2026-05-28T12:45:00 ERROR request_id=001234 DatabaseTimeout: see logs

Extract the kind (the word after the request_id field), count occurrences per kind:

from collections import Counter

def main():
    log_path = Path('sample.log')
    error_lines = list(scan(log_path))
    print(f'Found {len(error_lines)} ERROR lines.')

    # break down by kind
    kinds = []
    for line in error_lines:
        # split on whitespace; the kind is the field that contains ':' before ' see logs'
        parts = line.split()
        # parts looks like: ['2026-...', 'ERROR', 'request_id=001234', 'DatabaseTimeout:', 'see', 'logs']
        kind = parts[3].rstrip(':')
        kinds.append(kind)

    counts = Counter(kinds)
    print('Breakdown by kind:')
    for kind, n in counts.most_common():
        print(f'  {kind:25s} {n:5d}')

    print('First 10:')
    for line in error_lines[:10]:
        print(f'  {line}')

Run again. You should see a breakdown like:

Breakdown by kind:
  DatabaseTimeout             130
  DiskFull                    127
  NetworkUnreachable          125
  AuthFailed                  118

The exact numbers depend on the seed (we set it to 42 for reproducibility).

collections.Counter is a stdlib utility that counts hashable items; most_common() returns the counts sorted by frequency. Cleaner than building the same with a plain dict.


Part C: Read the path from the command line (10 min)

Hard-coding 'sample.log' is fine for one file. Let the user pass the path:

import sys

def main():
    if len(sys.argv) != 2:
        print('Usage: lab-5-scanner.py LOGFILE', file=sys.stderr)
        sys.exit(2)
    log_path = Path(sys.argv[1])
    if not log_path.exists():
        print(f'Error: {log_path} does not exist', file=sys.stderr)
        sys.exit(1)
    # ... rest as before ...

The manual sys.argv parsing is the pre-argparse version. Week 6 introduces argparse and you will rewrite this with a real CLI parser.

Test:

python3 lab-5-scanner.py sample.log     # works
python3 lab-5-scanner.py missing.log    # prints error, exits 1
python3 lab-5-scanner.py                # prints usage, exits 2

The exit codes follow convention: 0 success, 1 runtime error (missing file), 2 usage error (wrong arguments).


Part D: Verify memory behavior (15 min, optional but recommended)

Create a much bigger log file:

# from inside ~/fnd-102/lab-5
python3 -c "
import random
random.seed(42)
with open('big.log', 'w') as f:
    for i in range(1_000_000):  # 1 million lines, ~75 MB
        r = random.random()
        if r < 0.05:
            f.write(f'2026-05-28T12:00:00 ERROR id={i:09d} Something failed\n')
        else:
            f.write(f'2026-05-28T12:00:00 INFO id={i:09d} OK\n')
"
ls -lh big.log

Run your scanner on the big log:

time python3 lab-5-scanner.py big.log

It should finish in under 10 seconds and use less than 100 MB of memory (most of that is the list of matches). On Linux you can confirm with /usr/bin/time -v python3 ... and look at "Maximum resident set size."

The discipline here: line-by-line iteration scales to arbitrary file sizes. Slurping the whole file with f.read() would crash on a 10 GB log; your scanner does not.


Part E: Commit your work (10 min)

cd ~/fnd-102/lab-5
git add gen-log.py lab-5-scanner.py sample.log
# do NOT commit big.log; it's not part of the lab artifact
git commit -m "lab-5: log-file scanner with kind-breakdown and CLI path argument"

For the streaming-verification step:

git commit -m "lab-5: verify memory behavior on 1M-line log file" --allow-empty

(An empty commit documents a verification step that produced no file changes. Some teams discourage empty commits; FND-102 allows them for documentation purposes.)


Expected output / artifact

lab-5-scanner.py should:

  • Accept a log file path as a CLI argument
  • Stream the file line-by-line (NOT f.read())
  • Count ERROR lines and break down by kind (using Counter or equivalent)
  • Print the first 10 matches
  • Exit 0 on success, 1 if the file does not exist, 2 if no argument was passed

Files committed: gen-log.py, lab-5-scanner.py, sample.log.


What's the failure mode?

This tool's likely failure modes:

  1. Non-UTF-8 log file. Real-world logs sometimes contain garbage bytes (binary data leaked into a text log, mixed encodings). open(path, encoding='utf-8') raises UnicodeDecodeError on the first bad byte. Forward-stretch fix: open(path, encoding='utf-8', errors='replace') to substitute ? for bad bytes and keep scanning.
  2. Log format does not match the assumption. Your parts[3] indexing assumes a specific structure. If the log line is shorter (e.g., a malformed line), parts[3] raises IndexError. Defensive fix (week 9): try/except around the parse, count failures separately.
  3. Path contains a tilde. Path('~/fnd-102/lab-5/sample.log').exists() is False on most systems because ~ is not expanded. Use Path('~/fnd-102/...').expanduser() to expand it, or pass the absolute path on the command line.
  4. Out of memory on list(scan(...)). If the log has millions of ERROR lines, collecting them all into a list could exhaust memory. The streaming generator does not crash; the list() call does. For very large logs, iterate and tally without collecting.

Common pitfalls

  • for line in f: after f.read(). Once you have called f.read(), the file position is at EOF and iteration yields nothing. Use one or the other, not both.
  • Forgetting with. f = open(...) followed by reads, without f.close(), leaks the file handle. Always use with.
  • if 'ERROR' in line vs if line.startswith('ERROR'). The first matches ERROR anywhere; the second matches only at the start. For your log format, "anywhere" works because the level appears mid-line; for a different log format you might need startswith.
  • Case sensitivity. 'ERROR' in line does not match 'error'. If your log mixes cases, normalize: if 'error' in line.lower().

Stretch (optional)

  1. Add --top N (manual sys.argv parsing). Show the top N error kinds instead of all of them.
  2. Stream the matches to an output file. Instead of printing, write to sample.errors.log and report the count to stdout.
  3. Parse the timestamp. Use datetime.strptime(parts[0], '%Y-%m-%dT%H:%M:%S') and report the timespan of the ERROR lines (first error to last error).
  4. Iterate without list(). Refactor so the scanner streams matches to print or write without ever building the full list in memory. Useful when the log has millions of errors.

Lab 5 v0.1.