Classroom Glossary Public page

Lab 6: Argparse + Logging Refactor

748 words

~90 minutes. Take Lab 5's scanner and rewrite it with a real argparse CLI and a logging setup. The behavior is the same; the interface becomes professional.


Goal: ship lab-6-scanner.py that has a proper argparse-based CLI (--top, --verbose, --output) and a logging-based debug output (DEBUG to a file, INFO+ to stderr based on --verbose).

Estimated time: 90 minutes

Prerequisites: Week 6 lecture (argparse, logging, modules). Lab 5 complete and working.


Setup

mkdir -p ~/fnd-102/lab-6
cd ~/fnd-102/lab-6
cp ../lab-5/lab-5-scanner.py lab-6-scanner.py
cp ../lab-5/sample.log .   # so the lab is self-contained

Open lab-6-scanner.py.


Part A: Build the argparse CLI (25 min)

Replace the manual sys.argv parsing from Lab 5 with argparse:

import argparse
from pathlib import Path

def build_parser():
    """Return an ArgumentParser for the log-scanner CLI."""
    parser = argparse.ArgumentParser(
        description='Scan a log file for ERROR lines and report counts.',
        epilog='Example: %(prog)s sample.log --top 20 --verbose'
    )
    parser.add_argument(
        'input',
        type=Path,
        help='path to the log file to scan'
    )
    parser.add_argument(
        '--top',
        type=int,
        default=10,
        help='show the top N matches (default: %(default)s)'
    )
    parser.add_argument(
        '--output',
        type=Path,
        default=None,
        help='write matching lines to this file (default: print to stdout)'
    )
    parser.add_argument(
        '--verbose', '-v',
        action='store_true',
        help='enable INFO-level logging to stderr'
    )
    parser.add_argument(
        '--debug',
        action='store_true',
        help='enable DEBUG-level logging to stderr (very chatty)'
    )
    return parser

Notice:

  • type=Path parses the string into a pathlib.Path automatically.
  • default=%(default)s in the help text is an argparse template that substitutes the actual default value when --help is rendered.
  • %(prog)s in the epilog is replaced with the program's name.
  • Both --verbose and --debug are flags; both could be passed but --debug is more verbose. We will give --debug precedence in the logging setup.

Run python3 lab-6-scanner.py --help and read the output. If any help string is field-name shaped instead of sentence shaped, rewrite it.


Part B: Set up logging (20 min)

Add the logging configuration:

import logging
import sys

log = logging.getLogger('scanner')

def configure_logging(verbose: bool, debug: bool):
    """Configure logging. DEBUG always goes to scanner.debug.log; stderr level depends on flags."""
    log.setLevel(logging.DEBUG)

    # File handler: always DEBUG-level
    file_handler = logging.FileHandler('scanner.debug.log', mode='w', encoding='utf-8')
    file_handler.setLevel(logging.DEBUG)
    file_handler.setFormatter(logging.Formatter('%(asctime)s %(levelname)s %(message)s'))
    log.addHandler(file_handler)

    # Stderr handler: WARNING by default; INFO with --verbose; DEBUG with --debug
    stream_handler = logging.StreamHandler(sys.stderr)
    if debug:
        stream_handler.setLevel(logging.DEBUG)
    elif verbose:
        stream_handler.setLevel(logging.INFO)
    else:
        stream_handler.setLevel(logging.WARNING)
    stream_handler.setFormatter(logging.Formatter('%(levelname)s: %(message)s'))
    log.addHandler(stream_handler)

Three things to notice:

  1. The logger name is 'scanner', not __name__. For a single-file script either works; the convention __name__ shines when the script is split into modules (each module gets its own named logger and you can tune levels per-module).
  2. File logging is unconditional. scanner.debug.log is always written with the full DEBUG trace; the stderr level decides what the USER sees. Operations engineers running this in production rely on the debug log being there when something goes wrong.
  3. mode='w' truncates the debug log on each run. mode='a' would append; for a CLI tool a fresh log per invocation is usually right.

Part C: Wire the scanner into the new shell (25 min)

The scan generator from Lab 5 stays unchanged. Replace the old main with one that uses argparse + logging:

from collections import Counter

def scan(path: Path):
    """Yield ERROR lines from a log file."""
    log.debug(f'opening {path}')
    with open(path, encoding='utf-8') as f:
        for i, line in enumerate(f, start=1):
            if i % 10000 == 0:
                log.debug(f'scanned {i} lines so far')
            if 'ERROR' in line:
                yield line.rstrip()

def main():
    args = build_parser().parse_args()
    configure_logging(args.verbose, args.debug)

    if not args.input.exists():
        log.error(f'input file does not exist: {args.input}')
        sys.exit(1)

    log.info(f'scanning {args.input}')
    matches = list(scan(args.input))
    log.info(f'found {len(matches)} ERROR lines')

    # breakdown by kind
    kinds = []
    for line in matches:
        parts = line.split()
        if len(parts) > 3:
            kinds.append(parts[3].rstrip(':'))
    counts = Counter(kinds)

    # write or print matches
    if args.output:
        with open(args.output, 'w', encoding='utf-8') as f:
            for line in matches[:args.top]:
                f.write(line + '\n')
        log.info(f'wrote {min(args.top, len(matches))} matches to {args.output}')
    else:
        print(f'Found {len(matches)} ERROR lines.')
        print('Breakdown by kind:')
        for kind, n in counts.most_common():
            print(f'  {kind:25s} {n:5d}')
        print(f'First {args.top}:')
        for line in matches[:args.top]:
            print(f'  {line}')

    sys.exit(0)

if __name__ == '__main__':
    main()

Test the four call paths:

# Default (no flags): no INFO output, just the human result
python3 lab-6-scanner.py sample.log

# Verbose: INFO output to stderr; same result to stdout
python3 lab-6-scanner.py sample.log --verbose

# Debug: chatty DEBUG output to stderr
python3 lab-6-scanner.py sample.log --debug

# Output to file: stdout silent; matches in the file
python3 lab-6-scanner.py sample.log --output matches.txt
cat matches.txt

Check scanner.debug.log: it should always have the full DEBUG trace regardless of the stderr level.


Part D: Check --help is good (10 min)

Run:

python3 lab-6-scanner.py --help

The output should be readable. A good --help looks like:

usage: lab-6-scanner.py [-h] [--top TOP] [--output OUTPUT] [--verbose] [--debug] input

Scan a log file for ERROR lines and report counts.

positional arguments:
  input                 path to the log file to scan

options:
  -h, --help            show this help message and exit
  --top TOP             show the top N matches (default: 10)
  --output OUTPUT       write matching lines to this file (default: print to stdout)
  --verbose, -v         enable INFO-level logging to stderr
  --debug               enable DEBUG-level logging to stderr (very chatty)

Example: lab-6-scanner.py sample.log --top 20 --verbose

If any line reads like a database field name rather than a sentence, rewrite it. The --help is your tool's documentation; treat it with the same care as a README.


Part E: Commit your work (10 min)

cd ~/fnd-102/lab-6
git add lab-6-scanner.py sample.log
# scanner.debug.log is an output artifact; do not commit
echo 'scanner.debug.log' >> ~/fnd-102/.gitignore
git add ~/fnd-102/.gitignore
git commit -m "lab-6: refactor scanner with argparse CLI and logging (file DEBUG + stderr level by flag)"

A .gitignore keeps generated artifacts out of the repo. The pattern is one path per line; entries can use globs (*.log).


Expected output / artifact

lab-6-scanner.py should:

  • Use argparse.ArgumentParser with at least four arguments: positional input, optional --top, --output, --verbose, --debug
  • Use logging.getLogger with two handlers (file + stderr)
  • File handler always DEBUG; stderr handler WARNING / INFO / DEBUG based on flags
  • Produce identical results to Lab 5 on the same input
  • --help reads like documentation, not field names

Files committed: lab-6-scanner.py, sample.log, .gitignore.


What's the failure mode?

This tool's likely failure modes:

  1. scanner.debug.log not writable. If you run the scanner in a read-only directory, the FileHandler raises PermissionError at config time. The user sees a traceback before any work happens. Defensive fix: try the file handler; on failure, log a warning and continue with stderr only.
  2. --output path's directory does not exist. open('subdir/out.txt', 'w') fails if subdir/ does not exist. Defensive fix: args.output.parent.mkdir(parents=True, exist_ok=True) before opening.
  3. The scanner crashes mid-stream. If a single log line has unexpected encoding, the for line in f: loop raises UnicodeDecodeError and you lose all matches found so far. Fix (week 9): try/except around the read.
  4. --top of 0. Argparse accepts --top 0; your matches[:0] is empty; the user sees "First 0:" followed by nothing. Not a crash, but confusing. Forward-stretch: add a choices=range(1, 1001) to argparse, or a manual validation.

Common pitfalls

  • logging.info(f'count: {n}') vs log.info('count: %d', n). The first formats eagerly (always); the second formats lazily (only if INFO is enabled). Tight loops should always use the lazy form. Lecture mentioned this; reinforce by running a million-iteration loop with each style and comparing wall time.
  • Forgetting if __name__ == '__main__':. Without it, importing lab-6-scanner.py runs the scanner. Tests in week 13 will fail loudly. Module names with hyphens cannot be imported anyway; rename to lab_6_scanner.py if you plan to test.
  • --verbose AND --debug passed together. Your config gives --debug precedence (good). Document this in the help string if it matters.
  • Argparse default of False for flags. action='store_true' defaults to False if the flag is absent. action='store_false' defaults to True. Pick the one that matches the natural off-state.

Stretch (optional)

  1. Add --threshold N that only reports kinds with ≥ N occurrences. Practice in conditional filtering with argparse-controlled threshold.
  2. Add --format with choices text, json, csv for the output format. Each choice changes how matches and counts are serialized.
  3. Split the file into a package. scanner/__init__.py, scanner/cli.py (argparse), scanner/core.py (scan generator), scanner/__main__.py (entry). Run with python3 -m scanner. This is the conventional shape for a real Python CLI tool.
  4. Add a config-file argument (--config config.json) that pre-loads defaults for --top and --threshold. CLI overrides config; config overrides argparse defaults.

Lab 6 v0.1.