Classroom Glossary Public page

Lab 9: Disk-Usage Reporter + Debugging Exercise

788 words

~90 minutes. Build a CLI tool that wraps du via subprocess and reports the largest directories under a path. Then debug a planted bug using pdb.


Goal: ship lab-9-du.py that wraps du -k, parses the output, and emits a sorted summary of the top-N largest directories. Then fix a bug in a provided lab-9-bug.py using pdb.

Estimated time: 90 minutes (60 for the tool, 30 for the debug exercise)

Prerequisites: Week 9 lecture (subprocess, try/except, pdb). All prior labs.


Setup

mkdir -p ~/fnd-102/lab-9
cd ~/fnd-102/lab-9

This lab assumes Unix (du is the canonical disk-usage tool on Linux and macOS). Windows students should use WSL2 (which has du); alternative: use Python's shutil.disk_usage for the equivalent without subprocess (a stretch exercise at the bottom).


Part A: Wrap du (30 min)

The Unix du ("disk usage") utility reports the size of each directory in a tree. The output looks like:

$ du -k /var/log
4    /var/log/cups
12   /var/log/sssd
84   /var/log/journal/4f4e...
84   /var/log/journal
8520 /var/log

Each line: <size_in_kb><tab><path>. -k forces kilobyte units.

Wrap it from Python:

import argparse
import subprocess
import sys
from pathlib import Path

def run_du(path):
    """Run `du -k path` and return a list of (size_kb, path) tuples.

    Raises FileNotFoundError if path does not exist.
    Raises subprocess.CalledProcessError if du fails for any other reason.
    """
    if not Path(path).exists():
        raise FileNotFoundError(f'path does not exist: {path}')
    result = subprocess.run(
        ['du', '-k', str(path)],
        capture_output=True, text=True, check=True
    )
    entries = []
    for line in result.stdout.splitlines():
        size_str, _, dir_path = line.partition('\t')
        try:
            entries.append((int(size_str), dir_path))
        except ValueError:
            # malformed line; skip silently
            continue
    return entries

Notice:

  • List form, not shell=True. No shell injection risk; the user's path is a single argument.
  • check=True raises CalledProcessError on a nonzero exit. Without it, you have to check result.returncode manually.
  • line.partition('\t') splits on the first tab; the rest of the path can contain whitespace (rare on Unix but possible).
  • Skip malformed lines rather than crashing. The Unix du sometimes emits a "Permission denied" line in stderr; with check=True it would crash on the first one. Forward-stretch: handle the permission-denied case explicitly.

Part B: Build the CLI (15 min)

def build_parser():
    parser = argparse.ArgumentParser(description='Report the top N largest directories under a path (wraps `du -k`).')
    parser.add_argument('path', type=Path, help='directory to analyze')
    parser.add_argument('--top', type=int, default=10, help='show the top N largest (default: %(default)s)')
    parser.add_argument('--human', action='store_true', help='show sizes in MB / GB instead of kilobytes')
    return parser

def format_size(kb, human=False):
    if not human:
        return f'{kb} KB'
    if kb >= 1024 * 1024:
        return f'{kb / 1024 / 1024:.2f} GB'
    if kb >= 1024:
        return f'{kb / 1024:.2f} MB'
    return f'{kb} KB'

def main():
    args = build_parser().parse_args()
    try:
        entries = run_du(args.path)
    except FileNotFoundError as e:
        print(f'Error: {e}', file=sys.stderr)
        sys.exit(1)
    except subprocess.CalledProcessError as e:
        print(f'Error: du failed with exit code {e.returncode}: {e.stderr}', file=sys.stderr)
        sys.exit(1)

    entries.sort(reverse=True)
    print(f'Top {args.top} directories under {args.path}:')
    for size_kb, dir_path in entries[:args.top]:
        print(f'  {format_size(size_kb, args.human):>12s}  {dir_path}')

if __name__ == '__main__':
    main()

Run it:

python3 lab-9-du.py /usr/share/doc --top 5 --human
python3 lab-9-du.py /tmp --top 10
python3 lab-9-du.py /nope     # error: path does not exist

A representative output:

Top 5 directories under /usr/share/doc:
       28.40 MB  /usr/share/doc
        5.20 MB  /usr/share/doc/libreoffice-common
        4.10 MB  /usr/share/doc/python3-pkg-resources
        2.30 MB  /usr/share/doc/python3-setuptools
        1.50 MB  /usr/share/doc/python3

Part C: The debugging exercise (30 min)

Create lab-9-bug.py. This is a buggy implementation; your job is to find and fix the bug with pdb:

"""lab-9-bug.py - a buggy implementation. Find the bug with pdb."""

def parse_du_output(text):
    """Parse `du -k` output into a list of (size_kb, path) tuples."""
    entries = []
    for line in text.splitlines():
        parts = line.split('\t')
        if len(parts) != 2:
            continue
        size_str, dir_path = parts
        # BUG: this multiplies the size by 100 instead of 1024
        # (the original author was converting KB to "centi-MB" or something
        # equally nonsensical; whatever they were doing, it produces wrong
        # totals downstream)
        size_kb = int(size_str) * 100
        entries.append((size_kb, dir_path))
    return entries

def total_size(entries):
    """Return the total size in kilobytes."""
    return sum(size for size, _ in entries)

def main():
    sample_output = """\
1024\t/path/a
2048\t/path/b
512\t/path/c"""
    entries = parse_du_output(sample_output)
    total = total_size(entries)
    print(f'Total: {total} KB')
    # Expected: 1024 + 2048 + 512 = 3584 KB
    # Actual: prints something much larger because of the bug
    assert total == 3584, f'expected 3584, got {total}'

if __name__ == '__main__':
    main()

Save this file unmodified. Run it:

python3 lab-9-bug.py

You should see AssertionError: expected 3584, got 358400 (the 100x multiplier).

Now find the bug with pdb:

  1. Add breakpoint() inside parse_du_output just before the entries.append line.
  2. Run python3 lab-9-bug.py. You drop into pdb at the breakpoint.
  3. p size_str prints '1024'.
  4. p size_kb prints 102400 (after the * 100).
  5. The bug is the * 100. Fix to * 1, or just int(size_str).
  6. Remove the breakpoint() line.
  7. Re-run; the assertion passes.

Commit the fix:

git add lab-9-bug.py
git commit -m "lab-9: fix size-parsing bug in lab-9-bug.py (drop spurious *100 multiplier)"

The instructive part of this exercise: you found the bug in 30 seconds with pdb because you could inspect size_kb at the moment it was computed wrong. With print-debugging, you would have added a print, run, observed, gone back to add another print to confirm... pdb compressed the loop.


Part D: Commit your work (15 min)

cd ~/fnd-102/lab-9
git add lab-9-du.py lab-9-bug.py
git commit -m "lab-9: disk-usage reporter wrapping du via subprocess; pdb debugging exercise complete"

A second commit if you added stretch features:

git commit -m "lab-9: add --human flag for MB/GB output" --allow-empty

Expected output / artifact

lab-9-du.py should:

  • Wrap du -k PATH via subprocess.run with the list form (NO shell=True)
  • Parse the output into (size_kb, path) tuples
  • Sort by size descending; report top N
  • Handle missing path with FileNotFoundError + clean error message + nonzero exit
  • Handle du errors with CalledProcessError + nonzero exit
  • Support --top and --human argparse flags

lab-9-bug.py should:

  • Be modified ONLY to remove the * 100 multiplier
  • Run without raising AssertionError
  • Have a commit message that names the bug

Files committed: lab-9-du.py, lab-9-bug.py.


What's the failure mode?

This tool's likely failure modes:

  1. du not installed (Windows without WSL). subprocess.run raises FileNotFoundError (the EXECUTABLE not found, not the path argument). Catch FileNotFoundError separately from the missing-path case, or check shutil.which('du') before running.
  2. Permission denied on subdirectories. du prints "Permission denied" to stderr for unreadable subdirs and continues with the readable ones. With check=True it exits nonzero ON THE LAST LINE; you may want to drop check=True and inspect returncode yourself.
  3. Very large directories (millions of entries). du walks the entire tree before returning. For huge trees this can take minutes; the user sees no feedback. Forward-stretch: stream the output line-by-line with subprocess.Popen and report progress.
  4. Paths with tabs. '\t' in path is rare but possible. line.partition('\t') correctly splits on the FIRST tab, so the path keeps any subsequent tabs. Good.

Common pitfalls

  • shell=True with user input. Discussed in lecture. Lab 9 uses list form; verify yours does too.
  • Forgetting text=True. Without it, result.stdout is bytes; line.partition('\t') raises TypeError. Always text=True for text output.
  • Forgetting check=True (or relying on it inappropriately). With check=True, ANY nonzero exit is an exception. For du which sometimes exits nonzero on partial-permission cases, you might want to handle the exit code manually.
  • int(size_str) on a row that does not split cleanly. The exception-handling try/except in Part A's run_du skips these. Without it, one malformed row crashes the program.

Stretch (optional)

  1. Replace du with shutil.disk_usage. Python's stdlib has shutil.disk_usage(path) which returns total/used/free for the FILESYSTEM containing path. Not the same as per-directory size, but no subprocess needed.
  2. Recursive Python implementation. Implement disk-usage in pure Python using pathlib.Path.rglob('*') and Path.stat().st_size. Compare speed with du; for small directories Python is comparable, for large trees du is faster.
  3. Stream du output with Popen. Use subprocess.Popen(...) + proc.stdout iteration to process lines as they arrive instead of waiting for du to finish. Useful for trees that take minutes.
  4. Add a --exclude PATTERN flag. Pass through to du --exclude=PATTERN. Test that the user's pattern is shell-safe (no injection risk because of list form).
  5. Plant your own pdb-worthy bug. Take any of your prior labs; intentionally introduce a subtle bug; commit the buggy version; give it to a peer; ask them to find it with pdb. The teaching reverses.

Lab 9 v0.1.