~90 minutes. Build a CLI tool that wraps du via subprocess and reports the largest directories under a path. Then debug a planted bug using pdb.
Goal: ship lab-9-du.py that wraps du -k, parses the output, and emits a sorted summary of the top-N largest directories. Then fix a bug in a provided lab-9-bug.py using pdb.
Estimated time: 90 minutes (60 for the tool, 30 for the debug exercise)
Prerequisites: Week 9 lecture (subprocess, try/except, pdb). All prior labs.
Setup
mkdir -p ~/fnd-102/lab-9
cd ~/fnd-102/lab-9
This lab assumes Unix (du is the canonical disk-usage tool on Linux and macOS). Windows students should use WSL2 (which has du); alternative: use Python's shutil.disk_usage for the equivalent without subprocess (a stretch exercise at the bottom).
Part A: Wrap du (30 min)
The Unix du ("disk usage") utility reports the size of each directory in a tree. The output looks like:
$ du -k /var/log
4 /var/log/cups
12 /var/log/sssd
84 /var/log/journal/4f4e...
84 /var/log/journal
8520 /var/log
Each line: <size_in_kb><tab><path>. -k forces kilobyte units.
Wrap it from Python:
import argparse
import subprocess
import sys
from pathlib import Path
def run_du(path):
"""Run `du -k path` and return a list of (size_kb, path) tuples.
Raises FileNotFoundError if path does not exist.
Raises subprocess.CalledProcessError if du fails for any other reason.
"""
if not Path(path).exists():
raise FileNotFoundError(f'path does not exist: {path}')
result = subprocess.run(
['du', '-k', str(path)],
capture_output=True, text=True, check=True
)
entries = []
for line in result.stdout.splitlines():
size_str, _, dir_path = line.partition('\t')
try:
entries.append((int(size_str), dir_path))
except ValueError:
# malformed line; skip silently
continue
return entries
Notice:
- List form, not
shell=True. No shell injection risk; the user's path is a single argument. check=TrueraisesCalledProcessErroron a nonzero exit. Without it, you have to checkresult.returncodemanually.line.partition('\t')splits on the first tab; the rest of the path can contain whitespace (rare on Unix but possible).- Skip malformed lines rather than crashing. The Unix
dusometimes emits a "Permission denied" line in stderr; withcheck=Trueit would crash on the first one. Forward-stretch: handle the permission-denied case explicitly.
Part B: Build the CLI (15 min)
def build_parser():
parser = argparse.ArgumentParser(description='Report the top N largest directories under a path (wraps `du -k`).')
parser.add_argument('path', type=Path, help='directory to analyze')
parser.add_argument('--top', type=int, default=10, help='show the top N largest (default: %(default)s)')
parser.add_argument('--human', action='store_true', help='show sizes in MB / GB instead of kilobytes')
return parser
def format_size(kb, human=False):
if not human:
return f'{kb} KB'
if kb >= 1024 * 1024:
return f'{kb / 1024 / 1024:.2f} GB'
if kb >= 1024:
return f'{kb / 1024:.2f} MB'
return f'{kb} KB'
def main():
args = build_parser().parse_args()
try:
entries = run_du(args.path)
except FileNotFoundError as e:
print(f'Error: {e}', file=sys.stderr)
sys.exit(1)
except subprocess.CalledProcessError as e:
print(f'Error: du failed with exit code {e.returncode}: {e.stderr}', file=sys.stderr)
sys.exit(1)
entries.sort(reverse=True)
print(f'Top {args.top} directories under {args.path}:')
for size_kb, dir_path in entries[:args.top]:
print(f' {format_size(size_kb, args.human):>12s} {dir_path}')
if __name__ == '__main__':
main()
Run it:
python3 lab-9-du.py /usr/share/doc --top 5 --human
python3 lab-9-du.py /tmp --top 10
python3 lab-9-du.py /nope # error: path does not exist
A representative output:
Top 5 directories under /usr/share/doc:
28.40 MB /usr/share/doc
5.20 MB /usr/share/doc/libreoffice-common
4.10 MB /usr/share/doc/python3-pkg-resources
2.30 MB /usr/share/doc/python3-setuptools
1.50 MB /usr/share/doc/python3
Part C: The debugging exercise (30 min)
Create lab-9-bug.py. This is a buggy implementation; your job is to find and fix the bug with pdb:
"""lab-9-bug.py - a buggy implementation. Find the bug with pdb."""
def parse_du_output(text):
"""Parse `du -k` output into a list of (size_kb, path) tuples."""
entries = []
for line in text.splitlines():
parts = line.split('\t')
if len(parts) != 2:
continue
size_str, dir_path = parts
# BUG: this multiplies the size by 100 instead of 1024
# (the original author was converting KB to "centi-MB" or something
# equally nonsensical; whatever they were doing, it produces wrong
# totals downstream)
size_kb = int(size_str) * 100
entries.append((size_kb, dir_path))
return entries
def total_size(entries):
"""Return the total size in kilobytes."""
return sum(size for size, _ in entries)
def main():
sample_output = """\
1024\t/path/a
2048\t/path/b
512\t/path/c"""
entries = parse_du_output(sample_output)
total = total_size(entries)
print(f'Total: {total} KB')
# Expected: 1024 + 2048 + 512 = 3584 KB
# Actual: prints something much larger because of the bug
assert total == 3584, f'expected 3584, got {total}'
if __name__ == '__main__':
main()
Save this file unmodified. Run it:
python3 lab-9-bug.py
You should see AssertionError: expected 3584, got 358400 (the 100x multiplier).
Now find the bug with pdb:
- Add
breakpoint()insideparse_du_outputjust before theentries.appendline. - Run
python3 lab-9-bug.py. You drop into pdb at the breakpoint. p size_strprints'1024'.p size_kbprints102400(after the* 100).- The bug is the
* 100. Fix to* 1, or justint(size_str). - Remove the
breakpoint()line. - Re-run; the assertion passes.
Commit the fix:
git add lab-9-bug.py
git commit -m "lab-9: fix size-parsing bug in lab-9-bug.py (drop spurious *100 multiplier)"
The instructive part of this exercise: you found the bug in 30 seconds with pdb because you could inspect size_kb at the moment it was computed wrong. With print-debugging, you would have added a print, run, observed, gone back to add another print to confirm... pdb compressed the loop.
Part D: Commit your work (15 min)
cd ~/fnd-102/lab-9
git add lab-9-du.py lab-9-bug.py
git commit -m "lab-9: disk-usage reporter wrapping du via subprocess; pdb debugging exercise complete"
A second commit if you added stretch features:
git commit -m "lab-9: add --human flag for MB/GB output" --allow-empty
Expected output / artifact
lab-9-du.py should:
- Wrap
du -k PATHviasubprocess.runwith the list form (NOshell=True) - Parse the output into
(size_kb, path)tuples - Sort by size descending; report top N
- Handle missing path with
FileNotFoundError+ clean error message + nonzero exit - Handle
duerrors withCalledProcessError+ nonzero exit - Support
--topand--humanargparse flags
lab-9-bug.py should:
- Be modified ONLY to remove the
* 100multiplier - Run without raising AssertionError
- Have a commit message that names the bug
Files committed: lab-9-du.py, lab-9-bug.py.
What's the failure mode?
This tool's likely failure modes:
dunot installed (Windows without WSL).subprocess.runraisesFileNotFoundError(the EXECUTABLE not found, not the path argument). CatchFileNotFoundErrorseparately from the missing-path case, or checkshutil.which('du')before running.- Permission denied on subdirectories.
duprints "Permission denied" to stderr for unreadable subdirs and continues with the readable ones. Withcheck=Trueit exits nonzero ON THE LAST LINE; you may want to dropcheck=Trueand inspectreturncodeyourself. - Very large directories (millions of entries).
duwalks the entire tree before returning. For huge trees this can take minutes; the user sees no feedback. Forward-stretch: stream the output line-by-line withsubprocess.Popenand report progress. - Paths with tabs.
'\t' in pathis rare but possible.line.partition('\t')correctly splits on the FIRST tab, so the path keeps any subsequent tabs. Good.
Common pitfalls
shell=Truewith user input. Discussed in lecture. Lab 9 uses list form; verify yours does too.- Forgetting
text=True. Without it,result.stdoutis bytes;line.partition('\t')raisesTypeError. Alwaystext=Truefor text output. - Forgetting
check=True(or relying on it inappropriately). Withcheck=True, ANY nonzero exit is an exception. Forduwhich sometimes exits nonzero on partial-permission cases, you might want to handle the exit code manually. int(size_str)on a row that does not split cleanly. The exception-handling try/except in Part A'srun_duskips these. Without it, one malformed row crashes the program.
Stretch (optional)
- Replace
duwithshutil.disk_usage. Python's stdlib hasshutil.disk_usage(path)which returns total/used/free for the FILESYSTEM containingpath. Not the same as per-directory size, but no subprocess needed. - Recursive Python implementation. Implement disk-usage in pure Python using
pathlib.Path.rglob('*')andPath.stat().st_size. Compare speed withdu; for small directories Python is comparable, for large treesduis faster. - Stream
duoutput withPopen. Usesubprocess.Popen(...)+proc.stdoutiteration to process lines as they arrive instead of waiting forduto finish. Useful for trees that take minutes. - Add a
--exclude PATTERNflag. Pass through todu --exclude=PATTERN. Test that the user's pattern is shell-safe (no injection risk because of list form). - Plant your own pdb-worthy bug. Take any of your prior labs; intentionally introduce a subtle bug; commit the buggy version; give it to a peer; ask them to find it with pdb. The teaching reverses.
Lab 9 v0.1.