Classroom Glossary Public page

Lab 11: Directory Integrity Checker

802 words

~90 minutes. Build a CLI tool that computes SHA-256 for every file in a directory tree; save to a manifest; verify a tree against a saved manifest. The structural shape of every intrusion-detection system.


Goal: ship lab-11-checker.py with two subcommands: manifest DIR --out FILE writes a manifest; verify DIR --manifest FILE reads the manifest and reports modified / added / deleted files.

Estimated time: 90 minutes

Prerequisites: Week 11 lecture (hashlib, pathlib, chunked-read). All prior labs.


Setup

mkdir -p ~/fnd-102/lab-11
cd ~/fnd-102/lab-11

Create a sample directory to scan:

mkdir -p sample-tree/{src,docs,tests}
echo "main code" > sample-tree/src/main.py
echo "utility module" > sample-tree/src/utils.py
echo "user guide" > sample-tree/docs/README.md
echo "test suite" > sample-tree/tests/test_main.py

You should now have 4 files in 3 subdirectories.


Part A: Hash a single file (15 min)

Write the chunked-read file-hash function in lab-11-checker.py:

import hashlib
from pathlib import Path

def hash_file(path, chunk_size=65536):
    """Return the SHA-256 hex digest of the file at `path`.

    Reads in chunks so the function works on files larger than memory.
    """
    h = hashlib.sha256()
    with open(path, 'rb') as f:
        while chunk := f.read(chunk_size):
            h.update(chunk)
    return h.hexdigest()

Test in the REPL:

>>> from lab_11_checker import hash_file
>>> hash_file('sample-tree/src/main.py')
'<some 64-char hex string>'

Cross-check with the system sha256sum:

sha256sum sample-tree/src/main.py
# Should match your Python hash output

If they don't match, you opened the file in text mode (you should be in binary mode); or you forgot to write the trailing newline when you created the file.


Part B: Build the manifest (25 min)

import json

def walk_files(root):
    """Yield (relative_path, full_path) for every file under root."""
    root = Path(root)
    for p in sorted(root.rglob('*')):
        if p.is_file():
            yield (str(p.relative_to(root)), p)

def build_manifest(root):
    """Return a dict {relative_path: sha256_hex} for every file under root."""
    return {rel: hash_file(full) for rel, full in walk_files(root)}

def write_manifest(manifest, path):
    """Write the manifest dict to a JSON file."""
    with open(path, 'w', encoding='utf-8') as f:
        json.dump({
            'version': 1,
            'algorithm': 'sha256',
            'files': manifest
        }, f, indent=2, sort_keys=True)

Two observations:

  1. sorted(root.rglob('*')) for deterministic ordering. Without sorting, different runs may produce files in different orders. Sorted = reproducible.
  2. p.relative_to(root) gives paths relative to the scan root. Without it, the manifest would store absolute paths, which break the moment you move the directory.

The manifest's top-level dict has a version and algorithm field. This is forward-thinking: a future v2 manifest might use a different algorithm; the version field lets the verifier know how to interpret it.


Part C: Build the verifier (25 min)

def read_manifest(path):
    """Read a manifest from disk; return the files dict.

    Raises ValueError if the manifest version or algorithm is unknown.
    """
    with open(path, encoding='utf-8') as f:
        data = json.load(f)
    if data.get('version') != 1:
        raise ValueError(f'unsupported manifest version: {data.get("version")}')
    if data.get('algorithm') != 'sha256':
        raise ValueError(f'unsupported algorithm: {data.get("algorithm")}')
    return data['files']

def verify(root, manifest_path):
    """Compare current state of root against the manifest.

    Returns a dict with three sets: modified, added, deleted.
    """
    expected = read_manifest(manifest_path)
    current = build_manifest(root)

    expected_paths = set(expected.keys())
    current_paths = set(current.keys())

    added = current_paths - expected_paths
    deleted = expected_paths - current_paths
    common = expected_paths & current_paths
    modified = {p for p in common if expected[p] != current[p]}

    return {'modified': modified, 'added': added, 'deleted': deleted}

The function returns a dict of three sets. The caller decides how to display them (text, JSON, exit code). Separation of computation from presentation is the discipline you have been practicing.


Part D: Wire up the CLI (15 min)

Use argparse subparsers (one parser for manifest, one for verify):

import argparse
import sys

def build_parser():
    parser = argparse.ArgumentParser(
        description='Directory integrity checker. Compute SHA-256 manifests and verify against them.'
    )
    sub = parser.add_subparsers(dest='command', required=True)

    p_manifest = sub.add_parser('manifest', help='write a manifest for a directory tree')
    p_manifest.add_argument('directory', type=Path, help='directory to scan')
    p_manifest.add_argument('--out', type=Path, required=True, help='manifest output path')

    p_verify = sub.add_parser('verify', help='check a directory tree against a manifest')
    p_verify.add_argument('directory', type=Path, help='directory to verify')
    p_verify.add_argument('--manifest', type=Path, required=True, help='manifest input path')

    return parser

def main():
    args = build_parser().parse_args()

    if args.command == 'manifest':
        if not args.directory.is_dir():
            print(f'Error: {args.directory} is not a directory', file=sys.stderr)
            sys.exit(1)
        manifest = build_manifest(args.directory)
        write_manifest(manifest, args.out)
        print(f'Wrote manifest with {len(manifest)} files to {args.out}')
        sys.exit(0)

    elif args.command == 'verify':
        if not args.directory.is_dir():
            print(f'Error: {args.directory} is not a directory', file=sys.stderr)
            sys.exit(1)
        if not args.manifest.exists():
            print(f'Error: manifest {args.manifest} does not exist', file=sys.stderr)
            sys.exit(1)
        result = verify(args.directory, args.manifest)
        clean = not (result['modified'] or result['added'] or result['deleted'])
        if clean:
            print('OK: all files match the manifest')
            sys.exit(0)
        else:
            print('CHANGES DETECTED:')
            for p in sorted(result['modified']):
                print(f'  MODIFIED  {p}')
            for p in sorted(result['added']):
                print(f'  ADDED     {p}')
            for p in sorted(result['deleted']):
                print(f'  DELETED   {p}')
            sys.exit(1)

if __name__ == '__main__':
    main()

Test:

# Write a manifest of the clean tree
python3 lab-11-checker.py manifest sample-tree --out manifest.json

# Verify (no changes yet; should be OK)
python3 lab-11-checker.py verify sample-tree --manifest manifest.json
# OK: all files match the manifest

# Now modify a file, add a file, delete a file
echo "MODIFIED" >> sample-tree/src/main.py
echo "new file" > sample-tree/src/new.py
rm sample-tree/docs/README.md

# Verify again (should report all three changes)
python3 lab-11-checker.py verify sample-tree --manifest manifest.json
# CHANGES DETECTED:
#   MODIFIED  src/main.py
#   ADDED     src/new.py
#   DELETED   docs/README.md

The exit codes follow integrity-tool convention: 0 if clean, 1 if changes detected. A wrapper script (cron job, systemd timer, CI) can rely on this.


Part E: Commit your work (10 min)

cd ~/fnd-102/lab-11
git add lab-11-checker.py
# do NOT commit manifest.json or sample-tree/; they are test artifacts
echo 'manifest.json' >> ~/fnd-102/.gitignore
echo 'sample-tree/' >> ~/fnd-102/.gitignore
git add ~/fnd-102/.gitignore
git commit -m "lab-11: directory integrity checker with SHA-256 manifests; manifest/verify subcommands"

Expected output / artifact

lab-11-checker.py should:

  • Use hashlib.sha256 with chunked reads (works on files larger than memory)
  • Walk the directory with pathlib.Path.rglob('*')
  • Skip directories and symlinks; only hash regular files
  • Write the manifest as JSON with version and algorithm keys
  • Read the manifest and reject unknown versions / algorithms
  • Use argparse subparsers (manifest and verify)
  • Exit 0 on success / clean; exit 1 on errors or detected changes

Files committed: lab-11-checker.py, .gitignore update.


What's the failure mode?

This tool's likely failure modes:

  1. Symbolic links. Path.is_file() returns True for both regular files AND for symlinks that point to regular files. Hashing the link follows it to the target. If your tree has symlinks to outside the scan root, you may hash files that are not "in" the tree. Defensive: use p.is_file() and not p.is_symlink(), or hash the link target's path string instead.
  2. Files modified between hash and write. If a file is being written while you scan, the hash is of the partial content. The next verification will report it as modified. This is a real production issue (databases, log files); the typical workaround is "scan during a quiet period" or "use a filesystem snapshot."
  3. The manifest itself. Your manifest is stored next to the files; an attacker who modifies a file also modifies the manifest. For real intrusion detection, store the manifest on a separate, write-protected medium (read-only filesystem; off-host).
  4. Permission denied. A file your process cannot read raises PermissionError. The current code crashes. Defensive: catch and report; continue with other files.

Common pitfalls

  • Text mode for hashing. open(path, 'r') (default text mode) re-encodes line endings on Windows, changing the hash. ALWAYS 'rb' for hashing.
  • Forgetting to sort rglob. Without sorting, the manifest's iteration order is filesystem-dependent. The JSON sort_keys=True on output saves you for the manifest file itself, but if you ever rely on walk_files() order elsewhere, sort.
  • Hashing the manifest in its own scan. If manifest.json is inside sample-tree/, the second run will see the manifest itself in the scan, fail to find its hash in itself (chicken-and-egg), and report it as added. Either keep the manifest outside the tree, or exclude it explicitly.
  • compare_digest vs ==. For integrity checking against a known hash, == is fine. For HMAC verification (forward-pointer to SEC-101), use hashlib.compare_digest to defend against timing attacks.

Stretch (optional)

  1. Add --algorithm argparse argument with choices SHA-256, SHA-512, BLAKE2b. Manifest version 2 supports the new field.
  2. Add --exclude PATTERN for the manifest command. Glob-style patterns excluded from the scan.
  3. Hash in parallel. Use concurrent.futures.ThreadPoolExecutor or ProcessPoolExecutor to hash files in parallel. For large trees on multi-core machines this is significantly faster.
  4. Stream progress. For huge trees the current code prints nothing until done. Add periodic progress lines: "scanned 1000 files so far."
  5. Track file metadata too. Not just hash: size, mtime, permissions. Useful for detecting changes that do not modify content (a file moved by cp -p may have identical content but new metadata).
  6. HMAC-sign the manifest. Use hashlib.blake2b(key=...) or hmac.new(key, ...) to sign the manifest with a secret. A tamperer who modifies a file AND the manifest must also forge the signature, which they can't without the key.

Lab 11 v0.1.