Cryptographic hashes as a tool for detecting change. The lab is a directory-integrity checker: compute SHA-256 for every file in a tree, save the manifest, later detect modifications.
Theme
A cryptographic hash function takes any input (a file, a string, a directory tree) and produces a fixed-size fingerprint. The fingerprint has two useful properties: (1) the same input always produces the same fingerprint; (2) a tiny change to the input produces a wildly different fingerprint. These two properties make hashes the standard tool for integrity checking: did this file change since I last saw it?
SHA-256 is the canonical modern hash function: 256 bits of output, no known practical collisions, fast on every CPU. The Python stdlib's hashlib.sha256 is what you reach for. Other hash functions exist (SHA-1, MD5, BLAKE2, SHA-3); SHA-256 is the safe default.
The lab is a directory-integrity checker. Given a directory path, the tool walks the tree, computes SHA-256 for every file, and writes the (path, hash) pairs to a manifest file. Running the tool a second time with --verify compares current hashes against the manifest and reports any files that changed, were added, or were deleted. This is structurally what intrusion-detection systems (AIDE, Tripwire) and backup-verification tools do.
This week introduces hashing at the practical level (use hashlib, compare digests, report differences). The cryptographic theory (collision resistance, birthday attacks, why SHA-1 is deprecated) is SEC-101 / CSA-201 material; FND-102 stops at "this is how you use it correctly."
By the end of week 11 you can: compute a SHA-256 hash with hashlib; chunk-read large files to hash them without loading into memory; pick between hexdigest (human) and digest (bytes); recognize the integrity-vs-encryption-vs-signing distinction (hashing alone does not authenticate; you also need a signature for that).
Reading list (~1 hour)
- Matthes does not cover hashlib. Use Sweigart.
- Sweigart, Cracking Codes with Python Ch 1 ("Making Paper Cryptography Tools") at the academy library copy. ISBN 978-1-59327-822-9; Calibre id 243. The first chapter sets cryptographic context without requiring math; useful framing even though SHA-256 is not a cipher.
- Python
hashlibmodule docs athttps://docs.python.org/3/library/hashlib.html. ~15 min read. Read at least the introduction and the "Hash algorithms" section. - Real Python: "Python's hash() Function and the hashlib Module" at
https://realpython.com/python-hash-table/(covershash()for dict keys) andhttps://realpython.com/python-hashlib/(covershashlibfor cryptographic hashing). ~25 min total. - NIST SHA-256 spec (FIPS 180-4) at
https://csrc.nist.gov/publications/detail/fips/180/4/final. Reference only; do not read end-to-end. Useful to know exists.
Lecture outline (~1.5 hours, 2 sessions of ~50 min)
Session 1: Hashing basics
Section 1.1: What a hash function is
- A hash function maps inputs of any size to outputs of fixed size.
- A cryptographic hash function additionally satisfies three security properties:
- Pre-image resistance: given a hash, it is infeasible to find an input that produces it
- Second pre-image resistance: given an input, it is infeasible to find a different input with the same hash
- Collision resistance: it is infeasible to find any two distinct inputs with the same hash
- SHA-256 (Secure Hash Algorithm, 256-bit output) satisfies all three. SHA-1 satisfies the first two but collision resistance was broken in 2017 (the SHAttered attack:
https://shattered.io/). MD5 is broken in all three; do not use for any new code. - "Infeasible" means computationally expensive, not impossible. For SHA-256, the work factor is ~2^128 operations to find a collision; current computing power makes this infeasible today.
Section 1.2: SHA-256 in Python
- The basic pattern:
import hashlib h = hashlib.sha256() h.update(b'hello world') print(h.hexdigest()) # 'b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9'
- The hash object accumulates input via
update(). Call multiple times to hash content in chunks:h = hashlib.sha256() h.update(b'hello ') h.update(b'world') h.hexdigest() # same as above; the function does not care about chunking
hexdigest()returns a 64-character hex string (256 bits / 4 bits per hex char = 64 chars).digest()returns 32 bytes (the raw output). Use hex for human display; use raw bytes for storage or further computation.
Section 1.3: Hashing a file
- For files small enough to fit in memory:
with open('myfile', 'rb') as f: data = f.read() print(hashlib.sha256(data).hexdigest())
- For larger files, chunk-read:
def hash_file(path, chunk_size=65536): """Return the SHA-256 hex digest of the file at `path`.""" h = hashlib.sha256() with open(path, 'rb') as f: while chunk := f.read(chunk_size): h.update(chunk) return h.hexdigest()
- The
chunk := f.read(...)is the walrus operator (Python 3.8+). Assign-and-test: keep reading until you get an empty bytes object (EOF). 'rb'mode is critical: binary mode. Text mode would re-encode line endings, changing the hash.- The chunk size 65536 is a common choice (matches typical OS page size); 4096 to 1048576 all work fine.
Section 1.4: Verifying a file against a known hash
- The pattern for "did this file change?":
expected = 'b94d27b9934d...' actual = hash_file('myfile') if actual != expected: print('FILE CHANGED')
- Use
hashlib.compare_digest(a, b)instead of==for constant-time comparison (defends against timing attacks). For integrity checking of file contents this is rarely needed; for HMAC verification it matters.
Session 2: Manifests, the integrity-vs-authenticity distinction
Section 2.1: A manifest file
- A manifest is "the list of expected hashes for a set of files." Typical formats:
- Plain text: one line per file, format
<hexdigest> <path>(the conventionsha256sumuses) - JSON:
{ "files": { "path1": "hash1", "path2": "hash2" } } - CSV:
path,hashwith a header row
- Plain text: one line per file, format
- The standard Unix
sha256sumutility produces and verifies the plain-text format:sha256sum file1 file2 > manifest.txt sha256sum -c manifest.txt
- Your Lab 11 reimplements this in Python. The Python version has advantages: cross-platform, can walk directories recursively, can output JSON for downstream tools.
Section 2.2: Walking a directory
- The stdlib
pathlib.Pathwalks a tree:from pathlib import Path for p in Path('mydir').rglob('*'): if p.is_file(): print(p, hash_file(p))
rglob('*')is recursive glob; matches every name in every subdirectory.- Filter to files only with
p.is_file(); symbolic links and directories should not be hashed as files. - Sort the output (paths or hashes) for deterministic ordering across runs.
Section 2.3: Detecting changes
- Three change categories:
- Modified: path exists in both manifests; hashes differ
- Added: path exists in current but not in manifest
- Deleted: path exists in manifest but not in current
- Compute as set operations:
manifest = {'a.txt': 'hash1', 'b.txt': 'hash2'} current = {'a.txt': 'hash1_NEW', 'c.txt': 'hash3'} modified = {p for p in manifest if p in current and manifest[p] != current[p]} # {'a.txt'} added = current.keys() - manifest.keys() # {'c.txt'} deleted = manifest.keys() - current.keys() # {'b.txt'}
- Set operations are fast even on millions of files.
Section 2.4: Integrity vs authenticity vs encryption
- Integrity: "this file has not been modified." A hash detects integrity violations.
- Authenticity: "this file came from a specific sender." Requires a signature (HMAC with a shared secret, or asymmetric crypto). A bare hash is NOT authentication: an attacker who can modify the file can also modify the manifest.
- Encryption: "this file's contents are hidden." A hash does NOT encrypt; it produces a fingerprint, not a reversible transformation.
- For Lab 11 (your own directory, your own manifest on the same disk), bare hashing is enough; the attacker model is "did I accidentally modify a file?", not "is an adversary tampering with my files?". For real intrusion detection (SEC-101), you also store the manifest on a write-protected or off-host medium so a tamperer cannot edit both at once.
Section 2.5: Other hash functions
hashlib.sha1exists; do NOT use for new code (collision-broken).hashlib.md5exists; do NOT use for new code (collision-broken; pre-image still mostly intact but no reason to pick it).hashlib.sha512(SHA-2, 512-bit output): faster than SHA-256 on 64-bit CPUs; rarely needed (SHA-256's 128-bit collision resistance is already overkill for most purposes).hashlib.sha3_256(SHA-3, 256-bit output): different construction (Keccak sponge); use if you need diversity for paranoid reasons. SHA-256 is the safe default.hashlib.blake2bandblake2s: fast modern alternatives. Equally secure as SHA-256. Use if you need speed and SHA-2 is a bottleneck (rare).
Labs (~90 minutes)
Lab 11: Directory Integrity Checker (labs/lab-11-integrity.md)
- Goal: build a CLI tool that computes SHA-256 for every file in a directory tree; saves to a manifest; can verify a tree against a saved manifest
- Time: ~90 minutes
- Artifact:
lab-11-checker.pyin~/fnd-102/lab-11/, committed to Git
Independent practice (~4 hours)
sha256sumcomparison (30 min). On macOS or Linux: runsha256sum myfileand compare to your Pythonhashlib.sha256output. They MUST match. If they don't, you opened the file in text mode (you should be in binary mode).- Hash a directory (45 min). Extend the file-hash idea: combine hashes of all files in a directory to produce a single "directory hash." Standard pattern:
What makes the order matter here? Why thedef hash_directory(path): h = hashlib.sha256() for p in sorted(Path(path).rglob('*')): if p.is_file(): h.update(str(p.relative_to(path)).encode('utf-8')) h.update(b'\0') h.update(hash_file(p).encode('ascii')) h.update(b'\0') return h.hexdigest()
b'\0'separator? - Hash collision search (30 min). SHA-256 collisions are infeasible. But for a SMALLER hash (truncated to 16 bits), they're easy to find:
Run and observe. By how many strings do you expect a collision? (Birthday paradox: ~2^8 = 256 strings.)def hash16(s): return hashlib.sha256(s.encode()).hexdigest()[:4] # 16 bits as hex import itertools, string seen = {} for length in range(1, 8): for combo in itertools.product(string.ascii_lowercase, repeat=length): s = ''.join(combo) h = hash16(s) if h in seen: print(f'collision: {s!r} and {seen[h]!r} both hash to {h}') break seen[h] = s
- Read a real Tripwire policy (30 min). Read the documentation on the AIDE intrusion detection system (
https://aide.github.io/) or the older Tripwire. Notice the manifest is signed AND stored off-host. Your Lab 11 manifest is not signed; what attack would compromise it? - Hash a stream from stdin (30 min). Write a script that reads from stdin and prints the SHA-256 hash of the input. Usage:
cat myfile | python3 myhash.py. Useful when piping data through a pipeline. - Use
hashlib.blake2bkeyed mode (45 min, optional stretch). BLAKE2 supports keyed hashing (effectively a built-in HMAC):
This is the simplest way to authenticate (not just integrity-check) a piece of data without rolling your own HMAC. Forward-pointer to SEC-101.import hashlib h = hashlib.blake2b(key=b'my-secret-key') h.update(b'hello') print(h.hexdigest())
Reflection prompts (~30 minutes)
- SHA-256 is collision-resistant; SHA-1 is not. What does this mean concretely for someone considering whether to upgrade existing SHA-1-based integrity checks?
- Your Lab 11 stores the manifest in the same directory as the files it checks. What attack would compromise this? How would you defend against it?
- Hashing detects modification; it does NOT authenticate the source. Give an example where you'd care about authentication beyond integrity.
- The chunked-read pattern (
while chunk := f.read(...)) lets you hash files larger than memory. Did your Lab 11 use it? What is the largest file your laptop has where this matters? - One thing from this week you want to know more about?
Tool journal (week 11)
hashlib.sha256: the cryptographic hash function defaulth.update(),h.hexdigest(),h.digest(): incremental hashing- Walrus operator
:=: for chunked-read loops pathlib.Path.rglob('*'): recursive directory walk'rb'mode: always for hashing (no encoding translation)hashlib.compare_digest: constant-time comparison (timing-attack-safe)- Manifest formats:
sha256sum-style plain text, JSON, CSV hashlib.blake2b,hashlib.sha3_256: alternative hash functions- Integrity vs authenticity vs encryption: three different properties
What comes next
Week 12 introduces HTTP via the requests library. Lab 12 is a weather-report CLI that reads a free weather API. The first time the course goes to the network; the first time you install a non-stdlib package; the first time you handle JSON from an external source.