Week 11: Hashing and Integrity · FND-102 · Virtus Cyber Academy Classroom

Cryptographic hashes as a tool for detecting change. The lab is a directory-integrity checker: compute SHA-256 for every file in a tree, save the manifest, later detect modifications.

Theme

A cryptographic hash function takes any input (a file, a string, a directory tree) and produces a fixed-size fingerprint. The fingerprint has two useful properties: (1) the same input always produces the same fingerprint; (2) a tiny change to the input produces a wildly different fingerprint. These two properties make hashes the standard tool for integrity checking: did this file change since I last saw it?

SHA-256 is the canonical modern hash function: 256 bits of output, no known practical collisions, fast on every CPU. The Python stdlib's hashlib.sha256 is what you reach for. Other hash functions exist (SHA-1, MD5, BLAKE2, SHA-3); SHA-256 is the safe default.

The lab is a directory-integrity checker. Given a directory path, the tool walks the tree, computes SHA-256 for every file, and writes the (path, hash) pairs to a manifest file. Running the tool a second time with --verify compares current hashes against the manifest and reports any files that changed, were added, or were deleted. This is structurally what intrusion-detection systems (AIDE, Tripwire) and backup-verification tools do.

This week introduces hashing at the practical level (use hashlib, compare digests, report differences). The cryptographic theory (collision resistance, birthday attacks, why SHA-1 is deprecated) is SEC-101 / CSA-201 material; FND-102 stops at "this is how you use it correctly."

By the end of week 11 you can: compute a SHA-256 hash with hashlib; chunk-read large files to hash them without loading into memory; pick between hexdigest (human) and digest (bytes); recognize the integrity-vs-encryption-vs-signing distinction (hashing alone does not authenticate; you also need a signature for that).

Reading list (~1 hour)

Matthes does not cover hashlib. Use Sweigart.
Sweigart, Cracking Codes with Python Ch 1 ("Making Paper Cryptography Tools") at the academy library copy. ISBN 978-1-59327-822-9; Calibre id 243. The first chapter sets cryptographic context without requiring math; useful framing even though SHA-256 is not a cipher.
Python hashlib module docs at https://docs.python.org/3/library/hashlib.html. ~15 min read. Read at least the introduction and the "Hash algorithms" section.
Real Python: "Python's hash() Function and the hashlib Module" at https://realpython.com/python-hash-table/ (covers hash() for dict keys) and https://realpython.com/python-hashlib/ (covers hashlib for cryptographic hashing). ~25 min total.
NIST SHA-256 spec (FIPS 180-4) at https://csrc.nist.gov/publications/detail/fips/180/4/final. Reference only; do not read end-to-end. Useful to know exists.

Lecture outline (~1.5 hours, 2 sessions of ~50 min)

Session 1: Hashing basics

Section 1.1: What a hash function is

A hash function maps inputs of any size to outputs of fixed size.
A cryptographic hash function additionally satisfies three security properties:
- Pre-image resistance: given a hash, it is infeasible to find an input that produces it
- Second pre-image resistance: given an input, it is infeasible to find a different input with the same hash
- Collision resistance: it is infeasible to find any two distinct inputs with the same hash
SHA-256 (Secure Hash Algorithm, 256-bit output) satisfies all three. SHA-1 satisfies the first two but collision resistance was broken in 2017 (the SHAttered attack: https://shattered.io/). MD5 is broken in all three; do not use for any new code.
"Infeasible" means computationally expensive, not impossible. For SHA-256, the work factor is ~2^128 operations to find a collision; current computing power makes this infeasible today.

Section 1.2: SHA-256 in Python

The basic pattern:

import hashlib
h = hashlib.sha256()
h.update(b'hello world')
print(h.hexdigest())   # 'b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9'

The hash object accumulates input via update(). Call multiple times to hash content in chunks:

h = hashlib.sha256()
h.update(b'hello ')
h.update(b'world')
h.hexdigest()   # same as above; the function does not care about chunking

hexdigest() returns a 64-character hex string (256 bits / 4 bits per hex char = 64 chars).
digest() returns 32 bytes (the raw output). Use hex for human display; use raw bytes for storage or further computation.

Section 1.3: Hashing a file

For files small enough to fit in memory:

with open('myfile', 'rb') as f:
    data = f.read()
print(hashlib.sha256(data).hexdigest())

For larger files, chunk-read:

def hash_file(path, chunk_size=65536):
    """Return the SHA-256 hex digest of the file at `path`."""
    h = hashlib.sha256()
    with open(path, 'rb') as f:
        while chunk := f.read(chunk_size):
            h.update(chunk)
    return h.hexdigest()

The chunk := f.read(...) is the walrus operator (Python 3.8+). Assign-and-test: keep reading until you get an empty bytes object (EOF).
'rb' mode is critical: binary mode. Text mode would re-encode line endings, changing the hash.
The chunk size 65536 is a common choice (matches typical OS page size); 4096 to 1048576 all work fine.

Section 1.4: Verifying a file against a known hash

The pattern for "did this file change?":

expected = 'b94d27b9934d...'
actual = hash_file('myfile')
if actual != expected:
    print('FILE CHANGED')

Use hashlib.compare_digest(a, b) instead of == for constant-time comparison (defends against timing attacks). For integrity checking of file contents this is rarely needed; for HMAC verification it matters.

Session 2: Manifests, the integrity-vs-authenticity distinction

Section 2.1: A manifest file

A manifest is "the list of expected hashes for a set of files." Typical formats:
- Plain text: one line per file, format <hexdigest> <path> (the convention sha256sum uses)
- JSON: { "files": { "path1": "hash1", "path2": "hash2" } }
- CSV: path,hash with a header row

The standard Unix sha256sum utility produces and verifies the plain-text format:

sha256sum file1 file2 > manifest.txt
sha256sum -c manifest.txt

Your Lab 11 reimplements this in Python. The Python version has advantages: cross-platform, can walk directories recursively, can output JSON for downstream tools.

Section 2.2: Walking a directory

The stdlib pathlib.Path walks a tree:

from pathlib import Path
for p in Path('mydir').rglob('*'):
    if p.is_file():
        print(p, hash_file(p))

rglob('*') is recursive glob; matches every name in every subdirectory.
Filter to files only with p.is_file(); symbolic links and directories should not be hashed as files.
Sort the output (paths or hashes) for deterministic ordering across runs.

Section 2.3: Detecting changes

Three change categories:
- Modified: path exists in both manifests; hashes differ
- Added: path exists in current but not in manifest
- Deleted: path exists in manifest but not in current

Compute as set operations:

manifest = {'a.txt': 'hash1', 'b.txt': 'hash2'}
current = {'a.txt': 'hash1_NEW', 'c.txt': 'hash3'}
modified = {p for p in manifest if p in current and manifest[p] != current[p]}  # {'a.txt'}
added = current.keys() - manifest.keys()  # {'c.txt'}
deleted = manifest.keys() - current.keys()  # {'b.txt'}

Set operations are fast even on millions of files.

Section 2.4: Integrity vs authenticity vs encryption

Integrity: "this file has not been modified." A hash detects integrity violations.
Authenticity: "this file came from a specific sender." Requires a signature (HMAC with a shared secret, or asymmetric crypto). A bare hash is NOT authentication: an attacker who can modify the file can also modify the manifest.
Encryption: "this file's contents are hidden." A hash does NOT encrypt; it produces a fingerprint, not a reversible transformation.
For Lab 11 (your own directory, your own manifest on the same disk), bare hashing is enough; the attacker model is "did I accidentally modify a file?", not "is an adversary tampering with my files?". For real intrusion detection (SEC-101), you also store the manifest on a write-protected or off-host medium so a tamperer cannot edit both at once.

Section 2.5: Other hash functions

hashlib.sha1 exists; do NOT use for new code (collision-broken).
hashlib.md5 exists; do NOT use for new code (collision-broken; pre-image still mostly intact but no reason to pick it).
hashlib.sha512 (SHA-2, 512-bit output): faster than SHA-256 on 64-bit CPUs; rarely needed (SHA-256's 128-bit collision resistance is already overkill for most purposes).
hashlib.sha3_256 (SHA-3, 256-bit output): different construction (Keccak sponge); use if you need diversity for paranoid reasons. SHA-256 is the safe default.
hashlib.blake2b and blake2s: fast modern alternatives. Equally secure as SHA-256. Use if you need speed and SHA-2 is a bottleneck (rare).

Labs (~90 minutes)

Lab 11: Directory Integrity Checker (labs/lab-11-integrity.md)

Goal: build a CLI tool that computes SHA-256 for every file in a directory tree; saves to a manifest; can verify a tree against a saved manifest
Time: ~90 minutes
Artifact: lab-11-checker.py in ~/fnd-102/lab-11/, committed to Git

Independent practice (~4 hours)

sha256sum comparison (30 min). On macOS or Linux: run sha256sum myfile and compare to your Python hashlib.sha256 output. They MUST match. If they don't, you opened the file in text mode (you should be in binary mode).

Hash a directory (45 min). Extend the file-hash idea: combine hashes of all files in a directory to produce a single "directory hash." Standard pattern:

def hash_directory(path):
    h = hashlib.sha256()
    for p in sorted(Path(path).rglob('*')):
        if p.is_file():
            h.update(str(p.relative_to(path)).encode('utf-8'))
            h.update(b'\0')
            h.update(hash_file(p).encode('ascii'))
            h.update(b'\0')
    return h.hexdigest()

What makes the order matter here? Why the b'\0' separator?

Hash collision search (30 min). SHA-256 collisions are infeasible. But for a SMALLER hash (truncated to 16 bits), they're easy to find:

def hash16(s):
    return hashlib.sha256(s.encode()).hexdigest()[:4]  # 16 bits as hex

import itertools, string
seen = {}
for length in range(1, 8):
    for combo in itertools.product(string.ascii_lowercase, repeat=length):
        s = ''.join(combo)
        h = hash16(s)
        if h in seen:
            print(f'collision: {s!r} and {seen[h]!r} both hash to {h}')
            break
        seen[h] = s

Run and observe. By how many strings do you expect a collision? (Birthday paradox: ~2^8 = 256 strings.)

Read a real Tripwire policy (30 min). Read the documentation on the AIDE intrusion detection system (https://aide.github.io/) or the older Tripwire. Notice the manifest is signed AND stored off-host. Your Lab 11 manifest is not signed; what attack would compromise it?
Hash a stream from stdin (30 min). Write a script that reads from stdin and prints the SHA-256 hash of the input. Usage: cat myfile | python3 myhash.py. Useful when piping data through a pipeline.

Use hashlib.blake2b keyed mode (45 min, optional stretch). BLAKE2 supports keyed hashing (effectively a built-in HMAC):

import hashlib
h = hashlib.blake2b(key=b'my-secret-key')
h.update(b'hello')
print(h.hexdigest())

This is the simplest way to authenticate (not just integrity-check) a piece of data without rolling your own HMAC. Forward-pointer to SEC-101.

Reflection prompts (~30 minutes)

SHA-256 is collision-resistant; SHA-1 is not. What does this mean concretely for someone considering whether to upgrade existing SHA-1-based integrity checks?
Your Lab 11 stores the manifest in the same directory as the files it checks. What attack would compromise this? How would you defend against it?
Hashing detects modification; it does NOT authenticate the source. Give an example where you'd care about authentication beyond integrity.
The chunked-read pattern (while chunk := f.read(...)) lets you hash files larger than memory. Did your Lab 11 use it? What is the largest file your laptop has where this matters?
One thing from this week you want to know more about?

Tool journal (week 11)

hashlib.sha256: the cryptographic hash function default
h.update(), h.hexdigest(), h.digest(): incremental hashing
Walrus operator :=: for chunked-read loops
pathlib.Path.rglob('*'): recursive directory walk
'rb' mode: always for hashing (no encoding translation)
hashlib.compare_digest: constant-time comparison (timing-attack-safe)
Manifest formats: sha256sum-style plain text, JSON, CSV
hashlib.blake2b, hashlib.sha3_256: alternative hash functions
Integrity vs authenticity vs encryption: three different properties

What comes next

Week 12 introduces HTTP via the requests library. Lab 12 is a weather-report CLI that reads a free weather API. The first time the course goes to the network; the first time you install a non-stdlib package; the first time you handle JSON from an external source.