Classroom Glossary Public page

Lab 7: Regex IP Extractor

733 words

~90 minutes. Read a network log file; extract every IPv4 and IPv6 address; emit a sorted unique list. Practice re.findall, character classes, alternation, and the ipaddress stdlib validator.


Goal: ship lab-7-extract.py that scans a log file and produces a sorted unique list of IP addresses found.

Estimated time: 90 minutes

Prerequisites: Week 7 lecture (regex). Lab 6 complete and working.


Setup

mkdir -p ~/fnd-102/lab-7
cd ~/fnd-102/lab-7

Create a sample network.log (use the generator below or write one by hand with 30-100 lines):

# gen-network-log.py
import random
random.seed(42)

ipv4s = ['192.168.1.10', '10.0.0.5', '172.16.0.100', '8.8.8.8', '1.1.1.1', '203.0.113.42']
ipv6s = ['fe80::1', '2001:db8::1', '::1', '2001:db8:85a3::8a2e:370:7334']
fakes = ['999.999.999.999', '256.256.256.256', '300.1.1.1', 'foo:bar:baz', 'gggg::1']

lines = []
for i in range(200):
    r = random.random()
    if r < 0.5:
        ip = random.choice(ipv4s)
        port = random.choice([80, 443, 22, 8080])
        lines.append(f'2026-05-28T12:00:{i:02d} INFO request from {ip}:{port}')
    elif r < 0.8:
        ip = random.choice(ipv6s)
        lines.append(f'2026-05-28T12:00:{i:02d} INFO request from [{ip}]:443')
    elif r < 0.95:
        lines.append(f'2026-05-28T12:00:{i:02d} INFO unrelated line, no IP here')
    else:
        # garbage that looks like an IP
        ip = random.choice(fakes)
        lines.append(f'2026-05-28T12:00:{i:02d} WARN parse failed: {ip}')

with open('network.log', 'w') as f:
    for line in lines:
        f.write(line + '\n')

print(f'Wrote network.log with {len(lines)} lines.')

Run it:

python3 gen-network-log.py
head network.log

Part A: IPv4 extraction (25 min)

A rough IPv4 pattern: r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b'

import re
from pathlib import Path

IPV4_RE = re.compile(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b')

def extract_ipv4(text):
    """Return a list of IPv4-shaped substrings found in `text`."""
    return IPV4_RE.findall(text)

def main():
    log = Path('network.log')
    with open(log, encoding='utf-8') as f:
        text = f.read()
    found = extract_ipv4(text)
    print(f'Found {len(found)} IPv4-shaped strings.')
    print('Unique:')
    for ip in sorted(set(found)):
        print(f'  {ip}')

if __name__ == '__main__':
    main()

Run it. You should see the real IPv4s from your generator PLUS the fakes (999.999.999.999, 256.256.256.256, 300.1.1.1). That is because the regex matches "shape," not "valid value."

Two observations:

  1. \b (word boundary) prevents matching 192.168.1.10:80 and getting 192.168.1.10 correctly (the : is a non-word character, so the boundary holds). Without \b, you risk partial matches on longer numeric sequences.
  2. re.compile is the pre-compile-once-use-many pattern. For a one-shot script this is overkill; for a function called in a loop it is significant.

Part B: Validate with ipaddress (15 min)

Filter out the fakes by using the ipaddress stdlib module to validate each match:

import ipaddress

def is_valid_ipv4(s):
    """Return True if s is a valid IPv4 address (octets 0-255)."""
    try:
        ipaddress.IPv4Address(s)
        return True
    except ipaddress.AddressValueError:
        return False

def extract_valid_ipv4(text):
    """Return IPv4-shaped strings that are also valid IPv4 addresses."""
    candidates = IPV4_RE.findall(text)
    return [ip for ip in candidates if is_valid_ipv4(ip)]

The pattern is "regex for shape, library for validation." This is faster to write than a regex that enforces octet ranges (the strict-octet regex is r'(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(?:\.(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}' and it is brittle as well as ugly).

Run with the validator. You should now see only the real IPv4s; 999.999.999.999 and friends are filtered.


Part C: IPv6 extraction (20 min)

IPv6 is harder. The full grammar allows for compressed forms (::1, fe80::1), expanded forms (2001:db8:0:0:0:0:0:1), and bracketed-with-port ([fe80::1]:443). A pragmatic regex that catches most forms:

# This is intentionally permissive; the ipaddress validator does the real filtering.
IPV6_RE = re.compile(r'(?<![\w:])([0-9a-fA-F:]+:[0-9a-fA-F:]+)(?![\w:])')

def is_valid_ipv6(s):
    try:
        ipaddress.IPv6Address(s)
        return True
    except ipaddress.AddressValueError:
        return False

def extract_valid_ipv6(text):
    candidates = IPV6_RE.findall(text)
    return [ip for ip in candidates if is_valid_ipv6(ip)]

The lookarounds (?<![\w:]) and (?![\w:]) are lookbehind and lookahead assertions: "not preceded by a word character or colon" and "not followed by a word character or colon." They prevent matching IPv6 fragments inside larger strings.

The "permissive regex + library validator" pattern is the same as IPv4. Trying to enforce IPv6 grammar in pure regex is a project; the ipaddress module already does it correctly.


Part D: Combine and emit sorted unique (15 min)

Final main():

def main():
    log = Path('network.log')
    with open(log, encoding='utf-8') as f:
        text = f.read()

    ipv4s = set(extract_valid_ipv4(text))
    ipv6s = set(extract_valid_ipv6(text))

    print(f'Found {len(ipv4s)} unique IPv4 addresses:')
    for ip in sorted(ipv4s, key=lambda s: ipaddress.IPv4Address(s)):
        print(f'  {ip}')

    print(f'Found {len(ipv6s)} unique IPv6 addresses:')
    for ip in sorted(ipv6s, key=lambda s: ipaddress.IPv6Address(s)):
        print(f'  {ip}')

The key=lambda s: ipaddress.IPv4Address(s) sorts IPs in numeric (not lexicographic) order. '10.0.0.5' and '9.0.0.5' would sort wrong as strings; the IPv4Address class sorts them correctly because the class implements __lt__.

Test:

python3 lab-7-extract.py

Expected output: a sorted unique list of the real IPs from your generator, no fakes.


Part E: Add CLI arguments (10 min)

Polish the tool with argparse:

import argparse

def build_parser():
    parser = argparse.ArgumentParser(description='Extract IPv4 and IPv6 addresses from a log file.')
    parser.add_argument('input', type=Path, help='path to the log file')
    parser.add_argument('--ipv4-only', action='store_true', help='only extract IPv4 addresses')
    parser.add_argument('--ipv6-only', action='store_true', help='only extract IPv6 addresses')
    parser.add_argument('--no-validate', action='store_true', help='do not validate; report shape matches')
    return parser

def main():
    args = build_parser().parse_args()
    if not args.input.exists():
        print(f'Error: {args.input} does not exist', file=sys.stderr)
        sys.exit(1)
    with open(args.input, encoding='utf-8') as f:
        text = f.read()

    if not args.ipv6_only:
        if args.no_validate:
            ipv4s = set(IPV4_RE.findall(text))
        else:
            ipv4s = set(extract_valid_ipv4(text))
        print(f'IPv4 ({len(ipv4s)}):')
        for ip in sorted(ipv4s, key=lambda s: ipaddress.ip_address(s) if not args.no_validate else s):
            print(f'  {ip}')

    if not args.ipv4_only:
        if args.no_validate:
            ipv6s = set(IPV6_RE.findall(text))
        else:
            ipv6s = set(extract_valid_ipv6(text))
        print(f'IPv6 ({len(ipv6s)}):')
        for ip in sorted(ipv6s, key=lambda s: ipaddress.ip_address(s) if not args.no_validate else s):
            print(f'  {ip}')

Test all four call paths:

python3 lab-7-extract.py network.log
python3 lab-7-extract.py network.log --ipv4-only
python3 lab-7-extract.py network.log --ipv6-only
python3 lab-7-extract.py network.log --no-validate

Part F: Commit your work (10 min)

cd ~/fnd-102/lab-7
git add gen-network-log.py lab-7-extract.py network.log
git commit -m "lab-7: regex IP extractor with shape+validate pattern, ipv4/ipv6 split, argparse"

Expected output / artifact

lab-7-extract.py should:

  • Use re.compile with the r'...' raw-string prefix
  • Extract IPv4-shaped substrings and validate with ipaddress.IPv4Address
  • Extract IPv6-shaped substrings and validate with ipaddress.IPv6Address
  • Emit sorted unique lists, sorted in numeric (not lexicographic) order
  • Support --ipv4-only, --ipv6-only, --no-validate argparse flags
  • Filter out invalid IPs like 999.999.999.999

Files committed: gen-network-log.py, lab-7-extract.py, network.log.


What's the failure mode?

This tool's likely failure modes:

  1. IPv6 that looks like a MAC address. aa:bb:cc:dd:ee:ff is a MAC, not an IP. The IPv6 regex might match it; the validator rejects it. Good: the layered pattern saves you.
  2. IPv4 inside a longer numeric string. version 192.168.1.10.beta might or might not match depending on word-boundary handling. Test with such strings.
  3. Compressed IPv6 with ::. ::1 is a valid IPv6 (loopback). Your permissive regex must include the :: form. Verify by adding ::1 to your generator and confirming it matches.
  4. IPv6 with embedded IPv4. ::ffff:192.168.1.1 is a valid IPv6 (IPv4-mapped IPv6). The IPv6 regex might match a substring; the IPv4 regex might match the trailing 192.168.1.1. Decide whether to report both or only IPv6.

Common pitfalls

  • Forgetting the r prefix. '\d+' works by coincidence (Python does not have a \d escape); '\b' does NOT (Python interprets \b as backspace, not word boundary). Always r'...'.
  • Greedy vs non-greedy. The IPv6 regex above uses non-greedy implicitly. Greedy IPv6 patterns can over-match on lines with multiple colons.
  • Forgetting word boundaries. Without \b, 192.168.1.10 matches 192.168.1.1 and 1.168.1.10 (overlapping). re.findall is non-overlapping but boundaries still help.
  • Sorting IPs as strings. '10.0.0.5' < '9.0.0.5' is True (lexicographic). For numeric ordering, use ipaddress.IPv4Address as the key.

Stretch (optional)

  1. Extract URLs too. Add a URL regex (rough): r'https?://\S+'. Emit a sorted unique URL list alongside the IPs.
  2. Count per-IP frequency. Use collections.Counter to count how many times each IP appears. Emit by-frequency-desc.
  3. Geolocate (offline). If you can install the geoip2 library and download the free GeoLite2 database from MaxMind, augment each IP with a country code. Forward-stretch; reads like a real SRE tool.
  4. Stream the file (week 5 pattern). Replace f.read() with line-by-line iteration so the tool scales to large logs.
  5. CIDR ranges. Accept an --allow CIDR argument; only report IPs that fall inside the CIDR block. Use ipaddress.IPv4Network(...).hosts() or address in network for the membership check.

Lab 7 v0.1.