~90 minutes. Read a network log file; extract every IPv4 and IPv6 address; emit a sorted unique list. Practice re.findall, character classes, alternation, and the ipaddress stdlib validator.
Goal: ship lab-7-extract.py that scans a log file and produces a sorted unique list of IP addresses found.
Estimated time: 90 minutes
Prerequisites: Week 7 lecture (regex). Lab 6 complete and working.
Setup
mkdir -p ~/fnd-102/lab-7
cd ~/fnd-102/lab-7
Create a sample network.log (use the generator below or write one by hand with 30-100 lines):
# gen-network-log.py
import random
random.seed(42)
ipv4s = ['192.168.1.10', '10.0.0.5', '172.16.0.100', '8.8.8.8', '1.1.1.1', '203.0.113.42']
ipv6s = ['fe80::1', '2001:db8::1', '::1', '2001:db8:85a3::8a2e:370:7334']
fakes = ['999.999.999.999', '256.256.256.256', '300.1.1.1', 'foo:bar:baz', 'gggg::1']
lines = []
for i in range(200):
r = random.random()
if r < 0.5:
ip = random.choice(ipv4s)
port = random.choice([80, 443, 22, 8080])
lines.append(f'2026-05-28T12:00:{i:02d} INFO request from {ip}:{port}')
elif r < 0.8:
ip = random.choice(ipv6s)
lines.append(f'2026-05-28T12:00:{i:02d} INFO request from [{ip}]:443')
elif r < 0.95:
lines.append(f'2026-05-28T12:00:{i:02d} INFO unrelated line, no IP here')
else:
# garbage that looks like an IP
ip = random.choice(fakes)
lines.append(f'2026-05-28T12:00:{i:02d} WARN parse failed: {ip}')
with open('network.log', 'w') as f:
for line in lines:
f.write(line + '\n')
print(f'Wrote network.log with {len(lines)} lines.')
Run it:
python3 gen-network-log.py
head network.log
Part A: IPv4 extraction (25 min)
A rough IPv4 pattern: r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b'
import re
from pathlib import Path
IPV4_RE = re.compile(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b')
def extract_ipv4(text):
"""Return a list of IPv4-shaped substrings found in `text`."""
return IPV4_RE.findall(text)
def main():
log = Path('network.log')
with open(log, encoding='utf-8') as f:
text = f.read()
found = extract_ipv4(text)
print(f'Found {len(found)} IPv4-shaped strings.')
print('Unique:')
for ip in sorted(set(found)):
print(f' {ip}')
if __name__ == '__main__':
main()
Run it. You should see the real IPv4s from your generator PLUS the fakes (999.999.999.999, 256.256.256.256, 300.1.1.1). That is because the regex matches "shape," not "valid value."
Two observations:
\b(word boundary) prevents matching192.168.1.10:80and getting192.168.1.10correctly (the:is a non-word character, so the boundary holds). Without\b, you risk partial matches on longer numeric sequences.re.compileis the pre-compile-once-use-many pattern. For a one-shot script this is overkill; for a function called in a loop it is significant.
Part B: Validate with ipaddress (15 min)
Filter out the fakes by using the ipaddress stdlib module to validate each match:
import ipaddress
def is_valid_ipv4(s):
"""Return True if s is a valid IPv4 address (octets 0-255)."""
try:
ipaddress.IPv4Address(s)
return True
except ipaddress.AddressValueError:
return False
def extract_valid_ipv4(text):
"""Return IPv4-shaped strings that are also valid IPv4 addresses."""
candidates = IPV4_RE.findall(text)
return [ip for ip in candidates if is_valid_ipv4(ip)]
The pattern is "regex for shape, library for validation." This is faster to write than a regex that enforces octet ranges (the strict-octet regex is r'(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(?:\.(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}' and it is brittle as well as ugly).
Run with the validator. You should now see only the real IPv4s; 999.999.999.999 and friends are filtered.
Part C: IPv6 extraction (20 min)
IPv6 is harder. The full grammar allows for compressed forms (::1, fe80::1), expanded forms (2001:db8:0:0:0:0:0:1), and bracketed-with-port ([fe80::1]:443). A pragmatic regex that catches most forms:
# This is intentionally permissive; the ipaddress validator does the real filtering.
IPV6_RE = re.compile(r'(?<![\w:])([0-9a-fA-F:]+:[0-9a-fA-F:]+)(?![\w:])')
def is_valid_ipv6(s):
try:
ipaddress.IPv6Address(s)
return True
except ipaddress.AddressValueError:
return False
def extract_valid_ipv6(text):
candidates = IPV6_RE.findall(text)
return [ip for ip in candidates if is_valid_ipv6(ip)]
The lookarounds (?<![\w:]) and (?![\w:]) are lookbehind and lookahead assertions: "not preceded by a word character or colon" and "not followed by a word character or colon." They prevent matching IPv6 fragments inside larger strings.
The "permissive regex + library validator" pattern is the same as IPv4. Trying to enforce IPv6 grammar in pure regex is a project; the ipaddress module already does it correctly.
Part D: Combine and emit sorted unique (15 min)
Final main():
def main():
log = Path('network.log')
with open(log, encoding='utf-8') as f:
text = f.read()
ipv4s = set(extract_valid_ipv4(text))
ipv6s = set(extract_valid_ipv6(text))
print(f'Found {len(ipv4s)} unique IPv4 addresses:')
for ip in sorted(ipv4s, key=lambda s: ipaddress.IPv4Address(s)):
print(f' {ip}')
print(f'Found {len(ipv6s)} unique IPv6 addresses:')
for ip in sorted(ipv6s, key=lambda s: ipaddress.IPv6Address(s)):
print(f' {ip}')
The key=lambda s: ipaddress.IPv4Address(s) sorts IPs in numeric (not lexicographic) order. '10.0.0.5' and '9.0.0.5' would sort wrong as strings; the IPv4Address class sorts them correctly because the class implements __lt__.
Test:
python3 lab-7-extract.py
Expected output: a sorted unique list of the real IPs from your generator, no fakes.
Part E: Add CLI arguments (10 min)
Polish the tool with argparse:
import argparse
def build_parser():
parser = argparse.ArgumentParser(description='Extract IPv4 and IPv6 addresses from a log file.')
parser.add_argument('input', type=Path, help='path to the log file')
parser.add_argument('--ipv4-only', action='store_true', help='only extract IPv4 addresses')
parser.add_argument('--ipv6-only', action='store_true', help='only extract IPv6 addresses')
parser.add_argument('--no-validate', action='store_true', help='do not validate; report shape matches')
return parser
def main():
args = build_parser().parse_args()
if not args.input.exists():
print(f'Error: {args.input} does not exist', file=sys.stderr)
sys.exit(1)
with open(args.input, encoding='utf-8') as f:
text = f.read()
if not args.ipv6_only:
if args.no_validate:
ipv4s = set(IPV4_RE.findall(text))
else:
ipv4s = set(extract_valid_ipv4(text))
print(f'IPv4 ({len(ipv4s)}):')
for ip in sorted(ipv4s, key=lambda s: ipaddress.ip_address(s) if not args.no_validate else s):
print(f' {ip}')
if not args.ipv4_only:
if args.no_validate:
ipv6s = set(IPV6_RE.findall(text))
else:
ipv6s = set(extract_valid_ipv6(text))
print(f'IPv6 ({len(ipv6s)}):')
for ip in sorted(ipv6s, key=lambda s: ipaddress.ip_address(s) if not args.no_validate else s):
print(f' {ip}')
Test all four call paths:
python3 lab-7-extract.py network.log
python3 lab-7-extract.py network.log --ipv4-only
python3 lab-7-extract.py network.log --ipv6-only
python3 lab-7-extract.py network.log --no-validate
Part F: Commit your work (10 min)
cd ~/fnd-102/lab-7
git add gen-network-log.py lab-7-extract.py network.log
git commit -m "lab-7: regex IP extractor with shape+validate pattern, ipv4/ipv6 split, argparse"
Expected output / artifact
lab-7-extract.py should:
- Use
re.compilewith ther'...'raw-string prefix - Extract IPv4-shaped substrings and validate with
ipaddress.IPv4Address - Extract IPv6-shaped substrings and validate with
ipaddress.IPv6Address - Emit sorted unique lists, sorted in numeric (not lexicographic) order
- Support
--ipv4-only,--ipv6-only,--no-validateargparse flags - Filter out invalid IPs like
999.999.999.999
Files committed: gen-network-log.py, lab-7-extract.py, network.log.
What's the failure mode?
This tool's likely failure modes:
- IPv6 that looks like a MAC address.
aa:bb:cc:dd:ee:ffis a MAC, not an IP. The IPv6 regex might match it; the validator rejects it. Good: the layered pattern saves you. - IPv4 inside a longer numeric string.
version 192.168.1.10.betamight or might not match depending on word-boundary handling. Test with such strings. - Compressed IPv6 with
::.::1is a valid IPv6 (loopback). Your permissive regex must include the::form. Verify by adding::1to your generator and confirming it matches. - IPv6 with embedded IPv4.
::ffff:192.168.1.1is a valid IPv6 (IPv4-mapped IPv6). The IPv6 regex might match a substring; the IPv4 regex might match the trailing192.168.1.1. Decide whether to report both or only IPv6.
Common pitfalls
- Forgetting the
rprefix.'\d+'works by coincidence (Python does not have a\descape);'\b'does NOT (Python interprets\bas backspace, not word boundary). Alwaysr'...'. - Greedy vs non-greedy. The IPv6 regex above uses non-greedy implicitly. Greedy IPv6 patterns can over-match on lines with multiple colons.
- Forgetting word boundaries. Without
\b,192.168.1.10matches192.168.1.1and1.168.1.10(overlapping).re.findallis non-overlapping but boundaries still help. - Sorting IPs as strings.
'10.0.0.5' < '9.0.0.5'is True (lexicographic). For numeric ordering, useipaddress.IPv4Addressas the key.
Stretch (optional)
- Extract URLs too. Add a URL regex (rough):
r'https?://\S+'. Emit a sorted unique URL list alongside the IPs. - Count per-IP frequency. Use
collections.Counterto count how many times each IP appears. Emit by-frequency-desc. - Geolocate (offline). If you can install the
geoip2library and download the free GeoLite2 database from MaxMind, augment each IP with a country code. Forward-stretch; reads like a real SRE tool. - Stream the file (week 5 pattern). Replace
f.read()with line-by-line iteration so the tool scales to large logs. - CIDR ranges. Accept an
--allow CIDRargument; only report IPs that fall inside the CIDR block. Useipaddress.IPv4Network(...).hosts()oraddress in networkfor the membership check.
Lab 7 v0.1.