Classroom Public page

Lab 8.2: Text Search

383 words

~60 minutes. Use grep, cut, sort, and uniq to extract and summarize information from a sample log file.


Goal: build grep-based pipelines to extract structured information from text, identify patterns, and produce summary statistics.

Estimated time: 60 minutes

Prerequisites: Week 8 lecture (grep flags, pipes, cut, sort, uniq); Lab 8.1 completed


Setup

cd ~/fnd-101/lab-8-1    # reuse the sample from Lab 8.1

You will work with sample/logs/access.log from Lab 8.1.


Part A: grep flags in practice

For each task, write the command and record the output (or count).

  1. Case-insensitive search: count lines containing "get" regardless of case:

    grep -ic "get" sample/logs/access.log
    
  2. Inverted search: show only lines that do NOT contain "200":

    grep -v "200" sample/logs/access.log
    
  3. Line numbers: show lines containing "403" with their line numbers:

    grep -n "403" sample/logs/access.log
    
  4. Recursive search: if you create a second log file sample/logs/error.log with a few lines, search for "error" in all files under sample/logs/:

    echo "ERROR: connection refused" > sample/logs/error.log
    echo "WARNING: disk space low" >> sample/logs/error.log
    grep -ri "error" sample/logs/
    

Part B: cut + sort + uniq pipeline

Find the top 3 IP addresses by request count:

cut -d' ' -f1 sample/logs/access.log | sort | uniq -c | sort -rn | head -3

Break down what each stage does:

  1. cut -d' ' -f1: split each line on space, take the first field (the IP address)
  2. sort: sort alphabetically so identical IPs are adjacent
  3. uniq -c: count consecutive identical lines; produces "count IP" output
  4. sort -rn: sort numerically in reverse order (highest count first)
  5. head -3: keep only the top 3

In your worksheet, write the output of this pipeline AND write a sentence explaining what each stage in the pipeline does.


Part C: Extracting specific fields

The log format is: IP METHOD PATH STATUS

  1. Extract only the STATUS field (field 4) and list all unique values:

    cut -d' ' -f4 sample/logs/access.log | sort | uniq
    
  2. Count requests per HTTP method (GET, POST):

    cut -d' ' -f2 sample/logs/access.log | sort | uniq -c
    
  3. Find all paths that were requested more than once:

    cut -d' ' -f3 sample/logs/access.log | sort | uniq -c | awk '$1 > 1'
    

Part D: Write the pipeline as a script

Save your "top IP addresses" pipeline as a reusable script:

cat > lab-8-2.sh << 'EOF'
#!/bin/bash
# Usage: ./lab-8-2.sh <logfile>
# Prints the top 5 IP addresses by request count

if [ -z "$1" ]; then
    echo "Usage: $0 <logfile>"
    exit 1
fi

if [ ! -f "$1" ]; then
    echo "Error: $1 not found"
    exit 1
fi

cut -d' ' -f1 "$1" | sort | uniq -c | sort -rn | head -5
EOF

chmod +x lab-8-2.sh
./lab-8-2.sh sample/logs/access.log

Expected output / artifact

lab-8-2-notes.txt with:

  • Output of each Part A command
  • Output and explanation for Part B pipeline
  • Output of each Part C command
  • lab-8-2.sh (the script)
git add lab-8-2-notes.txt lab-8-2.sh
git commit -m "lab-8-2: text search pipelines + top-IP script"

Common pitfalls

  • uniq -c requires sorted input: uniq only counts consecutive identical lines; if you skip the sort stage, it will under-count
  • Field numbers in cut: field numbering starts at 1, not 0
  • sort -rn vs sort -r: -n sorts numerically; without it, sort treats "10" as less than "9" (lexicographic ordering)

Stretch (optional)

Modify lab-8-2.sh to accept a second argument: a minimum count threshold. Only print IPs that appear at least N times. Test with N=2 on the sample log.


Lab 8.2 v0.1.