On this page
- Purpose
- The Problem
- The Solution
- Implementation Details
- Prerequisites & Tooling
- Knowledge Base
- Environment
- Key Files to Master
- High-Level Architecture
- The Packet Processing State Machine
- Airport Package Sorting Facility
- Implementation
- Part 1: Entry Point and Packet Validation
- Part 2: Protocol Multiplexing (ARP vs IP)
- Part 3: ARP Packet Handling
- Part 4: IP Packet Validation (The Gauntlet)
- Part 5: Destination Check (For Me or Forward?)
- Part 6: IP Packet Forwarding (The Critical Path)
- Part 7: The Longest Prefix Match Algorithm
- Under the Hood
- Memory Layout: The Packet Buffer
- CPU Cache Optimization: The Hot Path
- Compiler Optimizations: Function Inlining
- Memory Barriers and Atomic Operations
- Edge Cases & Pitfalls
- Concurrency: The ARP Queue Race
- Security: IP Spoofing and Reflection Attacks
- Memory: The Queued Packet Leak
- Performance: The Checksum Recalculation Bottleneck
- Conclusion
- What You Actually Learned
- Real-World Applications
- Next Steps
- The Ultimate Challenge
Purpose
The Problem
You’re sitting at the core of a router, processing thousands of packets per second. Each packet arrives as raw bytes—you don’t know if it’s ARP, ICMP, or IP until you parse it. Here’s the nightmare scenario:
Packet arrives: aa:bb:cc:dd:ee:ff → 11:22:33:44:55:66, type=0x0800 (IP)
Parse IP: Checksum valid? TTL > 0? Destination = me or forward?
If forwarding: Look up next hop in routing table
But wait: You don’t know the next hop’s MAC address
Send ARP request: “Who has 10.0.2.2?”
Problem: The original IP packet is now in limbo—you can’t forward it until ARP completes
Meanwhile: 5 more packets arrive for the same destination
Race condition: Do you queue all 6 packets? Send 6 ARP requests? What if ARP times out?
The real challenge: How do you build a stateful packet processor that handles multiple protocols, manages asynchronous operations (ARP resolution), prevents memory leaks, and maintains wire-speed performance?
The Solution
We’re dissecting sr_handlepacket() from router/sr_router.c—the main event loop of this router. This single function orchestrates:
- Protocol multiplexing (Ethernet → ARP or IP)
- Stateful ARP resolution with packet queuing
- ICMP error generation
- IP forwarding with TTL management
- Longest-prefix matching for routing decisions
This is a production-grade systems programming challenge: coordinating memory allocation across async operations, handling implicit state machines, and optimizing hot paths for performance.
Implementation Details
This implementation demonstrates:
- Zero-Copy Packet Handling: Pointers into raw buffers instead of struct copies
- Deferred Work Queues: Packets waiting for ARP resolution
- State Machine Coordination: 6 different packet types, each with different flow
- Memory Safety: Preventing leaks when packets are dropped mid-flight
- Performance: O(1) packet parsing, O(n) routing table lookup (optimizable to O(log n))
Prerequisites & Tooling
Knowledge Base
- Required: C pointers, pointer arithmetic, state machines, packet headers
- Helpful: Operating systems (kernel packet processing), network stack internals
- Advanced: Lock-free data structures, cache-line optimization
Environment
From the project’s configuration:
# Compiler optimizations
GCC -O2 (enables common subexpression elimination, loop unrolling)
# Debugging
GCC -g (symbol tables for valgrind/gdb)
# Tools for analysis
valgrind --leak-check=full ./sr # Detect memory leaks
perf record -g ./sr # Profile hot paths
Key Files to Master
router/sr_router.c- Main packet handler (this tutorial)router/sr_protocol.h- Protocol header definitionsrouter/sr_arpcache.h- ARP cache state structuresrouter/sr_rt.h- Routing table structures
High-Level Architecture
The Packet Processing State Machine
stateDiagram-v2
[*] --> ReceivePacket: Ethernet frame arrives
ReceivePacket --> ValidateEthernet: Parse Ethernet header
ValidateEthernet --> ParseEtherType: Read EtherType field
ParseEtherType --> ARPPath: 0x0806 (ARP)
ParseEtherType --> IPPath: 0x0800 (IPv4)
ParseEtherType --> Drop: Unknown type
ARPPath --> ARPRequest: opcode=1 (request)
ARPPath --> ARPReply: opcode=2 (reply)
ARPRequest --> SendARPReply: Target IP = my IP?
ARPRequest --> Drop: Not for me
SendARPReply --> [*]
ARPReply --> UpdateARPCache: Cache MAC mapping
UpdateARPCache --> SendQueuedPackets: Packets waiting?
SendQueuedPackets --> [*]
IPPath --> ValidateIP: Checksum, TTL, length
ValidateIP --> DestinationCheck: For me or forward?
DestinationCheck --> ICMPHandler: Destination = my IP
DestinationCheck --> ForwardPacket: Destination = other network
ICMPHandler --> SendEchoReply: ICMP Echo Request (ping)
ICMPHandler --> Drop: Other ICMP types
SendEchoReply --> [*]
ForwardPacket --> LongestPrefixMatch: Lookup in routing table
LongestPrefixMatch --> CheckTTL: Route found
LongestPrefixMatch --> SendICMPUnreachable: No route
CheckTTL --> DecrementTTL: TTL > 1
CheckTTL --> SendICMPTimeExceeded: TTL = 1
DecrementTTL --> ARPLookup: Get next-hop MAC
ARPLookup --> SendPacket: MAC in cache
ARPLookup --> QueuePacket: ARP needed
QueuePacket --> SendARPRequest_Async: Queue for ARP
SendARPRequest_Async --> [*]: Wait for ARP reply
SendPacket --> [*]
Drop --> [*]
Airport Package Sorting Facility
Imagine you’re the router, and packets are parcels:
- Receive: Parcel arrives on conveyor belt (Ethernet frame)
- Parse: Check destination label (IP header)
- Decision Tree:
- Label = “This Airport” → Open and handle (ICMP echo = reply to sender)
- Label = “Other Airport” → Check flight map (routing table)
- Flight found → Check if you know the airline counter (ARP cache)
- Counter known → Send parcel there (forward packet)
- Counter unknown → Ask information desk (send ARP), hold parcel (queue packet)
The complexity: You’re handling 1000 parcels/second, must remember which parcels are waiting for which airline counters, and can’t lose any parcels even if information desk never responds.
Implementation
Part 1: Entry Point and Packet Validation
Goal: Extract Ethernet header and validate minimum packet size.
void sr_handlepacket(struct sr_instance* sr,
uint8_t * packet, /* Raw packet buffer (lent) */
unsigned int len, /* Total packet length */
char* interface) /* Receiving interface name */
{
/* Sanity check: Minimum valid Ethernet frame is 14 bytes */
if (len < sizeof(sr_ethernet_hdr_t)) {
fprintf(stderr, "Error: Packet too short (len=%u)\n", len);
return; /* Drop silently */
}
/* Cast raw bytes to Ethernet header structure (zero-copy) */
sr_ethernet_hdr_t* eth_hdr = (sr_ethernet_hdr_t*)packet;
/* Get interface metadata (IP address, MAC address) */
struct sr_if* iface = sr_get_interface(sr, interface);
if (iface == NULL) {
fprintf(stderr, "Error: Unknown interface %s\n", interface);
return;
}
/* Extract EtherType (network byte order → host byte order) */
uint16_t ether_type = ntohs(eth_hdr->ether_type);
🔵 Deep Dive - The “Lent” Comment:
Notice the comment /* lent */ on the packet buffer. This is critical documentation! The buffer is owned by the caller (VNS communication layer). We must not call free() on it. If we need to keep the packet beyond this function’s scope, we must malloc() a copy.
Part 2: Protocol Multiplexing (ARP vs IP)
Goal: Branch to appropriate handler based on EtherType.
/* Protocol multiplexing based on EtherType */
if (ether_type == ethertype_arp) {
/*───────────────────────────────────────────────────────────
* ARP PACKET PATH
*───────────────────────────────────────────────────────────*/
/* Validate ARP packet size */
if (len < sizeof(sr_ethernet_hdr_t) + sizeof(sr_arp_hdr_t)) {
fprintf(stderr, "Error: Malformed ARP packet\n");
return;
}
/* Cast to ARP header (immediately after Ethernet header) */
sr_arp_hdr_t* arp_hdr = (sr_arp_hdr_t*)(packet + sizeof(sr_ethernet_hdr_t));
/* Dispatch to ARP handler (explained in Part 3) */
handle_arp_packet(sr, eth_hdr, arp_hdr, iface);
} else if (ether_type == ethertype_ip) {
/*───────────────────────────────────────────────────────────
* IP PACKET PATH
*───────────────────────────────────────────────────────────*/
/* Validate IP packet minimum size */
if (len < sizeof(sr_ethernet_hdr_t) + sizeof(sr_ip_hdr_t)) {
fprintf(stderr, "Error: Malformed IP packet\n");
return;
}
/* Cast to IP header */
sr_ip_hdr_t* ip_hdr = (sr_ip_hdr_t*)(packet + sizeof(sr_ethernet_hdr_t));
/* Dispatch to IP handler (explained in Part 5) */
handle_ip_packet(sr, packet, len, eth_hdr, ip_hdr, iface);
} else {
/* Unknown protocol - drop silently (could log for debugging) */
fprintf(stderr, "Dropping packet with unknown EtherType=0x%04x\n", ether_type);
return;
}
}
🔴 Danger - Pointer Arithmetic:
Notice (packet + sizeof(sr_ethernet_hdr_t)). This is byte-level pointer arithmetic. We’re saying “skip the first 14 bytes (Ethernet header) and cast the next bytes as an ARP header.” If Ethernet header size is wrong, we read garbage or segfault.
Part 3: ARP Packet Handling
Goal: Respond to ARP requests, cache ARP replies.
void handle_arp_packet(struct sr_instance* sr,
sr_ethernet_hdr_t* eth_hdr,
sr_arp_hdr_t* arp_hdr,
struct sr_if* iface)
{
/* Determine ARP operation: 1=Request, 2=Reply */
uint16_t arp_op = ntohs(arp_hdr->ar_op);
if (arp_op == arp_op_request) {
/*───────────────────────────────────────────────────────────
* ARP REQUEST: "Who has IP X.X.X.X?"
*───────────────────────────────────────────────────────────*/
/* Check if target IP matches any of our interfaces */
uint32_t target_ip = arp_hdr->ar_tip; /* Already network byte order */
if (target_ip != iface->ip) {
/* Not for us - ignore (don't proxy ARP) */
return;
}
/* Build ARP reply packet */
send_arp_reply(sr,
iface->name, /* Outgoing interface */
target_ip, /* Our IP (the answer) */
arp_hdr->ar_sip, /* Sender's IP (who asked) */
eth_hdr->ether_shost, /* Sender's MAC (reply dest) */
iface->addr); /* Our MAC (the answer) */
} else if (arp_op == arp_op_reply) {
/*───────────────────────────────────────────────────────────
* ARP REPLY: "IP X.X.X.X is at MAC aa:bb:cc:dd:ee:ff"
*───────────────────────────────────────────────────────────*/
/* Extract the resolved mapping */
uint32_t resolved_ip = arp_hdr->ar_sip;
unsigned char* resolved_mac = arp_hdr->ar_sha;
/* Critical: Insert into cache AND send queued packets */
struct sr_arpentry* entry = sr_arpcache_insert(&(sr->cache),
resolved_mac,
resolved_ip);
/* The insert function automatically:
* 1. Adds the MAC→IP mapping to cache
* 2. Looks for pending requests for this IP
* 3. Sends all queued packets
* 4. Removes the request from pending queue
* (See Tutorial 01 for details) */
}
}
🔵 Deep Dive - The Hidden State Transition:
When sr_arpcache_insert() is called, it triggers a cascading state change:
- Wakes up packets sleeping in the ARP request queue
- For each packet, performs longest-prefix match (routing lookup)
- Decrements TTL and recalculates checksums
- Sends packets out the appropriate interface
- Frees the queued packet structures
This is implicit state machine execution—one function call triggers a complex workflow.
Part 4: IP Packet Validation (The Gauntlet)
Goal: Ensure IP packet is well-formed before processing.
void handle_ip_packet(struct sr_instance* sr,
uint8_t* packet,
unsigned int len,
sr_ethernet_hdr_t* eth_hdr,
sr_ip_hdr_t* ip_hdr,
struct sr_if* iface)
{
/* Validation 1: Checksum integrity */
uint16_t received_cksum = ip_hdr->ip_sum;
ip_hdr->ip_sum = 0; /* Zero out for recalculation */
uint16_t calculated_cksum = cksum(ip_hdr, sizeof(sr_ip_hdr_t));
if (received_cksum != calculated_cksum) {
fprintf(stderr, "IP checksum mismatch: expected 0x%04x, got 0x%04x\n",
calculated_cksum, received_cksum);
return; /* Drop corrupted packet */
}
ip_hdr->ip_sum = received_cksum; /* Restore original */
/* Validation 2: TTL check (must be > 0 to forward) */
if (ip_hdr->ip_ttl <= 0) {
fprintf(stderr, "Dropping packet with TTL=0\n");
return;
}
/* Validation 3: Minimum length (some packets have IP options) */
uint16_t ip_total_len = ntohs(ip_hdr->ip_len);
if (ip_total_len < sizeof(sr_ip_hdr_t)) {
fprintf(stderr, "IP total length too small: %u bytes\n", ip_total_len);
return;
}
/* Validation 4: Ethernet frame contains complete IP packet */
if (len < sizeof(sr_ethernet_hdr_t) + ip_total_len) {
fprintf(stderr, "Truncated IP packet: frame_len=%u, ip_len=%u\n",
len, ip_total_len);
return;
}
🔴 Danger - Checksum Recalculation Side Effect:
Notice we modify ip_hdr->ip_sum by setting it to 0, calculate checksum, then restore it. This mutates the original packet buffer temporarily. If another thread reads this packet simultaneously (in a multi-threaded router), they’d see corrupted data during this window. Single-threaded execution saves us here.
Part 5: Destination Check (For Me or Forward?)
Goal: Determine if packet is addressed to this router or should be forwarded.
/*───────────────────────────────────────────────────────────────
* Check if packet is addressed to any of our interfaces
*───────────────────────────────────────────────────────────────*/
struct sr_if* if_walker = sr->if_list;
int packet_for_me = 0;
while (if_walker != NULL) {
if (ip_hdr->ip_dst == if_walker->ip) {
packet_for_me = 1;
break;
}
if_walker = if_walker->next;
}
if (packet_for_me) {
/*───────────────────────────────────────────────────────────
* PACKET DESTINED FOR THIS ROUTER
*───────────────────────────────────────────────────────────*/
/* Check protocol in IP header */
if (ip_hdr->ip_p == ip_protocol_icmp) {
/* ICMP packet - handle ping requests */
sr_icmp_hdr_t* icmp_hdr = (sr_icmp_hdr_t*)(packet +
sizeof(sr_ethernet_hdr_t) +
sizeof(sr_ip_hdr_t));
if (icmp_hdr->icmp_type == 8) { /* Echo Request (ping) */
/* Send ICMP Echo Reply (type=0, code=0) */
send_icmp_echo_reply(sr, iface,
ntohs(ip_hdr->ip_id),
(uint8_t*)icmp_hdr + sizeof(sr_icmp_hdr_t),
ip_total_len - sizeof(sr_ip_hdr_t) - sizeof(sr_icmp_hdr_t),
ip_hdr->ip_src,
ip_hdr->ip_dst,
eth_hdr->ether_shost,
iface->addr);
}
} else {
/* TCP/UDP to router itself - send ICMP Port Unreachable */
send_icmp_error_message(sr, iface->name, ntohs(ip_hdr->ip_id),
packet + sizeof(sr_ethernet_hdr_t),
ip_hdr->ip_src, ip_hdr->ip_dst,
eth_hdr->ether_shost, iface->addr,
ICMP_TYPE_DEST_UNREACHABLE);
}
} else {
/*───────────────────────────────────────────────────────────
* PACKET TO BE FORWARDED
*───────────────────────────────────────────────────────────*/
forward_ip_packet(sr, packet, len, eth_hdr, ip_hdr, iface);
}
}
🔵 Deep Dive - Interface Iteration:
We iterate through all interfaces (if_list) to check if destination matches any of our IPs. This is O(n) where n = number of interfaces (typically 2-4). For line-rate routers with hundreds of interfaces, this would use a hash table for O(1) lookup.
Part 6: IP Packet Forwarding (The Critical Path)
Goal: Route packet to next hop with TTL decrement and ARP resolution.
void forward_ip_packet(struct sr_instance* sr,
uint8_t* packet,
unsigned int len,
sr_ethernet_hdr_t* eth_hdr,
sr_ip_hdr_t* ip_hdr,
struct sr_if* iface)
{
/*───────────────────────────────────────────────────────────────
* Step 1: Longest Prefix Match (Find route)
*───────────────────────────────────────────────────────────────*/
struct sr_rt matched_rt = longest_prefix_match(sr, ip_hdr->ip_dst);
if (matched_rt.dest.s_addr == 0) {
/* No route to destination - send ICMP Network Unreachable */
send_icmp_error_message(sr, iface->name, ntohs(ip_hdr->ip_id),
(uint8_t*)ip_hdr,
ip_hdr->ip_src, ip_hdr->ip_dst,
eth_hdr->ether_shost, iface->addr,
ICMP_TYPE_DEST_UNREACHABLE);
return;
}
/*───────────────────────────────────────────────────────────────
* Step 2: TTL Decrement and Validation
*───────────────────────────────────────────────────────────────*/
ip_hdr->ip_ttl--;
if (ip_hdr->ip_ttl == 0) {
/* TTL expired - send ICMP Time Exceeded */
send_icmp_error_message(sr, iface->name, ntohs(ip_hdr->ip_id),
(uint8_t*)ip_hdr,
ip_hdr->ip_src, ip_hdr->ip_dst,
eth_hdr->ether_shost, iface->addr,
ICMP_TYPE_TIME_EXCEEDED);
return;
}
/* Recalculate IP checksum (TTL changed) */
ip_hdr->ip_sum = 0;
ip_hdr->ip_sum = cksum(ip_hdr, sizeof(sr_ip_hdr_t));
/*───────────────────────────────────────────────────────────────
* Step 3: ARP Resolution for Next Hop
*───────────────────────────────────────────────────────────────*/
uint32_t next_hop_ip = matched_rt.gw.s_addr;
/* Look up next hop's MAC address in ARP cache */
struct sr_arpentry* arp_entry = sr_arpcache_lookup(&(sr->cache), next_hop_ip);
if (arp_entry != NULL) {
/*───────────────────────────────────────────────────────────
* FAST PATH: MAC address known
*───────────────────────────────────────────────────────────*/
/* Get outgoing interface */
struct sr_if* out_iface = sr_get_interface(sr, matched_rt.interface);
/* Update Ethernet header */
memcpy(eth_hdr->ether_dhost, arp_entry->mac, ETHER_ADDR_LEN);
memcpy(eth_hdr->ether_shost, out_iface->addr, ETHER_ADDR_LEN);
/* Send packet */
sr_send_packet(sr, packet, len, matched_rt.interface);
/* Free ARP cache entry (was dynamically allocated by lookup) */
free(arp_entry);
} else {
/*───────────────────────────────────────────────────────────
* SLOW PATH: ARP resolution needed
*───────────────────────────────────────────────────────────*/
/* Check if we already have a pending ARP request for this IP */
struct sr_arpreq* req = sr_arpcache_queuereq(&(sr->cache),
next_hop_ip,
packet,
len,
matched_rt.interface);
/* If this is the first packet for this IP, send ARP request */
if (req->times_sent == 0) {
send_arp_request(sr, next_hop_ip);
req->sent = time(NULL);
req->times_sent = 1;
}
/* Packet is now queued - will be sent when ARP reply arrives */
}
}
🔴 Danger - The Free Trap:
Notice free(arp_entry) after sending. The sr_arpcache_lookup() function returns a dynamically allocated copy of the cache entry, not a pointer into the cache itself. This prevents race conditions (cache could be modified while we’re using the entry), but requires us to free the returned pointer. Forgetting this creates a memory leak proportional to forwarded packet count.
Part 7: The Longest Prefix Match Algorithm
Goal: Find routing table entry with most specific subnet match.
struct sr_rt longest_prefix_match(struct sr_instance* sr, uint32_t ip_addr)
{
struct sr_rt* rt = sr->routing_table;
struct sr_rt best_match;
uint32_t longest_mask = 0;
/* Initialize to "no route found" */
memset(&best_match, 0, sizeof(struct sr_rt));
/* Linear scan through routing table */
while (rt != NULL) {
/* Check if this route matches the destination IP */
uint32_t masked_dest = ip_addr & rt->mask.s_addr;
uint32_t masked_route = rt->dest.s_addr & rt->mask.s_addr;
if (masked_dest == masked_route) {
/* Match found - check if it's more specific than current best */
uint32_t current_mask = ntohl(rt->mask.s_addr);
if (current_mask > longest_mask) {
/* More specific match (longer prefix) */
best_match = *rt;
longest_mask = current_mask;
}
}
rt = rt->next;
}
return best_match;
}
🔵 Deep Dive - The Mask Comparison:
Why ntohl(rt->mask.s_addr) > longest_mask? Masks are network byte order (big-endian), so comparing them directly as integers works correctly:
255.255.255.0=0xFFFFFF00(24-bit prefix)255.255.0.0=0xFFFF0000(16-bit prefix)0xFFFFFF00 > 0xFFFF0000✓ (more specific)
Performance Optimization (Not Implemented Here):
Production routers use Patricia tries (radix trees) for O(log n) lookup instead of O(n) linear scan. For a routing table with 100,000 entries, Patricia trie lookup takes ~16 comparisons vs 50,000 average comparisons for linear scan.
Under the Hood
Memory Layout: The Packet Buffer
Let’s visualize exactly what’s in memory when we receive a 70-byte ICMP echo request:
┌────────────────────────────────────────────────────────────────┐
│ PACKET BUFFER (70 bytes) │
├────────────────────────────────────────────────────────────────┤
│ Offset 0-13: Ethernet Header (14 bytes) │
│ ├─ ether_dhost[6] = aa:bb:cc:dd:ee:ff │
│ ├─ ether_shost[6] = 11:22:33:44:55:66 │
│ └─ ether_type(2) = 0x0800 (IP) │
├────────────────────────────────────────────────────────────────┤
│ Offset 14-33: IP Header (20 bytes) │
│ ├─ version(4 bits) + header_len(4 bits) = 0x45 │
│ ├─ ip_len(2) = 0x0038 (56 bytes total) │
│ ├─ ip_ttl(1) = 64 │
│ ├─ ip_p(1) = 1 (ICMP) │
│ ├─ ip_src(4) = 0xC0A80101 (192.168.1.1) │
│ └─ ip_dst(4) = 0xC0A80102 (192.168.1.2) │
├────────────────────────────────────────────────────────────────┤
│ Offset 34-41: ICMP Header (8 bytes) │
│ ├─ type(1) = 8 (Echo Request) │
│ ├─ code(1) = 0 │
│ ├─ checksum(2) │
│ ├─ id(2) │
│ └─ seq(2) │
├────────────────────────────────────────────────────────────────┤
│ Offset 42-69: ICMP Data (28 bytes of payload) │
└────────────────────────────────────────────────────────────────┘
Stack Frame of sr_handlepacket():
┌────────────────────────────────────────┐
│ packet (8 bytes) = 0x7ffca8200000 ────┼──→ Points to buffer above
│ len (4 bytes) = 70 │
│ interface (8) = 0x7ffca8300000 ────┼──→ Points to "eth0" string
│ eth_hdr (8) = 0x7ffca8200000 ────┼──→ Same as packet (no copy!)
│ ip_hdr (8) = 0x7ffca820000E ────┼──→ packet + 14 bytes
│ icmp_hdr (8) = 0x7ffca8200022 ────┼──→ packet + 34 bytes
└────────────────────────────────────────┘
Key Insight: All the *_hdr pointers are views into the same buffer. No copying occurs. This is zero-copy processing—critical for performance.
CPU Cache Optimization: The Hot Path
Modern CPUs have multiple cache levels:
- L1 Cache: 32KB, ~4 cycle latency
- L2 Cache: 256KB, ~12 cycle latency
- L3 Cache: 8MB, ~40 cycle latency
- RAM: Gigabytes, ~200+ cycle latency
The Problem: If our routing table doesn’t fit in cache, every lookup is a ~200-cycle RAM access.
The Optimization (Not Fully Implemented):
/* Keep recently used routes in a small cache (fits in L1) */
#define ROUTE_CACHE_SIZE 8 /* 8 entries = ~256 bytes fits in L1 */
struct {
uint32_t ip;
struct sr_rt* route;
} route_cache[ROUTE_CACHE_SIZE];
/* Check cache before full routing table scan */
for (int i = 0; i < ROUTE_CACHE_SIZE; i++) {
if (route_cache[i].ip == (ip_addr & route_cache[i].route->mask.s_addr)) {
return route_cache[i].route; /* L1 hit - 4 cycles */
}
}
/* Cache miss - do full lookup (~200 cycles) and update cache */
For 80% cache hit rate, average lookup time drops from 200 to ~44 cycles.
Compiler Optimizations: Function Inlining
With -O2, GCC inlines small functions like:
inline uint16_t ntohs(uint16_t x) {
return (x << 8) | (x >> 8);
}
Becomes:
movzx eax, WORD PTR [rsi] ; Load 16-bit value
rol ax, 8 ; Rotate left 8 bits (byte swap)
Performance Impact:
- Without inlining: Function call overhead = ~10 cycles
- With inlining: Single instruction = ~1 cycle
At 100,000 packets/sec with 10 byte order conversions per packet, inlining saves 9 million CPU cycles per second.
Memory Barriers and Atomic Operations
In multi-threaded routers, ARP cache updates need atomic operations:
/* Naive (WRONG - race condition) */
cache->entries = new_entry;
new_entry->next = old_head; /* Another thread could modify cache->entries here! */
/* Correct (compare-and-swap atomic) */
do {
old_head = cache->entries;
new_entry->next = old_head;
} while (!__sync_bool_compare_and_swap(&cache->entries, old_head, new_entry));
This ensures no lost updates even if two threads insert simultaneously.
Edge Cases & Pitfalls
Concurrency: The ARP Queue Race
Scenario: Packet arrives while ARP reply is being processed.
Thread 1 (Packet Handler): Thread 2 (ARP Reply Handler):
────────────────────────────── ─────────────────────────────
forward_ip_packet()
├─ ARP lookup: MISS
├─ Queue packet handle_arp_reply()
└─ About to send ARP request ├─ Insert ARP entry
└─ Send queued packets
├─ Send ARP request
└─ (Pointless - already resolved!)
The Bug: We send a duplicate ARP request after resolution completes.
The Fix:
pthread_mutex_lock(&cache->lock);
struct sr_arpreq* req = sr_arpcache_queuereq(&cache, next_hop_ip, packet, len, iface);
/* Check if ARP was resolved while we were acquiring lock */
struct sr_arpentry* entry = sr_arpcache_lookup(&cache, next_hop_ip);
if (entry != NULL) {
/* Race detected - ARP resolved while we were queueing */
sr_arpreq_destroy(&cache, req); /* Remove queued packet */
/* Send packet directly (we have MAC now) */
send_packet_with_mac(sr, packet, len, entry->mac, iface);
free(entry);
} else if (req->times_sent == 0) {
send_arp_request(sr, next_hop_ip);
req->times_sent = 1;
}
pthread_mutex_unlock(&cache->lock);
Security: IP Spoofing and Reflection Attacks
Attack 1: ICMP Echo Reflection
Attacker sends: ICMP Echo Request with spoofed source IP = victim's IP
Router replies: ICMP Echo Reply → Victim receives unsolicited ping
(Amplified if attacker sends thousands of requests)
Defense (Not Implemented):
/* Rate limit ICMP replies per source IP */
if (icmp_rate_limit_exceeded(ip_hdr->ip_src)) {
log("Rate limiting ICMP from %s", inet_ntoa(ip_hdr->ip_src));
return;
}
Attack 2: TTL=0 on Input
Attacker sends IP packet with TTL=1
Router decrements TTL → 0
Router sends ICMP Time Exceeded → Attacker learns router IP
(Used for network mapping)
Defense:
if (ip_hdr->ip_ttl <= 1) {
/* Drop silently instead of sending ICMP */
return;
}
Memory: The Queued Packet Leak
The Scenario:
1. Packet A queued for ARP resolution
2. 5 seconds pass, no ARP reply
3. Packet B queued for same IP
4. ARP timeout thread runs
5. Sends ICMP unreachable for packets A and B
6. Destroys ARP request
The Bug (Subtle):
void send_icmp_for_queued_packets(struct sr_arpreq* req) {
struct sr_packet* pkt = req->packets;
while (pkt != NULL) {
send_icmp_unreachable(sr, pkt->buf, pkt->len);
pkt = pkt->next;
/* BUG: We're not freeing pkt or pkt->buf! */
}
}
The Fix:
void send_icmp_for_queued_packets(struct sr_arpreq* req) {
struct sr_packet* pkt = req->packets;
while (pkt != NULL) {
struct sr_packet* next = pkt->next; /* Save before freeing */
send_icmp_unreachable(sr, pkt->buf, pkt->len);
free(pkt->buf); /* Free packet data */
free(pkt); /* Free packet structure */
pkt = next;
}
req->packets = NULL; /* Mark queue as empty */
}
Performance: The Checksum Recalculation Bottleneck
The Problem: Every forwarded packet requires full IP checksum recalculation (O(n) over header bytes).
Optimization (Incremental Checksum Update):
Since we only modify TTL (1 byte), we can update checksum incrementally:
/* Instead of full recalculation: */
ip_hdr->ip_sum = 0;
ip_hdr->ip_sum = cksum(ip_hdr, sizeof(sr_ip_hdr_t)); /* ~20 CPU cycles */
/* Incremental update (RFC 1624): */
uint16_t old_ttl = ip_hdr->ip_ttl;
ip_hdr->ip_ttl--;
uint16_t new_ttl = ip_hdr->ip_ttl;
/* Update checksum algebraically */
uint32_t sum = ~ntohs(ip_hdr->ip_sum) & 0xFFFF;
sum += (~old_ttl & 0xFFFF) + new_ttl;
sum = (sum & 0xFFFF) + (sum >> 16); /* Fold carry */
ip_hdr->ip_sum = htons(~sum & 0xFFFF); /* ~3 CPU cycles */
Speedup: 6.7x faster for TTL-only changes.
Conclusion
What You Actually Learned
This wasn’t just “forward some packets.” You mastered:
-
State Machine Coordination: Implicit state machines (ARP pending → resolved → send queued packets) with lock-free synchronization points.
-
Zero-Copy I/O: Pointer arithmetic over raw buffers instead of struct copies—the foundation of kernel networking stacks (Linux, BSD).
-
Performance-Aware Data Structures: Understanding when O(n) is acceptable (10 interfaces) vs when you need O(1) (1M cache entries).
-
Memory Lifecycle Management: Tracking ownership across async operations (who frees the packet? sender? receiver? timeout thread?).
-
Hot Path Optimization: Identifying the critical path (ARP cache hit → forward) and optimizing it vs cold paths (ARP miss → queue).
Real-World Applications
This exact architecture appears in:
- Linux Kernel Networking:
netif_receive_skb()innet/core/dev.cuses identical state machine pattern - DPDK (Data Plane Development Kit): High-performance NICs use zero-copy packet processing
- P4 Programmable Switches: Hardware pipelines implement this logic in ASICs at terabit speeds
- Container Networking (CNI plugins): Docker/Kubernetes use similar packet forwarding with NAT
Next Steps
- Optimize It: Implement Patricia trie for O(log n) longest-prefix match
- Harden It: Add rate limiting, anti-spoofing checks, and sanity validators
- Scale It: Port to DPDK for user-space packet processing (100Gbps+)
- Measure It: Add perf counters for packets/sec, cache hit rates, and latency percentiles
- Visualize It: Live flame graphs showing where CPU time is spent per packet
The Ultimate Challenge
Rewrite this using io_uring (Linux 5.1+) for asynchronous packet I/O:
io_uring_queue_init(QUEUE_DEPTH, &ring, 0);
io_uring_prep_recvmsg(sqe, sockfd, &msg, 0);
io_uring_submit(&ring);
/* Packet arrives asynchronously, callback invokes sr_handlepacket */
This eliminates system call overhead, enabling 10M+ packets/sec on commodity hardware.
You’ve now dissected a production router’s innermost loop. The next time you ping google.com, you’ll see it differently—not magic, but 100 lines of carefully orchestrated pointer arithmetic, state machines, and zero-copy buffers.
Welcome to systems programming at the speed of light.