The Exciting Case of Schrodinger's Packets

Published 2019-04-22

You are probably familiar with the quantum theory thought experiment of Schrodinger's cat, wherein a cat in a box is both alive and dead, but you don't which until you look inside the box.

In this story, our network packets both arrive and do not arrive, but we don't know which until we look inside the box. Thankfully we can build modern network applications on top of HTTP instead of quantum physics ethernet. For now, anyway.

Disappearing Packets

Several years ago I received a call from the NOC at work informing me that an obscure service was down. I may or may not have been on-call at the time but when it came to obscure services the NOC usually just called me anyway because I (though technically a software engineer) was good at reverse-engineering long-forgotten production systems that no one rembered exsisted in the first place.

After some troubleshooting I determined that there was a network problem. Every so often, all network traffic on the box would drop. Nothing came in, and nothing came out. Sometimes for minutes at a time. I called in a more experienced network engineer to help me diagnose the problem.

We ran through every OS-based diagnostic we could think of, and he looked at the routing and firewall logs to see if there was some other kind of network problem. He could see the traffic drop off but there was no apparent explanation for it in either the OS or the network diagnostics.

Since there was a pair of boxes running this service, and only one of them exhibited problems, we ruled out a problem with the wider network, and after exhausting our options over SSH and finding nothing, my expert networking colleague chalked it up to some faulty hardware. We reconfigured a load balancer to route production traffic to the remaining healthy node and scheduled the fussy one for recomissioning at a later date.

I got back to work and forgot about it.

A few weeks later someone came by my desk and said, "Hey, remember that box that died with service X?" "Oh yeah, I remember." "Ops figured it out. Turns out two machines were assigned the same IP address."

The machine was fine, it had just been misconfigured. Or, perhaps, it was configured properly and its IP assignment was never recorded, and later a new machine was brought online and stole the IP. Except the IP theft would randomly revert.

IP Addressing

Networking is based on a layer cake of protocols that (from the top down) ultimately end up in hardware with physical things connected to eachother. Layer 1 is literally the copper cable that is pluged into the socket. The whole layer cake is called "the OSI model".

IP addresses don't show up until layer 3, and DHCP (commonly used for assigning IPs to various boxes) is layer 7, the Application Layer. There's a lot of black magic and ASICs involved to make network performance fast, and a lot of the things we think about like machine-to-machine communication don't actually exist at the networking level. They just are abstractions that save us time and mental capacity.

If we cast around for a simple analogy we'll find that a lot of networking is like plumbing. The water just flows generally around all connected pipes, not specifically through this pipe or that pipe, unless there is a lower-level mechanism (i.e. a cable) that is physically segmented.

Both the router and the network card on your computer need to agree about your IP address in order for you to get the proper traffic. Actually, your computer is getting all kinds of traffic all the time via broadcasts that it ignores. And at the same time the router may be passing along a lot of traffic that your machine doesn't recognize, so the network card or OS just ignores it.

Similarly, the router may not agree that you have the IP you say you do, and in that case it's sending your traffic somewhere else — maybe to another machine, or maybe into a black hole. If there is a network card on the other end that gets your traffic, it doesn't send it back to you, or even back to the router, it just deletes it.

(All this is to say nothing of the various switches and other intermediate devices whose maximum understanding may be layer 1, 3, 5, or something else, and whose basic error handling is "forward, repeat, or drop".)

Normally IP works fine, but in the case where two machines are fighting over an IP, it's like someone moving the address for your house so all your mail is delivered to them instead. And when you switch the address back to your house, you get some of your mail but you also get some of their mail, and some of the mail you sent doesn't get where it's supposed to go either (or maybe it does, but you'll never know, because the reply packets were sent somewhere else).

Since the decision-making for which packet goes where is handled in hardware, it's unlikely you'll actually be able to see what configuration is actually being used, or where the traffic is actually going. The IP could flip between machines at any time. DHCP does not prevent this either because it operates at layer 7, not layer 3 where packets are actually flowing on the wire.

Once More, with Feeling

A week ago I ran into this problem again in VMware. I was trying to setup SSH to a VM and its clone, but one of them, in spite of having a seemingly-valid DHCP lease, was not reachable on the network. Because I was able to poke around at all parts of the problem simultaneously (both VMs, and the virtual network) I noticed both machines were leased the same IP, and both machines were trying to bind to it.

I thought perhaps I had identified the wrong IPs from the DHCP lease table so I looked at it, and while each machine had different IPs assigned in DHCP they also had multiple overlapping leases, and at least one of them was invalid and was causing both machines to (attempt to) acquire the same IP. My code wasn't faring any worse than the linux dhcp client so I figured it was not a problem I caused.

One machine would get network connectivity for a few minutes, and then lose it again when the other one took it back. DHCP was not able to fix the problem in part because the machine that was disconnected was not able to receive DHCP packets, and in part because it was stuck with a stale config because of the borked DHCP lease table which kept handing out bad IPs.

I eventually wiped out the entire DHCP lease table, restarted the virtual network and restarted both machines, and the problem went away. But in the interim there were a lot of misrouted packets.

In Conclusion

Obviously you should not assign two machines the same IP address. However, if this does happen you can expect random, mysterious, total packet loss while everything else (OS, hardware, and even host-specific networking configuration) appears to be fine. It's only when you inspect the network as a whole that you'll be able to identify the problem.


Related

networking