TCP is an interesting protocol. It’s interesting mostly because it is less a specification of bytes and more a specification of behavior. Most TCP implementations have developed from the sort of arcane knowledge that you can only amass after trying to implement such a basic protocol over a long period of time.
By the time TCP (and indeed the whole IP stack) has made it to my desktop, it had been on a long journey. MacOSX’s network has a storied pedigree that goes back deep into the iterations of the original BSD Unix. There is, as they say, heavy voodoo.
Today I hit a corner case that only made sense after some pretty serious debugging. I’m sharing it here with the hopes that it may save you the headache.
The backbone of the Internet is designed to just have these routes disappear for a while. You get a few ICMP messages back if you’re very lucky. Otherwise, your packets might just disappear.
One of the nice features of TCP is that it’s incredibly resilient to network links just disappearing. This is no problem for it. In fact, if you’re not sending any traffic, you may not even notice that you’re down. Failures being invisible is a nice feature when you’re not doing anything.
That said, there are protocols that really want to know when they’re down. XMPP is one of them. For protocols like XMPP, there is a pretty standard procedure of having some sort of “keep-alive” data that you occasionally send. Since XMPP data streams are just XML documents, most XMPP implementations just send a few whitespace bytes in between stanzas when idle.
Today, I was debugging an XMPP connection over a 3G modem. This manifests itself under a ppp0 link in MacOSX. While I wasn’t thinking about it, I walked around with my laptop. One spot caused the phone connection to lose signal and it failed. When I noticed, I reconnected the modem. This provoked some very interesting behavior (or rather lack of behavior) from the BSD IP stack.
Normally, when some fundamental aspect of network changes, there is some device that will interrupt your connection. For example, if my XMPP server had lost power, when it recovered the keepalive packets would have triggered a TCP reset, which breaks the connection. Similarly, if I remove an IP address from a Linux machine, connections on that IP are interrupted. It just so happens that in this case, the IP stack did NOT break the connections.
In fact, it just silently ate any data that the connections attempted to send. So the keepalives completely failed to kill the dead connection. It took almost fifteen minutes until some sort of behavior that caused the IP stack to notice that the connection should be killed.
It took a while to track down what was happening, but apparently the connections were maintained (so says netstat) and the sent packets just disappeared without any sort of sending error! Very weird behavior triggered by an odd corner case. I’ve also discovered that this appears to also happen when you close your laptop and go to another wireless access point. This is ugly for my use-case, as I want the agent on the laptop to reconnect when it has a new address. If anyone has a good way to detect this in a portable way (i.e. not plugging into Apple’s NetKit watchers), please let me know.
While I find this mildly annoying, I have to admit that I can’t fault TCP. If the packets are just disappearing, the best behavior is to just resend and keep waiting for the connection to come back up for a reasonably long time out. This is exactly what happened. Instead, I hope that Apple will eventually do what Linux does and push an error into the socket when it tries to send from an address that isn’t valid for that machine anymore.