Belt & Suspenders

There was an interesting post on the Tindie forums that talked about how PayPal seemed to be not sending IPNs. They just never arrived, and there seems to be no notice from PayPal. Apparently this has been going on for several days and it’s possible they didn’t even know about it. Well, that’s annoying.

After reading the post, I found myself a bit grumpy… but not at PayPal. Why was Tindie relying so heavily on receiving the IPNs that it would actually harm their business? I wrote this reply to the article on Hacker News:

You can make an API call to check the status of a transaction very easily.

When my users are redirected back to my site (thanks page, or similar), I check if their transaction is completed, if not, I kick off an every-five-seconds check while asking the user to hold on while we talk with PayPal. I will eventually fail after some number of checks of course, but this means PayPal can stop sending IPNs and everything will keep going along just fine.

If the user might not end up back on your site for some reason, run a cronjob that tries to verify transactions created in the past day/hour/whatever.

An issue like this doesn't have to, and really shouldn't, cripple your business.

That pretty much sums up my PayPal-IPN-specific thoughts on the subject, but the whole situation got me thinking about how we so often just assume things will keep working. A reply on HN from mcguire succinctly referred to my approach above as Belt & Suspenders.

…and an elastic waistband, too.

As software developers, we like to think our code works. Maybe not their code, but certainly our code. The problem is that all code has someone behind it thinking it works just fine. So PayPal engineers have a hole in their testing or something which allows them to think everything is working, even though no IPNs have been sent for 9 days (to at least some segment of applications, anyway).

Maybe PayPal is in a unique position here, where they don’t actually care that IPNs aren’t being sent. It doesn’t significantly impact their business if your code doesn’t account for a lost IPN. What part of your infrastructure can be broken for 9 days with no impact to the bottom line? Is a part of your infrastructure broken right now, and you aren’t aware of it?

Having confidence our stuff is working is a desirable position. If you think your code works then you can go to sleep at night and not wake up until morning. And no nightmares! But, unfortunately, you should be having nightmares.

How much code that isn’t yours does your code interact with while you are sleeping? Thinking about my own products, CourseCraft and Bugrocket, here is a list off the top of my head: Important AWS stuff, like PostgreSQL, but also the fact that both apps are behind ELBs, routed to via Route 53, upload files to S3, etc. Then there’s Stripe, PayPal, and Keen for white-labelled analytics. For Bugrocket, MongoHQ, Websolr, and Braintree for subscription payment processing. And both use Postmark for transactional mail. This list gets much bigger if I start including projects for clients.

All of these fail sometimes. You probably have a similar list of things that break from time to time. Do you have some kind of… belt, suspenders, and an elastic waistband in place for when they fail?

I don't believe you.

If not, you should probably go (run) and do something about that!

Especially you, PayPal :)