State machines are a fundamental part of any payments system. Managing payment state is critical to providing a trustworthy and reliable service, which is why you can find state1 machines2 everywhere3. Outside of extracting away complexity, I believe they’re such a great fit because payment processing is in the business of money movement and with movement, there must be a destination whether it is successful or not.
All their accolades aside, state machines are only as good as your data hygiene. A system that tolerates incorrect object state or even permits the occasional anomaly undermines itself. Unreliable state paves the way for bugs and can severely impact an engineering team’s ability to add new features. After all, how do you expect to account for edge cases when your data isn’t consistent enough to convey anything meaningful?
The onus is on you to keep object state pristine. Let’s take a look at how you can tell if your data hygiene is questionable.
Checking your scent
Despite their common functionality, every payment system will have its own unique nuances that will manifest data hygiene smells differently. Fortunately, we can use SQL to uncover two of the most common scenarios with minimal effort.
Objects stuck in an invalid state
Using Spree’s state machine as a base, let’s pretend the Payment object should only be in a “processing” state for no more than a day before it transitions to either “completed” or “failure.” We expect “processing” to be temporary, and any objects outside our expectation are cause for concern.
SELECT count(*) FROM payments WHERE state = 'processing' AND processing_state_at < (NOW() - INTERVAL 1 DAY)
Invalid collaborating objects
Imagine we have an Order that has one Invoice. For simplicity, they both start in a “pending” state and ideally end up in “settled.” It’d be concerning if we found an Order in a “pending” state if its Invoice were “settled.”
SELECT count(*) FROM orders JOIN invoices on orders.invoice_id = invoices.id WHERE orders.state = 'pending' AND invoices.state = 'settled'
If you get a positive result in either of the above queries, you likely have data hygiene issues. Both of the queries are dead simple, but the principle can quickly reveal what type of state issues your application is silently permitting.
One of my favorite things to do in a new codebase is to turn the state machines into queries and use SQL to ask questions. More often than not, I can uncover behavior within a domain that was either unintentional or documented inexplicitly. This is specifically useful when working in large codebases that have seen churn throughout the years.
Keep it clean
In no particular order, here are some suggestions to get your state management on the right track.
Never bypass a state machine
State machines do no good when objects are updated manually. As innocent as it may seem, bypassing transitions is a great way to miss out on valuable validations or necessary interactions with collaborators. Bugs introduced this way are some of the most pernicious and difficult to detect.
Webhooks are an optimization
Many PSPs use webhooks to send meaningful events, and it’s tempting to set up some handlers then call it a day. No matter how reliable they are, the onus is on your application to ensure each object’s state is correct, and that logic should be baked into your domain.
One solution would be to go the traditional route and poll your PSP for updates. There are a lot of ways to skin that cat, but Kill Bill, an open-source billing and payment platform, has a robust solution called janitor that’s worth taking a look at. In essence, it acknowledges that some state is temporary and runs in the background to resolve transitions.
Set up monitoring
There’s no excuse for lack of insight. Turn SQL queries into anomaly alerts or aggregate log events – do something to ensure you can quickly and easily know if your state is misbehaving. If your team performs retros for bugs/incidents, a good practice to introduce could be to find out how a monitor could have detected the issue sooner.
Clean up after bugs
If a bug caused state to get out of sync, set the precedent that your team must correct the objects. Time and time again, I’ve seen bugs get fixed, but the fallout is left behind because it doesn’t impact the user. That incorrect data will make it harder for future engineers to understand the objects' behavior or intent. Always clean up.