There was an Old Lady who Swallowed a Fly
In May of 2017, I estimated the size of my crowd, walked up to the keynote stage, and confidently pre-announced a real-time security analytics product in front of several thousand of our best customers.
The adventure that unfolded over the next two and a half years provides the backdrop upon which I co-founded Grainite. I’m sharing the story because over the last 5 years, not much has changed in how we build streaming applications. However many more of us: businesses, business owners, engineers and operators, are on this journey now.
The Strategic Imperative
We were almost late to the market. We had a ton of relevant data. We were in the pole position to harness this data for security signals. The opportunity was ours to lose. Given the strategic push, I figured we could get this shipped before the year was up. All that remained… was to build the app on top of one of those data pipelines. How hard could that be?
Ubiquity of streaming applications
After all, everyone was building these pipelines. Hardly a day would go by without hearing of a Google or a Facebook or an Amazon revealing an astounding new capability, powered by some new-fangled data pipeline.
And it was not just the internet behemoths either. There were startups, some of them - competing with us (those ankle-biters!) - who were proudly parading their pipeline prowess.
If they could build one with five engineers, surely we could too, with a team of sharp-shooters ten times that, and time-to-market squarely in their cross-hairs.
Among the Fortune 5000, there was growing talk of data intensive streaming applications too. The moniker 360° was often used to describe this. For example - 360° security, or 360° customer data processing, or 360° user personalization, or 360° monitoring and visibility.
Then there was real-time. Real-time fraud detection. Real-time payment processing. Real-time supply-chain. Real-time inventory management. And so on.
Each of these use cases was powered by a streaming application crunching volumes of data, and producing insights in real-time.
The paper napkin design
We researched (erm... googled) the best practices and quickly etched out a sketch. This was roughly how everyone appeared to be building their data pipelines.
The architecture would cost 100s of times more than we were prepared to pay.
We’d have a stream ingestion stage - use Kafka or equivalent for that. Unbeknownst to me, I had swallowed a fly…
To process the ingested data, we’d bring in Spark (n.b. explore Flink too). And just like that, I had swallowed a spider…
To store the results, we’d use a nosql database. And I had swallowed a bird…
Pretty soon, it was clear that we’d need a durable state store beyond just the results-database. A scratchpad database. We pulled in another (cheaper) database for the scratchpad. I had swallowed a cat!
To scale reads, we concluded we’d employ one database to serve writes, and a different one to serve reads. I had swallowed a dog.
At our estimated processing rates, our chosen databases would either not scale, or would cost 100s of times more than we were prepared to pay. We needed a cache! And I had swallowed a cow!
With this many disparate systems, the design felt complex even at the paper-napkin stage. But to the first order, it was what everyone was building. Paper-napkin in one hand and an open checkbook in the other, we set out to turn our vision into reality.
Paper-napkin in one hand and an open checkbook in the other, we set out to turn our vision into reality.
Building the team | Unobtainium
Almost immediately, we ran into a snag. As a Fortune 1000 company, betting on all these open source projects, we wanted a bit more control over our destiny. “Could we hire a committer from each of the FOSS projects we plan to use?” Turns out there are not enough of those to go around.
It took several months just to get engineers with the right set of expertise on board. I jokingly referred to the skills profile we were looking for as Unobtainium. It seemed there was a better chance of finding a buried treasure in my backyard, than of finding an architect with meaningful experience using each of the building blocks we were looking to use in the pipeline.
Building the Application
The pipeline architecture quickly grew a lot more complex. Each of the components was itself a distributed system, with multiple nodes. And within each there were ever more moving parts - some of them, distributed systems in their own right (please welcome: Zookeeper! I swallowed a horse).
The half-life of streaming technology is ever shrinking
To make matters worse, the half-life of streaming technology is ever shrinking, and we found ourselves chasing newer / different technologies as we ran into problems with the ones we had picked. Should it be Spark or Flink? Do we want to try using a Graph database? or stick to key-value nosql databases? Which kind of consistency (there are so many flavors) is right for us in Azure? The TLA+ spec for each is most helpfully provided. But do we have anyone on the team who can make sense of it? Regardless, with the many versions of the data - in the cache, the scratchpad, the read db or the write db - one could never be sure if what they had was consistent.
In the cloud, and especially with streaming pipelines, we learnt - the hard way - that resiliency is a property of the application, not of the infrastructure. At every step of the way, we had to write extra code for handling failures and crashes - both in our own code, and in our dependencies.
The application developer had to solve for ordering, and linearizability. The application developer had to solve for Locking. Idempotency. Deduplication. and on and on.
Resiliency is a property of the application, not the infrastructure
What could be 200 hundred lines of simple business logic ballooned into 2,000 lines of nuanced distributed systems code. Weeks turned into months, and months into years. “Everything is harder when you put it in the pipeline,” I found myself saying.
Operating the Pipeline
Operating the application at scale, was an adventure in itself.
A tell-tale sign of complexity is when simple questions have no simple answers
I had searched kafka countless times before but now I started noticing that if you type kafka consumer into google search, it suggests “lag” as the next term.
A tell tale sign of unwieldy complexity is when simple questions have no simple answers. Questions like:
- how do you backup the system for disaster recovery? It internally consists of several disparate data stores? What is a consistent backup? Each system we used had its own (and very different) ideas for how it wanted to be backed up. There was no coordination of consistency across the systems.
- how many copies of each data do we make in the system? What is the read and write amplification?
- what is the RPO and RTO?
- what happens if the cache fails? “wait, we’ve scratchpad data, that is not expendable?”
And the answer was almost always - “Let’s get back to you on that one, Abhishek!”. “Is it minutes, hours or days, for recovery?” “We’ll get back to you…”.
To add insult to injury, operating multiple clusters of disparate systems wasn’t just complex and inefficient. It was exquisitely expensive. Some days I’d question whether a viable business could ever be built, given our cost of goods.
Running with Shards
Shards is how data workloads scale out. More the shards, more the parallelism. But operating disparate sharded systems comes with a long list of pitfalls. You pay by the shard. But you don’t know how many you should have. And you don’t know when you need to add more. And if you do, you don’t know that in many systems, you can’t decrease the number of shards later! Wait what! ?
We found running with shards to be just as perilous as running with scissors!
In operations, we spent an inordinate amount of time managing, coordinating, and matching the shards across disparate systems.
We found running with shards to be just as perilous as running with scissors!
And then there were application level failures that posed a long-tail problem. We had ended up with a complex application. The business logic was sprinkled with code to handle the vagaries of distributed systems. We had a complex data model, few consistency guarantees, performance imperatives, misplaced expectations and flawed assumptions. With that came a never-ending onslaught of difficult-to-debug failures.
In 2019, I found the man with all the answers. Ashish Kumar. At Google, he was responsible for all the databases Google had built, as well as the world’s largest real time processing pipeline behind Google’s personalization engines. I asked him how they did it at Google.
How would a typical Fortune 5000 enterprise solve this?
“Well, you know that guy who wrote that new paper on distributed mumbo jumbo …, he works in my team. So does the person who invented …”. He quickly rattled off the names of several distinguished exponents in the world of distributed systems. “In fact, our database team fields the most senior cast of engineers, anywhere in Google!”
“But wait”, I stopped him. “It's one thing for Google or Netflix to solve it. How would a typical Fortune 5000 enterprise solve this?”
In the pregnant silence that followed, the idea of Grainite was born.