Orchestra, the developers of the wildly popular Mailbox mobile app, have a problem every app developer dreams of having: they need to seriously scale, really fast. Few developers, however, can claim to have scaled to one million users in just six weeks with hardly a glitch.
To learn from Mailbox's success, ReadWrite sat down with Orchestra's Mailbox engineering lead, Sean Beausoleil. Among some now-common refrains like the need to continually iterate on a project, Beausoleil offers other advice, like the need to significantly limit the number of moving infrastructure parts, and the company's reservation system, which may be novel to many.
Planning For Scale
ReadWrite: Orchestra took Mailbox to 1 million users in just six weeks. Did you expect that level of success?
Beausoleil: We always planned for a large scale system because email is a high-volume data problem, but we weren't expecting the demand that we saw. When we launched our video back in December, we were hoping for 100,000 views as we ramped towards product launch, but the video received that in under four hours, much to our surprise. This initial interest made it quite apparent that we would need to support more ambitious scale than we had been expecting.
ReadWrite: How do you plan for that kind of growth? You were very deliberate about how you staged your launch. Please walk through the process by which you've scaled up the infrastructure behind Mailbox.
Beausoleil: I think there were three critical phases in the evolution of the Mailbox infrastructure that led to our current scale.
- Designing, iterating and building with scale and correctness in mind;
- Simulating large scale as best we could; and
- Reacting to production load: developing and executing rapidly to scale just in time.
Since we were dealing with email and email is business critical, we designed the system with scalability, availability and correctness in mind. Our goal was to design a scalable system while forcing ourselves to move quickly and iterate on our system and product. In order to do so, we built a modularized system and relentlessly iterated on each component.
In order to find as many incorrect assumptions and bottlenecks as we could before launch, we built a clone of our system and an IMAP server that simulated production load. This allowed us to find limitations and problems with our backend that would have been painful to fix while trying to keep the system alive.
However, we knew that we weren't going to build a perfect system on day one. When building software, assumptions and constraints change rapidly as the problem and your understanding of the problem evolves, so you need to bake in learning and adjustment time as a necessary part of the process. After seeing the initial demand driven by the video and realizing that the system needed time to evolve, we decided that we had to build a reservation system to help us control the load on our system.
Our highest priority was ensuring that everyone already using the app to manage their email continued to have a great experience.
For several weeks after we launched our entire engineering team worked literally around the clock to identify issues and fix them so that we could continue allowing people into the app. This phase was the raw horsepower behind scaling so quickly. Various pieces of the core infrastructure were either tweaked, sharded, or removed entirely as we learned how our data and users behaved.
Limiting The Number Of Moving Parts
ReadWrite: How did you select the components of your infrastructure? Was it technology that you had used on the to-do app?
The components of our infrastructure is another story of iteration and evolution. Since Mailbox itself was really just an iteration on our to-do app, the technology similarly evolved out of our Orchestra To-Do backend. We were privileged with an opportunity that very few startups are fortunate to have: the chance to completely re-write our entire system. This allowed us to take the things we knew worked and scrap the parts that we knew didn't (both technologies and code that we had written) and start fresh.
However, as we were developing Mailbox, we discovered that some previous technologies we were using either weren't going to cut it given our new constraints or were simply not the right fit for what we needed. So we spent quite a bit of time vetting all of the options available.
For example, I remember one weekend where we built a huge whiteboard matrix with a dozen database options and the pros/cons of each. That allowed us to make the best decision we could at the time for our system. And then we just ran with it.
Our iPhone app was a pretty direct evolution of Orchestra To-Do, though. We took the data parsing and networking frameworks that we had built to facilitate real-time messaging for Orchestra To-Do and iterated on the pieces that needed improvement. We also took our learnings on how to build an efficient and responsive iOS UI and applied those learnings into our own custom front-end framework that makes the app draw quickly and feel fast.
One principal we stuck to was that we tried to keep the number of different technologies to a minimum. We didn't want to have to become experts in 20 different things while building out our system. We wanted to become really good at three things and focus as much as we could on our product.
ReadWrite: Mailbox's infrastructure runs in the cloud. Did you ever consider building out the infrastructure in your own data center? Any thoughts of moving to a dedicated data center as you grow?
Beausoleil: A dedicated data center requires a lot of resources and up-front commitment. We were just a small team trying to build out a large-scale backend. We didn't have the resources to manage a dedicated data center. AWS [Amazon Web Services] was an awesome partner, giving us the flexibility to iterate and scale out our system. The platform proved to be both cost-effective and efficient for our team to build on top of, which was a necessity given our limited resources and tight timeline.
Expecting The Unexpected
ReadWrite: Your launch wasn't without glitches. At one point, messages wouldn't load, which Orchestra blamed on an "unusual server issue." Looking back, is this something you could have anticipated? Should you have anticipated it, or was it a known unknown?
Beausoleil: I think hindsight is always going to be 20/20. When you look back at any issue with any piece of software that you write, you could say that you might have prevented the issue or should have caught the bug because of reason X. But that's because you now understand part of the problem or have insight into some code path that you didn't have before. It's nearly impossible to write perfect software on your first try and while I think there are things we could have done to prevent various issues or catch them earlier, I'm sure something else would have popped up. When you launch something with any kind of scale, there will invariably be things that fail and need to be fixed.
ReadWrite: Were you to launch over again, what would you have done differently? Are there components of your stack that you've found work less well than you had hoped, that you're hoping to replace?
Beausoleil: Again, hindsight is 20/20. Knowing what we know now, knowing which parts broke and how we were able to fix them, we would definitely have improved certain things. But then we wouldn't have had the opportunity to experience building and scaling our system in such a short amount of time. The end result is awesome, but the journey is what really matters.
We are continually improving everything that we're doing and trying to make our system more efficient and faster for the end user. I think that will be a forever effort. There's always a better way to do something, some are just harder than others to achieve.
Advice To Other Startups Hoping To Scale
ReadWrite: Any other advice you'd share with startups hoping to get to Mailbox scale?
Beausoleil: Iterate, iterate, iterate. Whatever your current state is, it can be better. Just keep going at it and keep making it better. It will lead to a better overall architectural design as you iterate through your initial assumptions and eventually lead to a more scalable system as you learn how your implementation and data behaves.
Details matter. Be obsessed with the details, but don't let them get in the way of executing quickly. It requires a lot of really hard work to do so, but is worth every ounce of effort.
No comments:
Post a Comment