Charity Majors: Hi. This is a beautiful place. Hi, people. Yay! How many of you here work on platforms? Cool. Good. I also work on platforms. I love them. I have, as my friend Matt Zimmerman used to say "a platform's problem." Normal systems just seem so boring after you've been building the systems that people build their systems on. Like if it's not fractal badness all the way down, it's to solve problems. As you can probably tell, my background is in operations. I strongly believe that the only good diff is a red diff. I will write code, but only when I have no other choices. [chuckle] Code causes problems. So you might have heard ____ and go, oh she's gonna try and sell me something, but I swear to God I'm not, partly, because I'm bad at selling things, but partly because I actually just have a rant about platforms. So I worked on Parse. I lead back in engineering there for three and a half years. Pre-acquisition, actually before we went into beta through the Facebook acquisition afterwards, there were zero users in production when I started working on Parse. There are 60,000 when we got acquired, and there are over a million when I left. After which five months later they shut it all down, but no hard feelings...[chuckle]
CM: We went through a really... We were doing microservices before they were called microservices. We were doing a lot of kind of cutting-edge stuff, and we experienced a lot of trauma and Honeycomb in a very real way was built out of that trauma. So I am actually gonna tell the story of a what we went through; the pain, the recovery and the problems that we're trying to solve. We don't solve all of it. Solve a lot of it. I hate monitoring, like I hate computers, but I really hate monitoring. Like it's the most depressing form of systems. [chuckle] I have strong feelings about platforms, and I expect many of you do too. So let's talk a little bit about what makes a platform successful from a technical point of view. I started listing off a bunch of... I just started writing down history and like what I thought platforms are like and after a while, it was really late, and I was in a lot of pain. [chuckle] Too much scar tissue. Oh, my God. And I found this beautiful post. Somebody else wrote for me; very nice of them. And I agree with a lot of it. So I just posted the links up with the slides up so you can find the links. But basically like there's some stuff that's very obvious. You should provide something that people need. Cool. You should have a business plan. Also seems very obvious, but if you're an engineer it's probably not. Takes us a while to figure these things out.
CM: Should be simple. It should be flexible enough that you don't have users using it, you have developers using it, which is a super different thing. You need to have the composable and yet feature complete, building blocks... Am I doing that? I'm probably doing that. You should provide great developer support. And the one that wasn't on that list is stability. I think that's also fairly important. But what these lists you have in common is they're about your platform being understandable. Right? About you being able to ask and answer like arbitrary questions, any question. The beautiful and terrible thing about platforms is you can't predict how people are going to use them. Right? If you're working with an engineering team, and they deploy terrible query that does a 5x full table scan, you can walk over there and throw things at them. If they're in London, you can still like shake your angry fist and like threaten them with performance review, feedback or whatever. You have a certain amount of leverage and control over the developers you work with.
CM: The developers who are using your platform, you're supposed to be nice to. They tell me. [chuckle] And this means that your support people need to be able to understand and answer very complicated new questions all the time about your platform just like your systems engineering team or software engineers need to be able to do that internally for reliability. Like when we were at Parse, out of these, we had definitely built something people wanted, our API was terrific. Business model is shaky, but we would get there. We had great developer support until we started to grow. We had the intention to provide incredible support. We were so nice. We wrote such great docs. We would spent so much time pairing with people. The first problem was it was built in Ruby on Rails which is a perfectly respectable choice in 2012.
CM: When we started to scale, I don't know if you've ever used Ruby on Rails and Unicorn, but there's a process. There's one request that can be in flight per process at any given time, so you'll run it, 20% percent of all of your workers are used, and then one of your databases get slow and suddenly, boom, you're at 100% and everything's just 504ing, 502ing. And we could not often track down why with any regularity or any predictability. And this brings me to the difference between monitoring and observability. This like all stupid term, arguments is highly contentious, but I don't actually give a shit. Monitoring is very much about, you build a system, you look at it, you try to predict how it's going to break, and you write a bunch of checks, right? So you're just constantly going like, "Ah, do you fall within these bounds?" and then you spend your entire life tweaking those checks and those thresholds, because otherwise you're getting woken up. But it's like you have this illusion that you could predict what your system is going to do and you would know if it's not healthy in some way.
CM: Observability, the term is taken from control theory... But I think of it just much more as, the world the way it really is. It's about deeply instrumenting your code, your entire stack, everything from your database internals to your own software to whatever you could get out of the third party software that you're running, which is surprisingly a lot, and then asking questions about what's happening as it surfaces to you. And there are no honest narrators in technology as anywhere else, but it's a hell of a lot closer. Oh, this is my alter ego... Couldn't resist. What would happen to me many times a week is, let's say around the time of acquisition, we had 60,000 apps. People would just come to me and be like "Parse is down, I just got an email or somebody just filed a task, Parse is down," and I'm like, "Parse is not fucking down," look at my beautiful wall, full of beautiful dashboards, and they're all green. Nothing's wrong! [chuckle]
CM: Nothing was wrong, really, I guess, sort of, but it was from that user's perspective and so they did not give a shit that my service was up; what they cared about was that for them, it was timing out, right? And the longer I argue with them, the more I just lose credibility. So we build a system with some of the smartest, best engineers in the world that I've ever worked with, building a system using all of the best-in-class techniques at the time, using metrics and dashboards. We had a ton of unknown unknowns every day. Metrics, for those of you, you all probably know this, but it's basically a dot of information; it's an integer or a counter or a gauge or whatever. And then you could append a bunch of tags to that dot, so that you can find that dot and other dots like it. Like a host tag, or whatever.
CM: But the thing about those tags is there's a hard upper limit, because of write amplification. So I think that the hard limit on Datadog is 250, is that right, Mike Fiedler? Higher than that? Well, it's a lot. But it doesn't matter, it's still a hard cap when actually you want tens of millions. And your performance goes to shit if you start to go to that upper bound of tags. So we have a system with metrics and dashboards, and what would happen was... I should probably fast-forward through some slides, whatever, I'll just tell you stories... People would be like "oh, something's wrong," and I would start going to my dashboards, just pattern-matching, "Was it this scenario? Was it that scenario? Is it this scenario? Is it that scenario?" "Oh, I've gone through all my dashboards." Fuck, it's a new problem.
CM: We had all these unknown unknowns every day because we have people being really creative with our systems. And it was very time-consuming. And honestly, this is just a fact about distributed systems; that you want events, because the way events are stored on disk is it's not a dot with a bunch of tags, it is a wide... As many key value pairs as you want, saying all these things are happening at once, or this thing happened and here is all the information I have about it. The user ID, the application ID, the platform, the version of the SDK, the available CP, free memory, whatever, just throw it all in there because you don't know what you don't know. You haven't crafted a dashboard for it, you shouldn't have crafted a dashboard for it, because that's just gonna make you dashboard-blind. If you have all these fucking dashboards and you probably have dashboards to define your dashboards with your dashboards because if it's like, so many dashboards, you're not actually doing science, [chuckle] you're not actually just like formulating a hypothesis and asking a question about your system.
CM: It's not debugging, it's pattern matching with your eyeballs, and I find it exhausting. It must be event-driven, it must not be pre-aggregated. This is super important! Many times your databases, what they will do is they will store everything at a second interval, and they smoosh everything that happens in that second into that one little speck of data. Okay, you can never get back to the original source of truth after that. You can never get back to your original event. Boom, you've blown your shot. Events tell stories. And events work the way our brains work. Events make it much easier for you to reconstruct what happened, and what it looked like when things happened, and causes and effects.
CM: I think that at Facebook like every API or every web event that comes, every website request that comes in, generates hundreds of Scuba events. Scuba is their, you can think of it as Honeycomb's godmother. It's their internal, event driven, exploratory system. ____ destroy your details and you need more, not less. Honestly, the goal is for... What was I thinking, I just distracted myself. Oh, they're still swelling... Oh, yeah, so you're looking for needles in haystacks. You should be able to locate every single request that comes into your system. If a user accidentally deleted their database, boy, it'd be nice to be able to find that request, right? So it's not exactly like a stack of needles, it's more like a flaming trash heap, but you need to find every piece of trash in that flaming trash heap. And the thing about metrics and Dashboards is they squish those and they ditch the whatever. You get what I'm saying. And high cardinality, this becomes not a nice-to-have, this is an absolute necessity. So the happy ending to this story was, so we get acquired by Facebook, they start pushing all of his crap on us; "Use this, use this, use this". Most of it was not appropriate for us. We got scarred.
CM: And a lot of people kept telling us "use Scuba". And I was like "no, get away from me". But we finally did, because somebody was so... They wanted it and they put the edge data set in and suddenly it was like "oh, my God". Instead of it taking hours or sometimes days to track down every one of these user reports, it would be seconds or minutes just reliably, just like "boom, boom, boom". And the key to this was high cardinality. The ability to not just understand what the system looks like, like fuck the system, nobody cares about the health of the system anymore. That's____ things. You care about the health of the system as experienced by every single individual user, or application, or combination of user and application, or combination of user and application broken down by SDK version. Build-IDs, what are they? A monotonically, infinitely increasing integer. Well, that's tough to stick in tags unless you're going to be reaping it constantly and doing all this shit. High cardinality is like at the root of... And I feel like we've almost lost a generation to fucking metrics. Kids these days get out of college and honestly they are better at using Explorable, event driven debugging systems than very senior engineers whose brains have been stuck in Metrics for 20 years because they get locked into these assumptions about how data works; they're not how data works, it's how metrics work. I really hate metrics.
CM: Don't tell all the cool kids on the internet that I said that, please. They're already pissed at me. For platforms, the whole point of a platform is that your users are given building blocks to be creative with it. With creativity comes great chaos. [chuckle] You want to be able to isolate that chaos and at least understand it before you decide if you're going to shut them off and throttle them or what. There's also some really cool shit that you can do with these databases. Facebook Scuba was actually developed for them to help them understand how MySQL was behaving as they grew. And you can totally understand why there are very few data debugging tools out there because it requires what? It requires a very wide, rich event. It requires you to have the rawQuery, the host IP, the username, the query family, all this information about it and what do most people know how to do? They know how to find the slow query log, right? And then they're like "oh, but these queries haven't changed in years", "oh, this data hasn't grown either, what's happened?" Well, usually it's because the slow query log is full of slow queries that are reads that are getting slower 'cause they could yield. Rights can't yield, they're getting slower because the right lock is saturated on some row or some table or something.
CM: And if you would just sum up all of the lock time that was being held, then you could go "oh, 97.3% of the lock time is being held by that user, no, we don't need to reprovision and reshard, we just need to solve that user." [chuckle] Turns out this is not a hard problem. Databases are not a hard problem. I think the only reason that DBAs even exist is because we had such terrible tools for so long. I'm like a third of the way through my slides, and I have three minutes left, so... ____ to the norm. Your system is never up... I skipped one. The health of the system definitely doesn't matter. Think about it, what would you rather have? All of your users, the requests are getting served on time, but 50% of your provisioned instances are down, it's the middle of the night, do you want to get woken up for that? Maybe. Do you wanna get woken up if all of your graphs and dashboards look great, but Disney is your biggest customer and they're getting blocked at the edge? It's not even a choice.
CM: Your system is never really up, I feel like this is the first step of just coming to terms with the new world order and distributed systems, we're all distributed systems engineers now, by the way, congratulations. I think you can ask for more money now. [chuckle] But the system's never up because so many catastrophic states exist all the time, right now. You have so many terrible bugs at your system. So, sleep well. It's a huge lie. Our graphs, our dashboards they just mean that our tools aren't good enough to find the problems that we have. And I feel like the key knob for the future is really finding a way to make these unknown unknowns, to turn them into known unknowns, quickly, and predictably, to not have it be this open ended, "I don't know. Never seen this before. Cannot tell you if it's going to be hours or days or weeks or never, until I'm gonna figure it out," because that's an engineering problem. And engineering problems are hard, and we don't like those. We like support problems. Turning engineering problems into support problems is literally our entire job.
CM: Examples of stuff. Doesn't matter. Picture of a ____ flaming trash would probably be more appropriate. That's my definition of a platform. Developers are the worst users, right? They expect everything. They expect to pay nothing. They're convinced that they could do your job better than you, and they'll tell anyone in Hacker News you can ever listen.
CM: I feel like this is really... These are the converging trends that mean that... This is why we built Honeycomb. Was because that experience of scaling with Scuba was so transformative. It saved us. It added two nines to our reliability numbers. It added a whole lot of sleep to all of us. It let us put all of our software engineers in the on-call rotation. The best engineers I ever worked with at Facebook would spend half their time in their IDE, half their time in Scuba, understanding the consequences of what they had deployed. And this is really how I think about it. It's not a thing that you look at when things are on fire, necessarily. It is the kind of thing you need to have to understand what's happening in production after you've hit deploy.
CM: It's insane. I hope and pray that in a few years, we look back on these dark ages when we would just ship code and not look at it, not explore it. Just have a habit of going to compare, does the Canary... Is the memory usage look the same as the control? Is the thing that I think that I'm shipping the thing that I actually shipped. Does anything else look weird? It matters. And this is really like how to get it. You instrument everything. I don't have time to talk about it, whatever. Know your isolation model because those points where you have multi tenant systems are always going to be the ones that you need this sort of under... You need to know who's consuming what percentage of the resources, and this is the only way to get it.
CM: See if I have any other interesting things. Parse is down. "But I have a wall full of green dashboards. You are wrong." Put your engineers in support rotations. I believe in this strongly. Debugging must be explorable. And test before prod, test in prod. I was gonna get to this, I don't have any time. You have to test in production. This doesn't mean you only test in production. You test before you get to production too, but I hate this meme, the most interesting men of the world, "I don't always test, but when I do, I do it in production." So that's a huge amount of damage. There are so many categories of things that are behaviors, that are experiments, that are load tests. When you have a large complex distributed system, you can't just spin up a copy of it. Can you spin up a copy of Facebook? No. If you could, would it be useful? No. Because its that long thin tale of things that almost never happen, that matter. And you literally can't test for all of them 'cause you could not anticipate them. If you can capture and replay all of yesterday's traffic, its not gonna tell you everything that's gonna happen tomorrow. I speak with someone who's a giant fan and has written multiple of those load testing frameworks. It's not good enough. And if you don't like that example, what about the National Power Grid? You can't spin up a copy of that.
CM: Unknown unknowns mean testing in production, and that's why it's really worth investing all of that extra effort into making that resilient into making it so that you can recover quickly, making it so that everyone who has access to merging, to master, also knows how to roll to a good state, knows how to back up their changes. Invest in your Canaries, invest in your, rolling deploys, and all this stuff. And let's build tools that don't lie to us as much. It's really all I'm going for. Thank you.[applause]
CM: I have a couple of minutes. Anyone has questions? Christ.[background conversation]
Audience Question 1: What's the bare minimum of instrumentation?
CM: The bare minimum of instrumentation, the answer is always, it depends what kind of system you have. But I think the best place to start is at your edge. If you're running Nginx or HAProxy or anything like that at your edge, there's this really cool thing that you can do where in your application code if you wrap and you call out to a database or call out to another service in a timing check then you can put that in the header and that'll get passed back to Nginx, HAProxy or whatever. The hardest problem is often finding where the problem is and once you've found what component it's in, It's pretty easy to find the actual problem. But in a distributed system when you've got requests looping back in and fan-out and one node can make the latency rise for every service because of interdependencies. It's most important that you are able to find where the slow thing or the erroring thing is and you can do that with nothing, but your edge data set and some headers.
CM: Yeah. I mean, one of the interesting things about these shifts in systems is just that, you used to be able to attach a debugger. Or be like, "Find me when it breaks again and I'll go and I'll look at it and I'll run estrays or trust or whatever." And now we're breaking it up so that it always hops across the network. So you can't do that. You're not really debugging code as much anymore as you're debugging systems which is a different, but related skill set. Anything else? Yeah.
Audience Question 2: Do you try and keep systems simple or do you just embrace complexity?
CM: Oh, fuck yeah! Of course I try and keep them simple. It's the world that's conspiring against me. That won't let me. [chuckle] No, absolutely. As simple as possible. So one of the things that I put in there was Polyglot Persistence. It used to be that the best practice, was you have a database. You can have MySQL or you have PostgreS. You get to choose and then you put everything in that database. And that was a really great, it still is, a really great practice when everything that you want to do can be done by the database. But there has been such a proliferation of new and exciting and awesome things that you can do with other databases. MongoDB gets a lot of shit. I maintain, the best thing that ever happened to Mongo was Parse getting acquired by Facebook 'cause Mark Callaghan, my lord and savior, got involved and now it has a pluggable storage engine API and it's a real database. I digress. There's cool shit that you can do with new data stores. You probably may very well need to run a few of them. So the simplest architectural model that will let you achieve your goals. But as a startup, your advantage is often that you can afford to take more risks. You can afford to try new stuff. And could embrace that because it's gonna go away. Alright. Awesome. Thanks. I'll be around for a little while.[applause]