S1: Chaos Engineering at Netflix

Episode Summary

Meet the teams that aim to provide an uninterrupted Netflix experience! Haley Tucker from Resilience Engineering and Aaron Blohowiak from Demand Engineering discuss Chaos Engineering, High Availability and how they create a reliable system out of inherently unreliable parts.

Episode Notes

Meet the teams that aim to provide an uninterrupted Netflix experience! Haley Tucker from Resilience Engineering and Aaron Blohowiak from Demand Engineering discuss Chaos Engineering, High Availability and how they create a reliable system out of inherently unreliable parts.

Episode Transcription

I would love to have an episode of We Are Netflix in which we learn together exactly what happens when a person logs in and starts watching Netflix, but our systems are complex.  To serve over 130 million people all over the planet takes a bit of infrastructure, and no one person really knows every piece of the system.  So, how do we make sure to keep everything running smoothly, making sure everyone can enjoy Netflix whenever they want?  Well, years ago, we realized that causing parts of the system to fail on purpose could help us make a fault-tolerant Netflix, and that is what we do.  Welcome to We Are Netflix.  I'm Lyle Troxell.  On this episode, I chat with Aaron Blohowiak from Demand Engineering and Haley Tucker from Resilience Engineering to learn what it takes to make a robust and stable Netflix service. 


                        You guys, kind of, are two pieces of a large, complex system that we deal with.  So, I'm really excited to, kind of, talk about it.  Before I start with Netflix, the one thing I knew about engineering at Netflix is that we, kind of, embrace chaos.  We randomly do things on the network, and it's called monkey cast engineering.Can you please describe that, Haley?


Haley:              Yeah.  Several years ago, we started doing a series of experiments.  Like, one was chaos monkey.  There was latency monkey, and what that is is we have something that will go in to an individual AWS cluster and, like, kill one note.  And, the purpose of that is to make sure that you have redundancy built into your cluster and that killing that one note doesn't take down your entire application. 


Lyle:                Why don't you just design it so that you do have redundancy?Why do you have to actually do that?


Haley:              That's the thing is, in general, people do design for that and they do think about redundancy.  It's more of just, like, a sanity check to make sure.  Like, so, it, kind of, brings it into the culture, and so, you know this chaos monkey is out there and it's going to kill one of my instances.  So, I better make sure that I have them there when that happens.  And so, if you push your service out and you don't have that, it's like a quick wakeup call.Like, we'll even run it in the test environment, and so, it's a quick wakeup call before it even gets to production sometimes.  That like, "Oh, Yeah.  I need to go set all these numbers up correctly to make sure I have redundancy."


Lyle:                It's interesting.  When I do these tactical discussions, I have to think about the person that's listening that has no technical knowledge and decide are they going to stick with this or are they going to leave.  Should I say something to them?  Should I break down what a cluster is and what an instance is?  And, it's a hard thing to decide to do.  So, let's do this.  Let's table a bit of the tech and talk a bit about stuff that's apparent to lots of people, and then, we'll get back into that because I want to talk about, well, how many clusters do we have, what do you mean a cluster.  You know, we'll get into what that means, but that is, kind of, the level of discussion we just had is what I talked about years ago when it first became public that Netflix does this.  But, there's a whole industry around that.  So, that's what we're going to talk about today.  So, first, let's talk about when you started.  Haley, you started five years ago.  You've been here about the same time as me.  What industry did you come from?  Where do you come from? 


Haley:              So, prior to this, I was in a small consulting company in Dallas, and I did a bunch of, like, integrations around an Oracle billing and revenue management platform.  So, for, like, tel-co companies, when you get your cell phone bill at the end of the month, they have to, kind of, sum, all of the usage and generate a bill. 


Lyle:                Oh, I love those bills. 


Haley:              I know.  So, it's everybody's favorite thing.  So, yeah.When I came to Netflix, I was super excited to work on something that people actually liked. 


Lyle:                You literally delivered something that people hated. 


Haley:              Exactly. 


Lyle:                Was it Resilience Engineering?  Was that the kind of thing you were doing, or was it something else?


Haley:              No.  It was software engineering development work. 


Lyle:                UI or—


Haley:              All back end, Java. 


Lyle:                Java stuff. 


Haley:              Yeah, Java stuff. 


Lyle:                So, you're a Java engineer.  And did you move right into the chaos area or the resilience area at Netflix, or did you do something else first? 


Haley:              So, when I started, I was actually on the playback services team, so I spent a little over three years there owning and operating several services that get hit when a user clicks the play button. 


Lyle:                Yeah, okay.  All right, good.  So, traditional software engineer, moved to Netflix.  What was most surprising to you coming in here?  Like, what did it feel like, "Oh, I didn't expect that"?


Haley:              I think, like, the very first day, I came in, and my team was like, "Okay.  You're going to push something to production today."  I was like, "What?" 


Lyle:                Daisy Rowe [phonetic] [00:04:27]. 


Haley:              Like Daisy Rowe.  They're like, "Yeah.  Get your workspace set up.  Here's the instructions."  That was relatively quick, and then, they're like, "Do this one little change, and we'll get it out into production." 


Lyle:                And, don't worry.  It's only going to affect millions of people. 


Haley:              Right?  And so, like, I think that was the first time that I was like, "Whoa."  And, you start to learn, like, all of the, kind of, guardrails and safety checks that we have built up across the company that, kind of, enable that rapid velocity-type change, but it also brings a whole category of things to deal with in the production environment. 


Lyle:                Yeah, okay.  All right.Aaron, you've been here less, about three years, two and a half years.  Where do you come from?  What kind of industry did you come from? 


Aaron:             I have been working at startups and contracting at startups for about the past 10 years, a whole variety of different industries, and what qualified me, I guess, for my role was that I had brought down production many times over those 10 years.  So, I had an intimate experience with how systems fail at scale and prod. 


Lyle:                Were you, kind of, an ops person?  Were you handling those fires, or were you just taking, you know, taking them down and going, "Oh, that's a problem"?


Aaron:             In small companies, there's really a blur between development and operations. 


Lyle:                Right.  You're just everything. 


Aaron:             Yeah.  I did primarily application development, but the whole philosophy that we have here of developers own and operate their own services is, kind of, like, the default assumption when you have a team of five, six, even 20 people. 


Lyle:                Yeah. 


Aaron:             So, being on call, handling production issues has been a part of my life for the better part of a decade. 


Lyle:                So, you saw recruiting?  Or, did you get recruited, or did you just see a posting?  How did it happen? 


Aaron:             One of the people that I had worked with a long time ago actually had joined Netflix for a while, and after he joined, he started talking to me about, you know, "Hey.  It's pretty cool.  You should come."  And, I was like, "Hmm, you know, maybe I've heard some things.  You let me know after a year if you still like it."And, a year later, he reached out to me, and, at that point, I was like, "All right.  Well, if it really is as good as you say, I'll come check it out."  And, that's how I came. 


Lyle:                So, you had the benefit of somebody else testing the waters for you? You cheated.


Aaron:             I really did.  You know, here in the Valley, we hear lots of different about different companies. 


Lyle:                What had you heard about Netflix? 


Aaron:             I had heard that the expectations were extremely high, and I was always confident in my ability to, like, have a technical career, but I had real doubts if I was, like, Netflix-good, if you will.  And, one of the ways that those doubts were allayed, if you will, was this person I knew as just like, "No.  You know you're good.  You know, there's no magical different class of people that, you know, work at Netflix.  They're really good, really smart people with a very high talent density, but, you know, you're pretty good too.  Like, come check it out.  See how things are going.  We'll see if there's a fit."  And, that really just, sort of, made me realize, you know, there is no exceptional, truly out-of-this-world level that you can attain.  You know, through the depth of my experience, I'd seen so many things.  I realized, you know, maybe this is time for me to take my chances and give it my shot.


Lyle:                You've been here five years, so you must have gotten some critique that was hard to hear.  But, give me an example of critique you were given, who gave it to you, and what happened. 


Haley:              Sure.  So, I have gotten this critique a few times, which is that I can come off, kind of, snippy or frustrated-sounding with people on occasion, and it's one of those things that I, like, when I talk to the person about it and, like, can you give me an example and they give me the example, I was like, "Oh, yeah.You're right.  I absolutely did that."  And, like, we can talk through it, and I can be like, "Okay.  I can first apologize."  But then, also, I can say like, "Okay.  So, here's some things that I'm going to work on.Here's some things that you could do to help me." Because, in a lot of those cases, you know, I was under water and super stress out, and I just, like, you know, snapped or whatever.  I'm like okay.  So, if I'm like that and you could tell I'm like that, maybe just, like, send me a message first and then let me, like, reset my mental map and then we can, like, get into the conversation.  So, like, it, kind of, works both ways.  You're able to, kind of, set better expectations on both sides, and then, you're aware of it. And so, like, I've done a lot to, kind of, breathe, not, like, immediately react to people, and it's something that I know I struggle with, but it's also gotten, you know, like, I feel like I've gotten a lot better.  Aaron could probably talk since he's seen more recently.  I think I've gotten that in check now. 


Aaron:             Absolutely.  I think—


Haley:              Because, I don't want to be the scary, like, jerk that nobody wants to come talk to. 


Aaron:             Oh, no, not at all. 


Lyle:                And, Haley, it sounds like when someone says, "Hey.You were snippy with me in feedback," and you hear that from them, you go, "Yeah.  You're right.  In that case I was.  Help me next time."  That next time, when it occurs, they can call you in the moment and go, "Hey—"


Haley:              Exactly. 


Lyle:                And, you guys both know what you're talking about, right?


Haley:              Yep.  And, it's, sort of, like, acknowledging that people are going to have feelings and people are going to do things you don't like and just, like, making it okay to have those interactions and work through them, as opposed to, like, letting it fester.


Aaron:             Yeah.  I think that was one of the big ideas I had in my mind was that a Netflix engineer doesn't make mistakes.  You know, that was part of that, you know, notion of we have all these stunning colleagues, and we do, but they're human.  Right?And so, our culture involves, like, interacting with them and embracing that and saying, "Hey.  You know, I've got you.  Do something different next time." 


Lyle:                Yeah.  It's funny because your jobs really is to just ensure that our customers have a good time, even when we make mistakes. 


Aaron:             Exactly. 


Lyle:                It's, kind of, neat. 


Aaron:             So, it all comes back.  Like, the culture deck says we prefer rapid recovery to preventing errors because that preserves freedom, and a lot of prevention methods take away from your ability to pursue the best choice in time.  And, I find that this is true not only in terms of our culture, but also our technology. 


Lyle:                Yeah.  All right.Well, let's talk a bit more about our technology.  There's a lay down for anybody that wants to listen, and now, let's get into it a little.First, let's describe, kind of, the space we're talking about.  Earlier, you talked about AWS.  That's Amazon Cloud Services, AWS.  Amazon—d


Haley:              Amazon Web Services. 


Lyle:                Thank you.  I knew there was an initial in there somewhere.  So, basically, almost all the code that I write, almost all the code we write is actually sitting on Amazon's infrastructure, and Amazon, of course, has a giant datacenter in multiple regions.  And, we have almost all of our stuff running there.  The one caveat to that has to do with our video files.  They actually sit mostly on boxes we manage, but, other than that, all AWS.  So, what's a cluster? 


Haley:              Okay.  So, we've got AWS, which is the cloud provider.  If you break it down into that, they've got multiple datacenters around the world, so we talk about this as being regions.  So, we run in three regions, so that's three areas of the world where we deploy software.  If you break it down from that within each region, we run in multiple datacenters, and then, when you break it down from that, we deploy out applications.  And, each of those applications is deployed into clusters. So, a cluster is a group of servers that are running our application code. 


Lyle:                Let's not get down to the process here because I saw your eyes roll up when you said servers.  It's actually virtual servers, but we won't get into that. 


Haley:              No, we won't. 


Lyle:                So, when I say I want—I do the iOS app, right, and the iOS application on the server side is called the endpoint.  And so, there's Groovy code, Java code, that runs when anybody asks for stuff for their iPhone.  It goes and hits some cluster, which is dedicated, or multiple clusters, which is dedicated for that app, kind of. 


Haley:              Correct. 


Lyle:                Okay.  And, yeah.


Aaron:             And, the app that sits, that servicing area request directly, then makes calls to other apps and other apps and other apps, and so, we actually have hundreds of different apps that power your iPhone experience. 


Lyle:                And, we call those microservices. 


Aaron:             Exactly. 


Lyle:                Just little applications.  Okay.  And so, for an example of one of—Well, let's do an example of those.  When we release new strings, new languages—let's say we release a whole new language set or someone modifies it—there's a service called Obelisk and that Obelisk service actually provides translations for all of our different services.  So, that's an example of, like, I don't actually interact.  I don't do anything with Obelisk, but I use it all the time in my app and whenever I use it, as actually a secondary service.  And there's hundreds of those you said? 


Aaron:             Yep. 


Lyle:                Okay.  All right.So, there's also a lot of software engineers, and we're making changes a bit. 


Aaron:             Yeah.  We have well north of 1,000 software engineers, and we're pushing, I think, over 5,000 production changes a day. 


Lyle:                Okay.  How can 1,000 people be pushing that many production changes a day? 


Haley:              So, when we talk about production changes, we're talking about code deployments, but we're also talking about data changes, so, like, turning a feature, maybe a piece of data that we change in production.  So, there could be, like, any number of things that are altering production behavior, and it's not necessarily just code pushes. 


Lyle:                Would you consider pushing a new image to the service?Is that considered one of these production changes? 


Haley:              Yep. 


Lyle:                Okay.  So, really not code?  And so, what you're saying is, like, you know, we get Season 3 of Stranger Things, the first episode, and we have to upload the video and we have to upload the files, and all that's happening by outside vendors, by local people in L.A., by maybe some people up here in Los Gatos.  And, when they make the change, they're using some tool that we wrote that then deploys it, and then, multiple services grab it and push it all over the world. 


Haley:              Correct. 


Lyle:                All that process is this part of this, like, over 5,000 production changes, and any one of those, theoretically, could go bad at any point. 


Aaron:             Exactly.  And, it impossible for any single person to hold the entirety of the Netflix infrastructure in their head at any single time, and, even if they could, to be able to imagine ahead of time all of the ripple effects of any single change would take a long time and actually be impossible.  So, there are, like, multiple different sources of error. There are environmental sources, like Haley was talking about earlier.  Computers die.  They crash all the time.  It happens.There are things that we're introducing change into the system, and then, growth itself actually could be a potential source of error because that can put pressure on our ability to scale and handle larger data. 


Lyle:                Example of that, let's say Apple releases a new phone and people get it on, you know, day one, and so, all of a sudden, thousands of new devices are coming online and the old devices might not be going offline.So, all of a sudden, we get this boost, and it's a slightly different call methodology.  And, that can happen everywhere all the time. Okay. 


Haley:              Yep. 


Lyle:                So, we, kind of, laid back a little bit of the complexity, that there is 1,000 engineers making changes, but above and beyond 1,000 engineers is all this other content as well.   And then, there's the users' behavior can be different, and then, you're not even talking about, like, I don't know, power outage in one of those servers.  Like, when you talk about a region, you're actually talking multiple buildings, and one of those buildings could go offline for some reason as well. 


Haley:              Right, correct. 


Aaron:             Undersea cables getting cut, all kinds of fun stuff. 


Lyle:                And, some of these things we've seen. 


Aaron:             Oh, yeah. 


Lyle:                Or, we've seen all of them. 


Aaron:             Yeah. 


Lyle:                So, one way to think about that is just to make sure that things are functional, and you guys actually, kind of, represent, as I hinted at earlier, two different sides of this way of thinking about it.  So, let's take a break from chaos because I talked about that for a second, and we'll get back to the chaos aspect idea of it.But, let's talk just about what does happen if all of a sudden, you know, one of our regions, like the West Coast, networking problems emerge and we have somebody in California watching Netflix and all of a sudden something is wrong with the computer they're connecting to.What happens?  What can you do? 


Aaron:             We can—So, we run our system in three different regions, I think as Haley mentioned, and we can actually totally evacuate any one of those regions at a time and redirect that traffic to the other regions to preserve the customer experience.  And, we have gotten so good at this that we can actually do it without customers noticing impact in a huge percentage of cases. 


Lyle:                Okay.  So, what you're saying is I can be watching Netflix in my home in the Santa Cruz Mountains and be enjoying it or not, depending on what show it is, but enjoying it, and all of a sudden, all my network connections happening to, you know, someplace in the West Coast somewhere.  And, all of a sudden, someone in your team says let's move it to the East Coast, and the next request that goes up gets redirected to New York or somewhere out in the East Coast.  And, I get my data back, and, at no point, did I notice anything different happening.


Aaron:             That's the goal. 


Lyle:                That's the goal. 


Aaron:             And, that's the usual experience. 


Lyle:                So, we do that every 10 years or so?  How often does this happen? 


Aaron:             We do it on a planned cadence very two weeks, and every other one of those planned experiences, we actually do it twice in the same day for various detailed reasons. 


Lyle:                So, you're saying is that you turn off one of these regions.  You turn off all the West Coast every two weeks, or whatever the cycle is? 


Aaron:             Yeah.  It rotates.


Lyle:                Rotating every six weeks, you turn off the West Coast.


Aaron:             Hm-hmm [affirmative]. 


Lyle:                And, the thousands and thousands of people, because we're talking about three regions.  We're talking about one-third of the population of Netflix customers, which is over 130 million people, and, of course, that's a customer and they can actually have multiple profiles.  There's more devices than that.  You just move that traffic to another place? 


Aaron:             Yeah.  It's pretty amazing, and we have to do a lot of stuff to make that happen.  So, we have to make sure that we have sufficient capacity in the other regions that are going to absorb that traffic, and that's a big portion of what my team does is understanding how Netflix scales as a result of different traffic patterns from different kinds of devices which are represented in different ratios and different parts of the world.  So, each of those different types of what we call evacuations have different resulting characteristics on the other regions. 


Lyle:                You're saying that, if you move traffic from the West Coast, the way the West Coast operates—and you hinted at these—the East is going to change, not just double its size, but actually change the way it behaves? 


Aaron:             Exactly.  So, in some parts of the world, smart TVs are very popular.  In other parts of the world, mobile devices are a huge portion of usage, and that has very different effects when we're talking about those hundreds of different services on the back end.  Some are involved much more commonly in certain types of devices and others for other types of devices.  So, we have this huge what we call, like, demand mapping project to understand those ripple effects throughout the infrastructure. 


Lyle:                To some more clarity and, like, real ideas of the differences, we can talk about, like, the way the TV app works versus the way the phone app works. 


Aaron:             Absolutely. 


Lyle:                When you log into Netflix on your phone and you get all the movie listings, that payload all happens one time.  The entire screen you can navigate around happens once, and, in TV, the TV systems are, can't hold all that effectively in memory.  And so, as you scroll through the lists, you get them loaded in piecemeal.  So, the operation of a phone, it's a one-time heavy hit to us, and the operation of the TV is a slow trickly hit.  And, both of those have very different performance characteristics. 


Aaron:             Exactly. 


Lyle:                Okay.  I, kind of, get that idea.  But why did we come to this idea that we should just do it for practice? 


Aaron:             We didn't start that way. We started doing it infrequently, and, whenever you're new at something, you usually aren't extremely good at it from the beginning.And, not being great at it meant that we were a little afraid of it, and that fear led to an infrequency.  And, the less frequently we did it, the more that we would accumulate what we would call, like, regressions or things where we would slip back into things not working as perfectly as they should.And, we realized that this was a cycle that we were, kind of, stuck in.  So, we leaned into doing it more and more frequently, and that led us to get better and better.  There's this pithy, little phrase that, like, "if it hurts, do it more often," and that was originally in the context of continuous deployment, but it's true also for these kinds of practicing and remediations, if you will. 


Lyle:                I think that works for mental challenges, but I don't know if that works for physical things.  Like, if you're hurting your legs running, maybe don't do it. 


Aaron:             But, like, maybe you should go to the gym a bit if, you know, you get a little sore or something. 


Lyle:                Get stronger and then it won't hurt as much.Okay. 


Aaron:             So, it's like you stress and distress.  I'm going with you stress and not distress. 


Lyle:                Okay.  So, how do you—Were you around when we increased the cadence to every other week?


Aaron:             Yeah.  It's been a gradual process that started before I joined Netflix, but yeah.  Going to every other week was during my tenure.


Lyle:                Okay.  So, that's the, kind of, worst-case scenario and protection for that, but, of course, there's only three regions and I'm sure it's, kind of, expensive for us to keep that maintenance of high level.  Because, basically, every region has to have double the capacity of the usage.


Aaron:             It's not quite double, interestingly enough.  So, when you're only in one region, you only need to have 100 percent of capacity for your global traffic peak.  If you're in two regions, either one would have to 100 percent.That leads you to 200 percent total.But with three regions, any two needs to equal 100 percent.  So, if you do three times 50 percent, you get to 150 percent of your capacity costs.So, it's actually cheaper to run in three regions than two.  I actually gave a talk about this at SREcon EMEA and you can find the talk online. 


Lyle:                All right.  Cool.  Again, the conference was what? 


Aaron:             SREcon, and then EMEA is the Europe/Middle East and Africa addition. 


Haley:              And, one of the other things to remember is anytime, like, we're not using that capacity for streaming needs, we have a lot of applications that we'll use, or unused capacity, for other, sort of, batch jobs.A lot of our, like, encoding pipelines and things like that, leverage that capacity.  So, it's not sitting there wasted. 


Lyle:                Right.  Okay.  So, that's the big picture, rollover thing that we do, which is, kind of, awesome that we do that, and I have, of course, as an engineer, on a lot of the tools that I use, they'll be banners out that say, "Hey.  We're, you know, evacuating East."  When I first started, I recall—And, this was, like, five years ago or so.  When I first started, I recall seeing problems when I was devving.  You know, all of a sudden, the East Coast would go offline, and test accounts happened to be there for some reason.  And so, I'd be like, "Wait.  This is problematic."  And, it was something I paid attention to.  I don't pay attention to it at all anymore.  I, like, I mean, no offense, but I ignore the warnings. 


Aaron:             No.  Thank you very much.  That means we're doing—


Haley:              A compliment. 


Aaron:             …our job well.  Yeah. 


Lyle:                Yeah.  Like, even as an engineer, you can't detect that it's happening really, which is, kind of, amazing. 


Aaron:             That's the goal.  Yeah.We work very hard to take this from an operational burden to a strategic advantage. 


Lyle:                Yeah. 


Aaron:             And, part of that means getting out of the way of the other engineers. 


Lyle:                Is there any desire to go to a fourth region? 


Aaron:             There are.  There are locality benefits that you get in terms of latency.  When I was talking about those cost dynamics earlier, that assumes that your cost is primarily based on throughput. 


Lyle:                When we're talking about, like, shaping that and the cost, the static cost and all that stuff, are you involved highly in that discussion process?


Aaron:             I help guide it. 


Lyle:                Why? 


Aaron:             That's the job.  So, we sit at the nexus of many different teams, and being at the intersection of many different concerns gives us a unique perspective on the org.  And so, we don't have a particular stake in, say, how the databases work or even how the transit between your device and the cloud works.  What we care about is the global distribution of load and data and how that affects our availability, latency, and costs. 


Lyle:                Okay.  So, you're looking at those larger pictures.  Of course, you know the best about latency and about scale and all of those things because you're always doing analysis and testing it. 


Aaron:             Well, the people on my team do.  I try to keep up. 


Lyle:                Okay.  Let's move over to the chaos area a bit.  So, we were talking about all the clusters on the AWS and with these different regions that we just talked about, but there's also these instances.  And, one little instance, kind of, represents a computer, and we're lots of them.  And, I think you inferred earlier that we take them down, we just kill them sometimes.


Haley:              Yeah.  So, we have several different types of chaos experiments that run.  One of them is just take down an instant set of a cluster, make sure the rest of the cluster stays healthy, and that there's no ripple effects.  So, I would say, like, that's, kind of, the most, almost most rudimentary form of chaos experiments at work. 


Lyle:                How do you take that down? 


Haley:              We will basically make a call to Amazon and say terminate this instance. 


Aaron:             It's, kind of, like pulling the plug out of the wall. 


Haley:              Yeah. 


Lyle:                So, you just do that? 


Haley:              We just do that. 


Lyle:                So, if I'm running, like, the infrastructure for the Apple devices, right, because that's one of the things I work on.  I deploy the endpoint code and help deploy that and all that.  Teams of people, of course, do this.  I'm simplifying.  And, I have, you know, 50 instances running for some reason.  You might just, without talking to me, just kill one? 


Haley:              Correct. 


Lyle:                But, that seems rude. 


Haley:              When chaos monkey first started, it was an opt-in.So, service owners were like, "No."  They would say, "Okay.  Yeah.Add my app to this.  I have decided that I'm set up and I should be able to handle this just fine." 


Lyle:                So, before them opting in, they did something? 


Haley:              So, before them opting in, they did something.  So, we operated in that mode for several years, and then, it got to the point where service owners were like, "Well, no.It should just always be opted in because, like, I know this now.  Like, this is part of what we do, and I don't want to have to tell you every time I want to opt my application in."  So, now, the model has been flipped so that the default behavior is that you're opted in, and, but service owners can still go opt out.  So, they have an ability to go.  So, like, a lot of our state-filled services, they'll go in and turn that off because they manage their redundancy in a different way.  So, there's ways for service owners to turn that off. 


Aaron:             I think this goes back to what we were talking about earlier.  Like, if it hurts, do it more often.  So, if chaos monkey weren't running, those instance failures would still happen.They would just be much more rare, and so, we wouldn't have the habit of being able to respond. 


Haley:              Right. 


Lyle:                Well, oddly though, I don't even think—Like, I, kind of, know that happens, but I don't really think about it as an effect.  I definitely don't feel like I see it.  So, if I'm a customer and I say, you know, I want the details of Stranger Things, Season Two, and so, that means that, on the device, you click on it and you see the list of episodes to find out what episode I want to, you know, show my family or whatever.  When I do that, a request goes out to Netflix and data gets returned.  What—If that box that's trying to return that data to me gets killed, don't I no longer have that data and all of a sudden, as a customer, I see a problem?  How do we actually handle that? 


Haley:              Yeah.  So, in that case, it should be at the point, at the microservice layer where that call is being made to get that information, what should happen is it hits that instance that goes away and then it should do a retry.  And so, from the customer perspective, you won't see it.  Another option is it goes to get that data, it fails, and then we provide a fallback.  So, you may not see all of the data.  You may see 80 percent of the data that's been cached.  And so, the goal is to make sure that we handle that case in a way that the customer doesn't feel the pain. 


Lyle:                Right. And, if the data is lost, at least hopefully, they still got what they needed? 


Haley:              Right. 


Lyle:                Right.  It's, kind of, the worst-case scenario is that they didn't get all the stuff they could have gotten. 


Haley:              Right. 


Lyle:                And, they might not even be aware that that occurred.


Haley:              Yeah.  So, one example, this is getting a little bit more into other types of chaos experiments, but we have a service that provides the little badges.  So, it'll say, like, there's 5.1 audio or HD video, and those little badges show up on the UI.  And, if that's down or if that goes latent, that shouldn't prevent you from playing your movie. 


Lyle:                Okay. 


Haley:              And so, one of the things that we try to do is we'll actually run tests to make sure that happens.  So, there's the chaos monkey type experiments that are, kind of, like, killing one instance, but that doesn't necessarily give you the tight feedback loop that we want to make sure that we're not impacting customers.  So, one of the things that we've been moving toward is being able to run, like, for my device, I want to just fail the calls to that service—


Lyle:                The microservice, hm-hmm [affirmative]. 


Haley:              …that's providing those badges and make sure that I still get my movie playback. 


Lyle:                Okay.  So, you're saying—Are you working towards that?  Because, I would like that feature. 


Haley:              So, we have that now. 


Lyle:                So, I can actually say for my account, or whatever, or my device ID, I could say please disable that microservice completely and I can try it?


Haley:              Absolutely. 


Lyle:                Okay.  I need to surface that to my team. 


Haley:              Yes. 


Lyle:                I need to do that more.  That sounds fantastic because what I'm normally doing—This is a little bit of inside baseball. Sorry.  But, I'm normally modifying my endpoint so that service call rejects or returns it all or doesn't return it empty and see if I still get a payload, and, if I do, then, say, of course, my client, I assume that I won't have it.  But, the manual process would be easier.  It'll be I'll flip a switch and just test it. 


Haley:              Yeah.  And so, we have integrations in place now, like, at all the Cassandra layers, at all of the RPC layers, EV caches.  So, like, pretty much any network boundary that you could hit, you can go in and say, "For my device, fail it or add latency and see what happens."  And so, that gives us a lot more flexibility, one, in, like, what we're testing, and it also allows us to do a lot more of that feedback loop.  So, like, in experiments that we run, we can monitor, like, what is actually happening to the end user and make sure that their experience isn't altered. 


Aaron:             How can you do that safely? 


Haley:              So, one of the things that we've been really focused on over the last couple of years is how to do that safely, and historically, Netflix has had a pretty good culture of making, running canaries, like AB experiments to make sure that you have a baseline to compare against.  And, we've, kind of, taken that approach with chaos experiments as well.  So, we will take a population of users that we're not injecting any failure into, and we'll take a population of users where we, you know, inject failure or latency into one of these non-critical components.  And, we monitor that they're still able to get playback in both populations, and then, what that allows us to do is, if those populations diverge too much, we can shut down the experiment within minutes because we're able to get that, like, really tight feedback loop and know that we're causing pain.  And, we can stop it before we do any other testing.


Lyle:                So, it's a much finer thought process than just killing an instance.  What you're basically doing is segregate the population, turn of or slow down a service and see what happens.  And, when you say you do it so that you can turn off the experiment really quick, what you're saying is that if you see that, "Oh, look.  This population that we're testing against is actually having, you know, a five percent less watching in this 20 minutes.  Oh, let's turn this experiment off.  We don't want to affect users like that." 


Haley:              Right. 


Lyle:                It's that kind of granularity. 


Haley:              It's that kind of granularity, and we can, like, in a major problem, we can literally shut it down within, like, two minutes. 


Lyle:                Just like region rollover stuff, we just go, "Okay.Don't point to those boxes anymore.That thing is dirty and bad."


Haley:              Exactly. 


Lyle:                Now, why do that in production?  Why do that to real customers?  Why can't we just have test accounts doing that? 


Haley:              That's a good question.  So, to go back to the example of these little badges that show up on the UI, we actually had this exact case last year or the year before where we had, you know, tons of unit tests, tons of regressions tests, integration tests.The service looked healthy.  Everything was happy.  We got it out under production, and one class of customers were getting failures.  We're like, "Why is this happening?"  Like, that service went down, and they were being impacted.  So, they couldn't play—


Lyle:                And, when you say class of customers, like a type of device, like some TV by XYZ company? 


Haley:              So, it actually turned out to be certain languages. 


Lyle:                Interesting.  Okay. 


Haley:              And so, what the issue was is all of the testing that we did didn't cover every single combination of languages, and for detailed reasons I'm not going to get into, there is a language impact for display in some of those badges, particularly around the audio—


Lyle:                Yeah.  Okay.Interesting. 


Haley:              …and what audio streams are available.  So, I like that example because it shows, like, no matter how much testing you do in any sort of a test environment, there's going to be a class of air conditions that you can't catch there.  And so, in order to get, like, a real representation of what your customers are seeing, you have to do some amount of it in production. 


Lyle:                Yeah.  And so, what you're saying, all those tests and stuff, we didn't turn those tests off.We still do all the integration tests, all that stuff. 


Haley:              Right. 


Lyle:                We also know that there's going to be some small percentage that might come out.  So, we roll things out slowly, carefully, look at our customers, see if they're still enjoying the service, and, if they are, we're good.  And, if not, we wait and reinvestigate. 


Haley:              Right. 


Lyle:                Okay.  So, it's not like you're really just, like, willy-nilly testing on customers. 


Haley:              No. 


Lyle:                There's really, like, no other way to see that. 


Haley:              Right.  And, we expect—Like, one thing that I really want to call out is, when we're running these experiments, we expect them to pass.  So, like, we're testing things that engineers have invested time and energy into adding fallbacks or tuning their service to make sure that latency doesn't cause a cascading failure.  So, like, there's been investments made.  They think it's fine, but, until you get it under load and at scale, sometimes you just don't ever see these issues.  And, it's not realistic to mimic the Netflix system in a test environment because there's just too much on there. 


Lyle:                It's interesting, that case.  It just feels like, my gut reaction is to go, "Well, why didn't we test every language?  We know the languages we have."  Like, it seems like there's always a way to do more.  Is it that it just multiples too quickly? 


Haley:              Well, so, this particular example actually comes back to those 5,000 changes a day.  Those languages didn't even exist at the time that the tests were written.  So, like, you know, when we come back to, like, new content being pushed out, new languages being turned on, like, there's, it's a moving target.  So, even if you have, like, a pretty good coverage of all of your languages at Point X, like, a month down the road, that may look completely different. 


Lyle:                It's interesting the idea that—It just, kind of, flashes on me that you talk about it didn't exist when we were at the test, and then it exists later.  It's, like, almost, like, every bit of things we're doing is a growth.  It's going to keep on changing and growing.  And so, it's more like an organism.  Like, right now, if you tried to build from scratch the Netflix infrastructure and do this, it would be, kind of, astronomical to get to where we are because we grew into that space rather than developed it. 


Aaron:             I really to take an ecological approach because you have all of these different actors, and they can only see their local environment.And yet, any local decision could potentially impact the entire ecosystem at large, and so, not only, like—In this case, you could say, "Oh, well, maybe they should have thought of that particular language."  But, this introduction of the language was by a totally different team sitting somewhere else on its totally separate release cadence.  So, we have this philosophy and belief in being highly aligned, but loosely coupled, and when you get into the potential proliferation of interactions between systems, that leads to emergent phenomena. 


Lyle:                Okay.  Do you really think that we're a complicated enough system for emergent phenomena to occur?


Aaron:             Oh, absolutely. 


Lyle:                Even, like, textbook definition of that? 


Aaron:             Yeah.  It depends on your textbook, I guess, but yeah.  I totally think so because, to me, the key there is that you have these radical discontinuities and behavior through seemingly small changes to input, and that happens on the regular here.  And, we need to be really, really good then at figuring out how to verify our expectations of resilience, like the chaos team does, and recover when those expectations aren't upheld, which is, kind of, what my team focuses on. 


Lyle:                The idea of a computer system being known and mathematical, and, like, all that really makes sense to me as an engineer, and then, the idea of adding chaos to that mix seems to be two separate buckets of cognitive ideas.


Haley:              So, one thing that might help with that, I, kind of, look at the system that we're operating as the chaos, and we're trying to understand it by seeing how it operates under different conditions. 


Lyle:                By, like, poking and prodding at it. 


Haley:              Right. 


Aaron:             You're bring the latent uncertainty and entropy forward so you can understand it in a controlled manner. 


Haley:              Exactly. 


Aaron:             And, to your notion of, like, believing that the computer is a machine, kind of like a clockwork, that, kind of, gets to, like, to be super obtuse again, you know, Newton and then Laplace came along.  And, there's this notion of Laplace as demon, which is this thinking that, like, there could be perhaps a being that knows the current state of the universe and the mechanism by which the next state is achieved.  Then, you could have perfect foreknowledge and also knowledge of history.  But, it turns out, through various math stuff that I don't understand at all, that not only is that—


Lyle:                That quantum theory. 


Aaron:             Yeah, exactly.  Not only is that intractable for anyone to do, that's also thermodynamically impossible.  And so, it's also, like, not useful for us to think about because we create these, like, mental abstractions and models.  And so, what we're trying to really drive forward is this notion of, like, propensities of causality instead of strict determinism.  And then, within that, there's always these, like, statistical models which leave room for chance and error and unexpected things, and I think, in many ways, these practices that we're doing are an outreach of, like, the work of Karl Popper, for instance. 


Lyle:                Yeah.  Okay.  I like that you're thinking about the system like that because it does drive home this idea, that I find fascinating, that we are participating in the creation of something that we can't understand, and that's, I mean, that's true all the time.  Right?  Your relationships you can understand. 


Aaron:             Absolutely. 


Lyle:                It's always true as a human, but it's just so funny to see that matched with the dichotomy of this is an OR gate.  You know what I mean?  When it comes down to it, it's true, thermodynamics, the non-ability of understanding actually a system.  We can actually detrace it.  Like, we do know what's happening on the processor, not at the electrical level, but at the representation electrical level. 


Aaron:             Can you do that across multiple machines?  Because that gets into believing that there's this notion of simultaneity, and, if you can figure out how to do that in a distributed system, you would get all the PhDs. 


Lyle:                All of them? 


Haley:              All of them. 


Aaron:             All of them. 


Lyle:                I want all the PhDs.  What are you looking at to change for your team?  What challenges are you going to grow into next? 


Aaron:             For my team, we have this special knowledge of how the system scales as a result of load, and, so far, its main application has been for reliability for recovery of production issues.  But, we realize that there's a lot of value there that we can use to help the business for efficiency efforts and also for things like holiday planning for the growth of our system and to increase availability by making sure that we have sufficient capacity to handle the kinds of discontinuities and load that we anticipate having in the future. 


Lyle:                So, it's about changing the way we're doing things to better align with what we need to do? 


Aaron:             Hm-hmm [affirmative].  Taking advantage of the knowledge that we have gained from empowering failures.


Lyle:                So, what would that look like to one of those apps?What kind of things would you hand to them and suggest to them? 


Aaron:             Sure.  So, a lot of what we want to do actually builds off of some of the thing that Resilience Team is focusing on, and one of the areas—Or, why don't you go ahead and talk about what you're currently working on, and then, we'll get back to it. 


Haley:              Sure.  So, as I mentioned, the last couple of years, there's been a lot of focus on how to run experiments in production safely, and what we've realized is a lot of the components that we've built and a lot of the time that we've spent on being able to, you know, shut off experiments quickly could be leverages across any number of production changes.  And so, we're looking at updating our platform to not just be able to run chaos experiments, but to also be able to run load experiments, make sure that services don't fall over, like, find the limits of them.  We also could potentially run just regular code canaries with all the safety checks in place, sticky canaries for client teams potentially. 


Lyle:                And, that generally means, like, I think my code is ready to deploy.  Let's do it in a tested way, just like you're talking about. 


Haley:              Right.  Where we reduce the blast radius.  We have the monitoring in place so we know if we're negatively affecting customers, and then, be able to provide that feedback back to the service owner to know like, "Hey.  This is causing problems for your customers." 


Lyle:                That sounds like you're, kind of, growing into a new area in some regard because you're hoping to help teams deploy different things. 


Haley:              Right.  Yeah.So, it's very different, but it's basically leveraging a lot of the components that we've already built.  And also, I just want to call out, like, we haven't built everything from scratch.  Like, we use our existing canary analysis for doing the, like, number crunching and the statistics around whether or not something passes.  We leverage our spinnaker infrastructure for actually spinning up new clusters and tearing them down.  So, like, we're basically just gluing a bunch of these components together that we have across the company, but in just a more unified way and providing it to our end users that can use it for all sorts of different types of production changes. 


Aaron:             It's totally awesome.  Having a safe, effective, and easy-to-use production experimentation platform is going to unlock so much developer productivity and innovation.  There's a lot of, like, keeping up with the changing Netflix that has to happen, a new operating system where a release gets out, a new dependency gets upgraded.  Right now, we ask each of the different service owners to have to keep up with this pace of change, whereas, if we have, if we are able to fully take advantage of this production experimentation platform, you could imagine that the centralized teams could actually perform these upgrades on those service owners' behalves just by checking the continuing health of their service through this experimentation, which is amazing. 


Lyle:                Right.  So, for example, some expert in some Groovy library or something everybody seems to be using, they could say, "Well, I'm going to make a change to that library and roll it out in a very small percentage, effectively making it more efficient." And, a developer like me that actually owns that code would later see a check-in that was changing it.  And, like, we test this.  It's cool.I'll be like, "Oh, cool.Better efficiency."  That's neat.  That's a really neat—


Aaron:             Without requiring you to go in and do it yourself. 


Lyle:                Right.  Without training every one of the engineers that might have to do it about the process.Yeah.  That's cool.  I like that idea.  I'm a thumbs up on that one. 


Haley:              Awesome. 


Aaron:             So, one of the ways that my team is going to try to take advantage of this platform, particularly at the load testing, is then use that to figure out how big a computer the software should run on or what we call right sizing their instances.  So, our cloud provider offers a huge variety of different sizes and shapes of resources per machine. 


Lyle:                At different costs. 


Aaron:             At different costs and different performance characteristics.  And so, that's an opportunity for us to really serve the business.  And then, it also provides my team the ability to create scaling policies and other kinds of things without requiring the service owners themselves to do work.  So, it's really this general trend of being able to centralize and routinize a lot of these more, like, laborious operational burdens. 


Lyle:                Right.  There's a service that I use that's logging analysis stuff called Raven, and whenever you do that and actually want to get charts out of it through this other service called Mantis, you actually have to decide to spin up an instance as a developer.And, there's, like, a drop down, and there's, like, five different instance sizes.  And, I never know what to put.  Like, how would I know it's appropriate. 


Aaron:             It's the biggest, right? 


Lyle:                I'm going to run it for five days, so biggest one it is.


Aaron:             Yeah. 


Lyle:                And, of course, we were really smart about also going, "Hey. By the way, how much did it cost Netflix to do that?"  So, I can go, "Is this enough?  Is the research I'm doing worth the cost, you know?"  So, the service thing, that's, kind of, how we run with just everyone makes the right decision effectively.  But, it's neat to hear that what would be, of course, easier in a production world, I say, "I need an instance," and you already know that AWS is my application and you provide the instance and the scale, or even the service of, like, we think this is the one that will work for your service.  That would be also very beneficial to making sure we're not overusing, like, underusing the stuff we have.  Right?And, that's a big reason for it. 


Aaron:             Absolutely.  And, with our culture freedom responsibility, you get a lot of local decision making happening where people pursue their local optima, but, if you were to combine all these different decisions together, it may not be the same as what we call, like, a more globally optimal solution.  So, for instance, when we think about what kind of excess capacity we should have to handle load spikes, each individual service owner is going to look at what a particular load spike would mean for their system, but not all load spikes have the same kind of ripple effect throughout all of the infrastructure.  So, as a centralized team that understands these dynamics and interrelations, we are in a better position to have more accurate decision making. 


Lyle:                Yeah.  All right.So, we've talked a bit about what you guys are excited about changing in the future, how the systems work.  Aaron, you have a book on this.  So, this has become something that we've published. 


Aaron:             Well, I'm the first author on the second line.  There are five of us part of that effort. 


Lyle:                All Netflix employees. 


Aaron:             Yeah. 


Lyle:                And why did you guys write a book about this?  I mean, it's beneficial for us, obviously, to run the company.  We talk about it publicly.  You go to conferences and stuff.  What's the reason for you to go to conferences and to also write a book and share this knowledge?  What benefit does it get to the company, or is it just an ego boost for you? 


Aaron:             Well, follow me on Twitter. 


Lyle:                What's your handle on Twitter? 


Aaron:             aaronblohowiak 


Lyle:                Okay.  Thank you.


Aaron:             More seriously though, I want to create as much value as I can in my life, and I also want the systems that I use to work.  So, this fulfills both of my personal goals and desires, and then, one of the things that we want to do is to help other people understand our thinking about these problems so we can find people who think about them similarly or want to tackle these kinds of challenges in the same kind of ways, or even come to us, challenge our thinking, and broaden our horizons.  And so, it's all about being to connect with people and hopefully figure out how we can work with them at some point. 


Lyle:                Haley, did he just say it's a recruiting tool? 


Haley:              Pretty much. 


Lyle:                Okay.  Thank you.


Aaron:             I am a manager. 


Haley:              Yeah.  So, I think there's a couple of things.  One is the idea of running chaos experiments in production is relatively, like, it's new and it's growing.  And, like, so, there is a community around it, and it's nice to, like, see what other people are doing.  So, one of the benefits of going to conferences is seeing what other people are doing also and hearing their stories and hearing things that they're struggling with, and you can, kind of, compare notes.  So, I think that's, kind of, a nice aspect of going to conferences, but yeah.Recruiting is obviously always a—


Lyle:                What's the pitch when you're looking at an engineer and they're, like, really awesome at their job and they're looking at reliability and they're, kind of, in that space?  What do you say to them that convinces them, ah, maybe Netflix is the right place? 


Haley:              One of the things that I like most about Netflix is being able to work with, like, really amazing people, work on really big problems.There's always something new to go after, and I also have the freedom and ability to, kind of, tackle these things—


Lyle:                That you find interesting? 


Haley:              …that I find interesting.  Yeah.  So, if you're passionate about distributive systems and how they fail.  Like, one of the things that I've always enjoyed since I've been here is there's this culture that just because you cause a problem in production doesn't mean you're going to get fired.  There's, like, this exercise run, understanding what happened, and building, like making the system better.  And so, if you're excited about those sorts of things, then I think this is a great company for you. 


Lyle:                That's one of your, that idea, the kind of person that would be excited about how we do that and what we do. 


Haley:              Right. 


Lyle:                Yeah.  I think that production fail thing is always a good way to—When people go, "Well, but, if you took down Netflix, that would be really bad," the thing is, once you've taken down Netflix, you're the least, like, person to do that ever again.


Haley:              Exactly. 


Lyle:                That's the kind of person you want around. 


Aaron:             I've taken down Netflix multiple times. 


Lyle:                What do you mean you take it down multiple times?  Like, Tuesday, you just unplug something, or—


Aaron:             We swing a very big hammer.  Sometimes, that can cause a bit of damage here and there, but it's important to do it and it makes us stronger overall.  And, to Haley's point, it's all about learning, and we are progressively approaching higher and higher and higher levels of quality.  But, it's not a straight line.  Right?  There are going to paths where we have to backtrack and try a different approach. 


Lyle:                So, Aaron, you're a manager.  You talk to people all the time, engineers.  What do you say to them to let them change their mindset to maybe I do want to work at Netflix?  What do you say? 


Aaron:             I'm still working on it.  I think that—


Lyle:                Do you pull out philosophy, French philosophy? 


Aaron:             No. No.  No, no, no, no, no.  And, I definitely don't expect anyone else to be interested in that.That is not a job requirement.  One of the things I do try and say is that we don't want more of the same of what we have.  We actually want to have a variety of different types of viewpoints and approaches.  I guess I really say that we want you to do your best work, and we trust you to do that.And, that is so profoundly different from most other places where I've worked, and we value high quality and treating people like adults.  So, if you need to take time off to go do whatever it is that you need to do, do it, and then get your work done.  And, it'll be cool.  Like, the people over process is something we live and breathe, and it's just such a breath of fresh air. 


Lyle:                Let's wrap up with something a little fun.  What are you watching, Haley? 


Haley:              Oh, man.  So, I've been watching The Great British Bakeoff.  I've gotten totally addicted to it, and I can't even explain it.But, I love it. 


Lyle:                I agree.  It's a fantastic show.  That's one of the family, my family watches.  Yeah.So, that's great. 


Aaron:             Have you started looking for proofing cabinets online? 


Haley:              No, but I love the oven doors that slide underneath the—


Aaron:             Ooh, very nice. 


Lyle:                And, Aaron, what are you—


Aaron:             I love Bojack and Kimmy Schmidt.  So, if I want to feel sad or happy, I pick one of those.


Haley:              I know.  I love Kimmy.


Lyle:                Well, Aaron and Haley, thank you so much for being on We Are Netflix.  I really appreciate it. 


Haley:              Thanks for having me. 


Aaron:             Thank you. 


Lyle:                And, thanks for keeping our systems up.  I definitely appreciate that.  This has been the We Are Netflix podcast.  I'm Lyle Troxell.  Thanks for listening.