WeAreNetflix

S1: Machine Learning Infrastructure at Netflix

Episode Summary

Hiring, firing, managing, and freedom and responsibility. Julie Pitt and Faisal Siddiqi discuss improving our infrastructure around new technologies and empowering data scientists to use their machine learnings models to make Netflix better every day.

Episode Notes

Episode Transcription

Music]

Female: We are Netflix, the podcast for people who love Netflix and want to learn how we do what we do.

Faisal: I paint and draw.

Lyle: What kind of painting?

Faisal: Landscapes, acrylics, oils, all sorts of things. I have a large studio at home.

Lyle: Are you the Bob Ross of machine learning? Faisal Siddiqi holds a bachelor of science in Nautical Engineering from AMU in India and a master of science in Electrical Engineering from Stanford University where Faisal started working on wireless application particle gateways, first as a cofounder of his own startup, and then four years as a software engineer at Route [phonetic 00:00:41] Science Technologies, where he pioneered Route Science’s network monitoring and decision-making application software.

Faisal moved to be an engineering manager at Avaya and then as director of engineering at Conviva, where he led platform engineering teams working on quality of experience of streaming video utilizing machine learning. Over three years ago, Faisal Siddiqi joined Netflix as an engineering manager of personalization infrastructure. Thank you, and welcome to We Are Netflix.

Faisal: Thank you. Great to be here.

Lyle: Julie Pitt has an associates of arts from De Anza College, a bachelor of science in Applied Mathematics from the University of California-Davis, where she also did graduate work in Computer Science.During college and following, Julie worked at Lawrence Livermore National Laboratory, where she worked on, among other things, the bio-encyclopedia.

Julie Pitt was first with Netflix about 10 years ago as a senior software engineer, transitioned to a manager of streaming server engineering, and then left Netflix for a few years to help in the recommendations infrastructure at StumbleUpon, then on to Lyve Minds, Inc. as an engineering manager helping to bring forgotten memories to life.

Julie then started Order of Magnitude Labs, a startup introducing autonomous agents that learn common sense to problems. Julie Pitt, welcome back to Netflix and your five-month position as director of machine learning infrastructure. What’s it like coming back? Oh, and welcome to We Are Netflix.

Julie: Thank you. That was a very comprehensive introduction.

Lyle: We have impressive people here. I want to showcase that.

Julie: Yeah. It’s amazing to be back. There’s just in the last five years since I’ve been gone a whole new set of problems to solve. You know, since I’ve been gone, Netflix has become not just a tech company that has great content, but now is developing original content, has become a studio.And we’re now applying tech to these problems, and so it’s really exciting to explore this new space.

Lyle: Yeah. Julie, you left Netflix of your own choice?

Julie: Yeah.

Lyle: Okay. And—and yrsh.[Laughter.] We’ve had other examples of people leaving not by their choice and actually coming back. See the last episode of this show. But, what did you miss when you left Netflix? Like, when you came back, it was exciting because of course we’re doing all this new, you know, originals programming. We’re doing a lot of new creation of content. But, from a business perspective, what was it like here versus the places you were since you had a time away?

Julie: I would say—I mean, it was very different environments. One of the environments was a company trying to establish a culture. And so one thing that I missed was a culture that had been established but still was evolving. Right? At Netflix, the culture is never done, but it was much more mature and much more farther along. And so, at these other startups that I was at, it was not so far along. And so it made me appreciate just how much focus is required at all levels of the organization to create a great, fantastic culture.

Michael: One of the benefits we have here is that, you know, at prior—this company kind of started out after—we’ve talked about it in some internal podcasting, but at a certain point we decided how important culture was. And that was at a very early stage within the company. So we’ve had a long runway to really develop that, whereas it feels like a lot of companies accidentally fall into the culture that they are like. And so we’ve kind of shaped what we want for many, many years.

Lyle: Faisal, what do you feel like about the culture here?Is it important to you how we operate?

Faisal: Absolutely. I mean, you know, I joined a little over three years ago, and 100 days after my joining, I wrote my first 100 days blog post. And I talked about some of the things that actually stood out to me.I’d always admired Netflix as a company from its—for its, you know, technical prowess. But, as I read more about it when I was contemplating joining this, I saw this—I saw a lot of good things coming from what I read in the Culture [unintelligible 00:04:34] or the Culture Memo now.

And then there’s also a bunch of FUD out there about, oh, it’s a cutthroat culture. And, you know, as a candidate who was prospective in evaluating whether I should join something like this, it was not easy to figure out what is the reality versus what’s not. And that’s why, once I had been here for three months, I felt that I needed to sort of clarify that out for others who were in similar situation.And I was amazed at how real and how living the culture deck had been.

And I saw examples which were counterintuitive to what I had seen in other places where how people were really sometimes brutally honest with the feedback that they wanted to give, and my first rude awakening happened in a meeting that I had organized and had come with the quarterly plans of this is what my team should do. And, basically, one of the engineers looked at me and said, “Yeah, we don’t do that here.”

Lyle: Can you dive in a bit more? What do you mean?

Faisal: Yeah. So the idea there was I was trying to predict what are some of the things that we should plan out for the next three quarters, and let’s try to fit this in and all that.

Lyle: And assign those things to people…?

Faisal: Yeah. And so, basically, it really depends—various things here have a lot more autonomy, and especially we want to try to make sure that the managers are playing the role of context providers and enablers and people who connect the dots, rather than people with strong opinions who want to influence how the product is going to be.

And so, when you’re trying to say here are the various things that you are going—you know, we want to be able to do, obviously, there’s a certain degree of prioritization that is necessary we do that. But engineers don’t like to be told, you know, how to do things. And, you know, that context for control is something that’s sort of engrained in the culture and—

Lyle: So you’re basically saying, you know, these are the things we need to solve the context of why. And then engineers like has more freedom to solve the problem the way that makes sense to that person.

Faisal: Absolutely. And so our role is really about identifying the opportunities, making sure that the people have their right connections, they’re talking to the right people, providing the business context on why we are prioritizing one thing over the other, but then how to build out what technologies to use, what languages to use, you know, what are the long-term implications of picking a new language like Rust or Go. It’s all left up onto our, you know, well-formed adults, as we like to say our engineers are.

Lyle: Faisal, you’ve got like 12 direct reports. That seems like too many people. How do you handle that?

Faisal: It’s about right, in my opinion. And it really depends upon the kind of people that you have, and if you have people who have enough autonomy that they can sort of essentially self-direct from a technical perspective what they want to do, it really makes my job easier, because I can then try to make sure what I said on how can I provide them contacts, and how can I make sure that what they’re working on is relevant. And so we—there are different teams.

There are some teams that are small. I do—I have a sub-team where a manager person my team manages about six people. I’d say somewhere between 12 and 15 becomes too much. We’re always thinking about how to scale and figure things out. But at this point, I feel that, because many of those people are experienced and have been doing this for a while, they have large autonomy and they’re essentially running projects which are two, three-people projects.

So there’s a whole plethora of various different tools that we’re building. And it’s manageable. And the only way—reason it’s manageable is because of the kind of people we are able to get in. If we bring in a lot of people who are very junior and need a lot of mentorship and handholding, we won’t be able to scale. And that’s true for most of Netflix.

Lyle: Yeah. So let me ask you both, since you both manage people, and that you mentioned, Faisal, earlier about the cutthroat mentality that outside the Netflix people think about us.How many people have you guys, you know, fired, had to let go?

Faisal: I had to let go of one engineer about a couple of years ago. And it’s never easy.

Lyle: You manage 12 people and more than that under you, and the last two years you’ve let go of one person?

Faisal: Yes.

Lyle: So it was a difficult thing, but that doesn’t seem cutthroat to me, in the sense that like if you wanted to really prune, you could do every year or something.

Faisal: Right, exactly. And, you know, we don’t make these decisions of—you know, we say we make hire and fire smartly—we don’t make these—take these decisions lightly. And it’s never a surprise to that person.It’s never a surprise to their team.If either of those are happening, then it’s the manager’s fault, and the manager should probably be fired.[Laughs.]

So only when it—you know, we try to give feedback, obviously. And sometimes there are certain behaviors that just are not improving.We don’t believe a lot in having these all performance improvement plans and structure. You know, you can read about that in our demo, in our memo. But sometimes we just have to make the right decision, and then the team actually—surprisingly or not, they are actually thankful and grateful for being shown that the quality of the people.

Lyle: Yeah, and it’s challenging. Okay. Well, it’s good to hear the perspective from a manager. We’ve talked a lot with independent people and stuff about what it feels like.

Faisal: It’s hard to have that conversation, but then you have to do it because, you know, I wouldn’t be doing my job and I wouldn’t be making my team as effective if I don’t have the A team that I need. And we talk about having the deep dream team and all the family.

Lyle: All right, thanks. Now let’s go ahead and talk about what you actually do here.[Laughing.] Let’s talk about machine learning and infrastructure at Netflix.You clarified earlier that the two infrastructure arms—that you guys are doing infrastructure, not the machine learning algorithms. I don’t think a lot about the infrastructure side. Everything’s just in the cloud. It all just works. So, can one of you tell me what it means to make infrastructure for machine learning?

Faisal: Yeah, sure. I can take a stab at it. Actually, there’s a famous paper by some folks at Google who’ve talked about, you know, as the growth of machine learning has really expanded, on how much emphasis there is on the actual core ML algorithms code, which, you know, obviously it requires a lot of science and research into getting to that point.

But then there’s this whole other set of systems that have to be in the—you know, they have to be designed well, they have to scale well, they have to be reliable—that are necessary for essentially getting to sort of those really small, but very meaningful, percentage gains that machine learning models and the quality of those models are after.

And so it’s about data quality.It’s about making sure that the input to your training data is appropriate and making sure that you have the right configuration parameter management, making sure that you are able to do the right feature engineering, making sure that you are able to give the researchers the ability to use any of the many upcoming ML toolkits, and then about model selection, picking out the right parameters. Lots of these things have to be in place.

And, in many ways, you know, you have all the requirements of good, well-engineered software engineering systems, and on top of that, machine learning adds, you know, a sense of sort of how you need to get your data, you know, sanitized. And so it adds its own level of essentially new engineering requirements on top.

So a lot of what you do in ML is about picking the right algorithms and the right network architectures for your deep learning models, but a lot of it that happens behind the scenes is just good software engineering. That’s basically what our two teams essentially focus on, trying to build up that platform so that the researchers can innovate on top of that platform.

Julie: So, let me put it this way. If Netflix just said, “Hey, we made the content and we have a bunch of MP4 and audio files. Here you go, users,” would we have a product? Well, guess what. There’s a whole lot of infrastructure to get that content to the user that needs to be built. And a similar problem exists for data science. Data scientists create models, and there’s a whole lot of work that goes into that. And when you’re trying to get that into a product, that’s just the beginning.

Lyle: Julie, let’s—I think I’ll—the four of us in the room right now are all engineers and understand this stuff pretty well, but I want to kind of open up some of this conversation to just anybody that happens to be listening to this podcast in multiple fields. Tell me what a model actually is in regards to machine learning.

Julie: So the simplest way to think of a model is really just a function that gets learned that takes some input, some X, and produces an output, Y. And you hope that the output is going to predict something of value to you. So, for example, one of the problems that we have is, when we are about to decide whether to make new content or purchase new content, how much is that content going to be viewed by users globally?

And, if we have a good idea of that, we can understand, how much is this content worth? So the model would essentially take in a whole bunch of variety of different inputs that might signal what the demand could be and produce this predictive signal that’s useful for folks making decisions.

Michael: So, in other words, it’s just a function approximator?

Julie: That’s a very good way to think about it.

Lyle: So, in general, the scope there in that problem that you were stating is, we have a whole bunch of people that are paying us money every month. And they all have multiple profiles. Those are whole families of people. And all of them like watching certain things. And we kind of know a bit about what they like watching, because they watch stuff with us.

And we have a whole side that we’ll talk in a second about, about just giving them the right content. But, when we’re talking with a creative person that’s written a script or has a few actors involved in a project and it’s about shoe sales in the U.S. or something, some documentary or something like that, there’s this question of like how I—you know, we can find interest in that.People can find interest around the company.

But the ideas of what we might like as people that choose what to hire content, you know, to pay for content stuff, isn’t a good representation of our audience, what they’d like. So we really want to try to marry those two.We want to pay—we want to buy stuff that our audience is going to enjoy, and we want to make sure that we produce that.And we want to spend money in accordance with making sure we please as much of our users as possible. Is that kind of the scope you’re talking about?

Julie: Yeah. I mean, the bottom line is that we’re building a global content business.We’re building content for audiences across the world. And, yes, we have expertise in house, and we’re trying to keep that as a diverse set of expertise. But at the same time, that’s not the complete picture. And we want to inform those choices by the data that we use.

Lyle: It seems like that’s a very hard problem to solve, in the sense that how do you find inputs to new creative that are valid inputs?

Julie: That’s an active area of research, as a matter of fact, because if you don’t have traditional things like box office numbers, because it’s something brand new, there’s a whole host of things that we can try.And this is really only one problem.If you can actually take this—models that might predict demand for content and extend those to, “How do we distribute the content on our Open Connect CDN?”

Lyle: Is that in your bailiwick as well? Open Connect?

Julie: So Open Connect is on a different team. But there is data scientists actively working on this problem that are going to us our infrastructure to develop their models to solve this problem. And the question is, if you have a brand new piece of content that’s never been released, where in the world and what caches are you going to deploy that to?

That’s another hard problem that has a different quality from buying content. At this point, you have to know exactly which assets, what bitrates, what languages for audio and subtitles are you going to put, and where are you going to put them in the limited space that you have?

Michael: So, just to kind of catch up our users, Open Connect is a kind of these boxes that we’ve spread throughout the world that contain all of our content. They are—

Lyle: But not all of our content.

Michael: Not all of it. And the reason why they cannot have it all is because we have lots of different bitrates and lots of different encodings, which means that it would be just petabytes of data. Therefore, since we only have a finite amount of space, we have to put whatever we think is the most popular in each region.

Lyle: Keep in mind, I don’t know if it’s really petabytes.Do you know that?

Male: No.

Michael: Well, I can do some napkin math, but it is a lot.I don’t know which one’s bigger, exa or peta, but one of the two is bigger.

Faisal: Ex is definitely bigger.

Michael: Okay. We’re probably not in exo then.

Lyle: So, one of those boxes that might be in south of France would have content that hopefully the people in that area are planning on watching. If somebody in that area starts watching something that’s available to them but not on that box, it’s a slower playback for them, because we have to download—we have to move that file or have them stream it from somewhere else.

So we want a predictive model that says, “When this new piece of content comes out, which of these boxes does it go onto?” And that’s a data science machine learning problem that we have to solve, yeah.

Julie: That’s right. That’s right. And that’s something you could solve with kind of traditional methods of, “Hey, based on how much this content has been viewed in the past, what do we predict?”Those methods could be improved.

Michael: So, do the Open Connect kind of filling of data—do we also use the same prediction models of how we promote within our service?Like if we go, “Hey, this area is going to be hot for this upcoming new show. Hey, let’s make sure we promote in the same way in the same region.” Or is it two different—

Julie: Well, it’s interesting that you bring that up, because those two problems are related, because what you market will affect what the demand is. The question is, can we connect that kind of on the backend, that the algorithms that are recommending content, marketing content, inform the distribution algorithms?And that’s an active area of research.

Faisal: Yeah. And I would add to that that, essentially, it’s the same set of inputs that are going into them. We’re not necessarily sharing the exact models, because if you look at it closely, the problems are actually slightly different, and the actual predictive function that the distribution requirements versus the promotional requirements seen are slightly different. But a lot of the common inputs and what people have watched and what geographies, what kind of content in popular in a particular area, they are very similar.

And so, to that extent, we share a lot of upfront analysis or ETL outputs, but then the actual problem statement and therefore the models themselves are different. But, as Julie mentioned, we are always looking for exploration in how we can sort of better work together in terms of the various efforts that could be well aligned together. And this is certainly one area where demand and essentially supply need to be aligned.

Lyle: So, you just mentioned two examples, which I didn’t really think of us using machine learning, that the content, potential investment content and also the distribution of that, because, of course, when people think about algorithms and machine learning at Netflix, they think about personalization. Let’s talk a bit about the scale problems of making sure that the right movie goes to the right user. What kind of infrastructure stuff do you solve?

Faisal: Yeah. So machine learning at Netflix, you know, can be traced back to almost a decade, starting with the Netflix Prize. And, really, that was a benchmark in leveraging external academic and industry research to come together for a problem when we put out basically a large dataset of our movie ratings and asked people to better our algorithms.

Lyle: And offered a big prize?

Faisal: We offered them $1 million, which was later on claimed by an ensemble of teams. But it really sort of set up an example of how to open up some data and get collaborative effort coming out of it. And we actually used some of the algorithms in the product that were suggested by some of those teams. It’s been a long time since then. The technology has evolved.

There’s been two secular trends, one in just the compute infrastructure’s gotten far better with Moore’s Law. The power of the machines and the GPUs, especially around 2010, 2011 have gotten substantially more powerful. And, essentially, the power—with that computer infrastructure, the power of complex deep-learned models have actually started to become very useful in business applications. And that’s what we have been focusing on for the last two, three years.

So the infrastructure that we’ve built in the personal edition space is—it has to scale at the scale of our member profiles. And, you know, we have 135 million members and growing. And, for each of those accounts, we have, you know, up to five different profiles. And so, now we have this problem of hundreds of thousands of content pieces and hundreds of millions of members. And we need to match them together.

Lyle: And those datasets get bigger all the time.

Faisal: And those datasets are only getting bigger, which is a great problem to have, but nonetheless something that we need to scale to.So, one of the unique requirements in our work, which isn’t as much of a thing that Julie’s team has to worry about, is because our member-facing systems are essentially all Java microservices, we have to make sure that the infrastructure we build can sort of scale and fit well in the JVM-based environment. And so, you know, we utilize Apache Spark a lot for computing, because it allows us to do essentially large-scale distributed computing and all in an environment that sort of fits in well with the JVM world.We—

Julie: Is most of this running in Amazon instances?

Lyle: Yeah. Almost everything is running in AWS. Our entire control plane is on AWS. The Open Connect part that was alluded to earlier is obviously separate. That’s essentially our CDN, and that’s running in various stacks of machines and various high speeds. But, what we call—we call that the data plane, and everything else that we all do here is the control plane. And our control plane is in AWS. And Netflix has invested a lot of time building really good tooling on AWS, so—

Lyle: And open source some of that tooling too. Yeah.

Faisal: A lot of it actually has been open sourced, and some of it has come out from our personalization teams as well, and so—

Michael: Are any of the teams you work on open source?

Faisal: Yeah. One of the teams—one of the projects that my team had worked on early on is called EVCache.

Michael: Yeah, sure.

Faisal: And EVCache is used extensively within the various microservices teams to store data. It’s basically a high throughput low-latency cache built on top of M Cache D, and we provide essentially global replication in AWS and all the fun stuff there.

Lyle: I didn’t know EVCache was a Netflix product site.I know it’s…

Faisal: Yeah. We’ve been using it extensively, and I think the personalization use case is one of the first big users of it, but it’s been used extensively now. And the way we use it is because you can imagine, when somebody hits play on their Netflix app, we really need to be able to, you know, get the content to them really quickly. And so, in the request pad, we can’t be doing these big math computations that machine learning requires.

So the way we use EVCache is, we essentially do a lot of pre-computations. We score—as an example, let’s say I have—we have a model for PVR. PVR stands for personalized video ranker. And, essentially, we rank the entire Netflix catalog for every active member profile every so often. So we always know what is title number one for you, Lyle, and title number, you know, one for—

Lyle: Seven or whatever.

Faisal: Exactly.

Lyle: Of the shows that we haven’t watched, most likely?

Faisal: It’s the entire catalog that we rank, and then, obviously, the ones that are at the top are the most relevant ones.

Lyle: So, when I get a row on Netflix of movies, the very first row of the very first box art is the one that’s the highest of that rank, kind of?

Faisal: Kind of.

Michael: Ballpark.

Faisal: And the reason for ballpark is that there are different models. The default model that actually does the column ranking, that is PVR, personalized video ranker. But some of these models are much more volatile. They need to be updated more often. So, for instance, continue watching row often is the top row, and that can be ranked just by your preference in our understanding. It’s really literally what you’ve watched, and you want to continue watching that.

Lyle: And there’s another one called Trending Now and another one called Popular in Netflix. These are things that have different algorithms behind them.

Faisal: They all have different algorithms behind them, and they have different sensitivity to time. For instance, Trending Now, as you can imagine, needs to be updated much more frequently, whereas many others can be, you know, scaled and ranked, you know, once a day or once every few hours.

Lyle: So you were just saying that, when a person pushes play, you don’t want to do these calculations. Why would you do any calculations of that stuff when you push play?

Faisal: Yeah. So, technically, it’s when they go on the app.

Lyle: Okay, when they launch the app.

Faisal: And we have to actually show them their home page.That’s when we need to make sure that we’re presenting the right recommendations to you.

Lyle: But even then they can push play too.

Faisal: They can push play, exactly. But even that has to happen in the order of milliseconds.And so you still can’t do a lot of inferencing and prediction. That needs to happen using machine learning also. What we do is, we basically run these precomputes every few hours, and then we store the results, the ranks, the PVR ranks, in EVCache. And so, when the request comes in—

Lyle: Per user?

Faisal: Per user, per profile.

Lyle: So, does that mean that that EVCache is different per region? Because there’s no reason for me to—like right now I’m in California—there’s no reason for my ratings to be in India somewhere, right?

Michael: It’s globally replicated.

Faisal: It’s globally replicated across the various regions.

Lyle: Just in case I’m traveling and I log out.

Faisal: Absolutely. And it happens. It happens.I mean, there are multiple regions, multiple AWS regions even within U.S., and sometimes there’s a case that you actually were traveling. You were on your mobile phone. You were recommended something.

Lyle: I travel sometimes, yeah.

Faisal: Well, what I was saying was that you were traveling through a part that you were handed off between different regions. And so we need to make sure that the caches in all the regions are up. And, obviously, there’s some smartness to storing the right amount of data, because, you know, if you are generally watching a product in the United States, we don’t have to have all of the recommendations in other regions. But we have globally replicated EVCache, and you should be able to access the data from a close region.

Michael: And you got to remember, we also run a bunch of [kong 00:25:16] exercises. We take down regions. We route people throughout the world.

Faisal: Excellent point.

Michael: And so, if it’s not globally replicated, you get a default [lelemo 00:25:23] pretty quickly.

Lyle: That’s actually—we should probably talk about that in some of the future episodes of the show. But, basically, that idea is, “Let’s go ahead, and if we have different fallbacks, let’s test that fallback experience. Let’s destroy a whole region, turn it off, and see if the service is working.” And we do that regularly just to make sure that everything’s working properly.

Faisal: And that’s a great example of Netflix, you know, fail-fast culture. It will literally take down an entire AWS region just to prove that we’ll able to do fine. And it actually happens, and it’s not an issue. And all of our system has to be—have to be supporting of that.

Lyle: And when we’re talking about regions, we’re talking about AWS, Amazon Web Services. So they have a West Coast and the East Coast and a European Region.There may be another two or so.And those are servicing. If you’re on the West Coast of the United States, for example, you’re probably in that place in the West Region. But we might turn that off one day and just have everything serve the east side. And the reason for that is, if there’s a problem, we want to be prepared for it.So that’s the—

Faisal: Yeah.

Lyle: Back on track to machine learning now. So you’re saying that we have to precompute that stuff, and we have to build infrastructure like the EVCache so that we can achieve the high performance throughput that we need for the user with that model that we’ve trained?

Faisal: That’s right. And EVCache is actually a fairly generic infrastructure that is used by a whole lot of other applications, not just ML infrastructures. But some of the ML-specific infrastructure that we have to do is a lot around, “How do you sort of manage the orchestration of these pipelines for these models that have to be trained?” So you have to—as I mentioned earlier on, there’s a lot of emphasis on data.

And so, you know, you need to make sure that your data is prepared well, your features are extracted out of that data. And then you feed that to essentially a trainer. And then trainer, which is essentially the learning function, comes up with a function. And then you have to do all sorts of hyper-parameter optimization.

So there’s a workflow orchestration systems that we’ve built. Meson is a tool. Dagobyson [phonetic 00:26:59] is a tool that my team has worked on. And then there are just various libraries that help you with that core feature engineering and training pieces.

There’s two libraries that I want to call out. One is called AlgoCommons, which is essentially a set of building blocks for various features and feature encoders and data maps that let you do the core—as a researcher, do the core definition of what is a model. Right? A model essentially is a specification of how data comes in, what to learn, what things to look at, and then what to predict. And so AlgoCommons provides that base layer.

Lyle: And is AlgoCommons a Java library?

Faisal: It’s a Java library. And then there is a Scala library sitting on top. It’s a higher level ML API called Boson. And that essentially uses all the building blocks that AlgoCommons has and essentially allows you to do this high-level pipeline I talked about from data preparation, feature engineering, training and so forth.

Lyle: In Scala?

Male: That’s in Scala.

Lyle: And why is that in Scala?

Faisal: Yeah, it’s a great question. One of the reasons why it’s in Scala is because we wanted to leverage the power of distributed computing that Spark brings.Apache Spark is this really popular big data compute platform, and we use that heavily. And the reason for using Scala is because Scala fits well in the JVM world. And remember how I said that we need to be fitting well with the member-facing side, which is all Java apps?

Lyle: Yeah.

Faisal: One of the important aspects of machine learning is that there’s this—you train the models and then you score the models.Right? So the model training happens in an offline context. The model scoring, where you’re actually coming up with the prediction, that’s happening in an essentially online context.

Lyle: And the concept there is that if I all of a sudden start watching a whole bunch of murder mysteries, we want the system to go, “Oh, here’s some more of those.”

Faisal: Absolutely.

Lyle: And so it has to learn quickly my behavior.

Faisal: That’s right.

Lyle: But the training mechanism of like how our titles performance, you look at the data for a week or something, and then you match titles together and stuff. You can have a little bit more time with it.

Faisal: It depends. That’s true. You know, we can go back several weeks. In certain cases we have to train in a much shorter window, depending upon the kind of use case. But the point I was trying to make is that, oftentimes when you compute metrics after you’ve trained your models, you have some idea of how good the model quality is.

But then when you deploy it and you actually get real traffic on it, you may get differences in those metrics.And the goal of ML researchers is to align these two as closely as possible. And one of the ways we do that is by literally running the same code in AlgoCommons in an online context as we can in an offline context.

Lyle: So therefore it has to run the Java [unintelligible 00:29:25] too?

Faisal: Exactly. So it has to be compliant with JVM world, and therefore those feature encoders that AlgoCommons provides, they run for training in Scala through Boson, and then they run in an online context, so that’s additional.

Lyle: So, do you have the same problems, Julie? Do you have to do the JVM context? Are you a Python?

Julie: So we don’t have that constraint. And we’re solving problems that are largely consumed by internal business users, so analysts, for example, being an end user.And, given the size and the breadth of the data science teams that we support, we want to enable them to be fairly autonomous from developing the model, from prototyping the model, to serving that model to those end user applications.

So the best path to doing that is, rather than asking data scientists, “Please learn Java on the JVM,” is to build tools that work in environments that they’re most familiar with.And that includes both Python as well as the R ecosystem. There’s actually a pretty solid base, both in Python and R

Lyle: For machine learning problems?

Julie: For machine learning problems. And one of the core components that we’ve built is called Metaflow. And it’s a very unopinionated framework that has a client library and a service that allows a data scientist to very simply specify a step-by-step flow, a DAG or a directed acyclic graph of computation.

Lyle: Of course. Yeah, right. You know I know what you’re talking about. Go on.

Julie: And that basically allows them to say, “Hey, step one I’m going to get my training data. Step two, I’m going to do a hyper parameter exploration,” which basically means there’s different parameters of the model that you might want to try.And do you want to try those out in parallel? For example, if you’re training a deep neural network, one hyper parameter would be, “Well, how deep is that network? How many layers are there? Or how wide is each layer?” And so, doing that in serial, one by one, is slow. Doing it in parallel can be very fast. So we want to—

Lyle: It sounds like people are doing a lot of research here.

Julie: So we want to enable data scientists to do this kind of work very quickly, very efficiently, parallelize it without kind of having to have the understanding of, “Well, how do I parallelize, or how do I distribute this problem?” So Metaflow allows them to do that, and then bring it back together, select the next model.And then another use case would be what we call bulk scoring. Scoring is really just giving new data that the model hasn’t seen and then asking it to make a prediction. And you sometimes want to do that in these large bulk settings.

Lyle: So it seems to me the problem scope that you’re dealing with in—has a slower feedback loop than like the play data of a user. Like when the models on the client give me a row of content and I click the first or second or third box art in one of the rows, the model goes, “Oh, that one worked.”Right? Like we slowly are able to train the system by trying for the model you see how well I play. In the loops, in some of these loops though, they have to be a lot slower on whether you know you’ve got a good predictive model. Is that true?

Julie: Well, in some cases, if you’re buying content and there’s sort of a year runway for that, you might not learn whether the prediction was good for a year.

Lyle: Yeah.

Julie: In terms of, you know, how often are these models updated, in a lot of cases they’re retrained on a daily basis or rescored on a daily basis. So, yes, it’s not as real-time as the product-facing use cases, but we definitely have a lot of use cases where it does need to be pretty fresh.

Lyle: So your role, basically, is to make sure that infrastructure for the data scientists is running smoothly and there’s easy tools for trying these things to empower them to test things quicker?

Julie: To test things, but also to deliver them into some end-facing application. And so, one of the problems is serving that model. “Hey, I’ve developed this model, but now it needs to get into some kind of application that’s going to pull together multiple data sources.”

Lyle: Can we talk about a concrete example of something that you recently produced an application out of your—

Julie: Yeah. So one fun example is—there’s one component I should describe first, which is our model hosting service. And so, once you’ve trained a model with Metaflow, it produces an artifact, which is basically a serialized model. And the question is, “Well, what do you do with that? It doesn’t do you any good.”

Lyle: But that’s the function you were talking about earlier? It’s now this—

Julie: No.

Lyle: —big mathematical function you can put stuff in and get stuff out?

Julie: Exactly. And so, well, what do you do with that? Well, it’s not really any good unless you can kind of turn that into a service that some other application can call. And so we have a model hosting service. It’s kind of like a metaservice, because it’s a service that hosts other services.

Lyle: Sure.

Julie: So the data scientist can basically specify a fairly straightforward Python class that would define their own custom REST interface that basically loads the model and provides a REST interface that then some end business application can call. It’s a microservice within, you know, the hosting service. And so one fun example is one of our early adopters, he’s really enthusiastic about using Metaflow and the hosting service.

And he’s been doing research in the NLP space, so, for example, pulling out named entities from screenplays and trying to understand how these entities relate to one another. And the realization was, well, NLP is a pretty general capability that’s needed—

Lyle: What that, multilayered perceptron?

Julie: Oh, that’s a great question. Natural language processing is NLP. Yeah, so it’s all kinds of machine learning-type applications to understand human language, written text.

Lyle: So this engineer has written a model to start consuming screenplays?

Julie: So he’s not an engineer. He’s a data scientist.

Lyle: Oh.

Julie: So he’s come up with this model. And what he did was, he defined a REST interface in the hosting service, and now he’s actually telling other data scientists about it. So the promise is, we, the machine learning infrastructure team, didn’t actually build a REST interface. He built it.And how he’s actually making it available to other teams.

Lyle: So, what does it do?

Julie: It essentially allows you to—let’s say one use case would be you send it a blob of text, so some kind of paragraph, and it will tell you, you know, what entities. Oh, here’s the—you know, this sentence mentions Washington, D.C. so it says, “Oh, there’s a city in this, and here’s the text Washington, D.C.” Or kind of—

Lyle: Kind of an example test case that we talk about in general, you know, when a new library comes online for machine learning, that’s the kind of stuff that hits the headlines. You can give it a paragraph, and it classifies the paragraph as much as possible. Are we using that system right now at Netflix to help us parse content?

Julie: That’s all in the early stages. So this is the fun part about research, right? Is that making it available is the first step.So there’s a lot of exciting work planned for the next couple of quarters on this.

Michael: So, are you guys building a microservice of machine learning services? Have I got the right term there?

Julie: It’s really enabling data scientists to build their own microservice with a very simple interface without necessarily requiring them to, you know, “Okay, create a new Repo with their project with all the configuration that’s needed to stand up a new service,” and all that kind of boilerplate.” Really, it’s just enabling them to worry about just this very thin interface.

Lyle: Their expertise, or their main expertise in a small interface.

Julie: Yeah. And one cool thing is that we are highly leveraging a newly released open source project by Netflix, which is called Titus. And Titus is—you could kind of think of it as an alternative to Kubernetes, where—

Lyle: That doesn’t help me at all.

Julie: [Laughs.] So it’s—

Michael: Containers.

Lyle: Containers. Okay.

Julie: Containers, yes. Okay.

Lyle: Virtual instances containers.

Julie: Basically virtual instances containers, and it will orchestrate the containers for you. So if you need 200 of these containers all doing the same task, it will orchestrate that for you.

Lyle: So, when I work—if I was a data scientist and I wanted to come up with a natural language processing system that would classify all these different known entities that we deal with, I build my model out, and then I say, “Okay, I want this, and I need it at this kind of scale and this kind of resource,” and then in the background your tool, Titus, is taking it and spinning up instances necessary to run that code. And, as a data scientist, it all just kind of works for me?

Julie: Yes, exactly.

Lyle: Yeah, that’s pretty cool. So that’s your role? What are your roles?

Julie: So, and I should clarify that the cool thing is that machine learning infrastructure did not develop Titus.We’re actually leveraging this project that’s being developed within a different team called platform engineering.

Lyle: Yeah, and Titus of course is the thing that actually spins up all the instances to run—when you talk—when your TV talks to Netflix, it’s actually talking to a virtual instance that’s managed—that’s inside Titus?

Faisal: In some cases, in some cases, because Titus is more of a container-based view, versus the traditional EC2-based infrastructure that the majority of our microservers currently run on.

Lyle: Okay, I see. So our edge instances right now don’t run in Titus?

Faisal: Most of them don’t.

Lyle: Why is that? I know we’re going outside of our—

Faisal: Only—some of them are actually using it. And, actually, there’s a layer of UI code that probably you guys are familiar with more that—called Node Cork [phonetic 00:38:50], which is essentially executing on Titus containers. But containers is more of an up-and-coming technology, and we are trying to make sure that you can have the same level of reliability and that regular EC2 virtualized instances have, but that do give us more flexibility in terms of, you know, better utilization of resources. And that’s why we’re sort of starting to go towards that. I think, you know, in two years’ time the landscape might be quite different.

Lyle: Yeah. I’m actually—I work primarily on the IOS application space.

Faisal: Ah, I see.

Lyle: And so we are the ones moving to the Node Cork sooner than everywhere else.

Faisal: I see.

Lyle: If you search on your iPhone, for example, that would be in Titus instance.

Faisal: Yeah. And so it definitely goes through Titus, and then eventually it bounces back to the rest of the stuff. But I would like to add one more thing to what you were saying. And I think much of the stuff that we have been building, we’ve been—being able to leverage technologies that have been built, not just within our own teams, but in a bunch of other platform engineering organizations.

And another example of collaboration between our two teams is that the cool Metaflow tool that she’s talking about, that also uses a tool that my team had built. I mentioned the Meson workflow orchestration engine. So Metaflow uses Meson for orchestrating and then Titus for actual execution. And so we’re basically leveraging the power of the expertise and things that we have built and other parts of Netflix so that we can focus on the areas that nobody else is looking at.

Lyle: Right. So we’ve been streaming video for years and making people happy with the content that we either buy from other distributors and all of that—or different—other makers and everything, and now we’re doing a lot of originals. I’m assuming both of you are desperately hiring at different levels, and you can find the most current jobs at jobs@.netflix.com.

But, Julie, I’m assuming your space is growing a little faster, because our content exploration space is just massive right now. We’re doing so much more. Can you talk a bit about some of the challenges you’re got coming up in the side of—the non-member-facing side?

Julie: So one big challenge—if I were to lay out the whole landscape of the thing, and I think about where I started as a software engineer, interestingly enough, this will all tie back together. Where I started was—it was nearly 20 years ago, and at that time it was actually really difficult to be collaborative as a software engineer and to be really productive, because if you look back, we didn’t have distributed version control. We didn’t have Git. We didn’t have Agile methodologies. We didn’t have continuous integration, continuous deployment. We didn’t have the cloud. There are a lot of things we didn’t have.

Lyle: You’re describing a lot of stuff that makes my job a lot better. It’s true.

Julie: And, as an individual software engineer, I mean, it was so much harder to work together. And I look at the data science space as a whole, and I realize that where data science is today is where software engineering was 20 years ago.And so I look—like the big, big picture is, we actually have as software engineer practitioners a big opportunity to bring those best practices to data science.

And I think the larger challenges, there’s a lot of amazing research going on and a lot of people who’ve kind of cut their teeth in academia that have great ideas to bring to the table. And how do we marry that with a product, with delivering kind of the end product? And so some of the challenges—you know, it’s not just about content. It’s also about, you know, delivering the content. It’s about studio production.

There’s all kinds of interesting problems, challenges there. It’s even about, how do we improve the quality of the streaming experience itself?So one example, a model that’s actually running on the client devices as an A/B test right now is to replace an existing rule-based method for deciding what bitrate to stream with a machine learning-based method.

Lyle: So what you’re telling me—earlier, Michael mentioned that we have lots—on these boxes we have lots of encoding versions of the file.

Julie: Right.

Lyle: Basically, we have a really high res and a lower res and a different res and a smaller res, so that, if your network connection is really bad, instead of having to wait for a long time with buffering, you get a video playback that’s not as rich and beautiful as the high quality, but still good enough to experience it.

So that bitrate delivery, we traditionally have had these rules, like somebody coded, “Well, if their network connection looks like this, let’s start up with this, and then after five minutes, let’s try to up the speed.” All those rules are kind of codified by a developer, a group of developers. And now you’re saying, instead of doing that, since we have data on how well those perform in different environments and different devices, we could actually train models with that?

Julie: Exactly. And, interestingly enough, my previous stint at Netflix, that algorithm, which is called adaptive streaming, was actually being developed. And I can tell you that it was painstakingly A/B tested over many years to get to where it is today. And the question is, can we make meaningful improvements using reinforcement learning to do the same thing, but better?

Lyle: Michael, didn’t you work on an A/B test last year that did with video bitrate for—

Michael: Yeah, we’ve done a lot of different testing throughout our app and then I’ve done a lot personally on that. And like one big thing we have to sacrifice with the main kind of display on Netflix is that we want to both have high quality and fast start time. And you really can’t have both of those all the time, and so we have to sacrifice startup time intentionally to have a higher quality. And that’s a very hard decision to make in software.

Faisal: Or the other way around in certain cases.

Michael: Yes, exactly. And so, when we do promotional content, we want that higher quality, because if we give somebody a trailer, something new that’s about to come out that we really think is going to be good for them and it’s just completely blotchy and, you know, 256, something really, really low, people aren’t going to watch that.It just doesn’t look good. It doesn’t—nothing’s appealing about it.And so we have we have to play a very careful game on what is the quality versus the startup time, because if they don’t ever play it, then it was obviously I’d say equally as bad as a really low quality play.

Lyle: Seems like there’s an opportunity there for personalization in that as well, in the sense that some people might care about one value over the other, even subtly.

Julie: Yeah, bingo. So, and to tie it back to my original point, there’s so many different problems to solve. And so our problem is really making each individual data scientist more productive.But the other major one is actually allowing and enabling data scientists to collaborate, because the question is, what are the big problems that we’re not solving, because today collaboration is very hard?

Lyle: Yeah, that seems like a serious—and you guys are both [unintelligible 00:45:02], I’m assuming?

Faisal: At different levels, yes, absolutely. And I think we are actually leveraging some shared infrastructure that other teams are doing around NodeBooks and how—you know, NodeBooks is this idea of code and output of the code together with a visualization, and so it makes it easier for people to share the results of, “Hey, here’s some new cool algorithm I tried. And here are the metrics. What do you think about that?” And then somebody can easily, you know, put codes in it and give more ideas. So we’re both using NodeBooks. There’s also other ways and other tools that we’re using. But, yep, it’s a pretty exciting time.

Julie: There’s so much—I have so much more material I could talk about, just culture and lessons and like—

Faisal: Yeah, culture.

Julie: No, I can tell you about feedback and how I learned how feedback works at Netflix, which is I was meeting with Patty McCord, who used to be the chief talent officer.

Lyle: The Patty McCord?

Julie: The Patty McCord. And I was talking to her about a situation that I had with another individual. And I said, “Hey, you know, I’m kind of concerned about this situation.” I laid it all out for her. And then she said, “Well, so what did they say when you told them that?” And then that was it. It’s like, “Oh, I get it now. I give feedback.”

Lyle: Was it a hard thing to say to that person?

Julie: Initially, yes. Now, one advantage of being a leader is that you have these kinds of conversations all day long. You’re not allowing things to get to a point where it becomes difficult and it becomes awkward. You’re having this kind of continuous feedback all the time, and you just get better and better at having these conversations. And, yeah, the first couple of times it’s like, “Oh, my God, what do I say? And how do I say it right?” And it’s really daunting.

Faisal: And another example, which—there might be other companies that do that, but it’s really ingrained in the culture here, is you have to take responsibility of the decision that you’re making and you have to be able to speak to the person who is impacted directly.

And this applies even for when we’re hiring people and when we realize some candidate is not good enough and we have to end early. As manager, as the hiring manager, we basically go and give them direct feedback in the moment on how the interview went and tell them that, sorry, we won’t be able to hire you for this and this reasons.

But we have to do it in a way that, you know, is—you know, that is respectful of the time and effort they’ve put into it and appreciative of the interest that they’ve taken into it, and give them actual usable feedback so they can get better in their next companies.And I’ve actually gotten feedback from some people who I passed on, give them feedback, and then later on, they said, “Actually, your feedback helped me interview at other companies, and here’s where I am.

Lyle: Nice.

Faisal: So,, you know, close that loop and—

Lyle: It’s the worst feeling to spend a day as an engineer and being hired somewhere, and at the end of it, like, “No, we’re passing.”“Can you give me feedback?”“Nope.” And the reason, of course, is the whole legal aspect of it, right?But I’m glad that we do candidate feedback for people we don’t hire.

Faisal: We give candidate feedback directly. And, I mean, it depends on who you’re asking, but most companies that I’ve interviewed with, and when I have not been successful at getting hired at some place, it’s always, “Yeah, we’ll get back to you.”And it’s this wait thing, and then you just don’t hear about it. I’d much rather, “Okay, here it is.” Yes, I’ll feel a little bit bad, but I’ll be over it. And if I can get some, you know, usable lessons for future, that will be awesome.

Lyle: If they handle that really badly, at the end you’re like, “Well, good. I don’t want to be in a place that that’s how you treat people,” you know.

Michael: Having active live feedback, you kind of can go, “Okay, at the end of this interview, I can see where I failed and I can see why they did this,” whereas if you go and hire at other places, you just get the, “No.” You’re like, “Why not?”

Lyle: That’s an excellent point.

Michael: And like, “No, we can’t talk about it.”You’re like, “Well, how do I know things are actually fair?” whereas here you kind of get the exact breakdown.

Julie: Well, and it goes both ways too. So I’ve on occasion had a follow-up phone conversation with the candidate that I’ve passed on. And it ends up being a two-way conversation. So I’m kind of like, “Hey, here’s what I’m looking for, and here’s what the interview feedback was. And, by the way, can you tell me about ways that I can improve your experience?”Because I’ve learned, you know, plenty from other candidates who’ve said, “Hey, you can improve this process.”

Michael: So, are you of reinforcement learning?

Julie: Absolutely.

Faisal: Absolutely. We’re all living this one-side matrix.

Julie: I mean, the other thing is like we—I mean, I don’t look at interviews as this transaction. I look at it as a relationship. There’s this kind of ongoing interaction. And you never know what’s going to happen two or three years down the line. Somebody that didn’t work out for this one role at this time might be fantastic for this other role at this other time.

Michael: We actually have a person here right now, he did not make it two interviews in a row. And on the third one he did make it. Now the team that hired him is extremely happy about his work there. So…

Lyle: Is this he named Michael? [Laughter.]

Michael: I don’t want to talk about my personal experiences.

Lyle: So I think you touched upon—Julie, you touched upon something that was really fundamental to the way a manager should be working. And I think this goes for any company. If you go into a situation where at some point you go from they’re doing a great job to, okay, now we have to do review, and there’s a problem we have to address.

And it’s like a step that happens, where now they’re in a process of possibly having to leave. That’s failing in management, right? The entire time you should be looking at the very small steps. You don’t want to change your modes with an employee because you’re doing a situation where you’re treating them differently in a different state. You want to always treat them the same way, with respect, communication, all that.

Julie: Right. And just to go back to Faisal’s earlier comment about, you know, from the outside it can look very cutthroat, you know, fear—culture of fear kind of thing. You know, if somebody makes bad decisions, the first question I ask is not like, “Oh, well, what did they do wrong?” The first question I ask is, “Hey, what context was missing that didn’t allow that person to make good decisions?”

And so there’s not like, “Oh, you made a mistake, and you’re out.” It’s actually a whole process where we need to examine, like what is actually going on here? And how can we improve the decision-making process as a whole, in addition to, how can that person improve themselves?

Faisal: Yeah, I mean, that’s a great point. I think that ties into this culture of how failure is celebrated. And, you know, we all make mistakes. That’s understandable. And the kind of mistakes that we don’t want people to make are judgment mistakes, right?Because that basically speaks to who you are and how you will behave. But, you know, pushing code that takes down some portion of the service, well, guess what? That’s the price of innovation and moving fast.

And we need to be able to celebrate it. Now, what we do need to do is to learn from those mistakes, and the managers need to basically set up an environment that people feel comfortable taking those risks and knowing that, “Hey, if a mistake is made, we will learn from it, try to get better at it, but there won’t be blame attribution and there won’t be, you know, negative outcomes coming out of that.” Obviously, we all have to get better and so that we reduce the rate at which we make mistakes.

But there is just a fundamental, you know, error rate, so to speak, that you have to be okay with when you have to move fast. And that applies not just in the way you develop your software, but also how you interact.And from a cultural standpoint it’s rather important to have this environment of psychological safety that’s so much been written about, that where it’s okay for people to, you know, say something that is occasionally not in the best team dynamics.

But they should be open to receiving feedback. And, as Julie was mentioning earlier, there’s this constant culture of giving feedbacks and receiving feedbacks. And, you know, every quarter or every month I would ask in my one-on-ones with my team members, you know, “What else can I be doing differently to make your job easier and better?”

And, you know, we have to have that mentality of serving the people who are basically building the stuff.You know, we’re really getting the right people who are experts at doing it. And so it’s our job to make sure that they can be as effective as possible.And that requires constantly listening in, giving feedback, receiving feedback, and fixing our behaviors.That’s fundamental to how stuff gets done at Netflix.

Julie: And I would also add that feedback is not something that the manager or the leader does. It’s actually something that everybody does. And we all have not only the freedom to give feedback, but we also have the responsibility to give feedback when we have it. One thing that actually happened when I was here the first time was, I was a manager.

And then I was in a series of meetings with my manager and kind of realized, hey, I’ve got this handled.And I just gave my manager direct feedback and said, “Hey, you don’t need to be in these meetings.” And he said, “Oh, great. That’s wonderful.” And he stopped going to the meetings. So, you know, feedback goes both ways.

Lyle: And everything worked.

Julie: Yeah, and it worked. It went fine

Lyle: Well, Julie Pitt and Faisal Siddiqi, thank you so much for joining us on We Are Netflix podcast. And thank you for being here making things better for our data scientists and our platform and teams.

Faisal: Thank you so much. That’s been great talking to both of you.

Julie: Yeah, this has been a delight.

[Music]