Thursday, May 25, 2017

Walking Down Memory Lane

In October of 2007, I was working at Netflix managing the testing team for the website. I presented a paper on the challenges of scaling up a software delivery effort as a business grows at the Pacific Northwest Software Quality Conference. Someone I met at the conference and who I connected with on LinkedIn reached out to me and prompted me to look back at the paper.

On re-reading it, I remembered that I knew many things that I've since forgotten I've learned there. Just in case you, dear reader, might also be interested I am reproducing that paper here. Something I found amusing was how I managed to find citations I could use in footnotes to make it look more like research rather than the retrospective that it clearly is.

Scaling Quality


When we develop software to support our business’ customers, we hope to make our products successful. We want to build software that will: be used, make user’s lives easier, do what it is supposed to do, and perform acceptably. If we are fortunate enough to be successful, the user base grows, and we have to learn how to scale the systems to support the increasing load. Everyone knows the three key subjects to study to deliver scalable software: design, coding and testing. However, there’s a fourth scalability subject that’s often overlooked. That is scaling the development and quality teams themselves.

With more cooks in the kitchen, the development environment needs to evolve. Communication within the organization has to move past the ‘prairie-dogs popping over the cubicle wall’ style. The code base has to be organized so that functional teams can build, test and deploy independently. Most importantly, the new process needs to stay as nimble as the process that fostered your current success. The pace of business innovation and deployment of systems to support it cannot be bogged down by process and bureaucracy.

Software tools that deliver operational insight and continuous testing aid and inform how the product is developed and delivered at all levels: product management, design, server engineering, production engineering, operations, and of course, quality assurance. This paper examines how Netflix has navigated between the Scylla of growing system demands and the Charybdis of chaos at the hands of ever-larger teams.


Some organizations become victims of their own success. Instincts that serve a startup team well can cause problems in a larger team if careful attention isn’t paid to how things are done as an organization grows. Deployability will not scale where the development engineer is also the QA, configuration, release and operations engineer. As the systems increase and grow and more people are working together on them, it becomes less likely that the organization can deploy code that “worked on my box.”

The Environment


The corporate culture at Netflix is all about High Performance. As the company has grown, the management team’s most important task has been to find and nurture star performers. As the complexity and volume of work increases, there are two ways to grow to meet the challenge of getting more done.

One is to add reasonably competent junior people (who are plentiful) and layers of management to guide them. As the number of people and the layers of management grow, the complexity, interconnections and overhead of coordination grows even faster. This leads to chaos. The usual response to the pain resulting from chaos is more rules and more process. More process requires more people to do the “real” work alongside an ever increasing volume of process work. This leads to diminishing returns for each addition to the team.

The alternative is to find and hire senior engineers (who are scarce) who can handle increasing complexity without the need for close supervision or rigid process. This allows the organization to scale more efficiently. The time and effort needed to hire is significantly greater when hiring only people that raise the performance bar. This is true because you’re recruiting out of a smaller pool to start with. It’s also true that as you raise the bar with each successive hire, you’re further shrinking the pool for the next hire. However, it’s the more effective way to meet the challenge of increasing complexity because you minimize the additional coordination cost and maximize the additional work done per person added to the team.

The consequences of this approach are openly communicated to all job candidates from the very beginning of their job search at Netflix. Reason number six on the company website’s “8 great reasons to work at Netflix” page states:
“At Netflix, adequate performance gets a generous severance package.”1
Here, the expectation is set immediately that there is no safe haven, that there is no cushy corner you can expect to hide in that you’ll be able to retire from. In a High Performance culture, everyone needs to be contributing to the bottom line all of the time. Anyone not pushing the envelope of their discipline can expect to be replaced by someone that will.

Pace of Innovation

Rapid change is in the DNA of Netflix. We use agile processes, not necessarily any particular brand of Agile Process. It’s a core competitive advantage to be able to deploy new features and functions every two weeks. This presents a significant challenge to both the work and mentality of the quality assurance team. In some ways, the most important skill of a QA engineer isn’t her ability to test, but her ability to decide what to test. We focus our limited testing resources on those changes that present the largest risk to the business. This means that we rely on our ability to: quickly detect bugs in production, find the root cause, correct the problem and deploy the fix to handle those defects not detected before deployment. A great luxury of being a web based application is that deploying new code is transparent to users and immediately fixes the problem for the entire user base. Contrast this with the world of shrink-wrap software, where a new release means shoving around a big pile of atoms (CDs, boxes, documentation, etc.) and not just bits. Even in the case of electronically delivered patches, you have to solve a number of problems with delivery, authentication and authorization that are free when your core business revolves around a website.

Nimble Processes

If you run with scissors and you trip, you have to be good at recovering. The ability to change the application so often is a double-edged sword. Repeated success with short release cycles trains business owners to rely on vision becoming working code in record time. Keeping serious defects at bay can become a challenge when the flow of requirements becomes too mercurial. However, this nimbleness is a great competitive advantage because you can quickly react to changes in the market. More importantly, you can also quickly iterate on proposed improvements to the service. With these costs and benefits in mind, experimentation has become the core value driving how Netflix continuously improves the website and supporting systems.

We test not only the quality of the software, but the quality of the members’ experience. In the process of creating any given feature, we will develop variations on how it works, and test those variations on prospective, new and existing members before rolling them out as the standard experience. We instrument code to collect metrics around whatever member behavior that feature is supposed to drive, and we pick the version with the biggest lift in the desired behavior. The ability to assess all of the variations during the QA cycle becomes a great challenge. To that end, developers build in URL-parameter based hooks and other shortcuts that a software tester can use to force the presentation of a given customer experience. Manual and automated software tests can reproducibly verify specific versions of a member experience without lengthy set up and tear down operations in the database or a long series of UI interactions.

By being good at creating and verifying partitioned customer experiences (A/B tests), we can:
  • understand what is important to the users
  • understand how well we’ve addressed those needs
  • iterate quickly, honing in on the most relevant and rewarding solutions
In order to support this degree of experimentation across a growing number of functional silos, it’s important to have tools that let us walk through the exponentially growing number of variations of: browser, feature, test cell and customer state. To that end, we’re using the Selenium framework to test different presentations of customer experiences undergoing A/B testing.2   This allows us to programmatically walk through the same basic flows while validating that all customers will have the correct experience.

Selenium is a powerful framework for testing browser based applications. Being written in JavaScript and running in the actual target browser, any test scripts will have access to all of the components that make up the rendered page. The inherent power of this model can be leveraged by judicious design and development of the code that creates the user interface. By providing a clear and consistent framework for rendering the site, it becomes simpler to discover and verify that the right content is being displayed in the right places on the page.

By providing a ‘real end-user browser experience’ to test against, and by providing tools as part of the website itself to drive a test into a specific user experience, one set of scripts can verify that common functionality doesn’t break, even across many variations of the user experience.

Scaling Systems


In the beginning, everything lived in the web application, all of the classes were available to each other, and that was good enough. Then, the needs of the application grew. There were more features, customers and movies. A monolithic application could no longer scale. This was not good. The testability of the application was impaired because in order to test anything, you had to spin up the whole web server and all of its supporting components.

As the computational needs of different parts of the application have become more intense, the carrying capacity of an individual web server in terms of the number of simultaneous customers has shrunk. Upgrading the web server hardware has helped expand capacity in counterpoint to this trend. Migrating to a 64-bit OS and JVM helped to crack the 1.6GB effective maximum heap size imposed by the 2Gb process space limit of a 32 bit OS. However, as Parkinson’s Law3 has been applied, the base heap consumption of a web server immediately following startup has risen from on the order of 400mb to on the order of 2GB. A ‘fully loaded’ 32 bit server carrying 1,600 customer sessions consumed approximately 1.2GB of the 1.6GB available. When the application first migrated from 32 bit to 64 bit, the 8GB heap was able to comfortably support up to 6,400 customer sessions. This has dwindled over the last year down to 2,400 customer sessions per server as the per-customer data used to customize the site presentation and the computational cost of that customization have grown.

A major effort is underway to abstract business logic out of the core web application into dedicated middle-tier servers. This is in order to meet the challenges of greater data volumes, more personalization and more customers. This will enable improved scalability of the web application, and improved testability of the business logic that drives the web application. It will improve the scalability of the application because individual components serving individual functions can be horizontally scaled. Testability of the business logic is improved because the presentation layer will reside in the front-end and all business logic must be accessed through a well defined API.

Operational Insight

When many developers are modifying the same code base, you cannot foresee and test all of the possible interactions. This is why it is important to build the application on a foundation of insight into the runtime operating environment. For example, developers have built a ‘tracer’ class that can be used to instrument method calls. The data gathered and summarized in hourly buckets includes:
  • minimum, maximum and mean execution time
  • the number of executions
  • a histogram for the number of executions taking <10 10-50="" 50-100="" milliseconds="" msec="" span="">
    msec, and on up to > 10 seconds.
Tools like these can be used both during the QA cycle and during the triage of production problems. The ability to see what’s going on inside the system at production runtime is invaluable in recovering quickly and effectively from performance issues. It is also invaluable in understanding the change in the performance profile of an application over time.

Scaling Teams

Team Communications

Four years ago, the web quality assurance team was four QA engineers and a manager. This was the only QA team in the company at the time. Back end systems developers did their own testing. The web QA manager was also in charge of the engineers who built the user interface of the website. There were three functional silos and everyone working on the website was on the same wing of the same floor. There was even room for the data warehouse team in the same space. The QA engineers and production engineers were sitting in the middle of the developers for their silo, along with the product managers for that silo. If there were questions or clarifications, they were to be had by sticking your head up over the cubicle wall and asking.

Fast forward four years, and what was Web Engineering is now Product Development. This team now covers two floors of a much larger building, split into twelve different functional silos. The three QA teams have three managers, fourteen QA engineers (with open requisitions) and a configuration and release engineer. A separate QA team for the back end systems has also grown from non-existence to eight members. Six product managers and six designers provide work for the functional silos, but now they’re all sitting clustered together instead of living with their teams. The QA engineers and eight production engineers are still sitting among the developers they support, but there are many more developers being supported. The number of interested parties has grown, but the number of interconnections has grown faster. As the QA engineers have become more specialized, the need to share expertise with peers has become more critical.

Component Organization

With a monolithic web application and many development teams, coordination becomes a major source of overhead. Push schedules across the different functional silos must be coordinated because tightly coupled classes and business logic spread across the web application and the database require that we deploy the whole world at once. To allow functional silos to push when they’re ready instead of all pushing at the same time, we must break down the monolith into independent services running in separate processes on separate machines. This enforces encapsulation, reduces coupling and increases the cohesion of each service. This allows for more focused triage in the event of a system issue. If something melts down, only the team responsible for the problem area has to stop new development work to figure it out.

QA Team Organization

Four years ago, every QA engineer on the team did everything:
  • tested features
  • built and shared what test automation and tools that we had
  • shared responsibility for builds
  • shared responsibility for deployments
  • pitched in on production monitoring tools and triage of site emergencies

An experiment in the fall of 2004 spun out the tool building and test automation function into a dedicated role. That experiment met with some limited successes, and still shows promise for future investment. We have gained some traction in having:
  • a consistently running nightly regression
  • a greater awareness among developers of the need for unit tests
  • a greater focus on capturing QA’s expertise in automated tests instead of tribal

In the fall of 2005, another experiment dedicated one person to configuration and release duties. That experiment has been a complete success. The tool of yesterday was a Word document describing all of the manual steps needed to build, deploy and restart a web server from the command line of each server in the farm. The tool of today allows for selection of a particular build from a particular branch with push-button build, deploy and restart, on multiple QA or production servers simultaneously, all triggered and monitored from a simple and intuitive web interface.

Now, every QA engineer can spend their time focused on the tasks that maximize the mitigation of business risk:
  • testing features
  • building automated tests
  • pitching in on production monitoring tools and triage of site emergencies
It’s time for another experiment. The team has grown from three full time engineers and a hands- on manager to ten engineers and a full-time manager. The single manager has become a choke- point for all of the relevant information that needs to flow from the development teams through QA to the operational teams. To address this, splitting the team is proposed.

Instead of a single QA organization concerned with the health of the entire website, two teams are formed, each focused on one aspect of the whole. The first is concerned with “Customers” and the second is concerned with “Movies”. The Customer team has responsibility for all parts of the system that are about the subscribers: signing them up, maintaining their account information and queues, Customer Service tools and managing who they associate with via the Friends feature. The Movie team has responsibility for all parts of the system that are about the movies: merchandising, metadata (titles, actors, inventory levels, etc.), ratings, recommendations and instant viewing.

Within these teams there is still the issue of solving the business problem of more engineers producing more code deployed as more components. To address this, a virtual split of the two QA teams is proposed. This is a further specialization within each sub-team, but not specialization by functional area. This split will focus the orientation of the QA engineers into two areas: functional QA and operational QA.

These are not job descriptions for which specific engineers are hired to do each type of testing. The following table describes more the nature of a tester and how they approach their job within the context of a particular functional silo.

Testing Perspective
Inward looking (back towards the developers and product managers)
Outward looking (forward towards the NOC and the production systems)
Testing Focus
Reviews changes at the feature and functional level
Reviews changes at the system level and integrated systems level
Guardian of...
... the member experience
... business continuity
Strongest Testing Style
Black box, but not necessarily all black box testing
White box, but also production monitoring and production operational tools
Answers the question...
“Does it do what we said we want it to do?”
“Will this behave how we need it to when we roll it out?”
Bonus Points
Build tools to help figure out how to fix the system when we find that we answered these questions incorrectly.

The expected benefits are:
  1. Each sub-team improves their load sharing among the team members by raising awareness of these two sets of attributes within the role of each QA engineer. Each team member identifies their strengths and weaknesses relative to these attributes and seeks other team members to shore up their weak points.
  2. Each sub-team is able to better identify risks to the business from each major functional area because of the many eyes on the narrowed but shared focus (Customer vs. Movies).
  3. Each sub-team is able to flexibly distribute testing load over a given push cycle without negotiations at the level of the functional silo engineering managers having to take place.
The intent is not to remove the barriers between the functional silos, but to lower them. If you compare this to college, every QA engineer should have a Major which corresponds to her functional silo, and she also has a Minor within the Customer or Movie teams that they can pitch in on when there is more business risk in someone else’s area. This proposal lets the QA engineers more flexibly address areas where the most business risk exists within a smaller, more focused team instead of trying to allocate resources and cross train for the whole site.


As a software product grows, development organizations face challenges that grow as fast or faster. Blindly throwing more people at these challenges can result in a tangled mess of toes being trod upon and defects slipping into production. Throwing intelligent people that know how to get things done at these challenges can delay the day of reckoning. This solution relies on personal heroics and sheer talent. However, the scalable solution is to use these talented people to decompose the problem in two dimensions. First, simplify the application by breaking it down into smaller units that can be tested in an automated fashion and can independently move forward as the part of the business it serves needs it to. Second, simplify the team by breaking down into smaller units focused on individual components instead of the whole system.

A lot can be learned from movies. For example, there are lessons to be found in The Wizard of Oz that can be applied to how to scale a software quality team. Dorothy is a perfect example of what can be done with a small team. She starts out all alone, lost and bewildered in a land that someone else has built. She does what any good QA engineer would do: she starts asking intelligent questions. Along the way, she builds a team that has all the right skills to solve a variety of problems, where the strengths and weaknesses of the team members all compliment each other. Eventually, the team gets to where she thinks it needs to be. They’ve reached the oracle that knows all the answers and can fix all their problems. However, she then learns that her journey has just begun. This all powerful wizard she’s been seeking out is merely a clever man who built an automated system so as to appear oracular. Dorothy wakes from her dream just as the real fun in Oz starts, where real problems get solved in a repeatable manner by attacking the most important ones in an intelligent fashion. 

Netflix. 2007. “8 Great Reasons to Work at Netflix.” Retrieved June 25, 2007 (
Wikipedia. 2007. “Parkinson’s Law”. Retrieved June 26, 2007 ('s_law)

Saturday, December 17, 2016

A Letter To President Obama

I read Joe Klein's "Amazing Grace" article in the Dec 19 edition of Time, and was inspired to send the following letter to our outgoing President today:

Dear President Obama,

You may recall that you spoke at OPOWER's Arlington headquarters in March of 2010. Just a few months later, in June of 2010, I was hired to help build the engineering team for OPOWER's San Francisco office. That was the first time I missed you.

In the two weeks leading up to June 24 of 2010, I was ramping up on my new job at the Arlington office. As I was leaving the next day, my boss took me and another co-worker out to lunch on that Thursday. He offered me the choice of where to go. Two of his suggestions were Guajilo Mexican food or Ray's Hell Burger. I thought Mexican sounded good, so we walked there. A short while later, black SUVs showed up and a Secret Service agent posted himself inside the door of Gualjilo's to make sure nobody left while you and Dmitry Medvedev enjoyed a burger and fries next door at Ray's. This was the second time I missed you.

Now we come to the end of your term, and I can't help but already be missing you for a third time. I will most miss your clarity of vision, our steady progress towards broader public good and the compassion and leadership you demonstrated and modelled for our whole nation. Whether in the face of good, bad or sometimes ugly events, I always felt reassured by your hand at the wheel.

I have grave fears that the same legislative engine that so vigorously opposed all the good that you've accomplished in spite of them will now spitefully dismantle it all in concert with the President Elect. These fears have caused me to question the 22nd Amendment. However, you have inspired in me great hope that what has been done once can be done again. That anything undone can be re-done with sufficient clarity of vision, steady progress, compassion and leadership.

Respectfully and Thankfully,

Rob Fagen

Wednesday, March 23, 2016

Getting Screwed (a little) By Starbucks

On April 12, 2016, Starbucks is adjusting their customer loyalty program. Instead of one Star per purchase, with a free item every 12 stars, you will now earn 2 Stars per dollar you spend, and need to collect 125 Stars for a free item. Additionally, at the transition any Stars you have on April 12 will be multiplied by 11.

I did a little analysis of whether the transition and ongoing purchases are more or less rewarding under the new regime. It is (unsurprisingly) sensitive to what your average purchase is. Here's the numbers:

Not to spoil the surprise, but if your average purchase is less than $5.21, you will not be receiving as many rewards as you used to. For example, someone who spends $3.50 per day will now have to spend $62.50 for a freebie instead of $42. They are getting a 5.6% return on their purchases instead of the 8.3% they used to. That's a pretty big haircut.

Then again, you're spending $3.50 for something with a make-at-home cost of less than $0.50 plus the amortization of whatever equipment you use. If you buy a $400 espresso maker and it lasts for four years of daily-weekday coffee (250/year), that's an extra $0.40, for less than $1.00 total.

Actually, if you're going to go that route, the new Starbucks rewards plan is a great change for you. You'll now get a free bag of beans after every five bags instead of every 12.

Thursday, March 17, 2016

Fight back against net anti-neutrality

I just sent this message to the FCC via -- I would recommend you do as well.
Dear FCC, 
Please don't let self-interested actors gut your brave and excellent work in preventing them from violating one of the internet's most fundamental operating principles: net neutrality. As a collaborative form of communication, the internet has thrived when all connected networks have been fair, balanced and unbiased towards the traffic passing over them. 
Previously, the attempts to capture the internet focused on charging content providers for unimpeded data transmission to the customers of an ISP. When that consumer requests content from a paying content provider, service is fast and efficient. When that consumer requests content from a non-paying content provider, service is throttled. This was stopped by last year's ruling. 
Now, networks are implementing plans where total data for a customer is capped, followed by a reduction in level of service when the cap is exceeded. They are exempting their own content or the content of a paying content provider from affecting those caps. The equation is the same whether the network constraint is on the service provided or the source of the data. The outcome is identical if the behavior at the consuming endpoint is impaired when a non-exempt data source serves data. The outcome is against the Title II rules. 
Please act to enforce the rules and keep the internet open and thriving.

Tuesday, May 26, 2015

A Mathematical Riddle I Thought Of At Random

A bit more than ten-thousand, or just seventeen.
What is the number that you think I mean?

Wednesday, March 12, 2014

And so it begins...

Welcome to the harbinger of the next financial crisis.

King Digital Entertainment, maker of Candy Crush (and a bunch of other stuff that hasn't been nearly as successful) is going public with a multi-billion dollar valuation and 600+ employees.

Granted that they are profitable, and so the half billion dollars they raise will give them infinite runway. However, I have but three words: this. seems. crazy.

bless them, and I hope they're successful, but my gut tells me we've reached the peak. I think I'm heading towards a lot more cash.

Tuesday, November 05, 2013

Brilliant or sociopathic: you decide

Got an email from someone at work, and it included a number of attachments. Turns out there were no real attachments, just all of the images that this person includes in their email signature. I turn to my neighbor and ask "Am I a crotchety old fart if I don't like seeing all these images pasted into email signatures? Doesn't anyone read RFCs any more?". He did tell me I was a crotchety old fart, but he did agree it was kind of ridiculous the number and size of images in signatures these days.

Then, inspiration struck. Stick a google display ad in your signature. Stick an Amazon affiliate block in your email signature. Clickthru rates will stink, but think of the pageviews!