On re-reading it, I remembered that I knew many things that I've since forgotten I've learned there. Just in case you, dear reader, might also be interested I am reproducing that paper here. Something I found amusing was how I managed to find citations I could use in footnotes to make it look more like research rather than the retrospective that it clearly is.
Scaling Quality
Abstract
When we develop software to support our business’ customers, we hope to make our products successful. We want to build software that will: be used, make user’s lives easier, do what it is supposed to do, and perform acceptably. If we are fortunate enough to be successful, the user base grows, and we have to learn how to scale the systems to support the increasing load. Everyone knows the three key subjects to study to deliver scalable software: design, coding and testing. However, there’s a fourth scalability subject that’s often overlooked. That is scaling the development and quality teams themselves.With more cooks in the kitchen, the development environment needs to evolve. Communication within the organization has to move past the ‘prairie-dogs popping over the cubicle wall’ style. The code base has to be organized so that functional teams can build, test and deploy independently. Most importantly, the new process needs to stay as nimble as the process that fostered your current success. The pace of business innovation and deployment of systems to support it cannot be bogged down by process and bureaucracy.
Software tools that deliver operational insight and continuous testing aid and inform how the product is developed and delivered at all levels: product management, design, server engineering, production engineering, operations, and of course, quality assurance. This paper examines how Netflix has navigated between the Scylla of growing system demands and the Charybdis of chaos at the hands of ever-larger teams.
Introduction
Some organizations become victims of their own success. Instincts that serve a startup team well can cause problems in a larger team if careful attention isn’t paid to how things are done as an organization grows. Deployability will not scale where the development engineer is also the QA, configuration, release and operations engineer. As the systems increase and grow and more people are working together on them, it becomes less likely that the organization can deploy code that “worked on my box.”The Environment
People
The corporate culture at Netflix is all about High Performance. As the company has grown, the management team’s most important task has been to find and nurture star performers. As the complexity and volume of work increases, there are two ways to grow to meet the challenge of getting more done.One is to add reasonably competent junior people (who are plentiful) and layers of management to guide them. As the number of people and the layers of management grow, the complexity, interconnections and overhead of coordination grows even faster. This leads to chaos. The usual response to the pain resulting from chaos is more rules and more process. More process requires more people to do the “real” work alongside an ever increasing volume of process work. This leads to diminishing returns for each addition to the team.
The alternative is to find and hire senior engineers (who are scarce) who can handle increasing complexity without the need for close supervision or rigid process. This allows the organization to scale more efficiently. The time and effort needed to hire is significantly greater when hiring only people that raise the performance bar. This is true because you’re recruiting out of a smaller pool to start with. It’s also true that as you raise the bar with each successive hire, you’re further shrinking the pool for the next hire. However, it’s the more effective way to meet the challenge of increasing complexity because you minimize the additional coordination cost and maximize the additional work done per person added to the team.
The consequences of this approach are openly communicated to all job candidates from the very beginning of their job search at Netflix. Reason number six on the company website’s “8 great reasons to work at Netflix” page states:
“At Netflix, adequate performance gets a generous severance package.”1Here, the expectation is set immediately that there is no safe haven, that there is no cushy corner you can expect to hide in that you’ll be able to retire from. In a High Performance culture, everyone needs to be contributing to the bottom line all of the time. Anyone not pushing the envelope of their discipline can expect to be replaced by someone that will.
Pace of Innovation
Rapid change is in the DNA of Netflix. We use agile processes, not necessarily any particular brand of Agile Process. It’s a core competitive advantage to be able to deploy new features and functions every two weeks. This presents a significant challenge to both the work and mentality of the quality assurance team. In some ways, the most important skill of a QA engineer isn’t her ability to test, but her ability to decide what to test. We focus our limited testing resources on those changes that present the largest risk to the business. This means that we rely on our ability to: quickly detect bugs in production, find the root cause, correct the problem and deploy the fix to handle those defects not detected before deployment. A great luxury of being a web based application is that deploying new code is transparent to users and immediately fixes the problem for the entire user base. Contrast this with the world of shrink-wrap software, where a new release means shoving around a big pile of atoms (CDs, boxes, documentation, etc.) and not just bits. Even in the case of electronically delivered patches, you have to solve a number of problems with delivery, authentication and authorization that are free when your core business revolves around a website.Nimble Processes
If you run with scissors and you trip, you have to be good at recovering. The ability to change the application so often is a double-edged sword. Repeated success with short release cycles trains business owners to rely on vision becoming working code in record time. Keeping serious defects at bay can become a challenge when the flow of requirements becomes too mercurial. However, this nimbleness is a great competitive advantage because you can quickly react to changes in the market. More importantly, you can also quickly iterate on proposed improvements to the service. With these costs and benefits in mind, experimentation has become the core value driving how Netflix continuously improves the website and supporting systems.We test not only the quality of the software, but the quality of the members’ experience. In the process of creating any given feature, we will develop variations on how it works, and test those variations on prospective, new and existing members before rolling them out as the standard experience. We instrument code to collect metrics around whatever member behavior that feature is supposed to drive, and we pick the version with the biggest lift in the desired behavior. The ability to assess all of the variations during the QA cycle becomes a great challenge. To that end, developers build in URL-parameter based hooks and other shortcuts that a software tester can use to force the presentation of a given customer experience. Manual and automated software tests can reproducibly verify specific versions of a member experience without lengthy set up and tear down operations in the database or a long series of UI interactions.
By being good at creating and verifying partitioned customer experiences (A/B tests), we can:
-
understand what is important to the users
-
understand how well we’ve addressed those needs
- iterate quickly, honing in on the most relevant and rewarding solutions
In order to support this degree of experimentation across a growing number of functional silos, it’s important to have tools that let us walk through the exponentially growing number of variations of: browser, feature, test cell and customer state. To that end, we’re using the Selenium framework to test different presentations of customer experiences undergoing A/B testing.2 This allows us to programmatically walk through the same basic flows while validating
that all customers will have the correct experience.
Selenium is a powerful framework for testing browser based applications. Being written in JavaScript and running in the actual target browser, any test scripts will have access to all of the components that make up the rendered page. The inherent power of this model can be leveraged by judicious design and development of the code that creates the user interface. By providing a clear and consistent framework for rendering the site, it becomes simpler to discover and verify that the right content is being displayed in the right places on the page.
By providing a ‘real end-user browser experience’ to test against, and by providing tools as part of the website itself to drive a test into a specific user experience, one set of scripts can verify that common functionality doesn’t break, even across many variations of the user experience.
As the computational needs of different parts of the application have become more intense, the carrying capacity of an individual web server in terms of the number of simultaneous customers has shrunk. Upgrading the web server hardware has helped expand capacity in counterpoint to this trend. Migrating to a 64-bit OS and JVM helped to crack the 1.6GB effective maximum heap size imposed by the 2Gb process space limit of a 32 bit OS. However, as Parkinson’s Law3 has been applied, the base heap consumption of a web server immediately following startup has risen from on the order of 400mb to on the order of 2GB. A ‘fully loaded’ 32 bit server carrying 1,600 customer sessions consumed approximately 1.2GB of the 1.6GB available. When the application first migrated from 32 bit to 64 bit, the 8GB heap was able to comfortably support up to 6,400 customer sessions. This has dwindled over the last year down to 2,400 customer sessions per server as the per-customer data used to customize the site presentation and the computational cost of that customization have grown.
A major effort is underway to abstract business logic out of the core web application into dedicated middle-tier servers. This is in order to meet the challenges of greater data volumes, more personalization and more customers. This will enable improved scalability of the web application, and improved testability of the business logic that drives the web application. It will improve the scalability of the application because individual components serving individual functions can be horizontally scaled. Testability of the business logic is improved because the presentation layer will reside in the front-end and all business logic must be accessed through a well defined API.
Selenium is a powerful framework for testing browser based applications. Being written in JavaScript and running in the actual target browser, any test scripts will have access to all of the components that make up the rendered page. The inherent power of this model can be leveraged by judicious design and development of the code that creates the user interface. By providing a clear and consistent framework for rendering the site, it becomes simpler to discover and verify that the right content is being displayed in the right places on the page.
By providing a ‘real end-user browser experience’ to test against, and by providing tools as part of the website itself to drive a test into a specific user experience, one set of scripts can verify that common functionality doesn’t break, even across many variations of the user experience.
Scaling Systems
Components
In the beginning, everything lived in the web application, all of the classes were available to each other, and that was good enough. Then, the needs of the application grew. There were more features, customers and movies. A monolithic application could no longer scale. This was not good. The testability of the application was impaired because in order to test anything, you had to spin up the whole web server and all of its supporting components.As the computational needs of different parts of the application have become more intense, the carrying capacity of an individual web server in terms of the number of simultaneous customers has shrunk. Upgrading the web server hardware has helped expand capacity in counterpoint to this trend. Migrating to a 64-bit OS and JVM helped to crack the 1.6GB effective maximum heap size imposed by the 2Gb process space limit of a 32 bit OS. However, as Parkinson’s Law3 has been applied, the base heap consumption of a web server immediately following startup has risen from on the order of 400mb to on the order of 2GB. A ‘fully loaded’ 32 bit server carrying 1,600 customer sessions consumed approximately 1.2GB of the 1.6GB available. When the application first migrated from 32 bit to 64 bit, the 8GB heap was able to comfortably support up to 6,400 customer sessions. This has dwindled over the last year down to 2,400 customer sessions per server as the per-customer data used to customize the site presentation and the computational cost of that customization have grown.
A major effort is underway to abstract business logic out of the core web application into dedicated middle-tier servers. This is in order to meet the challenges of greater data volumes, more personalization and more customers. This will enable improved scalability of the web application, and improved testability of the business logic that drives the web application. It will improve the scalability of the application because individual components serving individual functions can be horizontally scaled. Testability of the business logic is improved because the presentation layer will reside in the front-end and all business logic must be accessed through a well defined API.
Operational Insight
When many developers are modifying the same code base, you cannot foresee and test all of the possible interactions. This is why it is important to build the application on a foundation of insight into the runtime operating environment. For example, developers have built a ‘tracer’ class that can be used to instrument method calls. The data gathered and summarized in hourly buckets includes:-
minimum, maximum and mean execution time
-
the number of executions
-
a histogram for the number of executions taking <10 10-50="" 50-100="" milliseconds="" msec="" span="">10>
msec, and on up to > 10 seconds.
Tools like these can be used both during the QA cycle and during the triage of production problems. The ability to see what’s going on inside the system at production runtime is invaluable in recovering quickly and effectively from performance issues. It is also invaluable in understanding the change in the performance profile of an application over time.
Fast forward four years, and what was Web Engineering is now Product Development. This team now covers two floors of a much larger building, split into twelve different functional silos. The three QA teams have three managers, fourteen QA engineers (with open requisitions) and a configuration and release engineer. A separate QA team for the back end systems has also grown from non-existence to eight members. Six product managers and six designers provide work for the functional silos, but now they’re all sitting clustered together instead of living with their teams. The QA engineers and eight production engineers are still sitting among the developers they support, but there are many more developers being supported. The number of interested parties has grown, but the number of interconnections has grown faster. As the QA engineers have become more specialized, the need to share expertise with peers has become more critical.
With a monolithic web application and many development teams, coordination becomes a major source of overhead. Push schedules across the different functional silos must be coordinated because tightly coupled classes and business logic spread across the web application and the database require that we deploy the whole world at once. To allow functional silos to push when they’re ready instead of all pushing at the same time, we must break down the monolith into independent services running in separate processes on separate machines. This enforces encapsulation, reduces coupling and increases the cohesion of each service. This allows for more focused triage in the event of a system issue. If something melts down, only the team responsible for the problem area has to stop new development work to figure it out.
Scaling Teams
Team Communications
Four years ago, the web quality assurance team was four QA engineers and a manager. This was the only QA team in the company at the time. Back end systems developers did their own testing. The web QA manager was also in charge of the engineers who built the user interface of the website. There were three functional silos and everyone working on the website was on the same wing of the same floor. There was even room for the data warehouse team in the same space. The QA engineers and production engineers were sitting in the middle of the developers for their silo, along with the product managers for that silo. If there were questions or clarifications, they were to be had by sticking your head up over the cubicle wall and asking.Fast forward four years, and what was Web Engineering is now Product Development. This team now covers two floors of a much larger building, split into twelve different functional silos. The three QA teams have three managers, fourteen QA engineers (with open requisitions) and a configuration and release engineer. A separate QA team for the back end systems has also grown from non-existence to eight members. Six product managers and six designers provide work for the functional silos, but now they’re all sitting clustered together instead of living with their teams. The QA engineers and eight production engineers are still sitting among the developers they support, but there are many more developers being supported. The number of interested parties has grown, but the number of interconnections has grown faster. As the QA engineers have become more specialized, the need to share expertise with peers has become more critical.
Component Organization
With a monolithic web application and many development teams, coordination becomes a major source of overhead. Push schedules across the different functional silos must be coordinated because tightly coupled classes and business logic spread across the web application and the database require that we deploy the whole world at once. To allow functional silos to push when they’re ready instead of all pushing at the same time, we must break down the monolith into independent services running in separate processes on separate machines. This enforces encapsulation, reduces coupling and increases the cohesion of each service. This allows for more focused triage in the event of a system issue. If something melts down, only the team responsible for the problem area has to stop new development work to figure it out.
QA Team Organization
Four years ago, every QA engineer on the team did everything:-
tested features
-
built and shared what test automation and tools that we had
-
shared responsibility for builds
-
shared responsibility for deployments
- pitched in on production monitoring tools and triage of site emergencies
An experiment in the fall of 2004 spun out the tool building and test automation function into a dedicated role. That experiment met with some limited successes, and still shows promise for future investment. We have gained some traction in having:
- a consistently running nightly regression
-
a greater awareness among developers of the need for unit tests
-
a greater focus on capturing QA’s expertise in automated tests instead of tribal
knowledge
In the fall of 2005, another experiment dedicated one person to configuration and release duties. That experiment has been a complete success. The tool of yesterday was a Word document describing all of the manual steps needed to build, deploy and restart a web server from the command line of each server in the farm. The tool of today allows for selection of a particular build from a particular branch with push-button build, deploy and restart, on multiple QA or production servers simultaneously, all triggered and monitored from a simple and intuitive web interface.
Now, every QA engineer can spend their time focused on the tasks that maximize the mitigation of business risk:
- testing features
-
building automated tests
-
pitching in on production monitoring tools and triage of site emergencies
It’s time for another experiment. The team has grown from three full time engineers and a hands- on manager to ten engineers and a full-time manager. The single manager has become a choke- point for all of the relevant information that needs to flow from the development teams through QA to the operational teams. To address this, splitting the team is proposed.
Instead of a single QA organization concerned with the health of the entire website, two teams are formed, each focused on one aspect of the whole. The first is concerned with “Customers” and the second is concerned with “Movies”. The Customer team has responsibility for all parts of the system that are about the subscribers: signing them up, maintaining their account information and queues, Customer Service tools and managing who they associate with via the Friends feature. The Movie team has responsibility for all parts of the system that are about the movies: merchandising, metadata (titles, actors, inventory levels, etc.), ratings, recommendations and instant viewing.
These are not job descriptions for which specific engineers are hired to do each type of testing. The following table describes more the nature of a tester and how they approach their job within the context of a particular functional silo.
Attribute
|
Functional
|
Operational
|
Testing Perspective
|
Inward looking (back towards
the developers and product
managers)
|
Outward looking (forward
towards the NOC and the
production systems)
|
Testing Focus
|
Reviews changes at the feature
and functional level
|
Reviews changes at the system
level and integrated systems
level
|
Guardian of...
|
... the member experience
|
... business continuity
|
Strongest Testing Style
|
Black box, but not necessarily
all black box testing
|
White box, but also production
monitoring and production
operational tools
|
Answers the question...
|
“Does it do what we said we
want it to do?”
|
“Will this behave how we need
it to when we roll it out?”
|
Bonus Points
|
Build tools to help figure out how to fix the system when we
find that we answered these questions incorrectly.
|
The expected benefits are:
-
Each sub-team improves their load sharing among the team members by raising
awareness of these two sets of attributes within the role of each QA engineer. Each team
member identifies their strengths and weaknesses relative to these attributes and seeks
other team members to shore up their weak points.
-
Each sub-team is able to better identify risks to the business from each major functional
area because of the many eyes on the narrowed but shared focus (Customer vs. Movies).
-
Each sub-team is able to flexibly distribute testing load over a given push cycle without
negotiations at the level of the functional silo engineering managers having to take place.
Conclusion
As a software product grows, development organizations face challenges that grow as fast or
faster. Blindly throwing more people at these challenges can result in a tangled mess of toes
being trod upon and defects slipping into production. Throwing intelligent people that know how
to get things done at these challenges can delay the day of reckoning. This solution relies on
personal heroics and sheer talent. However, the scalable solution is to use these talented people
to decompose the problem in two dimensions. First, simplify the application by breaking it down
into smaller units that can be tested in an automated fashion and can independently move
forward as the part of the business it serves needs it to. Second, simplify the team by breaking
down into smaller units focused on individual components instead of the whole system.
A lot can be learned from movies. For example, there are lessons to be found in The Wizard of Oz that can be applied to how to scale a software quality team. Dorothy is a perfect example of what can be done with a small team. She starts out all alone, lost and bewildered in a land that someone else has built. She does what any good QA engineer would do: she starts asking intelligent questions. Along the way, she builds a team that has all the right skills to solve a variety of problems, where the strengths and weaknesses of the team members all compliment each other. Eventually, the team gets to where she thinks it needs to be. They’ve reached the oracle that knows all the answers and can fix all their problems. However, she then learns that her journey has just begun. This all powerful wizard she’s been seeking out is merely a clever man who built an automated system so as to appear oracular. Dorothy wakes from her dream just as the real fun in Oz starts, where real problems get solved in a repeatable manner by attacking the most important ones in an intelligent fashion.
A lot can be learned from movies. For example, there are lessons to be found in The Wizard of Oz that can be applied to how to scale a software quality team. Dorothy is a perfect example of what can be done with a small team. She starts out all alone, lost and bewildered in a land that someone else has built. She does what any good QA engineer would do: she starts asking intelligent questions. Along the way, she builds a team that has all the right skills to solve a variety of problems, where the strengths and weaknesses of the team members all compliment each other. Eventually, the team gets to where she thinks it needs to be. They’ve reached the oracle that knows all the answers and can fix all their problems. However, she then learns that her journey has just begun. This all powerful wizard she’s been seeking out is merely a clever man who built an automated system so as to appear oracular. Dorothy wakes from her dream just as the real fun in Oz starts, where real problems get solved in a repeatable manner by attacking the most important ones in an intelligent fashion.
1 Netflix. 2007. “8 Great Reasons to Work at Netflix.” Retrieved June 25, 2007 (http://www.netflix.com/Jobs?id=5366#reason6)
2 http://www.openqa.org/selenium/
3 Wikipedia. 2007. “Parkinson’s Law”. Retrieved June 26, 2007 (http://en.wikipedia.org/wiki/Parkinson's_law)
3 Wikipedia. 2007. “Parkinson’s Law”. Retrieved June 26, 2007 (http://en.wikipedia.org/wiki/Parkinson's_law)
No comments:
Post a Comment