The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win

Metadata

Highlights & Notes

Because they’re likely art or music majors, not people with a technology background, they’ll publicly promise the impossible and it will have to figure out how to deliver.

Each year, it gets harder. We have to do more with less, to simultaneously maintain competitiveness and reduce costs.

Situations like this only reinforce my deep suspicion of developers: They’re often carelessly breaking things and then disappearing, leaving Operations to clean up the mess. The only thing more dangerous than a developer is a developer conspiring with Security. The two working together gives us means, motive, and opportunity.

When John agrees, I thank him for his time. “Wait, one more question. Why do you believe that this product didn’t cause the failure? Did you test the change?” There’s a short silence on the phone before John replies, “No, we couldn’t test the change. There’s no test environment. Apparently, you guys requested a budget years ago, but…” I should have known.

She says defensively. “We need more process around here and better support from the top, including it process tooling and training. Everyone thinks that the real way to get work done is to just do it. That makes my job nearly impossible.”

“What do you mean, ‘e-mail everyone?’ There’s no system where people put in their changes? What about our ticketing system or the change-authorization system?” I ask, stunned. This is like Scotland Yard e-mailing everyone in London to find out who was near the scene of a crime.

Something seems wrong in a world where half the e-mail messages sent are urgent. Can everything really be that important?

He says, “I’m pretty sure it’s about Brent not finishing up that configuration work for the Phoenix developers. Everyone is chasing their tails because the developers can’t actually tell us what the test environment should look like. We’re doing our best, but every time we deliver something, they tell us we did it wrong.”

“Two weeks ago. It’s the typical bullshit with Development, but worse. They’re so freaked out about hitting their deadlines, they’re only now starting to think about how to test and deploy it. Apparently, they’re making it our problem. I hope you’re wearing your asbestos underwear like me. Sarah is going to be at that meeting with torches, wanting to throw us onto the bonfire.”

Chris is constantly asked to deliver more features and do it in less time, with less money.

I gape at Chris. He just made up an arbitrary date to go into production, with complete disregard for all the things we need to do before deployment.

“Now just wait a minute here!” Wes interjects, pounding the table. “What the hell are you trying to pull? We just found out two weeks ago about the specifics of the Phoenix deployment. Your guys still haven’t told us what sort of infrastructure we need, so we can’t even order the necessary server and networking gear. And by the way, the vendors are already quoting us three-week delivery times!” He is now facing Chris, pointing at him angrily. “Oh, and I’ve heard that the performance of your code is so shitty, we’re going to need the hottest, fastest gear out there. You’re supposed to support 250 transactions per second, and you’re barely doing even four! We’re going to need so much hardware that we’ll need another chassis to put it all in and probably have to pay a custom-manufacturing fee to get it in time. God knows what this will do to the budget.” Chris wants to respond, but Wes is relentless. “We still don’t have a concrete specification of how the production and test systems should be configured. Oh, do you guys not need a test environment anymore? You haven’t even done any real testing of your code yet, because that fell off the schedule, too!”

The plot is simple: First, you take an urgent date-driven project, where the shipment date cannot be delayed because of external commitments made to Wall Street or customers. Then you add a bunch of developers who use up all the time in the schedule, leaving no time for testing or operations deployment. And because no one is willing to slip the deployment date, everyone after Development has to take outrageous and unacceptable shortcuts to hit the date.

“What the hell happened in there? How did we get into this position? Does anyone know what’s required from us to support this launch?” “No one has a clue,” he says, shaking his head in disgust. “We haven’t even agreed on how to do the handoff with Development. In the past, they’ve just pointed to a network folder and said, ‘Deploy that.’ There are newborn babies dropped off at church doorsteps with more operating instructions than what they’re giving us.”

He’s right. Unless we can break this cycle, we’ll stay in our terrible downward spiral. Brent needs to work with developers to fix issues at the source so we can stop fighting fires. But Brent can’t attend, because he’s too busy fighting fires.

But, even after two years, all we have is a great process on paper that no one follows and a tool that no one uses. When I pester people to use them, all I get are complaints and excuses.”

Before, I was merely worried that it Operations was under attack by Development, Information Security, Audit, and the business. Now, I’m starting to realize that my primary managers seem to be at war with each other, as well. What will it take for us to all get along?

“Oh, come on.” I mutter. “Brent. Brent, Brent, Brent! Can’t we do anything without him? Look at us! We’re trying to have a management discussion about commitments and resources, and all we do is talk about one guy! I don’t care how talented he is. If you’re telling me that our organization can’t do anything without him, we’ve got a big problem.”

I have a sinking feeling in the pit of my stomach. How can we manage production if we don’t know what the demand, priorities, status of work in process, and resource availability are? Suddenly, I’m kicking myself that I didn’t ask these questions on my first day.

“Well, we need to start. We can’t make new commitments to other people when we don’t even know what our commitments are now!” I say. “At the very least, get me the work estimate to fix the audit findings. Then, for each of those resources, tell me what their other commitments are that we’re going to be pulling them off of.”

“We should have done this a long time ago. We bump up the priorities of things all the time, but we never really know what just got bumped down. That is, until someone screams at us, demanding to know why we haven’t delivered something.”

Any improvement made after the bottleneck is useless, because it will always remain starved, waiting for work from the bottleneck. And any improvements made before the bottleneck merely results in more inventory piling up at the bottleneck.”

“Oh, really?” he turns to me, frowning intensely. “Let me guess. You’re going to say that it is pure knowledge work, and so therefore, all your work is like that of an artisan. Therefore, there’s no place for standardization, documented work procedures, and all that high-falutin’ ‘rigor and discipline’ that you claimed to hold so near and dear.”

“Your job as vp of it Operations is to ensure the fast, predictable, and uninterrupted flow of planned work that delivers value to the business while minimizing the impact and disruption of unplanned work, so you can provide stable, predictable, and secure it service.”

“The First Way helps us understand how to create fast flow of work as it moves from Development into it Operations, because that’s what’s between the business and the customer. The Second Way shows us how to shorten and amplify feedback loops, so we can fix quality at the source and avoid rework. And the Third Way shows us how to create a culture that simultaneously fosters experimentation, learning from failure, and understanding that repetition and practice are the prerequisites to mastery.”

I look levelly at her, wondering whether these briefings that we give the outside world might be what is causing such pressure on Chris’ team to release features so prematurely.

He continues in a more sympathetic voice. “My suggestion to you? Go to your peers and make your case to them. If your case is really valid, they should be willing to transfer some of their budget to you. But let me be clear: Any budget increases are out of the question. If anything, we may have to cut some heads in your area.”

I realize that changes are the third category of work. When Patty’s people moved around the change cards, from Friday to earlier in the week, they were changing our work schedule. Each of those change cards defined the work that my team was going to be doing that day.

But what is the relationship between changes and projects? Are they equally important?

If we had exactly the amount of resources to take on all our project work, does this mean we might not have enough cycles to implement all these changes?

“You mean, like in our ticketing system? No, because opening up a ticket for each of those calls would take longer than fixing the problem.” Brent says dismissively.

For way too many things, Brent seems to be the only one who knows how they actually work.”

“I’m not suggesting Brent is doing this deliberately, but I wonder whether Brent views all his knowledge as a sort of power. Maybe some part of him is reluctant to give that up. It does put him in this position where he’s virtually impossible to replace.” “Maybe. Maybe not,” I say. “I’ll tell you what I do know, though. Every time that we let Brent fix something that none of us can replicate, Brent gets a little smarter, and the entire system gets dumber. We’ve got to put an end to that. “Maybe we create a resource pool of level 3 engineers to handle the escalations, but keep Brent out of that pool. The level 3s would be responsible for resolving all incidents to closure, and would be the only people who can get access to Brent—on one condition. “If they want to talk with Brent, they must first get Wes’ or my approval,” I say. “They’d be responsible for documenting what they learned, and Brent would never be allowed to work on the same problem twice. I’d review each of the issues weekly, and if I find out that Brent worked a problem twice, there will be hell to pay. For both the level 3s and Brent.” I add, “Based on Wes’ story, we shouldn’t even let Brent touch the keyboard. He’s allowed to tell people what to type and shoulder-surf, but under no condition will we allow him to do something that we can’t document afterward. Is that clear?”

It’s not a good sign when they’re still attaching parts to the space shuttle at liftoff time.

One of the developers had actually walked in a couple of minutes ago and said, “Look, it’s running on my laptop. How hard can it be?”

I smirk at the reference to smoke tests, a term circuit designers use. The saying goes, “If you turn the circuit board on and no smoke comes out, it’ll probably work.” He shakes his head and says, “We have yet to make it through the smoke test. I’m concerned that we no longer have sufficient version control—we’ve gotten so sloppy about keeping track of version numbers of the entire release. Each time they fix something, they’re usually breaking something else. So, they’re sending single files over instead of the entire package.”

I wince, thinking about how this will tie up even more of our guys, doing menial work that the broken application should be doing. Nothing worries auditors more than direct edits of data without audit trails and proper controls.

Although Wes and Patty seem simultaneously pleased and taken aback by this sudden cooperation from Development, Sarah is not pleased. She says, “I don’t agree. We’ve got to be able to respond to the market, and the market is telling us that Phoenix is too hard to use. We can’t afford to screw this up.” Chris replies, “Look, the time for usability testing and validation was months ago. If we didn’t get it right the first time, we’re not going to get it right without some real work. Have your product managers work on their revised mockups and proposals. We’ll try to get it in as soon as we can after the crisis is over.”

As I’m talking, I realize how liberating it is to state that my team is absolutely at capacity and that there aren’t any calories left over for any new tasks, and people actually believe me.

“Maybe my group being outsourced wouldn’t be the worst thing in the world. I’ve been in software development for virtually my entire career. I’m used to everyone demanding miracles, expecting the impossible, people changing requirements at the last minute, but, after living through this latest nightmare project, I wonder if it might be time for a change…”

“It’s like the free puppy,” I continue. “It’s not the upfront capital that kills you, it’s the operations and maintenance on the back end.” Chris cracks up. “Yes, exactly! They’ll say, ‘The puppy can’t quite do everything we need. Can you train it to fly airplanes? It’s just a simple matter of coding, right?’”

“You know, we’re struggling, too. We’ve never had so many problems hitting our ship dates. My engineers keep getting pulled off of feature development to handle escalations when things break. And deployments keep taking longer and longer. What used to take ten minutes to deploy starts taking an hour. Then a full day, then an entire weekend, then four days. I’ve even got some deployments that are now taking over a week to complete.

“What use is it having all these offshore developers building features if we aren’t getting to market any faster? We keep lengthening the deployment intervals, so that we can get more features deployed in each batch.”

What’s happening with Phoenix is a combination of the need to deliver needed features to market, forcing us to take shortcuts, which are causing ever-worsening deployments.

Ah… Now I see it. What can displace planned work? Unplanned work. Of course.

“Yes, I think I can,” I say. “At the plant, I gave you one category, which was business projects, like Phoenix,” I say. “Later, I realized that I didn’t mention internal it projects. A week after that, I realized that changes are another category of work. But it was only after the Phoenix fiasco that I saw the last one, because of how it prevented all other work from getting completed, and that’s the last category, isn’t it? Firefighting. Unplanned work.” “Precisely!” I hear Erik say. “You even used the term I like most for it: unplanned work. Firefighting is vividly descriptive, but ‘unplanned work’ is even better. It might even be better to call it ‘anti-work,’ since it further highlights its destructive and avoidable nature. “Unlike the other categories of work, unplanned work is recovery work, which almost always takes you away from your goals. That’s why it’s so important to know where your unplanned work is coming from.” I smile as he acknowledges my correct answer, and am oddly pleased that he validated my antimatter notion of unplanned work, as well.

“Very good,” he says. “You’ve put together tools to help with the visual management of work and pulling work through the system. This is a critical part of the First Way, which is creating fast flow of work through Development and it Operations. Index cards on a kanban board is one of the best mechanisms to do this, because everyone can see wip. Now you must continually eradicate your largest sources of unplanned work, per the Second Way.”

You’ve started to take steps to stabilize the operational environment, you’ve started to visually manage wip within it Operations, and you’ve started to protect your constraint, Brent. You’ve also reinforced a culture of operational rigor and discipline.

Remember, unplanned work kills your ability to do planned work, so you must always do whatever it takes to eradicate it. Murphy does exist, so you’ll always have unplanned work, but it must be handled efficiently.

You get what you design for. Chester, your peer in Development, is spending all his cycles on features, instead of stability, security, scalability, manageability, operability, continuity, and all those other beautiful ’itties.

Being able to take needless work out of the system is more important than being able to put more work into the system. To do that, you need to know what matters to the achievement of the business objectives, whether it’s projects, operations, strategy, compliance with laws and regulations, security, or whatever.”

“Remember, outcomes are what matter—not the process, not controls, or, for that matter, what work you complete.”

“A great team doesn’t mean that they had the smartest people. What made those teams great is that everyone trusted one another. It can be a powerful thing when that magic dynamic exists.

Five Dysfunctions of a Team, by Patrick Lencioni.

“If that’s true,” I say, digging in, “there’s something really wrong with our definition of what a ‘completed project’ is. If it means ‘Did Chris get all his Phoenix tasks done?’ then it was a success. But if we wanted Phoenix in production that fulfilled the business goals, without setting the entire business on fire, we should call it a total failure.”

“I don’t know. But this is a recurring pattern. Chris’ group never factors in all the work that Operations needs to do. And even when they do, they use up all the time in the schedule, leaving none for us. And we’re always left cleaning up the mess, long afterward.”

I reply, “Erik has helped me understand that there are four types of it Operations work: business projects, it Operations projects, changes, and unplanned work. But, we’re only talking about the first type of work, and the unplanned work that get’s created when we do it wrong. We’re only talking about half the work we do in it Operations.”

“Erik asked me how we made the same type of decision in it,” I recall. “I told him then, and I’ll tell you now, I don’t know. I’m pretty sure we don’t do any sort of analysis of capacity and demand before we accept work. Which means we’re always scrambling, having to take shortcuts, which means more fragile applications in production. Which means more unplanned work and firefighting in the future. So, around and around we go.”

To my surprise, Erik interrupts. “Well put, Bill. You’ve just described ‘technical debt’ that is not being paid down. It comes from taking shortcuts, which may make sense in the short-term. But like financial debt, the compounding interest costs grow over time. If an organization doesn’t pay down its technical debt, every calorie in the organization can be spent just paying interest, in the form of unplanned work.”

  • @sergeiw

“As you know, unplanned work is not free,” he continues. “Quite the opposite. It’s very expensive, because unplanned work comes at the expense of…” He looks around professorially for an answer. Wes finally speaks up, “Planned work?”

“Precisely!” Erik says jovially. “Yes, that’s exactly right, Chester. Bill mentioned the four types of work: business projects, it Operations projects, changes, and unplanned work. Left unchecked, technical debt will ensure that the only work that gets done is unplanned work!”

He addresses the rest of the room. “Unplanned work has another side effect. When you spend all your time firefighting, there’s little time or energy left for planning. When all you do is react, there’s not enough time to do the hard mental work of figuring out whether you can accept new work. So, more projects are crammed onto the plate, with fewer cycles available to each one, which means more bad multitasking, more escalations from poor code, which mean more shortcuts. As Bill said, ‘around and around we go.’ It’s the it capacity death spiral.”

Uncertain, I ask Steve, “Are we even allowed to say no? Every time I’ve asked you to prioritize or defer work on a project, you’ve bitten my head off. When everyone is conditioned to believe that no isn’t an acceptable answer, we all just became compliant order takers, blindly marching down a doomed path. I wonder if this is what happened to my predecessors, too.”

I need you to say no! We cannot afford to have this leadership team be order takers. We pay you to think, not just

“If you, or for that matter, anyone knows that a project will fail, I need you to say so. And I need it backed up with data. Give me data like that plant coordinator showed you, so we can understand why. Sorry, Bill, I like you a lot, but saying no just based on your gut is not enough.”

“We started restoring sanity when we figured out where our constraint was. Then we protected it, making sure that time on the constraint was never wasted. And we did everything to make sure work flowed through it.”

“Remember, Jimmy, the goal is to increase the throughput of the entire system, not just increase the number of tasks being done. And if you don’t have a trustworthy system of work, why should I trust your system of security controls? Bah. A total waste of time.”

“If we single-task on the most important project for two weeks and still aren’t able to make a big dent, then I think we should all find new day jobs.”

Let the inevitable happen, and we’ll see what we can learn from it.”

It’s astonishing what we agree to in the next hour. it Operations will freeze all non-Phoenix work. Development can’t idle the twenty-plus non-Phoenix projects, but will freeze all deployments. In other words, no work will flow from Development to it Operations for another two weeks. Furthermore, we will identify the top areas of technical debt, which Development will tackle to decrease the unplanned work being created by problematic applications in production. This will all make a huge difference in my team’s workload.

Wes leans back and says, “Well, that makes it official. Your project freeze is actually working.” Patty looks over at him, appearing surprised. “You actually doubted it? Come on, we’ve both been talking about how we’ve never seen people so focused before. It’s amazing how the project freeze has reduced the priority conflicts and bad multitasking. We know it’s made a huge difference in productivity.”

“Well,” Wes says, “we trust them to make the right decision, based on the data they have. That’s why we hire smart people.” This is not good.

“Let’s be honest,” Patty says. “Priority 1 is whoever is yelling the loudest, with the tie-breaker being who can escalate to the most senior executive. Except when they’re more subtle. I’ve seen a bunch of my staff always prioritizing a certain manager’s requests, because he takes them out to lunch once a month.”

I’m more unsettled than ever. When we have more than one project in the system at the same time, how do we protect the work from being interrupted or having its priority trumped by almost anyone in the business or someone else in it?

Despite my fretting, I realize how refreshing it is to be able to think about what work we need to be doing and how to prioritize and release it. For a moment, I marvel at the lack of constant firefighting that dominated so much of my career in it. The types of issues we’re having to solve lately are so…cerebral. It’s what I thought management was all about when I got my mba. I’m convinced that if we do a good job thinking, we can make a real difference.

“Good. Understanding the flow of work is key to achieving the First Way,”

“Brent is a worker, not a work center,” I say again. “And I’m betting that Brent is probably a worker supporting way too many work centers. Which is why he’s a constraint.”

continues, “every work center is made up of four things: the machine, the man, the method, and the measures.

“You’re right. I’ve heard my managers complain that if Brent were hit by the proverbial bus, we’d be completely up the creek. No one knows what’s in Brent’s head. Which is one of the reason I’ve created the level 3 escalation pool.”

“You’re standardizing Brent’s work so that other people can execute it. And because you’re finally getting those steps documented, you’re able to enforce some level of consistency and quality, as well. You’re not only reducing the number of work centers where Brent is required, you’re generating documentation that will enable you to automate some of them.”

“Incidentally, until you do this, no matter how many more Brents you hire, Brent will always remain your constraint. Anyone you hire will just end up standing around.”

Even though he got the additional headcount to hire more Brents, we never were able to actually increase our throughput.

“The candidate projects which are safe to release are those that don’t require Brent.”

“What you’re building is the bill of materials for all the work that you do in it Operations. But instead of a list of parts and subassemblies, like moldings, screws, and casters, you’re cataloging all the prerequisites of what you need before you can complete the work—like laptop model numbers, specifications of user information, the software and licenses needed, their configurations, version information, the security and capacity and continuity requirements, yada yada…”

“Well, to be more accurate, you’re actually building a bill of resources. That’s the bill of materials along with the list of the required work centers and the routing. Once you have that, along with the work orders and your resources, you’ll finally be able to get a handle on what your capacity and demand is. This is what will enable you to finally know whether you can accept new work and then actually be able to schedule the work.”

says, “Your second question was whether it was safe to start your monitoring project. You already established it doesn’t require Brent. Furthermore, you say that the goal of this project is to prevent outages, which prevents Brent escalations. More than that, when outages do occur, you’ll need less of Brent’s time to troubleshoot and fix. You’ve already identified the constraint, exploited it to squeeze the most out of it, and then you’ve subordinated the flow of work to the constraint. So, how important is this monitoring project?” I think for a moment. And then groan at the obvious answer. I run my fingers through my hair. “You said that we always need to be looking for ways to elevate the constraint, which means I need to do whatever is required to get more cycles from Brent. That’s exactly what the monitoring project does!” I say with some disbelief that I didn’t see this before, “The monitoring project is probably the most important improvement project we have—we need to start this project right away.”

“Properly elevating preventive work is at the heart of programs like Total Productive Maintenance, which has been embraced by the Lean Community. tpm insists that we do whatever it takes to assure machine availability by elevating maintenance.

‘Improving daily work is even more important than doing daily work.’

The Third Way is all about ensuring that we’re continually putting tension into the system, so that we’re continually reinforcing habits and improving something. Resilience engineering tells us that we should routinely inject faults into the system, doing them frequently, to make them less painful.

“Sensei Mike Rother says that it almost doesn’t matter what you improve, as long as you’re improving something. Why? Because if you are not improving, entropy guarantees that you are actually getting worse, which ensures that there is no path to zero errors, zero work-related accidents, and zero loss.”

“Sensei Rother calls this the Improvement Kata,” he continues. “He used the word kata, because he understood that repetition creates habits, and habits are what enable mastery. Whether you’re talking about sports training, learning a musical instrument, or training in the Special Forces, nothing is more to mastery than practice and drills. Studies have shown that practicing five minutes daily is better than practicing once a week for three hours. And if you want to create a genuine culture of improvement, you must create those habits.”

Just as important as throttling the release of work is managing the handoffs. The wait time for a given resource is the percentage that resource is busy, divided by the percentage that resource is idle.

“A critical part of the Second Way is making wait times visible, so you know when your work spends days sitting in someone’s queue—or worse, when work has to go backward, because it doesn’t have all the parts or requires rework.

“No! When we actually followed the parts around on the plant floor, we found that for the majority of time, the parts were just sitting in queues. In other words, the ‘touch time’ was a tiny fraction of ‘total process time.’ Our expediters had to search through mountains of work to find the parts and push them through the work center,” he says incredulously. “That’s happening at your plant, too, so watch for it,” he says.

“You screw up something that jeopardizes the business’ ability to maintain compliance with relevant laws and regulations? You better fix it—or you should be fired.”

“Tell me. All those projects that Jimmy your ciso is pushing. Do they increase the flow of project work through the it organization?” “No,” I quickly answer, rushing to catch up again. “Do they increase operational stability or decrease the time required to detect and recover from outages or security breaches?” I think a bit longer. “Probably not. A lot of it is just more busywork, and in most cases, the work they’re asking for is risky and actually could cause outages.” “Do these projects increase Brent’s capacity?” I laugh humorlessly. “No, the opposite. The audit issues alone could tie up Brent for the next year.” “And what would doing all of Jimmy’s projects do to wip levels?” he asks, opening the door that takes us back into the stairwell. Exasperated, I say as we descend the two sets of stairs, “It would go through the roof. Again.” When we reach the bottom, Erik suddenly stops and asks, “Okay. These ‘security’ projects decrease your project throughput, which is the constraint for the entire business. And swamp the most constrained resource in your organization. And they don’t do squat for scalability, availability, survivability, sustainability, security, supportability, or the defensibility of the organization.” He asks deadpan, “So, genius: Do Jimmy’s projects sound like a good use of time to you?” As I start to answer, he just opens the exit door and walks through it. Apparently, it was a rhetorical question.

You win when you protect the organization without putting meaningless work into the it system. And you win even more when you can take meaningless work out of the it system.”

“This guy is like the qa manager who has his group writing millions of new tests for a product we don’t even ship anymore and then files millions of bug reports for features that no longer exist. Obviously, he is making what you and I would call a ‘scoping error.’”

“Jimmy, Parts Unlimited has at least four of my family’s credit card numbers in your systems. I need you to protect that data. But you’ll never adequately protect it when the work product is already in production. You need to protect it in the processes that create the work product.”

Patty explains that a kanban board, among many other things, is one of the primary ways our manufacturing plants schedule and pull work through the system. It makes demand and wip visible, and is used to signal upstream and downstream stations.

“If it’s not on the kanban board, it won’t get done,” she says. “And more importantly, if it is on the kanban board, it will get done quickly. You’d be amazed at how fast work is getting completed, because we’re limiting the work in process. Based on our experiments so far, I think we’re going to be able to predict lead times for work and get faster throughput than ever.”

“You know, I did a quick poll of people we’ve issued laptops to. It usually takes fifteen turns to finally get them configured correctly. I’m tracking that now, and trying to drive this down to three. We’re putting in checklists everywhere, especially when we do handoffs within the team. It’s really making a difference. Error rates are way down.”

We need to know whether it increases our capacity at our constraint, which is still Brent. Unless the project reduces his workload or enables someone else to take it over, maybe we shouldn’t even be doing it. On the other hand, if a project doesn’t even require Brent, there’s no reason we shouldn’t just do it.”

Preventive work is important, but it always gets deferred. We’ve been trying to do some of these projects for years! This is our chance to get caught up.”

Improving something anywhere not at the constraint is an illusion.

“The task that Kirsten called about is delivering a test environment to qa. As she said, Brent estimated that it would take only forty-five minutes.”

“The wait time is the ‘percentage of time busy’ divided by the ‘percentage of time idle.’ In other words, if a resource is fifty percent busy, then it’s fifty percent idle. The wait time is fifty percent divided by fifty percent, so one unit of time. Let’s call it one hour. So, on average, our task would wait in the queue for one hour before it gets worked. “On the other hand, if a resource is ninety percent busy, the wait time is ‘ninety percent divided by ten percent’, or nine hours. In other words, our task would wait in queue nine times longer than if the resource were fifty percent idle.”

Creating and prioritizing work inside a department is hard. Managing work among departments must be at least ten times more difficult.

“What that graph says is that everyone needs idle time, or slack time. If no one has slack time, wip gets stuck in the system. Or more specifically, stuck in queues, just waiting.”

I add, “You know, if we can standardize all our recurring deployment work, we’ll finally be able to enforce uniformity of our production configurations. That would be our infrastructure snowflake problem—you know—no two alike. How Brent turned into Brent is that we allowed him to build infrastructure only he can understand. We can’t let that happen again.” “Good point,” Wes grunts. “You know, it’s odd. So many of these problems we’ve been facing are caused by decisions we made. We have met the enemy. And he is us.”

“You know, deployments are like final assembly in a manufacturing plant. Every flow of work goes through it, and you can’t ship the product without it.

Over the next forty-five minutes, we create our plan. Patty is going to work with Wes’ team to assemble the top twenty most frequently recurring tasks. She will also figure out how to better manage and control tasks when they are queued. Patty proposes a new role, a combination of a project manager and expediter. Instead of day-by-day oversight, they would provide minute-by-minute control. She says, “We need fast and effective handoffs of any completed work to the next work center. If necessary, this person will wait at the work center until the work is completed and carry to the next work center. We’ll never let critical work get lost in a pile of tickets again.” “What? Someone assigned to carry around tasks from person to person, like a waiter?” Wes asks in disbelief. “At mrp-8, they have a ‘water spider’ role that does exactly that,” she counters. “Almost all of this latest Phoenix delay was due to tasks waiting in queues or handoffs. This will make sure it doesn’t happen again. “Eventually,” she adds, “I’ll want to move all the kanbans, so that we don’t need a person acting as the signaling mechanism for work handoffs. Don’t worry. I’ll have it figured out in a couple of days.” Wes and I don’t dare doubt her.

There were still so many uncertainties. But unlike before, our challenges feel within our ability to understand and conquer. Our goals finally seem achievable. I no longer feel like I am always on my heels, with more and more people piling on, trying to push me over.

“To tell the truth is an act of love. To withhold the truth is an act of hate. Or worse, apathy.”

To help ensure that the company achieves our goals, I set up the objectives and measurements program for the entire management team. I wanted to keep all our managers accountable, ensure that they have the skills necessary to succeed, and help make sure that complex initiatives always have the right stakeholders involved and so forth.”

“In those moments, you wonder whether the problem is the economy, our strategy, our management team, you it guys, or, quite frankly, maybe the entire problem is me. Those are the days I just want to retire.”

CFO GOALS Health of company Revenue Market share Average order size Profitability Return on assets Health of Finance Order to cash cycle Accounts receivable Accurate and timely financial reporting Borrowing costs

“These are the company goals and the objectives I’ve set for finance,” he explains. “I’ve learned that while the finance goals are important, they’re not the most important. Finance can hit all our objectives, and the company still can fail. After all, the best accounts receivables team on the planet can’t save us if we’re in the wrong market with the wrong product strategy with an r&d team that can’t deliver.” Startled, I realize he’s talking about Erik’s First Way. He’s talking about systems thinking, always confirming that the entire organization achieves its goal, not just one part of it.

Are we competitive? Understanding customer needs and wants: Do we know what to build? Product portfolio: Do we have the right products? r&d effectiveness: Can we build it effectively? Time to market: Can we ship it soon enough to matter? Sales pipeline: Can we convert products to interested prospects? Are we effective? Customer on-time delivery: Are customers getting what we promised them? Customer retention: Are we gaining or losing customers? Sales forecast accuracy: Can we factor this into our sales planning process?

“Well, good for Jimmy. Or maybe I should call him ‘John.’ He finally got his head far enough out of his ass to begin to see,” I hear Erik say as he laughs, not unkindly. “As part of the First Way, you must gain a true understanding of the business system that it operates in. W. Edwards Deming called this ‘appreciation for the system.’ When it comes to it, you face two difficulties: On the one hand, in Dick’s second slide, you now see that there are organizational commitments that it is responsible for helping uphold and protect that no one has verbalized precisely yet. On the other hand, John has discovered that some it controls he holds near and dear aren’t needed, because other parts of the organization are adequately mitigating those risks. “This is all about scoping what really matters inside of it. And like when Mr. Sphere told everyone in Flatland, you must leave the realm of it to discover where the business relies on it to achieve its goals.” I hear him continue, “Your mission is twofold: You must find where you’ve under-scoped it—where certain portions of the processes and technology you manage actively jeopardizes the achievement of business goals—as codified by Dick’s measurements. And secondly, John must find where he’s over-scoped it, such as all those sox-404 it controls that weren’t necessary to detect material errors in the financial statements. “You may think that we’re mixing apples and oranges, but I assure you that we are not,” he continues. “Some of the wisest auditors say that there are only three internal control objectives: to gain assurance for reliability of financial reporting, compliance with laws and regulations, and efficiency and effectiveness of operations. That’s it. What you and John are talking about are just different slides of what is called the ‘coso Cube.’”

Go talk to the business process owners for the objectives on Dick’s second slide. Find out what their exact roles are, what business processes underpin their goals, and then get from them the top list of things that jeopardize those goals.

“People think that just because it doesn’t use motor oil and carry physical packages that it doesn’t need preventive maintenance,” Erik says, chuckling to himself. “That somehow, because the work and the cargo that it carries are invisible, you just need to sprinkle more magic dust on the computers to get them running again. “Metaphors like oil changes help people make that connection. Preventive oil changes and vehicle maintenance policies are like preventive vendor patches and change management policies. By showing how it risks jeopardize business performance measures, you can start making better business decisions.

“You’ll be ready for your meeting with Dick when you’ve built out the value chains, linking his objectives to how it jeopardizes it. Assembling concrete examples of how it issues have jeopardized those goals in the past. Make sure you’re prepared.”

There’s at least a predictable funnel that comes from marketing campaigns, generating prospects, leads, qualified leads, and sales opportunities that leads to a sales pipeline. One sales person missing their number rarely jeopardizes the entire department. On the other hand, any of my engineers can get me fired by making a seemingly small, harmless change that results in a crippling, enterprise-wide outage.

“Okay, it’s your nickel…” he says. “If you want to talk about sales forecast accuracy, you first need to know why it’s so inaccurate. It starts when Steve and Dick hand me a crazy revenue target, leaving me to figure out how to deliver on it. For years, I’ve had to assign way too much quota capacity to my team, so of course we keep missing our numbers! I tell Steve and Dick this, year after year, but they don’t listen. Probably because they’re having some arbitrary revenue target jammed down their throats by the board.

“We are clueless about what our customers want! We have too much product that will never sell and never enough of the ones that do.”

“You’re saying that ‘sales forecast accuracy’ is being jeopardized by our poor grasp of ‘understanding our customer needs and wants?’ And that if we know what products were out of stock in the stores, we could increase sales?” “You got it,” he says. “With the traffic we get in the stores, that’s the fastest and easiest way to increase revenue. It’s a lot easier than dealing with the fickle whims of our large automotive buyers, that’s for sure.” I make a note to myself to find out how stockout data are generated, and I see Patty furiously taking notes as well.

“Most of the time, we’re flying blind. Ideally, our sales data would tell us what customers want. You’d think that with all the data in our order entry and inventory management systems, we could do this. But we can’t, because the data are almost always wrong.”

“Our data quality is so bad that we can’t rely on it to do any sort of forecasting. The best data that we have right now comes from interviewing our store managers every two months and the customer focus groups we do twice a year. You can’t run a billion-dollar business this way and expect to succeed!

“Here, we get them once a month from Finance, but they’re full of errors. What do you expect? They’re done by a bunch of college interns, copying and pasting numbers between a million spreadsheets.”

“That’s a big magic wand,” she says, laughing. “I want accurate and timely order information from our stores and online channels. I want to press a button and get it, instead of running it through the circus we’ve created. I’d use that data to create marketing campaigns that continually do a/b testing of offers, finding the ones that our customers jump at. When we find out what works, we’d replicate it across our entire customer list. By doing this, we’d be creating a huge and predictable sales funnel for Ron. “I’d use that information to drive our production schedule, so we can manage our supply and demand curves. We’d keep the right products on the right store shelves and keep them stocked. Our revenue per customer would go through the roof. Our average order sizes would go up. We’d finally increase our market share and start beating the competition again.”

“In these competitive times, the name of the game is quick time to market and to fail fast. We just can’t have multiyear product development timelines, waiting until the end to figure out whether we have a winner or loser on our hands. We need short and quick cycle times to continually integrate feedback from the marketplace.

“But that’s just half the picture,” she continues. “The longer the product development cycle, the longer the company capital is locked up and not giving us a return. Dick expects that on average, our r&d investments return more than ten percent. That’s the internal hurdle rate. If we don’t beat the hurdle rate, the company capital would have been better spent being invested in the stock market or gambled on racehorses.

“When r&d capital is locked up as wip for more than a year, not returning cash back to the business, it becomes almost impossible to pay back the business,”

We’ve spent over $20 million on Phoenix over three years. With all that wip and capital locked inside the project, it will likely never clear the ten percent internal hurdle rate. In other words, Phoenix should not have been approved.

“We’re going way too slowly, with too much wip and too many features in flight. We need to make our releases smaller and shorter and deliver cash back faster, so we can beat the internal hurdle rate.

“I’m rebuilding our compliance program from scratch, based upon our new understanding of precisely where we’re relying our controls,” John says. “That dictates what matters. It’s like having a magic set of glasses that can differentiate what controls are earth-shatteringly important versus those that have no value at all.”

We can outsource the work but not the responsibility.”

“Over my dead body,” John says flatly, crossing his arms. “My fifth and last proposal is that we pay down all the technical debt in Phoenix, using all the time we’ve saved from my previous proposals. We know there’s a huge amount of risk in Phoenix: strategic risk, operational risk, huge security and compliance risk. Almost all of Dick’s key measures hinge on it.

There’s another Phoenix deployment scheduled for Friday. It’s only a bunch of defect fixes, with no major functionality added or changed, so it should be much better than last time. We’ve completed all of our deliverables on time, but as usual, there are still a million details that still need to be worked out. I’m grateful that my team can stay so focused on Phoenix, because we’ve stabilized our infrastructure. When the inevitable outages and incidents do occur, we’re operating like a well-oiled machine. We’re building a body of tribal knowledge that’s helping us fix things faster than ever, and, when we do need to escalate, it’s controlled and orderly. Because of our ever-improving production monitoring of the infrastructure and applications, more often than not, we know about the incidents before the business does. Our project backlog has been cut way down, partially from eradicating dumb projects from our queue.

Because we have a better idea of what our flows of work are, and managing carefully which ones are allowed to go to Brent, we’re finding that we can keep releasing more projects without impacting our existing commitments.

Erik says that we are starting to master the First Way: We’re curbing the handoffs of defects to downstream work centers, managing the flow of work, setting the tempo by our constraints, and, based on our results from audit and from Dick, we’re understanding better than we ever have what is important versus what is not.

There should be absolutely no way that the Dev and qa environments don’t match the production environment.”

The First Way is all about controlling the flow of work from Development to it Operations. You’ve improved flow by freezing and throttling the project releases, but your batch sizes are still way too large. The deployment failure on Friday is proof. You also have way too much wip still trapped inside the plant, and the worst kind, too. Your deployments are causing unplanned recovery work downstream.” He continues, “Now you must prove that you can master the Second Way, creating constant feedback loops from it Operations back into Development, designing quality into the product at the earliest stages. To do that, you can’t have nine-month-long releases. You need much faster feedback. “You’ll never hit the target you’re aiming at if you can fire the cannon only once every nine months. Stop thinking about Civil War era cannons. Think antiaircraft guns.”

“In any system of work, the theoretical ideal is single-piece flow, which maximizes throughput and minimizes variance. You get there by continually reducing batch sizes.

“You’re doing the exact opposite by lengthening the Phoenix release intervals and increasing the number of features in each release. You’ve even lost the ability to control variance from one release to the next.”

“The flow of work goes in one direction only: forward. Create a system of work in it that does that. Remember, the goal is single-piece flow.”

An inevitable consequence of long release cycles is that you’ll never hit the internal rate of return targets, once you factor in the cost of labor. You must have faster cycle times.

“The benefits were enormous,” he says with pride. “First, when defects were found, we fixed them immediately and we didn’t have to scrap all the other parts in that batch. Second, wip was brought down because each work center never overproduced product, only to sit in the queue of the next work center. But the most important benefit was that order lead times were cut from one month to less than a week. We could build and deliver whatever and however many the customer wanted and never had a warehouse full of crap that we’d need to liquidate at fire-sale prices.

“Allspaw taught us that Dev and Ops working together, along with qa and the business, are a super-tribe that can achieve amazing things. They also knew that until code is in production, no value is actually being generated, because it’s merely wip stuck in the system. He kept reducing the batch size, enabling fast feature flow. In part, he did this by ensuring environments were always available when they were needed. He automated the build and deployment process, recognizing that infrastructure could be treated as code, just like the application that Development ships. That enabled him to create a one-step environment creation and deploy procedure, just like we figured out a way to do one-step painting and curing.

grasshopper. In order for you to keep up with customer demand, which includes your upstream comrades in Development,” he says, “you need to create what Humble and Farley called a deployment pipeline. That’s your entire value stream from code check-in to production. That’s not an art. That’s production. You need to get everything in version control. Everything. Not just the code, but everything required to build the environment. Then you need to automate the entire environment creation process. You need a deployment pipeline where you can create test and production environments, and then deploy code into them, entirely on-demand. That’s how you reduce your setup times and eliminate errors, so you can finally match whatever rate of change Development sets the tempo at.”

Get humans out of the deployment business. Figure out how to get to ten deploys a day.”

“If you can’t out-experiment and beat your competitors in time to market and agility, you are sunk. Features are always a gamble. If you’re lucky, ten percent will get the desired benefits. So the faster you can get those features to market and test them, the better off you’ll be. Incidentally, you also pay back the business faster for the use of capital, which means the business starts making money faster, too.

Surprisingly, Chris speaks out the most fiercely. “What? Why in the world would we need to do ten deploys a day? Our sprint intervals are three weeks long. We don’t have anything to deploy ten times a day!” Patty shakes her head. “Are you sure? What about bug fixes? What about performance enhancements when the site grinds to a halt, like what’s happened during the last two major launches? Wouldn’t you love to do these types of changes in production routinely, without having to break all the rules to do some sort of emergency change?” Chris thinks for a couple of moments before responding. “Interesting. I would normally call those types of fixes a patch or a minor release. But you’re right—those are deployments, too. It would be great if we could roll out fixes more quickly, but come on, ten deploys a day?”

Countless configurations need to be set correctly, systems need enough memory, all the files need to be put in the right place, and all code and the entire environment need to be operating correctly. Even one small mistake could take everything down. Surely this meant that we needed even more rigor and discipline and planning than in manufacturing.

the release instructions are never up-to-date, so we’re always scrambling, trying to futz with it, having to rewrite the installer scripts and install it over and over again…”

“If we had a common build procedure, and everyone used these tools to create their environments, the developers would actually be writing code in an environment that at least resembles the Production environment. That alone would be a huge improvement.”

Right now, we focus mostly on having deployable code at the end of the project. I propose we change that requirement. At each three-week sprint interval, we not only need to have deployable code but also the exact environment that the code deploys into, and have that checked into version control, too.”

With the objective of doing whatever it takes to deliver effective customer recommendations and promotions, we started with a clean code base that was completely decoupled from the Phoenix behemoth.

Since the entire company could be out of business by that time, the developers and Brent decided to create a completely new database, using open source tools, with data copied from not only Phoenix but also the order entry and inventory management systems. By doing this, we could develop, test, and even run in operations without impacting Phoenix or other business critical applications. And by decoupling ourselves from the other projects, we could make all the changes we needed to without putting other projects at risk.

“For Phoenix, it takes us three or four weeks for new developers to get builds running on their machine, because we’ve never assembled the complete list of the gazillion things you need installed in order for it to compile and run. But now all we have to do is check out the virtual machine that Brent and team built, and they’re all ready to go.”

Because of our rapid progress, we decided to shrink the sprint interval to two weeks. By doing this, we could reduce our planning horizon, to make and execute decisions more frequently, as opposed to sticking to a plan made almost a month ago.

By the end of the meeting, I’m surprised at the unanticipated payoffs of automating our deployment process. The developers can more quickly scale the application, and potentially few changes would be required from us.

We’ve put in regular checks to make sure that the developers who have daily access to production only have read-only access, and we’re making good progress on integrating our security tests into the build procedures. I’m pretty confident that any changes that could affect data security or the authentication modules will get caught quickly.”

“But the notion of having to keep up with ten deploys a day?” he continues. “Complete lunacy! But after being forced to automate our security testing, and integrating it into the same process that William uses for his automated qa testing, we’re testing every time a developer commits code.

We had a genuine showstopper when qa discovered that we were recommending items that were out of stock. That would have been disastrous, as customers would excitedly click on the promotion, only to find them listed as “backordered.” Incredibly, Development developed a fix within a day, and it was deployed within an hour.

Maggie quickly agreed, but it still took the developers two hours to change and deploy. Now, this feature can be disabled with a configuration setting, so we can do it in minutes next time around, instead of requiring a full code rollout.

“The Unicorn team is kicking butt. They’ve moved from doing deployments every two weeks to every week, and we’re now experimenting with doing daily deployments. Because the batch size is so much smaller, we can make small changes very quickly. We’re now doing a/b testing all the time. In short, we’ve never been able to respond to the market this quickly, and I’m sure there are more rabbits we can pull out of this hat.”

I’m especially proud that for an entire month, my group hit our target of spending fifteen percent of our time on preventive infrastructure projects. And it shows.

Erik’s Third Way. We need to create a culture that reinforces the value of taking risks and learning from failure and the need for repetition and practice to create mastery.

I don’t want posters about quality and security. I want improvement of our daily work showing up where it needs to be: in our daily work.

My board holds me responsible for making the best use of company resources to achieve the goals that maximize shareholder value. My primary job is to lead my management team to make that happen.”

He continues, “I need each and every one of my business managers to take calculated risks, without jeopardizing the entire enterprise. People everywhere in the business are using technology, so it’s like the Wild West again—for better or for worse. Businesses that can’t learn to compete in this new world will perish.”

“In ten years, I’m certain every coo worth their salt will have come from it. Any coo who doesn’t intimately understand the it systems that actually run the business is just an empty suit, relying on someone else to do their job.”

As if Steve knows what I’m thinking, he says, “You know, when Erik and I first met, many months ago, he said that the relationship between it and the business is like a dysfunctional marriage—both feel powerless and held hostage by the other. I’ve thought about this for months, and I finally figured something out. “A dysfunctional marriage assumes that the business and it are two separate entities. it should either be embedded into business operations or into the business. Voilà! There you go. No tension. No marriage, and maybe no it Department, either.”

“I’ve long believed that to effectively manage it is not only a critical competency but a significant predictor of company performance,”

don’t be the idiot that fails because he didn’t ask for help.”

‘Messiahs are good, but scripture is better.’”

So much about DevOps is counter intuitive, contrary to common practice, and even controversial. If production deployments are problematic, how on earth can deploying more frequently be a good idea? How can reducing the number of controls actually increase the security of our applications and environments? And can technology really learn anything from manufacturing?

audiobook Beyond the Goal,

I’ve also come across otherwise smart [people] who are of the mistaken belief that if they hold on to a task, something only they know how to do, it’ll ensure job security. These people are knowledge Hoarders. This doesn’t work. Everyone is replaceable. No matter how talented they are. Sure it may take longer at first to find out how to do that special task, but it will happen without them.

And without doubt, as you speculated, our real-life Brent always had the best interests of the organization at heart and was merely a victim of the system.

Whether we are a John, a Brent, a Wes, a Patty, or a Bill, when we’re trapped in a system that prevents us from succeeding, our job becomes thankless, reinforces a feeling of powerlessness, and we feel like we are trapped in a system that preordains failure. And worse, the nature of technical debt that is not paid down ensures that the system gets worse over time, regardless of how hard we try.

We now know that DevOps principles and patterns are what allow us to turn this downward spiral into a virtuous spiral, through a combination of cultural norms, architecture, and technical practices.

Myth—DevOps is Only for Startups: While DevOps practices have been pioneered by the web-scale, Internet “unicorn” companies such as Google, Amazon, Netflix, and Etsy, each of these organizations has, at some point in their history, risked going out of business because of the problems associated with more traditional “horse” organizations: highly dangerous code releases that were prone to catastrophic failure, inability to release features fast enough to beat the competition, compliance concerns, an inability to scale, high levels of distrust between Development and Operations, and so forth. However, each of these organizations was able to transform their architecture, technical practices, and culture to create the amazing outcomes that we associate with DevOps. As Dr. Branden Williams, an information security executive, quipped, “Let there be no more talk of DevOps unicorns or horses but only thoroughbreds and horses heading to the glue factory.”

Myth—DevOps Replaces Agile: DevOps principles and practices are compatible with Agile, with many observing that DevOps is a logical continuation of the Agile journey that started in 2001. Agile often serves as an effective enabler of DevOps, because of its focus on small teams continually delivering high quality code to customers. Many DevOps practices emerge if we continue to manage our work beyond the goal of “potentially shippable code” at the end of each iteration, extending it to having our code always in a deployable state, with developers checking into trunk daily, and that we demonstrate our features in production-like environments.

Myth—DevOps is Incompatible with Information Security and Compliance: The absence of traditional controls (e.g., segregation of duty, change approval processes, manual security reviews at the end of the project) may dismay information security and compliance professionals. However, that doesn’t mean that DevOps organizations don’t have effective controls. Instead of security and compliance activities only being performed at the end of the project, controls are integrated into every stage of daily work in the software development life cycle, resulting in better quality, security, and compliance outcomes.

Imagine a world where product owners, Development, QA, IT Operations, and Infosec work together, not only to help each other, but also to ensure that the overall organization succeeds. By working toward a common goal, they enable the fast flow of planned work into production (e.g., performing tens, hundreds, or even thousands of code deploys per day), while achieving world-class stability, reliability, availability, and security. In this world, cross-functional teams rigorously test their hypotheses of which features will most delight users and advance the organizational goals. They care not just about implementing user features, but also actively ensure their work flows smoothly and frequently through the entire value stream without causing chaos and disruption to IT Operations or any other internal or external customer. Simultaneously, QA, IT Operations, and Infosec are always working on ways to reduce friction for the team, creating the work systems that enable developers to be more productive and get better outcomes. By adding the expertise of QA, IT Operations, and Infosec into delivery teams and automated self-service tools and platforms, teams are able to use that expertise in their daily work without being dependent on other teams. This enables organizations to create a safe system of work, where small teams are able to quickly and independently develop, test, and deploy code and value quickly, safely, securely, and reliably to customers. This allows organizations to maximize developer productivity, enable organizational learning, create high employee satisfaction, and win in the marketplace.

In our world, Development and IT Operations are adversaries; testing and Infosec activities happen only at the end of a project, too late to correct any problems found; and almost any critical activity requires too much manual effort and too many handoffs, leaving us to always be waiting. Not only does this contribute to extremely long lead times to get anything done, but the quality of our work, especially production deployments, is also problematic and chaotic, resulting in negative impacts to our customers and our business.

As a result, we fall far short of our goals, and the whole organization is dissatisfied with the performance of IT, resulting in budget reductions and frustrated, unhappy employees who feel powerless to change the process and its outcomes.† The solution? We need to change how we work; DevOps shows us the best way forward.

Today, organizations adopting DevOps principles and practices often deploy changes hundreds or even thousands of times per day. In an age where competitive advantage requires fast time to market and relentless experimentation, organizations that are unable to replicate these outcomes are destined to lose in the marketplace to more nimble competitors and could potentially go out of business entirely, much like the manufacturing organizations that did not adopt Lean principles.

In an age where competitive advantage requires fast time to market, high service levels, and relentless experimentation, these organizations are at a significant competitive disadvantage. This is in large part due to their inability to resolve a core, chronic conflict within their technology organization.

In almost every IT organization, there is an inherent conflict between Development and IT Operations which creates a downward spiral, resulting in ever-slower time to market for new products and features, reduced quality, increased outages, and, worst of all, an ever-increasing amount of technical debt. The term “technical debt” was first coined by Ward Cunningham. Analogous to financial debt, technical debt describes how decisions we make lead to problems that get increasingly more difficult to fix over time, continually reducing our available options in the future—even when taken on judiciously, we still incur interest.

Frequently, Development will take responsibility for responding to changes in the market, deploying features and changes into production as quickly as possible. IT Operations will take responsibility for providing customers with IT service that is stable, reliable, and secure, making it difficult or even impossible for anyone to introduce production changes that could jeopardize production. Configured this way, Development and IT Operations have diametrically opposed goals and incentives.

Alarmingly, our most fragile artifacts support either our most important revenue-generating systems or our most critical projects. In other words, the systems most prone to failure are also our most important and are at the epicenter of our most urgent changes. When these changes fail, they jeopardize our most important organizational promises, such as availability to customers, revenue goals, security of customer data, accurate financial reporting, and so forth.

Ideally, small teams of developers independently implement their features, validate their correctness in production-like environments, and have their code deployed into production quickly, safely and securely. Code deployments are routine and predictable. Instead of starting deployments at midnight on Friday and spending all weekend working to complete them, deployments occur throughout the business day when everyone is already in the office and without our customers even noticing—except when they see new features and bug fixes that delight them. And, by deploying code in the middle of the workday, for the first time in decades IT Operations is working during normal business hours like everyone else. By creating fast feedback loops at every step of the process, everyone can immediately see the effects of their actions. Whenever changes are committed into version control, fast automated tests are run in production-like environments, giving continual assurance that the code and environments operate as designed and are always in a secure and deployable state.

Automated testing helps developers discover their mistakes quickly (usually within minutes), which enables faster fixes as well as genuine learning—learning that is impossible when mistakes are discovered six months later during integration testing, when memories and the link between cause and effect have long faded. Instead of accruing technical debt, problems are fixed as they are found, mobilizing the entire organization if needed, because global goals outweigh local goals.

Furthermore, everyone is constantly learning, fostering a hypothesis-driven culture where the scientific method is used to ensure nothing is taken for granted—we do nothing without measuring and treating product development and process improvement as experiments.

The principles of Flow, which accelerate the delivery of work from Development to Operations to our customers The principles of Feedback, which enable us to create ever safer systems of work The principles of Continual Learning and Experimentation foster a high-trust culture and a scientific approach to organizational improvement risk-taking as part of our daily work

“the sequence of activities an organization undertakes to deliver upon a customer request,” or “the sequence of activities required to design, produce, and deliver a good or service to a customer, including the dual flows of information and material.”

In DevOps, we typically define our technology value stream as the process required to convert a business hypothesis into a technology-enabled service that delivers value to the customer.

The input to our process is the formulation of a business objective, concept, idea, or hypothesis, and starts when we accept the work in Development, adding it to our committed backlog of work. From there, Development teams that follow a typical Agile or iterative process will likely transform that idea into user stories and some sort of feature specification, which is then implemented in code into the application or service being built. The code is then checked in to the version control repository, where each change is integrated and tested with the rest of the software system. Because value is created only when our services are running in production, we must ensure that we are not only delivering fast flow, but that our deployments can also be performed without causing chaos and disruptions such as service outages, service impairments, or security or compliance failures.

Instead of large batches of work being processed sequentially through the design/development value stream and then through the test/operations value stream (such as when we have a large batch waterfall process or long-lived feature branches), our goal is to have testing and operations happening simultaneously with design/development, enabling fast flow and high quality. This method succeeds when we work in small batches and build quality into every part of our value stream.

Whereas the lead time clock starts when the request is made and ends when it is fulfilled, the process time clock starts only when we begin work on the customer request—specifically, it omits the time that the work is in queue, waiting to be processed (figure 2). Figure 2. Lead time vs. process time of a deployment operation Because lead time is what the customer experiences, we typically focus our process improvement attention there instead of on process time. However, the proportion of process time to lead time serves as an important measure of efficiency—achieving fast flow and short lead times almost always requires reducing the time our work is waiting in queues.

When we have long deployment lead times, heroics are required at almost every stage of the value stream. We may discover that nothing works at the end of the project when we merge all the development team’s changes together, resulting in code that no longer builds correctly or passes any of our tests. Fixing each problem requires days or weeks of investigation to determine who broke the code and how it can be fixed, and still results in poor customer outcomes.

In the DevOps ideal, developers receive fast, constant feedback on their work, which enables them to quickly and independently implement, integrate, and validate their code, and have the code deployed into the production environment (either by deploying the code themselves or by others). We achieve this by continually checking small code changes into our version control repository, performing automated and exploratory testing against it, and deploying it into production. This enables us to have a high degree of confidence that our changes will operate as designed in production and that any problems can be quickly detected and corrected. This is most easily achieved when we have architecture that is modular, well encapsulated, and loosely-coupled so that small teams are able to work with high degrees of autonomy, with failures being small and contained, and without causing global disruptions.

By speeding up flow through the technology value stream, we reduce the lead time required to fulfill internal or customer requests, especially the time required to deploy code into the production environment. By doing this, we increase the quality of work as well as our throughput, and boost our ability to out-experiment the competition. The resulting practices include continuous build, integration, test, and deployment processes; creating environments on demand; limiting work in process (WIP); and building systems and organizations that are safe to change. The Second Way enables the fast and constant flow of feedback from right to left at all stages of our value stream. It requires that we amplify feedback to prevent problems from happening again, or enable faster detection and recovery. By doing this, we create quality at the source and generate or embed knowledge where it is needed—this allows us to create ever-safer systems of work where problems are found and fixed long before a catastrophic failure occurs. By seeing problems as they occur and swarming them until effective countermeasures are in place, we continually shorten and amplify our feedback loops, a core tenet of virtually all modern process improvement methodologies. This maximizes the opportunities for our organization to learn and improve. The Third Way enables the creation of a generative, high-trust culture that supports a dynamic, disciplined, and scientific approach to experimentation and risk-taking, facilitating the creation of organizational learning, both from our successes and failures. Furthermore, by continually shortening and amplifying our feedback loops, we create ever-safer systems of work and are better able to take risks and perform experiments that help us learn faster than our competition and win in the marketplace. As part of the Third Way, we also design our system of work so that we can multiply the effects of new knowledge, transforming local discoveries into global improvements. Regardless of where someone performs work, they do so with the cumulative and collective experience of everyone in the organization.

We increase flow by making work visible, by reducing batch sizes and intervals of work, and by building quality in, preventing defects from being passed to downstream work centers.

A significant difference between technology and manufacturing value streams is that our work is invisible. Unlike physical processes, in the technology value stream we cannot easily see where flow is being impeded or when work is piling up in front of constrained work centers. Transferring work between work centers is usually highly visible and slow because inventory must be physically moved.

Work is not done when Development completes the implementation of a feature—rather, it is only done when our application is running successfully in production, delivering value to the customer.

However, interrupting technology workers is easy, because the consequences are invisible to almost everyone, even though the negative impact to productivity may be far greater than in manufacturing.

Dominica DeGrandis, one of the leading experts on using kanbans in DevOps value streams, notes that “controlling queue size [WIP] is an extremely powerful management tool, as it is one of the few leading indicators of lead time—with most work items, we don’t know how long it will take until it’s actually completed.”

Bad multitasking often occurs when people are assigned to multiple projects, resulting in many prioritization problems.

One of the key lessons in Lean is that in order to shrink lead times and increase quality, we must strive to continually shrink batch sizes. The theoretical lower limit for batch size is single-piece flow, where each operation is performed one unit at a time.

The large batch strategy (i.e., “mass production”) would be to sequentially perform one operation on each of the ten brochures. In other words, we would first fold all ten sheets of paper, then insert each of them into envelopes, then seal all ten envelopes, and then stamp them. On the other hand, in the small batch strategy (i.e., “single-piece flow”), all the steps required to complete each brochure are performed sequentially before starting on the next brochure. In other words, we fold one sheet of paper, insert it into the envelope, seal it, and stamp it—only then do we start the process over with the next sheet of paper.

Like in manufacturing, this large batch release creates sudden, high levels of WIP and massive disruptions to all downstream work centers, resulting in poor flow and poor quality outcomes. This validates our common experience that the larger the change going into production, the more difficult the production errors are to diagnose and fix, and the longer they take to remediate.

The equivalent to single piece flow in the technology value stream is realized with continuous deployment, where each change committed to version control is integrated, tested, and deployed into production.

Each time the work passes from team to team, we require all sorts of communication: requesting, specifying, signaling, coordinating, and often prioritizing, scheduling, deconflicting, testing, and verifying. This may require using different ticketing or project management systems; writing technical specification documents; communicating via meetings, emails, or phone calls; and using file system shares, FTP servers, and Wiki pages. Each of these steps is a potential queue where work will wait when we rely on resources that are shared between different value streams (e.g., centralized operations). The lead times for these requests are often so long that there is constant escalation to have work performed within the needed timelines.

To mitigate these types of problems, we strive to reduce the number of handoffs, either by automating significant portions of the work or by reorg-anizing teams so they can deliver value to the customer themselves, instead of having to be constantly dependent on others.

“In any value stream, there is always a direction of flow, and there is always one and only constraint; any improvement not made at that constraint is an illusion.” If we improve a work center that is positioned before the constraint, work will merely pile up at the bottleneck even…

Identify the system’s constraint. Decide how to exploit the system’s constraint. Subordinate everything else to the above decisions. Elevate the system’s constraint. If in the previous steps a constraint has been broken, go back to step…

Environment creation: We cannot achieve deployments on-demand if we always have to wait weeks or months for production or test environments. The countermeasure is to create environments that are on demand and completely self-serviced, so that they are always available when we need them. Code deployment: We cannot achieve deployments on demand if each of our production code deployments take weeks or months to perform (i.e., each deployment requires 1,300 manual, error-prone steps, involving up to three hundred engineers). The countermeasure is to automate our deployments as much as possible, with the goal of being completely automated so they can be done self-service by any developer. Test setup and run: We cannot achieve deployments on demand if every code deployment requires two weeks to set up our test environments and data sets, and another four weeks to manually execute all our regression tests. The countermeasure is to automate our tests so we can execute deployments safely and to parallelize them so the test rate can keep up with our code development rate. Overly tight architecture: We cannot achieve deployments on demand if overly tight architecture means that every time…

After all these constraints have been broken, our constraint will likely be Development or the product owners. Because our goal is to enable small teams of developers to independently develop, test, and deploy value to customers quickly and reliably, this is where we want our constraint to be. High performers, regardless of whether an engineer is in Development, QA, Ops, or Infosec, state that their goal is to help maximize developer productivity. When the constraint is here, we are limited only by the number of good business hypotheses we create and our ability to develop the code necessary to test these hypotheses with real customers. The progression of constraints listed above are generalizations…

Implementing Lean Software Development: From Concept to Cash,

Partially done work: This includes any work in the value stream that has not been completed (e.g., requirement documents or change orders not yet reviewed) and work that is sitting in queue (e.g., waiting for QA review or server admin ticket). Partially done work becomes obsolete and loses value as time progresses. Extra processes: Any additional work that is being performed in a process that does not add value to the customer. This may include documentation not used in a downstream work center, or reviews or approvals that do not add value to the output. Extra processes add effort and increase lead times. Extra features: Features built into the service that are not needed by the organization or the customer (e.g., “gold plating”). Extra features add complexity and effort to testing and managing functionality. Task switching: When people are assigned to multiple projects and value streams, requiring them to context switch and manage dependencies between work, adding additional effort and time into the value stream. Waiting: Any delays between work requiring resources to wait until they can complete the current work. Delays increase cycle time and prevent the customer from getting value. Motion: The amount of effort to move information or materials from one work center to another. Motion waste can be created when people who need to communicate frequently are not colocated. Handoffs also create motion waste and often require additional communication to resolve ambiguities. Defects: Incorrect, missing, or unclear information, materials, or products create waste, as effort is needed to resolve these issues. The longer the time between defect creation and defect detection, the more difficult it is to resolve the defect. Nonstandard or manual work: Reliance on nonstandard or manual work from others, such as using non-rebuilding servers, test environments, and configurations. Ideally, any dependencies on Operations should be automated, self-serviced, and available on demand. Heroics: In order for an organization to achieve goals, individuals and teams are put in a position where they must perform unreasonable acts, which may even become a part of their daily work (e.g., nightly 2:00 a.m. problems in production, creating hundreds of work tickets as part of every software release).†

In technology, our work happens almost entirely within complex systems with a high risk of catastrophic consequences. As in manufacturing, we often discover problems only when large failures are underway, such as a massive production outage or a security breach resulting in the theft of customer data.

We make our system of work safer by creating fast, frequent, high quality information flow throughout our value stream and our organization, which includes feedback and feedforward loops. This allows us to detect and remediate problems while they are smaller, cheaper, and easier to fix; avert problems before they cause catastrophe; and create organizational learning that we integrate into future work. When failures and accidents occur, we treat them as opportunities for learning, as opposed to a cause for punishment and blame.

doing the same thing twice will not predictably or necessarily lead to the same result. It is this characteristic that makes static checklists and best practices, while valuable, insufficient to prevent catastrophes from occurring.

Therefore, because failure is inherent and inevitable in complex systems, we must design a safe system of work, whether in manufacturing or technology, where we can perform work without fear, confident that any errors will be detected quickly, long before they cause catastrophic outcomes, such as worker injury, product defects, or negative customer impact.

Complex work is managed so that problems in design and operations are revealed Problems are swarmed and solved, resulting in quick construction of new knowledge New local knowledge is exploited globally throughout the organization Leaders create other leaders who continually grow these types of capabilities

The Fifth Discipline: The Art & Practice of the Learning Organization

In contrast, our goal is to create fast feedback and fastforward loops wherever work is performed, at all stages of the technology value stream, encompassing Product Management, Development, QA, Infosec, and Operations. This includes the creation of automated build, integration, and test processes, so that we can immediately detect when a change has been introduced that takes us out of a correctly functioning and deployable state. We also create pervasive telemetry so we can see how all our system components are operating in the production environment, so that we can quickly detect when they are not operating as expected. Telemetry also allows us to measure whether we are achieving our intended goals and, ideally, is radiated to the entire value stream so we can see how our actions affect other portions of the system as a whole.

Feedback is critical because it is what allows us to steer. We must constantly validate between customer needs, our intentions and our implementations. Testing is merely one sort of feedback.”

It prevents the problem from progressing downstream, where the cost and effort to repair it increases exponentially and technical debt is allowed to accumulate. It prevents the work center from starting new work, which will likely introduce new errors into the system. If the problem is not addressed, the work center could potentially have the same problem in the next operation (e.g., fifty-five seconds later), requiring more fixes and work.

It is only through the swarming of ever smaller problems discovered ever earlier in the life cycle that we can deflect problems before a catastrophe occurs. In other words, when the nuclear reactor melts down, it is already too late to avert worst outcomes.

To enable fast feedback in the technology value stream, we must create the equivalent of an Andon cord and the related swarming response. This requires that we also create the culture that makes it safe, and even encouraged, to pull the Andon cord when something goes wrong, whether it is when a production incident occurs or when errors occur earlier in the value stream, such as when someone introduces a change that breaks our continuous build or test processes. When conditions trigger an Andon cord pull, we swarm to solve the problem and prevent the introduction of new work until the issue has been resolved.† This provides fast feedback for everyone in the value stream (especially the person who caused the system to fail), enables us to quickly isolate and diagnose the problem, and prevents further complicating factors that can obscure cause and effect.

Preventing the introduction of new work enables continuous integration and deployment, which is single-piece flow in the technology value stream. All changes that pass our continuous build and integration tests are deployed into production, and any changes that cause any tests to fail trigger our Andon cord and are swarmed until resolved.

Examples of ineffective quality controls include: Requiring another team to complete tedious, error-prone, and manual tasks that could be easily automated and run as needed by the team who needs the work performed Requiring approvals from busy people who are distant from the work, forcing them to make decisions without an adequate knowledge of the work or the potential implications, or to merely rubber stamp their approvals Creating large volumes of documentation of questionable detail which become obsolete shortly after they are written Pushing large batches of work to teams and special committees for approval and processing and then waiting for responses

Instead of developers needing to request or schedule a test to be run, these tests can be performed on demand, enabling developers to quickly test their own code and even deploy those changes into production themselves.

Having developers share responsibility for the quality of the systems they build not only improves outcomes but also accelerates learning. This is especially important for developers as they are typically the team that is furthest removed from the customer. As Gary Gruver observes, “It’s impossible for a developer to learn anything when someone yells at them for something they broke six months ago—that’s why we need to provide feedback to everyone as quickly as possible, in minutes, not months.”

Lean defines two types of customers that we must design for: the external customer (who most likely pays for the service we are delivering) and the internal customer (who receives and processes the work immediately after us). According to Lean, our most important customer is our next step downstream. Optimizing our work for them requires that we have empathy for their problems in order to better identify the design problems that prevent fast and smooth flow. In the technology value stream, we optimize for downstream work centers by designing for operations, where operational non-functional requirements (e.g., architecture, performance, stability, testability, configurability, and security) are prioritized as highly as user features. By doing this, we create quality at the source, likely resulting in a set of codified non-functional requirements that we can proactively integrate into every service we build.

In these environments, there is also often a culture of fear and low trust, where workers who make mistakes are punished, and those who make suggestions or point out problems are viewed as whistle-blowers and troublemakers. When this occurs, leadership is actively suppressing, even punishing, learning and improvement, perpetuating quality and safety problems. In contrast, high-performing manufacturing operations require and actively promote learning—instead of work being rigidly defined, the system of work is dynamic, with line workers performing experiments in their daily work to generate new improvements, enabled by rigorous standardization of work procedures and documentation of the results.

In the technology value stream, our goal is to create a high-trust culture, reinforcing that we are all lifelong learners who must take risks in our daily work. By applying a scientific approach to both process improvement and product development, we learn from our successes and failures, identifying which ideas don’t work and reinforcing those that do. Moreover, any local learnings are rapidly turned into global improvements, so that new techniques and practices can be used by the entire organization. We reserve time for the improvement of daily work and to further accelerate and ensure learning. We consistently introduce stress into our systems to force continual improvement. We even simulate and inject failures in our production services under controlled conditions to increase our resilience. By creating this continual and dynamic system of learning, we enable teams to rapidly and automatically adapt to an ever-changing environment, which ultimately helps us win in the marketplace.

“Responses to incidents and accidents that are seen as unjust can impede safety investigations, promote fear rather than mindfulness in people who do safety-critical work, make organizations more bureaucratic rather than more careful, and cultivate professional secrecy, evasion, and self-protection.” These issues are especially problematic in the technology value stream—our work is almost always performed within a complex system, and how management chooses to react to failures and accidents leads to a culture of fear, which then makes it unlikely that problems and failure signals are ever reported. The result is that problems remain hidden until a catastrophe occurs.

three types of culture: Pathological organizations are characterized by large amounts of fear and threat. People often hoard information, withhold it for political reasons, or distort it to make themselves look better. Failure is often hidden. Bureaucratic organizations are characterized by rules and processes, often to help individual departments maintain their “turf.” Failure is processed through a system of judgment, resulting in either punishment or justice and mercy. Generative organizations are characterized by actively seeking and sharing information to better enable the organization to achieve its mission. Responsibilities are shared throughout the value stream, and failure results in reflection and genuine inquiry.

In the technology value stream, we establish the foundations of a generative culture by striving to create a safe system of work. When accidents and failures occur, instead of looking for human error, we look for how we can redesign the system to prevent the accident from happening again. For instance, we may conduct a blameless post-mortem after every incident to gain the best understanding of how the accident occurred and agree upon what the best countermeasures are to improve the system, ideally preventing the problem from occurring again and enabling faster detection and recovery.

“By removing blame, you remove fear; by removing fear, you enable honesty; and honesty enables prevention.”

Teams are often not able or not willing to improve the processes they operate within. The result is not only that they continue to suffer from their current problems, but their suffering also grows worse over time. Mike Rother observed in Toyota Kata that in the absence of improvements, processes don’t stay the same—due to chaos and entropy, processes actually degrade over time.

“Even more important than daily work is the improvement of daily work.”

We improve daily work by explicitly reserving time to pay down technical debt, fix defects, and refactor and improve problematic areas of our code and environments—we do this by reserving cycles in each development interval, or by scheduling kaizen blitzes, which are periods when engineers self-organize into teams to work on fixing any problem they want.

The result of these practices is that everyone finds and fixes problems in their area of control, all the time, as part of their daily work. When we finally fix the daily problems that we’ve worked around for months (or years), we can eradicate from our system the less obvious problems. By detecting and responding to these ever-weaker failure signals, we fix problems when it is not only easier and cheaper but also when the consequences are smaller.

Similarly, in the technology value stream, as we make our system of work safer, we find and fix problems from ever weaker failure signals. For example, we may initially perform blameless post-mortems only for customer-impacting incidents. Over time, we may perform them for lesser team-impacting incidents and near misses as well.

When new learnings are discovered locally, there must also be some mechanism to enable the rest of the organization to use and benefit from that knowledge. In other words, when teams or individuals have experiences that create expertise, our goal is to convert that tacit knowledge (i.e., knowledge that is difficult to transfer to another person by means of writing it down or verbalizing) into explicit, codified knowledge, which becomes someone else’s expertise through practice.

In the technology value stream, we must create similar mechanisms to create global knowledge, such as making all our blameless post-mortem reports searchable by teams trying to solve similar problems, and by creating shared source code repositories that span the entire organization, where shared code, libraries, and configurations that embody the best collective knowledge of the entire organization can be easily utilized. All these mechanisms help convert individual expertise into artifacts that the rest of the organization can use.

By relentless and constant experimentation in their daily work, they were able to continually increase capacity, often without adding any new equipment or hiring more people. The emergent pattern that results from these types of improvement rituals not only improves performance but also improves resilience, because the organization is always in a state of tension and change. This process of applying stress to increase resilience was named antifragility by author and risk analyst Nassim Nicholas Taleb.

In the technology value stream, we can introduce the same type of tension into our systems by seeking to always reduce deployment lead times, increase test coverage, decrease test execution times, and even by re-architecting if necessary to increase developer productivity or increase reliability.

we explicitly state the problem we are seeking to solve, our hypothesis of how our proposed countermeasure will solve it, our methods for testing that hypothesis, our interpretation of the results, and our use of learnings to inform the next iteration.

The leader helps coach the person conducting the experiment with questions that may include: What was your last step and what happened? What did you learn? What is your condition now? What is your next target condition? What obstacle are you working on now? What is your next step? What is your expected outcome? When can we check?

In the technology value stream, this scientific approach and iterative method guides all of our internal improvement processes, but also how we perform experiments to ensure that the products we build actually help our internal and external customers achieve their goals.