Archive for November, 2011

Kanban – a success story and a humbling experience

November 27, 2011

A success story

In 2008, I became manager of an outsourced team that did all the support, configuration and development on SAP, intranet and extranet applications for a packaging materials manufacturing company in Brazil (subsidiary of a British company), with sales around 1 billion USD/year.

My team varied in size responding to demand, but it typically contained between 18 and 25 people (on average, we tallied 3200 person-hours/month during 2009), and one typical configuration would be 5 MM analysts, 3 SD analysts, 1 FI/CO analyst, 4 ABAP programmers, 3 .Net programmers, 1 Basis consultant, 3 specialists on specific applications (HR/Payroll, MES and international commerce) and myself. The team was responsible for both support and changes – during 2009, the average was 1400 person-hours/month on support and 1800 person-hours/month on changes.

The client had a small internal IT team, that did the liaison between the internal clients and the outsourced team, did quality assurance on the outsourcing company’s work, and kept the knowledge in house to minimize vendor lock-in. The outsourced team did have direct access to the end users, but the internal analysts should always be involved on the conversations and in the decision-making process.

When the SAP ERP (v. 4.6) was implemented on 2005, new IT governance procedures were put in place. Nothing could be deployed to the production environment without a documented change request, and the CRs became the standard unit for conversation with the internal clients, deployment, work order for the team, tracking and reporting. A CR was meant to be a “minimal marketable feature” or “business value increment”, a unit that should be deployed as a whole to add value to the user. It could involve more than one system, such as the e-procurement system, the middleware (SAP PI) and SAP. The median size of a CR was 40 engineering hours, and the size distribution is given by the chart below.

Each internal analyst was responsible for one business process (order to cash, procure to pay, finance, HR, manufacturing), and each had a monthly meeting with their internal clients – the procure to pay analyst would meet with the procurement director, the logistics director, the subordinates they indicated according to the hot topics of the month, and occasionally other directors that would be affected by the CRs being discussed. A typical meeting dynamics would be:

  • Review a long backlog of CRs (several dozen), several of which would already have initial effort estimation.
  • Analyze the current workload of the SAP analyst of the specific module, and prioritize enough CRs to fill in the available time for those analysts in the following month. For instance, if there were 3 MM analysts in the team (corresponding to 12 person-weeks until the next meeting), and the CRs they were currently working on would take 4 more person-weeks, the directors would select CRs summing up slightly more than 8 man-weeks of estimated effort (usually between 8 and 16).

For each new CR, the first thing the analyst (SAP analyst and/or one of the satellite application analysts) would have to do was a scope statement document. It was usually a one to five page document describing in general terms what would be done, the business rules, and the detailed effort estimate. The configuration/development could only begin when the internal analyst (and, for a few critical months, the CIO himself) approved the scope statement. That was done for a few reasons: it surfaced misunderstandings before spending too much money, it gave the IT department an estimate on how much that change was going to cost, so they could challenge the business about the expected benefits if it smelled of waste, and, last but not least, to try to avoid doing unnecessary work. When working with SAP, the mindset is that you only do software development when there’s no reasonable way to do what you need without using code. Ideally, everything should be done using the standard SAP transactions and configuration (that’s why the programmers were less than one third of the team). When writing the scope document, the SAP analyst would validate if the need could be fulfilled without ABAP code, and, otherwise, identify which gaps needed custom code.

Our team had a software in which we tracked the engineering time and the progress on each CR or support ticket, so we had very precise historical data and had managed to calibrate our estimates so that every month our estimated X actual effort would fall between -5% and +10%, without requiring sophisticated and time consuming estimation techniques (we had become what Doug Hubbard refers to as “calibrated experts”).


Sometime in the first semester of 2009, in the Scrum Yahoo! Group (in which I was basically trolling around since I could see no possible fit whatsoever between Scrum and my context with highly specialized resources, multiple customers, and continuous one-click deployment), I read some posts about this new method called Kanban, which pointed to introductory papers and the kanbandev group. That made sense to me and it sounded like it could work in my context, so around July I called my team to a quick stand-up meeting (we were all co-located in an open office) where I explained the basics of Kanban and proposed we mapped our process and WIP on a board and started visualizing it to see what it would tell us. The investment was near 0 (two flipchart paper sheets, some post-it notes, and a couple of hours to setup an account on AgileZen and configure our process), and, at least, we would improve our visualization of our work, so everybody agreed that it wouldn’t hurt.

At first, the columns in our board were:

  •   Estimating: an analyst was doing the scope statement
  •   Estimated: the scope statement was done and waiting for approval
  •   Specification/Configuration: the analyst was configuring SAP or writing a specification document of an ABAP development
  •   ABAP development
  •   Internal test: the analyst is validating the ABAP + configuration on the DEV environment
  •   Validation: the user is validating the CR on the QA environment
  •   Done: the CR is deployed in production

We used different colors to represent the different internal analysts, and we didn’t have classes of service (that concept wasn’t even well-established on the Kanban literature on those pre-book days).
In my experience, I knew this wasn’t the best description we could do of our process, but that was the way the SAP analysts thought the process might be, so that’s how they described it – we specify, they develop, we validate.

We haven’t explicitly described our policies, but everybody seemed to agree on what each column meant, and how things should flow on the board. Very soon, several things started to emerge, even before we thought about limiting WIP or doing any other significant change:

  •  Several items were on the “Estimated” and “Validation” states
  •  Several items were on the “Specification” state, but, when I asked the analyst, she would say the spec was done, but she was waiting for a specific programmer to be free (with luck they would accept any programmer, but they tended to have their favorites).
  • Several items were on the “ABAP development” state, but, when asked, the programmer would say he was done, but the analyst wasn’t available to test it.
  • The items circulated a lot between “Specification”, “ABAP development” and “Internal test”
  • Several items were in a state such as “Specification”, but no one would be working on them because they need something like an answer from the client, or something from the infrastructure team.

Based on these observations, I proposed the team to redesign our board, replacing the columns “Specification”, “ABAP development” and “Internal test” with “In progress” and “In progress – ABAP”, meaning, respectively, that the thing was being worked on by an analyst without any ABAP programmer, and that the item was being worked on by an ABAP programmer, possibly with some analyst. Whenever an analyst was blocked on an item “In Progess” because she needed an ABAP programmer, she would signal it with a green magnet (which we represented as a “done” marker in AgileZen), and the first ABAP programmer that became free would check for these green signs and pair with the analyst. They would typically pair for some hours, in which the analyst would explain the spec and do some pair programming until the main skeleton of the program was ready, and then the ABAP developer would continue working alone filling in the details of the code. When the programmer was done or needed the analyst for some clarification, he would try to have her immediate help. If she didn’t happen to be available, he would mark the item with a green magnet on the board. When the analyst considered the item to be ready for user acceptance, she would request the development/configuration to be transported to the QA environment, and then move the item to the “Validation” state.

We also decided to mark any item that was depending on someone external to the team -except on the states “Estimated” and “Validation”, which were, by definition, responsibility of someone else – as blocked, signaled with a red magnet on the board, and with a “blocked” mark on AgileZen.
We showed our board to the client’s IT team, which not only loved the new level of visibility it gave them, but rushed to approve the “Estimated” items.

After one month, we had delivered 18 CRs, but our WIP had gone from 43 CRs to 60! In other words, 35 CRs had entered the system, but only half of them had flowed through. I considered we had enough information to take the next step, and I scheduled a meeting with the CIO for 9/11/2009.

Since it was a manufacturing company, I knew I could count on a general knowledge of Lean (the Lean manufacturing guys had even a Lean office initiative, but being myopic to all things not manufacturing, they had concluded that the only Lean tool that was applicable to the office was 5S), so I would base my rhetoric on this, but I needed some high impact data to show the problem on a deep emotional level.

Fortunately, I had several months worth of valuable data, and I come up with the metric which I considered would be closest to the CIOs heart – the price he paid (represented by the number of person-hours worked on CRs) versus the value he got (represented by the number of men-hours worked on CRs that actually got deployed on production). To avoid any misinterpretation I decided to present both the estimated ideal hours and the actual worked hours, to underline that it was not a matter of poor estimation, but of poor process efficiency. The actual graph I presented was this:

where the blue line is the number of person-hours worked on each month, the pink line is the sum of the estimated person-hours of the CRs that got deployed on the specific month, and the yellow line was the sum of the actual person-hours spent on the same CRs. The outliers in April and May correspond to a project in which several CRs were deployed en masse, and are related to the peak of hours spend in March. That anomaly aside, every other month had efficiencies below 40%!

Then I moved on to explain the causes:

  • Worked being pushed in big batched regardless of the WIP

  • Items waiting too much time on validation
  • Several blocked items

And then I proposed the solution: Limit WIP (we decided to do it at a system-level – CONWIP), Start pulling work instead of pushing and Attack the validation bottleneck. The validation bottleneck was a especially sensitive point, because, up to that point, the mindset both on the outsourced team and on the internal IT team was that, if a CR was deployed to the QA environment, the IT job was done, and now it was someone else’s responsibility. So much that, that they hardly ever mentioned those CRs on the monthly meetings with the business. Hence, I had to recommend explicitly that validation should be a joint responsibility among all the parties (team, IT and business), if we wanted to improve the efficiency.

The CIO gave me permission to present a somewhat edited version of the presentation to some key internal clients, such as the logistics, sales and procurement directors (all of which had Lean knowledge, two were Six Sigma green belts and one was a black belt). They got it right away, and were sincerely surprised by some of the data presented. They all bought in the validation taskforce, and the change on the queue replenishment policy on the monthly meetings.

Then a curious thing happened. As per my conversations with the clients, I started following up with all the requesting users (their subordinates) on the validation of all the CRs on the “Validation” state. As it turned out, more than half of the CRs hadn’t been validated so far because either the user had not been communicated that the CR was ready, or didn’t receive a basic explanation on the solution so he could validate it, or even the QA environment lacked some pre-requisite, such as enough test data, or the user didn’t have enough access. It became clear that, since the team considered its work done as soon as the technical solution was deployed onto the QA environment, but the user could not validate the solution unless he was properly communicated and all the pre-requisites were done, there was a limbo on our process, that hadn’t even surfaced up to that moment. We then defined the explicit policy that an item should be moved to “Validation” only when, if the user were available to validate it, he would be able to do it immediately. If, at any later moment, that condition ceased to be true (for instance, the QA environment went down), the item should be marked as blocked.

The following weeks saw the team, the internal IT and the users rushing to validate CRs, and the results were dramatic. Whereas we had delivered 20 CRs the month before, we delivered 33 CRs between 9/11 and 10/11, without increasing the WIP, as shown by the CFDs below (the second one is WIP-only).

After the first month, we decided there were additional measures we could implement, to improve our flow:

  • Using Little’s law, we predicted that, if we could keep our throughput and WIP at September’s level, we would reduce the average lead time from 11 weeks to 6 weeks. On the other hand, we knew that September’s throughput was somewhat artificially high, since we had started with many quick wins on the “Validation” state. So to keep our lead times low even with a slightly lower throughput, we decided to gradually reduce our WIP to 45 until the end of the year.
  • Using the Lean manufacturing body of knowledge, we knew that we needed to keep the bottleneck at our most expensive and most inelastic resource (i.e. most difficult to add extra capacity). That would be the SAP analyst. We, the contractor, had a pool of ABAP programmers so we could add an additional programmer with 1-week lead time, and for periods as short as 3 weeks. The SAP analysts, on the other hand, took on average 3 weeks to start (we usually had to bring them from other cities or hire new analysts), and had a much slower learning curve until they understood all the specific configurations of the client’s SAP, so they usually stayed for at least 3 months to pay it back. Until then, every addition to the team needed to be requested by the internal IT analyst and validated by the CIO, and was based purely on feeling, or on the imperfect scheduling mechanism on the monthly business meeting. The CIO decided to change that policy, and I was given permission to add as many programmers to the team as needed to avoid them to become a bottleneck, as long as I communicated it to him, and removed the programmer as soon as possible.
  • Additionally, since we implemented the policy of using the first ABAP programmer available, and we had a fluctuating population of temporary and sometimes inexperienced programmers, I was given permission to use the most experienced programmer on the team as a wildcard. He wouldn’t be allocated to any specific CR, but he would promiscuously pair with all the other programmers, help the analysts on estimates, and do peer reviews.

For the remainder of the year, we operated under these new policies, and we managed to sustain a much steadier flow, dramatically increase efficiency, and reduce WIP, as the charts below show.

A humbling experience

Then, at December 2009, I’ve received a job offer. The company was a consumer packaged goods company in Brazil (subsidiary of a French company), with around 1 billion USD/year of sales. They had implemented SAP 4.6 four years previously, and had several satellite systems. The new CIO had just arrived, and he had diagnosed that there were issues on the delivery process, and that a new professional should be hired, with the title of PMO manager, but whose scope of work would be not only related to projects, but to the IT processes and delivery in general. The core IT team consisted, besides the infrastructure team, of 5 SAP analysts (MM/WM, FI/CO and SD), 2 ABAP programmers, 4 satellite system analysts, and contractors that were hired either for turnkey projects or as time-and-material additional staff.

Armed with all the self-confidence the previous semester had given me, and the apparently almost perfect similarities of both contexts, I thought to myself: “Piece of cake”! A couple of interviews and the recommendation of my previous client’s CIO later, and the job was mine. In the first few weeks, I’ve lectured my boss and peer IT managers on pull systems, cost of delay and WIP limits, and how all our problems could be solved as quickly as in one semester using the magic of Kanban. At this point, I already had an earlier beta version of David’s book, and I used the Microsoft XIT case as the core of my sales pitch.

Here I am, 22 months later, and a good deal wiser, but without having anything closer to the spectacular results I’ve seen on my previous Kanban implementation. While in the other implementation we very quickly managed to reach all the 5 levels of a Kanban implementation, namely:
1. Visualize the work
2. Limit WIP
3. Manage the flow
4. Make your policies explicit
5. Improve your process using models & flow theory

now we’re still struggling on level 1.

Very naturally, despite of all our knowledge of Deming, my initial reaction was blaming the people, which in this case, was no other than myself. Of course, I was a fraud, and my initial success could only be attributed to pure luck. Luckily enough, I know better than that, and, even more luckily, not having experienced how much better we could have become, my company still considered that I’m doing a reasonably good job, so I still could do a deeper analysis of the situation instead of spend my time looking for a new position.

Looking carefully, the apparent similarity of the contexts is only skin-deep. I’ve managed to identify several relevant differences, some of them I already know how to treat, others I’m still on the process of discovering. Anyway, I think exposing these roadblocks may be valuable for other teams trying to start a Kanban implementation.

1 – IT governance

In my first implementation (from now on referred to as company A), the concept of change request was well established, the route and policy of CR submission was known and accepted by the internal customers, and this policy was enforced on the contract with the team (which was actually a legal contract between two companies, not only a implicit contract between two different departments in a single company). In my current company (company B), there’s no standard procedure for requests to come to the IT department, prioritization is achieved often via political escalation, and sometimes using shortcuts based on personal informal relationships with members of the team (under the radar).
While company A was mostly a manufacturing and logistics company, that served few big customers, and had few departments, mostly headed by engineers preoccupied with operational excellence, company B has 4 independent divisions, is focused on marketing, and see operations and processes as a nuisance, and a necessary evil between what’s really valued – innovation and marketing – and the actual delivery of products to consumers. So it’s very common for requests to come at a very late time with comments like “I didn’t realize IT should be involved in the implementation of the new sales channel of having the products parachuted to the consumer’s house anytime she mentions the product name on twitter; now we have already promised the president it will be operational next week”.

2 – Batch size

While the item size in company A was very consistent and reasonably small, in company B the dynamics of the system creates two attractors for items sizes – either big projects or very small requests that can fly under the radar. There are several factors that contribute to this:

  • The satellite system team doesn’t do any internal development, and, except for one application (SFA), doesn’t have an outsourcing contract designed for small batches. The consequence is that the demand is batched on chunks big enough to be contracted as a turnkey project.
  • The SFA vendor, even though they pay lip service to being using Scrum and/or Kanban in most of their teams, has proved to be unable to manage small batch flow. We have done 97 CRs with them in the last 12 months, we’ve planned to deliver them in 7 releases (named wave 1 to 7), that should have been delivered one a month during 7 months. They delivered waves 1, 2 and 3 on month 6 and waves 4, 5, 6 and 7 on month 12. Although I insisted that they reported to us weekly the status of each CR on their workflow, on every single report all the CRs of the same wave were in the same state. We have been trying to push them to adopt a steady flow, limited WIP process, and we’ve recommended that they contact one of the very good Kanban consultants we have in Brazil, but that’s the limit of what we can do with our current contract and leverage.
  • There are several group initiatives led by our competence center in France, such as SAP version upgrade, BI platform replacement, segregation of duties implementation, etc. They have to roll-out these programs to all 19 platforms using mixed teams, with central and local resources. It would be practically unmanageable to do that without planning each individual roll-out as a big project, and limiting the number of simultaneous projects at group-level, which constrains even more the scheduling for each subsidiary. We could still do Kanban in the project, instead of at the portfolio level, but besides the fact that our problem is not on the project level (at least on the last two years we’ve got a good success ratio on this level, that’s one of the reasons I’m still employed) – the big improvements would be on the portfolio level -, many of these projects are not good fits for Kanban (many dependent tasks on a long critical path, instead of several independent items going through short workflows).
  • Our purchasing department and our IT department have lots of experience on hiring time-and-material or turnkey projects, but little maturity on hiring outsourcing, small-batch, SLA-based contracts.
  • Small batches would be classified as operational expenses, whereas big projects are classified as capital expenses. While opex is strictly controlled by the CFO at country and group level, capex is loosely controlled by the CFO, and the CIO, both at country and group level has much more autonomy.
  • While the availability of key users to validate small changes is very hard to get, big projects may afford hiring temporary replacements and getting full-time availability of key users.
  • Since the prioritization is mostly based on political capital, big projects have the upper hand on reaching the ears and hearts of the executive committee.

All of that combine to a scheduling mechanism where the big projects are prioritized, and only some slots are reserved for treating medium-sized change requests, and even these slots often get eaten up as buffer by large projects. After starving for a long time, their cost of delay starts to kick in, and they rise enough on the food chain, to the point where someone on the executive committee change it to an expedite request, which leads us to the next issue.

3 – Too many urgent requests

Besides the medium-sized items that starved so long that they become urgent, and the demand that’s hidden from TI until it’s too late, there’s a special cause of expedite requests – fixed date legal requirements. People outside Brazil (or maybe India) wouldn’t see it as such a big deal, but here we have 27 states, and each two of them have different tax rules for each category of product, and they change them several times during the fiscal year. To make matters worse, Brazilian government implemented a really cool integrated system in which no truck can leave your company’s door without you sending an XML file to them with all the logistic and invoice information of the load, and getting back the approval XML from them, so, when rules change, either you implement it on the date, or you cannot deliver anything to your customers.

Of course, this is nothing particular to company A or company B, but, when you have IT governance issues, and the tax and legal departments get used to having their requests expedited, they tend not to be especially pressed to give IT the longer notice possible on upcoming legal changes. On company A, this was solved having a dedicated IT analyst working full-time with the tax department on anticipating the demand, and managing the analysis and prioritization of the resulting CRs.

What to do about it

When I was hired, the CIO had already started an internal development, on Microsoft Sharepoint, to start registering the change requests. While I’d prefer to start our board on a simpler and more malleable way until our process got stable, since I considered the contexts similar enough to think I already knew the final answer, and the fact we had been suffering from having the CRs on individual Excel spreadsheets on company A, I thought it would do no harm to have the Kanban board implemented on this tool (other than the physical board). So now, in order to put an item on the board, there are several fields that must be filled in and approved. While I do think we’d reach this point with additional maturity, in hindsight I’d strongly advise against doing it at the outset. Following the TPS advice on autonomation, keep it as simple as possible in the beginning, and automate when your practices become more stable.

Anyway, this isn’t a mortal sin, and we’ll probably relax just a bit the required fields to motivate people to keep the system up-to-date, and make the most of the board we already have.

The main action to be taken now is to work on the IT governance. It’s clear to me now that some level of IT governance is a pre-requisite to a successful Kanban implementation. Given this minimum level, Kanban can improve it dramatically, as my experience in company A, as well as the Microsoft XIT and Corbis cases on David’s book show; without the minimum level, however, the system just seems not to be on the basin of attraction of even a local optimum.

Investing on improving the process on the SFA contractor, and investing on SLA-based contracts with other vendors a definitely a way to go.

Regarding the under-the-radar items, we’ve changed the deployment procedure and nothing can be deployed to the production environment without my approval or with the approval of one of my peer managers, and one of the criteria for approval is to have a CR created on our system. There are still some items that are being created at the very last moment, only after the deployment is denied, but that’s the first step on making everything visible on the board.

Big projects and limited space broke our collocation, but we’re trying hard to have at least the core team back together, and have meaningful stand-up meetings, which would further reduce the under-the-radar cases and guarantee that the board is up-to-date.

Regarding the inherent differences in batch sizes, it isn’t yet 100% clear what to do. One idea is to create different classes of service and try to reserve and defend some capacity for the normal, medium-sized, accelerating cost-of-delay CRs. We’ll probably experiment a lot around this as soon as we remove the other major roadblocks out of the way.

There is a plan to have advanced posts on the divisions and the tax department to try to identify the demand before its cost of delay runs out of control, and try to better shape the demand.


  • No two contexts are equal. Trying to apply the exact solution in two different places is a recipe for disaster.
  • There seem to be some pre-requisites for successful Kanban implementations. The Kanban dynamics may have good local optima attractors with big basins of attraction (hence the motto “Start with whatever you currently have”), but there are initial conditions outside these basins of attraction.
  •  IT governance seems to be an important pre-requisite, at least for portfolio-level, whole IT implementations. There may be related findings for smaller scope implementations (project governance for project-level implementations, and so on).
  • Too many expedite items is always a smell. Look for a root cause.
  • Culture isn’t an act of God, nor should be blamed as a whole. Look for the specific things in the context that do impact you, understand the driving forces in the environment that create and feed the undesirable thing, and analyze how these forces can be changed or at least worked around.