For a long time, how to effectively measure the effectiveness of software development has been the mind of all R&D managers, but it has also been an unsolved problem. From early per-capita code lines to per-capita function point formula calculations, to storypoint-based iteration rates or per-capita throughput, the industry has been exploring.

If the per capita code line is used as a key indicator, it is contrary to the logic that better programmers should use more elegant and less code, and it is obviously unreasonable to equate the mental work of software programming with the speed of brickbuilding.

Function point calculation, calculated by a complex formula based on the number of pages to be modified and the number of interfaces to be modified after the requirements analysis and design, seems to be objective, but ignores the diversity of software research and development work. The interface of the channel side application is more, the number of function points is easy to be larger, but there are many types of work such as back-end development, basic platform development, data and report development, algorithm development, etc., and the front-end development also has the difference brought about by the use of different frameworks, and it is impossible to objectively measure the team’s production capacity with several formulas; In addition, increasingly complex calculation formulas rely on accurate design, and it is difficult for everyone to understand, requiring people to invest special time in calculations, and this kind of work without value creation is a waste.

As agile development evolves, story points can be used as a tool for collectively assessing complexity based on teams to measure the size of fine-grained requirements. Some managers have then considered measuring capacity in terms of story points per capita. However, there are no units of story points, different team story point benchmarks can be different, and the subjective characteristics of evaluation make it difficult for people to use each story point and iteration rate as satisfactory performance measurement key indicators.

After years of exploration, the DevOps community has proposed four key metrics to measure IT performance, including lead time (or lead time), deployment frequency, deployment failure rate, and online failure recovery time, referred to as “4 Key Metrics”. It’s a good direction. However, in practice, we find that there are actually more than four key indicators to be concerned about, such as the production defect rate is an essential key result, and demand throughput is often very concerned. The following are common indicators of R&D overshoot in practice, some of which are key performance indicators that reflect the final result.

To observe and evaluate R&D effectiveness, we must first define what is effectiveness? In a nutshell, performance is the team’s ability to deliver value consistently and quickly. The purpose is to deliver value, and its core capabilities in research and development lie in “responsiveness” and “robustness”, and at the same time, the concept of responsiveness can be observed from two dimensions: “flow rate” and “resource rate”. The former refers to the cycle time from the value to the delivery of the user, while the latter is the number of human resources delivered in the unit of human resources in the unit of time, the requirements for innovation and agile make the former more important than the latter.

Therefore, to evaluate effectiveness, here are a few key principles:

No single metric can reasonably observe and evaluate the effectiveness of a team, otherwise it will have side effects. For example, a single look at throughput will drive the team to blindly dismantle demand, or sacrifice quality; Looking at the lead time alone may drive the team to reduce the flow of demand.

Evaluating performance as much as possible looks at the overall results, rather than the stage performance, such as the pass rate of a retest is often important to reflect the effect of the built-in quality in the development phase, but it is not appropriate to evaluate the performance, it does not reflect the overall performance of the team.

Raw performance evaluation data should be an objective record from the tool, does not require manual calculations, does not require wasted time for evaluation, and is discriminatory for all teams.

Considering the diversity of types of software R&D work and the nature of work based on mental work, the observation of R&D efficiency should pay more attention to the improvement trend of the team than to the absolute value of horizontal comparison.

So how can the purpose of observing and evaluating effectiveness be achieved more rationally and effectively? The most direct method, and the most ideal, is to learn to observe and analyze a set of core indicators, such as taking out 4 Metrics data trends at the same time, or the key performance indicator data trends in the above chart for analysis and observation. Some mature companies will make these key indicators into dashboards (dashboards) that allow observers to analyze the overall situation at a glance. This is like doing data analysis for digital operations, and only through the comparative analysis of a set of data can we get relatively effective insights. It is strongly recommended that every performance manager and process improver learn and get used to evaluating the effectiveness of a team in this way with the goal of driving improvement.

However, this ideal way requires a high degree of observation for observers, and it is necessary to fully understand the meaning and internal logic of each indicator, and such a set of core indicators is not intuitive enough to reflect the macro performance improvement trend, and the cognitive load is a bit high. Especially for some management and outsiders, it is not clear whether the overall performance has become better or worse. To solve this problem, I thought of some similar solutions.

Countries need indicators to continuously observe the overall economic situation of an economy, typically like the Consumer Price Index (CPI) and the Purchasing Power Parity Index (PPP), which are composite indicators based on some internal logic using a basket of indicators. The benefits are:

Although it does not explain where the root cause of the problem is, it can more intuitively reflect the overall performance

The changes can be combined with the influence of a variety of factors, which can reflect the degree of influence of different factors on the overall evaluation

Reduces the likelihood of one-sided behavior to make a single metric look good

So, in a practical case, we designed the following conceptual formula, which combines six elements to produce a comprehensive evaluation index (R&D effectiveness CEI), which can be counted in weeks or months:

Overall Performance = (Delivery Throughput, Deployment Frequency, Release Success Rate) / (Demand Lead Time, Online Stability, Debt Backlog)

Reflecting the resource rate, it usually refers to the number of requirements delivered per unit of time, but this is the most difficult to calculate efficiently among the six elements because it is related to the granularity of requirements. Function points and story points need to be evaluated artificially, and there are some of the above problems. So a naturally occurring approximation is used: story development time, that is, the time from the beginning of development to the completion of development, is generated by dragging the story in the Kanban. Although this duration may be affected by the efficiency of individual development, it can be statistically approximated to represent the size of the demand. Development time is also affected by the number of stories working in parallel at the same time, the same size, the more parallelism, the longer the duration. Therefore, the per capita value of the delivered throughput is calculated as follows:

Delivery throughput = number of stories delivered * (average story development time / average story development WIP per capita) / team Size

This metric is the number of release unit deployments per capita, and theoretically the larger the team size, the more demand it can deliver, and the more frequently features should be delivered. To increase frequency, this metric drives teams to split deployment units. Considering that the deployment frequency is relatively low in relation to throughput and cycle time for overall performance evaluation, its impact is degraded through the power function. Deployment frequency = (number of deployment unit deployments / team Size)^(1/e)

This metric is simpler, that is, the success rate of each online release, as long as there is a rollback or a new version has a major failure, it is considered unsuccessful. Since this indicator is a percentage, the higher the ratio, the greater the difficulty of lifting, so the following exponential function is used to participate in the calculation:

Release success rate = e^ Version release success rate

This is a key indicator of the flow rate, that is, the length of the cycle from demand confirmation to demand on-line, measuring how quickly the team responds to value. The statistical scale of requirements here does not use stories, but features or user requirements that can be independently launched. Requirements Lead Time = Average feature lead time

The measurement of online stability can be combined with several basic indicators from different angles, such as per capita production defects, downtime and online failure recovery time. Considering that the per capita production defects, downtime and online failure recovery time values may be 0, and the smaller the number, the more difficult it is to improve, and the fluctuation of the values of these indicators is very large, so it is degraded by the following power function. Online stability = ((((number of production defects +1) / Team Size (downtime +1) (average recovery time for online failures +1)))^ (1/e)

This last factor I think needs to be added. The so-called “debt” here refers to the problems that various teams should solve in a timely manner but do not solve, including the backlog of demand, the backlog of defects, and the technical debt. Teams can incur a lot of debt during a quick delivery. If you ignore the technical debt, the backlog of defects, the high rate of a period of time is really just a cover up the problem. For the demand backlog, even if the team itself thinks that the efficiency is very high, from the perspective of the business side, its efficiency is still not enough to meet the needs, and the efficiency it perceives is not high. Defect backlog is an unresolved defect; A demand backlog is a demand that has been put forward by the business but has not entered the delivery of the demand beyond a certain time limit; Technical debt is currently easy to quantify code debt, such as the results of a Sonar scan, and if possible can also include the amount of architectural debt. Of course, given the difference in importance of these types of backlogs, a certain weight is given. The backlog per capita is calculated as follows:

Debt Backlog = (Demand Backlog 50% + Defect Backlog 30% + Technical Debt Volume / 10 * 20%) / Team Size

Finally, considering that both the numerator and the denominator calculation have two team Sizes involved in the calculation, simplification can be considered to cancel each other out to form the following final calculation formula:

The following is an example of a comprehensive evaluation index curve and source data formed in practice based on actual measurement data. We can intuitively see the change in the overall effectiveness of the team after combining various factors. An additional red trend line (dashed line) is automatically generated in the graph that reflects whether the overall trend of performance is getting better or worse over a period of time and the magnitude of the change. Due to the diversity of team work, there may be large differences in the value of the index results calculated by different types of teams, so this curve is mainly used for horizontal comparisons between teams compared to their own past, or between R&D teams with similar nature of work.

Through statistics and observation of this comprehensive indicator, the problem of the previous single indicator and manual statistics can be effectively solved, and the degree of influence of different dimension index factors on the comprehensive efficiency can be reflected. So who is this indicator trending useful for? In practice, there are several application scenarios:

For senior managers or those who are not familiar with performance data analysis, they can visually show the effectiveness changes of the team and the organization as the basis for communicating the effectiveness of R&D;

For department and team leaders and coaches, be able to quickly understand the overall changes in the effectiveness of the team; When the curve fluctuates significantly, further analyze which factors are causing the overall result to fluctuate, so as to take measures to improve it;

When the department or team sets the goal of improving efficiency, such as OKR, the composite index can be used as the key result to measure the goal achieved, avoiding the team from focusing one-sidedly on the improvement of a single indicator, but focusing on the comprehensive results, while focusing on improving individual indicators, but also to ensure that other key indicators do not decline.

There is also a lot of room for improvement in the indicator formula, for example, different enterprises and departments have different interpretations of efficiency, or different levels of emphasis on different key indicators, and the impact factor or weight in the formula can be appropriately adjusted. Alternatively, there may be a more scientifically accurate formula for how factors such as release success rates, production defects, etc. act on the final result, and suggestions are welcome.