This article introduces the QRF team model, which separates product business development from emergency response teams, providing a separate path to what matters and emergencies, helping teams focus on what matters. Engineering Org Structures— The QRF Team Model[1]

As a team leader, I’ve found that a lot of startup product and engineering teams are struggling with agility.

There are certain benefits to startups taking shortcuts early in their development, but they also come at a cost. In a way, this creates so much technical, capability, knowledge, and organizational debt that only the engineering team can handle ongoing errors, internal problems, and distractions.

Requests from all sides are expected to be handled quickly, and sometimes previous tasks are interrupted for several days in a row, greatly disrupting the engineer’s work. Even if each outage only takes a few minutes, a single context switch can ruin an entire afternoon’s worth of productivity.

How can engineering organizations handle these issues in a reasonable way?

The QRF model works very well in some environments and has no drawbacks from other scenarios.

Divide the organizational unit into two disparate, independent teams:

Team 1’s charter is simple: build what is on the roadmap. Like a typical product development team, they work on high-value, high-priority projects based on a roadmap, a predictable pace, and any suitable framework or methodology, such as Scrum or Kanban.

In short, they work for the company’s main goals.

Team 2’s task is to work on any project that could cause Team 1 to be outaged. They act as QRFs, i.e. fast reaction forces, dealing with any issues that could interrupt the sprint. In short, they prevent Team 1 from being disturbed and play a supporting role.

Other “agile” methods that are commonly employed have fundamental flaws and are therefore ineffective in some cases.

These include:

Which of the following things takes precedence:

The idea of the next billion dollar will always win any argument about priorities because it’s an unknown, idealized, and easily manipulated prediction. However, innovation focused on new product ideas means that current customers are not served.

It’s the paradoxical relationship between innovation and operations: both need to thrive in the marketplace, but most arguments about product prioritization underestimate operations. When the value of the choice grows exponentially, few product managers are willing to work on projects with incremental value.

This means that from a “value” perspective, internal tools, bug fixes, automation, technical debt, etc. cannot be prioritized. They usually have some known, smaller quantifiable value that pales in comparison to unvalidated idealized predictions.

Finally, no matter how many structured frameworks we use, prioritization is an art, not a science.

Some teams have tried to solve this problem by strictly adhering to the sprint commitments. Whenever an additional request arises, they say, “We’ve already planned this sprint, so we’ll put it in the next sprint.”

Unfortunately, when the next sprint arrives, there are an extra 15 items on the to-do list to complete.

Keep repeating until there’s a whole bunch of to-dos, most of which aren’t done. This team began to be seen as inflexible by customers, others in the company, and the executive team.

Why does this fail?

This approach assumes that things can wait until the end of the sprint.

The truth is, some things can’t wait. Compliance requirements, new employee onboarding, minor configuration changes for large customers, etc., are all tasks associated with related activities and need to be completed within a certain period of time.

Failure to meet these requirements within a short period of time may result in reputational damage or operational risks.

Another approach the team takes is to invest a certain amount of time, such as 10% of the sprint, or one day a week, to solve these problems.

Why does this fail?

This approach is useful for low-workload interferences, but not for organizations with a lot of work, because it takes up too much time.

In addition, turning this work into a “10%” thing can make people think that this part of the work is not important, distracting, and engineers may not be motivated to do it, resulting in some degree of inefficiency in performing these tasks.

There’s also an implicit assumption that enough can actually be done in that time period.

But in many cases it can’t be done and then back to the beginning.

The real problem here is that the pace of work of the team is slower than the environment requires.

If you receive 5 urgent requests per day but don’t get a new job until every two weeks, it means that the customer may have to wait up to 4 weeks to resolve the issue. In the startup world, 4 weeks can be an unacceptable long time. And suppose we can find priorities based on the planned roadmap (but that’s not usually the case).

Eventually things pile up and customers are upset: “Why does it take a week to set up a configuration?”

That’s why QRF solutions work well – forcing work to go faster [2].

The first and most important benefit is that interrupt tasks can be handled by a dedicated execution stream. While teams may switch contexts frequently, organizations as a whole end up switching between fewer important things.

This helps prevent context from deviating from the organization’s primary focus, which should be non-urgent, important work.

Ultimately, this is a system-level optimization at the expense of local costs.

The second benefit is increased organizational responsiveness to requests.

Many organizations perform some form of agile ritual that defines and sets work around goals every two weeks. This means that in the worst case, the theoretical time from the time the customer makes the request to the time the request is fulfilled can be up to 4 weeks.

The QRF team provides faster response times for such requests. I’ve seen teams have a lead time (from request to delivery) that only takes 3 days, which helps improve customer satisfaction and service satisfaction.

Maintenance is the most expensive aspect of the system lifecycle. Due to the high cost, it is often delayed, which in turn increases the risk and effort required to maintain the system.

QRF helps ensure that important maintenance work, such as bugs, operational tools, and other projects, is not interrupted.

QRF operates in two modes:

In a “reactive” mode, the QRF team actively extinguishes fires and solves urgent, disruptive problems, acting as the first line of defense against outages, providing support by focusing on problem solving.

In “standby” mode, the QRF team works on preventive “left-shift” projects or something that helps them react faster. These can be internal tools, improved monitoring/alerting, improved quality, automation, logging, or observability in audits – it’s up to the QRF team to decide how to use this time.

QRF’s biggest task is to “move left” in the event of an outage. Not only do they have to work hard to react and solve real problems, but they also have to avoid all the problems that arise in the field of engineering from the beginning.

Every company and problem looks different, but some common “shift left” solutions include:

Over time, as the types of interruptions shift to the left, the engineering organization as a whole has more room to focus on the execution of high-value projects.

QRF works closely with stakeholders across the organization, especially internal teams such as support, customer success, and operations. It’s important to stay in touch with these teams, who need to be aware of any changes to the workflow or solutions to the requests they may make.

There are several ways to measure the performance of a QRF team:

All of these are reasonable ways to assess whether QRF is effective.

The problem resolution rate is essentially the throughput of interrupt requests. In other words, without the QRF team, this would be the number of outages the main team would have to handle.

Categorizing completed projects can provide key insights into the composition and sources of the various disruptions that an organization may encounter:

The period time and lead time can be used to provide the requester with a predicted average of the completion of a new request. For example, if the average lead time for the last 100 tasks is 3 days, then the requester can expect that their request will take an average of 3 days.

Moving tasks to the left represents the number or category of tasks that the QRF team successfully prevented from reoccurring discrete workflows, which is a preventive effort to gain proactivity. Each category may represent dozens or hundreds of future outages that have been prevented.

Interruption prevention can be used to measure the QRF team’s situation before the problem arises, with any one item moving left (for example. Tasks through internal tools all mean problems that an engineer will not need to solve in the future.

Some companies have successfully implemented this model.

One of them, which I call “PayCo,” has its biggest problem with more than 12 requests a day, and only a few key engineers in the company know how to fix them. They were too busy with other things to complete these requests, so the backlog of requests kept and eventually the problem grew larger and larger, causing serious disruption to the project and requiring immediate processing.

When I first joined, there were tons of tasks and requests. I knew it wasn’t sustainable, so I quickly put together a quick response team.

Initially it was a tough job, every problem and requirement we encountered was new and took a lot of time to find a solution.

However, we have clearly and carefully documented the problems and solutions encountered, created detailed guidelines, and even written quick “populated” scripts that can be used in the future. We identified and documented solutions for more than 40 types of requests in our work.

From these categories, we identified key patterns:

We implemented a series of left-shift efforts and completely eliminated 25 types of requests by authorizing requesters to solve the problem themselves or telling them how to solve the problem.

We went from processing and resolving a few outage requests per day to receiving no more than 3 requests per week in just two quarters, and in the end, the QRF team didn’t need to be the first response interface to the problem, but only needed to respond to problems that other teams couldn’t solve.

For QRF to be successful, the organization needs to make the following commitments.

This means that other teams cannot tell the QRF what to do, but rather chooses on their own basis based on the prevention of interference and its ability.

The whole point of QRF is to react quickly, and if there are a bunch of tasks that have to be done, you can’t react quickly. The team should always be free to drop the work at hand and do what else needs to be done.

Issues must be addressed in direct collaboration with stakeholders and other departments, which may include process changes, interface adjustments, or the introduction of tools. Any licensing restrictions simply add to the wait times and delays, which makes the team less responsive and the outages more urgent.

There needs to be engineers in the team who like to do different things, delve into problems, work with stakeholders, document and track solutions without support. Otherwise, it will be difficult for them to influence change, gather information, and share larger solutions with the company.

QRF can only work effectively if it can easily identify requests that conform to its operating parameters and can identify long-term request patterns. This can only be done if the request is made in a consistent and visible manner.

In short, QRF cannot respond to requests that are not known.

If you don’t already have a formal request processing process, build one! [3]

One of the QRF’s jobs is to discover how to solve a problem, and it only needs to be solved once, which usually means writing down the steps to solve it and sharing it across the company so that a pre-plan can be built for this problem pattern.

If someone on the QRF team hates documents, the team will end up solving the same problem over and over again.

Playing a role in the QRF operating model requires a certain level of familiarity, which cannot be built in a matter of weeks. Teams need to be able to track leads, communicate with stakeholders, and identify left-shift patterns. Changing an engineer every few weeks makes it impossible to build relevant experience.

The whole point of a rapid response mechanism is to provide a timely response, so the team needs to commit to responding to service level agreements.

Something like “We’ll respond to all requests within 48 hours” will help prevent QRF from turning into an unpredictable black hole. Note that a response does not necessarily mean a resolution, only that people can know if it will be received by the QRF.

It’s easy to think “Oh, it’s a bug team,” but that’s not true.

Yes, it usually starts as a team fixing a lot of bugs, but QRF doesn’t prioritize differently. The bug team prioritizes bugs based on their importance and only deals with bugs.

QRF, which prioritizes based on the degree of disruption and inefficiency, may handle items that are not bugs at all. Their job is to prevent breaks, not fix bugs. This could mean letting some errors occur and remain in the system, creating a long-term preventive solution for another outage.

Self-driving and gratitude.

The key to QRF is to allow unlimited autonomy, which appeals to many engineers. As long as the response SLA is completed, the engineers on the team can basically decide for themselves what to do during the standby period.

There won’t be a lot of standby time at first, but eventually there’s plenty of free time between one urgent and the next that engineers can use to amplify their impact.

Due to the nature of the request, QRF is also a team that can interact directly with people in need of help. Some engineers find it incredibly motivating to deploy a fix or tool, and to have the opportunity to talk to the people who benefit the most from it immediately. While many product development teams talk about customer collaboration, too many continue to isolate engineers from users, making such opportunities the exception rather than the norm.

But it must be admitted that QRF is not for everyone. If engineers just want to bury their heads in the code, don’t let them join the QRF team.

No. Scrum’s prioritization cadence may be too long. QRF may adjust priorities on a daily basis, sometimes even several times a day, depending on the severity of the problem. That’s how powerful it is, that’s what it promises. Scrum does the opposite, creating an almost immutable contract that states (ideally) that reprioritization should only take place at the end of the sprint.

It is recommended that QRF adopt a pull-based operating model, such as Kanban. Kanban works well, helping to limit the work in progress, reducing batch size, making the work visible, and blocking work can be solved by working together with collective intelligence.

They work on preventive, left-shifted or other leveraged work, which is actually at the discretion of the team members. The only requirement is to respond and deal with issues in a timely manner to make the queue as clear as possible.

This means not doing large-scale, months-long projects and ensuring that the tasks delivered are iterative, incremental, and can be abandoned or interrupted at any time.

What is unlikely to happen is that QRFs really have no tasks to deal with, which means they have done their job! You can then smoothly transition to an internal support team, a development experience team, or a merge into another team.

If it is purely from the point of view of engineering ability, then, yes. However, this comparison is too simplistic, and if you put the same capabilities in the ordinary product team, you will not get the same results.

That’s because the key components of the QRF operate in a completely different way than other engineering teams in the company. In the long run, the team’s focus on preventing problems is more conducive to reducing future distractions than partial attention mode, and unlimited autonomous incentives help motivate and retain engineers working in that model.

For problem-specific analysis, the model can be applied to a single engineer on a small team, or it can be a team of several engineers. The important part is to check the interrupt request rate and make sure that the QRF has the appropriate staff to prevent a large number of requests.

If the team needs a QRF that is larger than the main work team, this will most likely mean larger, upstream, and possibly cross-functional organizational issues that need to be moved to the left, such as continuous poor demand, technical quality, or prioritization processes.

QRF models can be very effective, but keep in mind: no panacea can work in every situation, depending on the team background and circumstances.

I found this model to be a huge success in the following environments:

This pattern may not work in the following environments:

The environment is important. An article does not solve the problem, at most it can provide a mental model that can be applied, and after evaluating the environment and requirements, you may find that you need to modify this model. There is a good chance that the QRF model will solve your problem, but it is also possible that there is no help, only you know what is best.

References: [1] Engineering Org Structures— The QRF Team Model: https://betterprogramming.pub/engineering-org-structures-the-qrf-team-model-7b92031db33c [2] Using tempo to avoid the chaos of agile methodologies: https://jgefroh.medium.com/using-tempo-to-avoid-the-chaos-of-agile-methodologies-cb213a84652c [3] How to design an effective intake process: https://medium.com/swlh/how-to-design-an-effective-intake-process-cba0b98be4d4

Hello, I am Yu Fan, did research and development in Motorola, and now does technical work in Mavenir, and has always maintained a strong interest in communications, networks, back-end architectures, cloud native, DevOps, CICD, blockchain, AI and other technologies, usually like to read, think, believe in continuous learning, lifelong growth, welcome to exchange and learn together. WeChat public account: DeepNoMind