Technical Debt Management

When a project survives long enough to become complicated.

Aug 22, 2024

Technical debt is the deferred cost associated with a particular implementation or design choice (e.g. adopting a specific framework). It is an unavoidable and necessary outcome of software development and can be managed effectively. Like debt in general, there is high quality and low quality technical debt depending on how well it is managed, the original function it served, and the continuing value, if any, it provides.

The term "technical debt" was adopted because the deferred cost (e.g. decreased productivity) tends to compound: as projects progress, many issues become more costly in terms of both the negative impact (e.g. reduced developer velocity) and cost-to-remediate. Therefore the cost of technical debt may change over time depending on its "interest rate.”

Teams can identify and track technical debt, periodically groom their technical debt backlog in order to re-assess cost, and scope initiatives that are focused on paying down technical debt into their quarterly planning process [1].

Technical debt is evidence a project lived long enough to become complicated: it can be a good problem to have and a sign of success. Generally speaking, it is the responsibility of a team’s tech lead to manage its technical debt.

Preventing unhealthy technical debt

Since hidden technical debt tends to compound faster than well-understood technical debt, one of the most impactful ways a team can manage their debt (i.e. mitigate risks) is to simply document and track it (see next section). Developers should also provide conservative estimates for their work to ensure code quality standards are met for ongoing contributions. This limits incoming technical debt, makes planning more predictable, and gives developers a chance to opportunistically pay down existing technical debt that is co-present with the parts of the system that are being affected by the changes. Teams can also de-risk implementation plans with peer-reviewed design documents (e.g. RFCs) and use Architectural Decision Records (ADRs) to limit entropy due to misalignment and lack of shared context or standards [2]. A culture of preventative action and on-going maintenance (“refactoring is how we write code”) that is seamlessly woven into development must be enabled by tech leads and engineer managers as it takes additional planning, mentoring, and effort to incorporate these practices into a team’s process and culture. By adopting this stance of cleaning up code as its being modified, developers are not only encouraged to improve the health of the code base, they also become more intentional and selective about introducing technical debt going forward.

Tracking technical debt

The most essential tactic to managing technical debt is to first make it visible. Developers should track technical debt like any other future work: creating tickets to group related tasks together, breaking down tasks into smaller tickets as needed, and referencing tickets in the code (e.g. FIXME(ABC-1234): This component should eventually be refactored to account for X, Y, Z). In their tickets, developers should assess the approximate impact that completing each tech debt ticket would have by focusing on the ongoing cost and value of paying down technical debt in terms of value to the user, de-risking execution, and increasing developer productivity. The categories are based on the Flow Framework developed in Project to Product by Mik Kersten.

Value to the user: Anything that improves the user experience of an application delivers value to a user. Examples include: increasing stability and reliability, faster performance and responsiveness, increasing speed of delivery of new features, and delivering a more cohesive product (e.g. use of a consistent design language)
De-risking execution: Risks may be related to finance, security, governance, or other matters of compliance.
Increasing developer productivity: Removing impediments to delivery is critical to staying responsive to new circumstances such as changes in user needs or business priorities.

When using the above categories as framing, developers should be as specific and detailed as possible: "X impedes our ability to do Y, in Z ways. Modifying it would lead to improvements in A, B, and C." They should also—whenever possible—provide specific metrics for measuring the positive impact of paying down the debt. These metrics can include agile metrics like “flow velocity” (calculated by dividing the amount of work completed by the team during a sprint by the length of the sprint). When small, iterative changes are infeasible, developers should make the case for larger refactors in an RFC [3].

Teams should add these tickets to an application-specific technical debt epic. They can then periodically (e.g. once/month) groom the tech debt for their applications, resulting in a prioritized backlog with cost-estimated tickets. Developers can use ticket labels, statuses, sprints, or quarterly epics to distinguish between tech debt tickets that are in a pre-planning phase vs tickets that are prioritized and ready for development.

Prioritizing technical debt

Generally speaking, we can prioritize technical debt along the cost and level-of-effort dimensions, this generally means we should:

Prioritize tech debt that is exhibiting high cost (e.g. producing a lot of bugs, reducing developer productivity, slowing down application performance).
Prioritize tech debt that has a low effort required to remediate.
Prioritize tech debt in projects that have longevity (i.e. not likely to be retired/deprecated/sunsetted any time soon).
Prioritize tech debt in projects that are highly leveraged (i.e. infrastructure)

If multiple of the above are true, the tech debt is a good candidate for prioritization.

The single most effective method a team can adopt to identify high priority technical debt is to hold incident retrospectives, i.e. “postmortems.”

Paying down technical debt

Technical debt can often be addressed in the course of meeting a team’s immediate needs. This reduces overhead of paying down the debt by amortizing the cost of code changes (e.g. context switching, updating tests) across the related work items. This is the most efficient way to pay down technical debt but also comes with its own set of risks. The best way to mitigate these risks is to regularly track and groom technical debt, communicate clearly ahead of time the intention to pair the work together, and limit of the scope of the technical debt that will be addressed. Developers should separate functional changes from refactors by making separate MRs whenever possible.

In addition to this fix-as-you-go approach, on-call engineers can be tasked with working on groomed tech debt tickets. Having the on-call engineer stand outside of (i.e. not contribute towards) a team’s active sprint can be a way of quickly paying down accumulated tech debt by having this individual focus exclusively on on support issues (i.e. active incidents), bugs, UX debt, and technical debt.

We should be regularly paying down technical debt such that technical debt focused sprints are relatively rare.

Ownership

Generally speaking, developers and the tech lead are responsible for preventing, tracking, and continually paying down technical debt. The tech lead is responsible for hosting periodic review and grooming sessions of the team’s technical debt (e.g. once a month or every other month). Engineering managers are responsible for prioritizing larger technical debt initiatives as needed, based on input from the tech lead and developers on the team.

Footnotes

Ugly code and controversial decisions don't necessarily qualify. Bugs are not really technical debt but they can be symptoms of underlying technical debt and should be linked with related issues to help estimate costs. Note that technical debt associated with stable code that is rarely modified will have a near-zero cost associated with it.
Software has a kind of entropy (also called "bit rot") whereby it accumulates flaws and complexity by virtue of being changed and by being deployed in a changing environment. This can be conceived of as "unintended technical debt" but it's also helpful to consider it as an independent phenomena.
A rough outline of the RFC document and process is: describe the problem and opportunity, offer solutions with alternatives, propose a path forward, solicit input, and make modifications based on feedback.

jc foust

Discussion about this post