Improving deployments in a legacy codebase
I have been working predominantly on legacy codebases. These are almost always monoliths and deploying one of these to production is a ceremony involving multiple rituals. The challenge with most of them are:
- Tech debt – more specifically, old tech living alongside new tech
- A lack of automation testing with sufficient coverage to give you confidence
- An unreliable deployment process
Release trains are one way to deal with such situations. We have a release scheduled once every two weeks. We branch off of master at some point and call it our release candidate. This candidate is staged, verified (manually) and deployed. While a release train sounds like a good concept, it suffers from serious issues.
- The frequency of the release train depends on how fast and often your team of manual QA members can perform testing
- Every release needs them to run through a full site regression and test out new features, which can be very expensive
- If bugs were found in release trains, the train gets delayed impacting time-sensitive features and bug fixes
- If a rollback occurs, all the effort for this release train is essentially throw away work
There are 2 ways to solve this specific problem. You can work on adding enough quality gates to improve the quality of your release candidates. Or you can work on increaseing the frequency of your deployments. Both have their advantages and disadvantages.
Adding quality gates to your code is simple in the sense - it's easy to check for quality (whatever you deem that to be) in anything upcoming. The challenge is going back and retroactively fixing your code to fit that same standard. As an example, if you create a rule that prevents people from adding jQuery to your codebase, that's a good first step. You still have to go back and replace all existing instances of jQuery in the codebase.
Adding linting as a pre-commit hook is great. This way you can easily start linting and formatting any code that is being touched. Doing this incrementally makes your PRs noisy. The only way to avoid that is to run a full pass on your entire codebase that fixes all formatting issues. This is tedious to test because it touches the entire codebase but once it's done you have clean PRs and a fully formatted, linted codebase.
In a large enough codebase, with multiple teams working at the same time, it is easier to rely on code coverage metrics as a pre-build step. Being able to prevent untested net new code into the codebase is a great way to ensure we are drawing a line in the sand and improving quality from this point onward.
One trick that can also help is to force developers to add unit testing to any file they touch, even if it means they are testing code they didn't necessarily add. This slows down development process, so needs to be done with consideration.
Bringing CI / CD to a legacy codebase is hard, especially when you lack end to end tests. Lucky for us, we have a separate team that works on adding these tests exclusively. In general, what we have noticed is, if developers can plan to add E2E tests to their features, the gap becomes easier to bridge. Otherwise, you are dependant on either adding more automation engineers or being okay with untested paths.
One other idea that works great in scenarios like this is to run a sprint where the focus is exclusively on improving automation test coverage. In our team, we use a multitde of tools for automation. This is a result of it being a legacy codebase but allows us the flexibility to let automation engineers (SDETs) author tests in frameworks that work best for them while developers can use tools like Cypress which work better for their authroing experience.
The challeng with improving quality on a codebase is that it is a moving target. Unless there is ample communication across the team and everyone pitches in to share the workload, it can feel futile. If there are teams that are working solely on time sensitive projects as a way of operating, that's where it becomes difficult to enforce quality. We have tried a bunch of things including sharing the responsibilities and holding 20% of sprint capacity to work on tech debt / automation coverage improvements. Sometimes, with certain teams, the only way to get their attention is by preventing them from checking in code.
The other major issue with adding quality gates to an existing codebase is odd because it is possible that a majority of your codebase does not adhere to these new standards that you are setting. That shouldn't prevent you from beginning a culture of quality code and retrofitting your codebase (huge effort) around these quality gates. Some of these tools did not exist when the original code was written, and in those cases it is an easier sell.
The best solution to improve quality is to prevent it from entering the codebase. This means adding pre-commit hook that seem really stringent and difficult to work with for developers. As someone who cares a lot about developer experience, this seems hard to digest. However, if your team is working on removing jQuery from a codebase when another team is working in parallel adding more jQuery, you have a problem. The simplest and correct way to go about any effort that is across the entire codebase is to add linting rules that prevent addition of code that would undo your efforts. That should be the first step in tackling tech debt or quality issues.
The other choice in improving the deployment experience with legacy codebases is to improve the frequency with which deployments occur.
The problem with release trains is people trying to get in at the last moment becuse if they miss this train, they have to wait another 2 weeks for their changes to go live. This causes a lot of stress and delay when the dedicated train manager for that sprint is about to cut the RC.
Having more frequent release trains, resolves this issue. If someone isn't able to check in code within the time frame, they can just go in the next train. It makes it easy to start the concept of a nightly build and have automations run on the nightly builds to ensure no regressions have been introduced.
Challenge with increasing the quantity of deployment, however, is you are limited by the amount of manual QA folks you have. The amount of burn out you would cause humans from repeatedly testing RCs over and over. Hopefully this can serve as motivation to increase coverage.
There are options which we have tried where you can outsource this kind of manual testing to a service which abstracts the manual testing away from you. While this sounds good in theory, in practice, we run into the multiple issues with these services:
- Multiple false positives
- Inability to account for known issues (eg. Known environmental bugs)
- Having to translate test cases to propreitary format per service
There is no silver bullet when it comes to legacy codebase, especially when you have multiple teams working in parallel on a monolith. Hopefully, some of the tips provided here help with dealing with these issues. Depending on your context you will most likely need a balance of both - improving quantity and quality of your deployments.
Friendly reminder to never ship on a Friday.