The great stink in software pipelines

Greg Law Contributor

Greg Law is the co-founder and CTO at Undo.io, a software failure replay platform provider.

It’s the summer of 1858. London. The River Thames is overflowing with the smell of human and industrial waste. The exceptionally hot summer months have exacerbated the problem. But this did not just happen overnight. Failure to upkeep an aging sewer system and a growing population that used it contributed to a powder keg of effluent, bringing about cholera outbreaks and shrouding the city in a smell that would not go away.

To this day, Londoners still speak of the Great Stink. Recurring cholera infections led to the dawn of the field of epidemiology, a subject in which we have all recently become amateur enthusiasts.

Fast forward to 2020 and you’ll see that modern software pipelines face a similar “Great Stink” due, in no small part, to the vast adoption of continuous integration (CI), the practice of merging all developers’ working copies into a shared mainline several times a day, and continuous delivery (CD), the ability to get changes of all types — including new features, configuration changes, bug fixes and experiments — into production, or into the hands of users, safely and quickly in a sustainable way.

While contemporary software failures won’t spread disease or emit the rancid smells of the past, they certainly reek of devastation, rendering billions of dollars lost and millions of developer hours wasted each year.

This kind of waste is antithetical to the intent of CI/CD. Everyone is employing CI/CD to accelerate software delivery; yet the ever-growing backlog of intermittent and sporadic test failures is doing the exact opposite. It’s become a growing sludge that is constantly being fed with failures faster than can be resolved. This backlog must be cleared to get CI/CD pipelines back to their full capabilities.

What value is there in a system that, in an effort to accelerate software delivery, knowingly leaves a backlog of bugs that does the exact opposite? We did not arrive at these practices by accident, and its practitioners are neither lazy nor incompetent so; how did we get here and what can we do to temper modern software development’s Great Stink?

Ticking time bombs

When you speak to software engineers, 91% of them admit to having “known, but unresolved” software defects in their backlog. Ask any developer whether they have an ever-growing backlog of tests that sometimes fail and no one knows why, and you’ll get a shame-faced nod. Again, not the lazy sort, there must be a reason that nearly every single one is willing to accept potentially disastrous defects in their work. As it turns out, the most common reason provided is quite a good one: reproducibility.

Like the kitchen cockroach that scurried into a hole in the wall when the lights flicked on, these defects appeared somewhere in development, then disappeared into the dark. Engineers are sure they saw it, but aren’t able to find it in the same spot again and are not inclined to start punching holes in the wall. Besides, there is a deadline to meet and that cockroach that scurried into the hole is almost certainly unrelated to the task at hand.

Of course, this has always been true of software development. However, the mass adoption of CI/CD and its subsequent increase of code and test volume and velocity means that, in our cockroach scenario, not only are more bugs showing up, but the house is also constantly under construction with more rooms and floors getting added every day. More walls, more bugs, less energy to go find them, but an even bigger problem.

Further yet, software engineers admit to regularly ignoring these defects and hoping they go away. But the reality is, they won’t. The tests you sweep under the rug are ticking time bombs just waiting to go off, and it will be your customer who feels the impact. For many, when the explosion happens, it’s dramatic. Considering that we live in a time when “every company is a software company,” these catastrophic failures will reach every corner of our economy.

Unclogging your pipeline

As software steadily becomes more advanced, the challenges associated with CI/CD will continue to grow. While many organizations have come to embrace CI, the continuous delivery (CD) side of the DevOps equation remains a challenge, with even the most advanced practitioners limited by the rate at which bugs, including security flaws, can be discovered and remediated.

According to IBM, bugs that actually make it into production are seven times more costly to fix than those discovered at other stages. The backlog not caused by, but certainly compounded by CI, sends more and increasingly complex software failures into production, leading to an even greater stink and even higher costs.

Of course, there may come a day when machine learning algorithms automate much of the testing process, including determining which types of tests to run; but for now, organizations can streamline the software debugging process by recording software failures — and capturing bugs in the act — at the point of build/test and in production.

What comes next for improving defect resolution

Subsequent improvements in CI/CD will be about making defect resolution bounded, efficient and less skills-dependent. So how can engineering teams confidently deliver quality software on a scheduled, repeatable and automated basis?

The answer is a culmination of strategic and tactical ways of addressing the blockages, and here are a few to note:

Prepare for failures. Failures are inevitable, they are going to happen. Agile development best practices are designed to help resolve them fast. Companies that build solutions into their pipeline to remove the guesswork in failure diagnosis will be best-positioned to reduce Mean Time-to-Resolution (MTTR), enabling them to unblock their pipelines and keep their customers happy.
Make test failures repeatable. I can’t emphasize enough the importance of a systematic repeatable debugging workflow, one that can provide actionable, data-driven insight to assist in defect resolution. Where possible use tests that are simple and deterministic; where you need more complex tests ensure you have a means of software failure replay.
Evolve your CI best practices with the pace of your technology. While traditional methods of debugging like static and dynamic analysis are great, new platforms and solutions are available to better scale and adapt to the changing nature of software development.

Similar to the infamous Great Stink, software failures are unpleasant, to say the least. In his novel “Little Dorrit,” Charles Dickens said it best when he wrote that the Thames was “a deadly sewer … in the place of a fine, fresh river.”

Parliament in the 19th century waited too long to address their piping problem, which resulted in deaths, huge sums of money spent on short-term fixes and the overall embarrassment of a city that once instilled much pride in its citizens.

By learning from the past and anticipating the future, organizations in 2020 no longer need to learn the hard way and can better position themselves ahead of the inevitable software failure to avoid the catastrophic aftermath.