page info

Software Development Practices

An outline of current tech industry best practices for how to run successful software development projects, covering development, deployment, and operations. Nearly everything on this list could support a full talk.

Org Structure

Traditional tech org structures feature some or all of these classic roles:

The emphasis on automation means that there are development aspects (if not outright developers) in each of these roles, and frequent communication and collaboration mean that they should be supporting each other:

Project Hygiene

The most effective way to share knowledge and to mentor programmers is to pair program regularly. The rule of thumb when pairing a senior and a junior is: if the senior knows how to approach the problem already, the junior drives. Otherwise, the senior drives. Consider starting off all greenfield projects via pairing; the core design of your new system was now verified by multiple people, and you have multiple experts up front.

Making pairing the norm has its issues; some people find it difficult to vocalize their internal monologue, and find pairing to cause extreme anxiety and to be very draining. The skeptical pushback against pairing is that it is paying twice a much for the same output. Setting aside that paired programmers are often more productive than they would be solo (eg. they can get immediate answers to questions, they keep each other from getting distracted), you also get mentoring/training, team building, and knowledge sharing for free.

Environments

You may encounter consultants or articles online mandating a certain set of environments for all software projects. These may include, but are not limited to:

These people are well-meaning, but wrong. Environments exist to fulfill needs:

Following a standard set of environments to bootstrap software development projects in your organization is still a Good Thing, but you should be mindful of the purpose and goals of each environment and be ready to pivot into other methods of meeting them if your initial setup begins to fail to do so.

Examples:

Generally, developers will want to build/test locally, but this the underlying need here is fast feedback. If your systems and your components are small and this process can be done quickly, then it can make sense to distribute a development environment (eg, a centrally developed container or VM). If your project takes several minutes to build on commodity hardware, but having an upstream build cluster can cut that to several seconds, that building a workflow around that will be faster than clicking a build button in your IDE.

Staging environments are not production environments unless you monitor and maintain them as such, and people are generally unwilling to be bothered off-hours to fix something with no impact. Because of this, they scale poorly with the count of developers; if you have 5 developers who break staging once a month, then it's only broken about once a week. If you have 100 such developers, staging will be broken almost 5 times a day.

Testing

It's nearly unanimous at this point that automated testing of some kind is important to software correctness. It is often possible to build a simple program without significant errors and without tests, but it's nearly impossible to maintain a program without introducing new errors over time without automated tests. Testing can get dogmatic quickly, but methodologies generally vary on where and how they place emphasis a more or less accepted taxonomy:

Unit tests ensure that functions and types defined in the code produce the correct output. If code is making remote calls to system infrastructure that may be difficult to replicate in the testing or development environment, then a fake localized version (called a Mock) is used as a stand in. The behaviour of the mock must also be verified as matching the behaviour of that system; its "contract" must be tested via contract tests. Finally, programs and routines that take significant user input that is not pre-verified for correctness (parsers, compressors, et al.) should be "fuzz" tested; tested repeatedly and rigorously against a battery of subtly incorrect input to ensure they do not crash.

Integration tests ensure that a software system operates correctly when paired with its deployment environment. In some cases, you can get decent integration tests by running unit tests with your mocks disabled. Your contract tests may be integration tests. This is sometimes called "functional testing" or "system testing", and the supposed differences between them are too subtle to be important.

Finally, acceptance tests ensure that the system as a whole performs its intended function, not just individual aspects of the system. In smaller systems, these are sometimes not distinguishible from integration tests, and at the level of a command line tool this can be achieved with unit testing. Depending on the project, this may also be an extensive manual test, eg a trial that a blood sugar monitor will likely meet FDA requirements.

The most important part of testing is that you have a suite of tests that can be run automatically on new code changes that give you a good base level of confidence that things are functioning properly. This allows developers to see errors and fix them without going through the slower, more expensive testing tiers.

Automation

Automation is an iterative process, and knowing what is worth automating and what isn't is a process of discovery that will be different for every org and every project. As you understand the problem space better, you'll get better at making this decision. Things common to every software project are good candidates for automation up front:

However, building a culture of automation is important. A common approach to automation is to perform a simple calculus like:

If I only spend 7 minutes on this a day, and I do it once a week, why spend 2 days automating it? It will take almost 3 years to pay off.

It's important to acknowledge that this is actually a comparison of two estimates: the time it will take to automate something, and the amount of time that will be spent in the future doing that thing. Estimates are hard, but the farther out into the future they are the worse they get. Because of this, your time spent automating is likely a better estimate than your time spent manually doing some process.

If you develop a culture of automation, you will begin to see how many assumptions that first estimate of time spent relies upon. These processes change, and manual processes have other costs associated with them:

You will grow to know these likelihoods better the longer a project goes on. In other industries, this is known as experience.

If you start with one process, and a few months later there are 7, you're still only spending about 45 minutes a month still. But how likely are you to forget some? How likely are you to remember to hand them off if you go on holiday? What if you have to run them in response to some system property like normalized load reaching 2, or disk utilization reaching 75%? How much time are you spending tracking that?

Organizations that do not automate tend to develop and reinforce habits that make automation more difficult. Components inevitably optimize around their workflows. If part of the 2 days to automate your process involves adding an API to your component that makes other automation tasks much simpler, this may pay itself over many times during its lifetime on yet undreamt of tasks. On the other hand, if you get used to copying elements from some form into some other system, you will optimize around that workflow. New employees will copy the workflows of the past instead of challenging them.

Operations

Answers to common questions about operations:

Should developers go on call?

Yes. Sometimes a developer must sacrifice features and bug fixes to improve stability and operability. If developers are not part of operations, they are not directly incentivized to meet an operations budget and are less likely to achieve this balance. The key here is having objective metrics by which to measure "operability" and "stability", and producing a budget. If a system is blowing the budget, features and bugfixes must halt until it is within budget.

If a team is consistently failing over a long period of time, that's almost certainly a failure in management. If the individuals lack the skills, then they should have been replaced. If the workload is too high, then any multitude of business processes are likely failing.

Should developers have production access?

Probably. This can vary, especially with respect to regulatory requirements. The arguments against allowing developers on prod boil down to a few risk assessments:

The first two are actually very different, but it's very hard to mitigate a clumsy developer or a disgruntled attacker by not allowing direct prod access when they are given the ability to deploy code to production. Better mitigation strategies against these exist, such as only allowing services access to their dependencies ("security in depth").

The third is more directly solvable with limiting access, however in practice it can be difficult to determine what went wrong without poking around. It's not that uncommon to have subtle behavioural corruption bugs only pop up (or be detected) once every several dozen computer years of operation. Having production access is the easiest way of debugging these rare issues, though it is not the only way. Denying developers production access is a successful strategy in very large organizations that are able to build a significant ammount of infrastructure to make it less necessary.

In most other organizations (fewer than ~10k developers), it's enough to makeit a priority to iteratively reduce the requirement for direct access to prod, but allow it for investigation. Verification is very expensive. If you are having performance issues that is affecting the availability of your service, it's far easier and less damaging to allow a developer to try out adding a new index or an experimental build with a small tweak than it would be to re-create the current conditions to test it. Other methods that allow developers the ability to "test in prod" against conditions that may be prohibitively expensive or difficult to reproduce elsewhere include canary deployments, tagged releases, and feature flags.

For a lot of detailed information on operations, including more structured suggestions and frameworks that can get you off the ground, read the SRE book from Google, even if you have no plans of following its operational model.

Observability ("o11y")

This is a bit of a personal mission. Along with data storage, observability is probably my primary field of expertise, and I know many of the people who have written seminal articles on the subject personally. My company uses the term as one of its primary marketing pushes. So what is it?

Simply put, a system is "observable" if you can easily determine how a system is behaving and why it is behaving that way. That is, observability is a property of your system that makes it understandable and fixable.

From an operations perspective, a lot of this will look like metrics and monitoring. What is the system state in our database cluster? How much memory do we use on average? Are there outliers? How do we know when we have to upsize a cluster, or when a new more performant release means we can downsize it?

From a development perspective, this means both determining metrics that can be used to understand what a system is doing as well as adding hooks to those systems to allow deeper runtime inspection.

In my company's marketing literature, we tout our platform as offering the "Three Pillars of Observability." In her book "Distributed Systems Observability", Cindy Sridharan had a chapter devoted to the concept. The pillars are:

It's true that these are valuable tools in order to observe systems, but designing observable systems means carefully thinking about what measurements may have value during operation. As Kelly Sommers memorably ranted: