Sorry, this wiki is closed to anonymous edits.
An outline of current tech industry best practices for how to run successful software development projects, covering development, deployment, and operations. Nearly everything on this list could support a full talk. ### Org Structure Traditional tech org structures feature some or all of these classic roles: * support * testing * operations * development The emphasis on automation means that there are development aspects (if not outright developers) in each of these roles, and frequent communication and collaboration mean that they should be supporting each other: - Support knows what problems customers are facing. In order to help them do their jobs better, they may need to develop tools to look up user data, automatically verify workflows, etc. These tools may feed into the testing automation suite. At its full expression, support is able to fix minor issues with the product and free developers to work on strategic aims. - Testing as its own department is often absent from tech startups; writing automated tests is the job of the developer. In other industries (eg. medical devices), developers may not understand requirements enough to write appropriate tests. Testers should be focused on building and maintaining a rigorous suite of automated acceptance tests. - Operations has become "DevOps" in an ambitious rebrand. The reality is that, for most companies, cloud services have eliminated the need to have a team of datacentre designers and people to rack servers, run cables, deal with hardware vendors, etc. Operations now build and maintains the development and deployment tooling and environments, in addition to infrastructure level projects like service discovery, configuration management, feature flags, etc. - Development is focused on the product. They work intimately with operations to ensure that the product meets operability objectives (SLOs) and receive feedback from testing and support to ensure that production has a low defect rate. Development may also build tools and APIs specifically for operations, support and testing. ### Project Hygiene - use a revision control system (git, perforce, hg, .. something) - have an established workflow for working on features/bugfixes and merging them: - make this workflow easier to use than not via tooling (positive reinforcement) - make this workflow painful to not follow (negative reinforcement) - make it possible to opt out for exceptional cases (eg. force-push to remove accidentally committed sensitive data) - automate tests, builds, code analysis tooling (test coverage, linting, automated error detection) - use a code formatter (go fmt, [google-java-format](https://github.com/google/google-java-format), [python's 'black'](https://github.com/ambv/black), clang-format, etc) - establish a system of code review - it's easy for this to devolve into requiring another person to briefly glance and then +1 each release; avoid this antipattern and iterate on your review process - avoid bus-factor-1 projects with pairing, documentation, and scheduled demos/presentations - _speed is a feature_ - if part of your workflow is slow, it will impact _all_ of your developers The most effective way to share knowledge and to mentor programmers is to pair program regularly. The rule of thumb when pairing a senior and a junior is: if the senior knows how to approach the problem already, the junior drives. Otherwise, the senior drives. Consider starting off all greenfield projects via pairing; the core design of your new system was now verified by multiple people, and you have multiple experts up front. Making pairing the norm has its issues; some people find it difficult to vocalize their internal monologue, and find pairing to cause extreme anxiety and to be very draining. The skeptical pushback against pairing is that it is paying twice a much for the same output. Setting aside that paired programmers are often more productive than they would be solo (eg. they can get immediate answers to questions, they keep each other from getting distracted), you also get mentoring/training, team building, and knowledge sharing for free. ### Environments You may encounter consultants or articles online mandating a certain set of environments for all software projects. These may include, but are not limited to: - a development environment - a staging environment - a build & test environment (CI/CD) - a UAT environment - a production environment These people are well-meaning, but wrong. Environments exist to fulfill needs: - a place for developers to work - a place to build your system - a place to run automated tests - a place to sanity check releases - a place to test how a release will integrate with the wider platform - a place to run your platform for your users (for SaaS & online systems) Following a standard set of environments to bootstrap software development projects in your organization is still a Good Thing, but you should be mindful of the purpose and goals of each environment and be ready to pivot into other methods of meeting them if your initial setup begins to fail to do so. Examples: Generally, developers will want to build/test locally, but this the underlying need here is fast feedback. If your systems and your components are small and this process can be done quickly, then it can make sense to distribute a development environment (eg, a centrally developed container or VM). If your project takes several minutes to build on commodity hardware, but having an upstream build cluster can cut that to several seconds, that building a workflow around that will be faster than clicking a build button in your IDE. Staging environments are not production environments unless you monitor and maintain them as such, and people are generally unwilling to be bothered off-hours to fix something with no impact. Because of this, they scale poorly with the count of developers; if you have 5 developers who break staging once a month, then it's only broken about once a week. If you have 100 such developers, staging will be broken almost 5 times a day. ### Testing It's nearly unanimous at this point that automated testing of some kind is important to software correctness. It is often possible to build a simple program without significant errors and without tests, but it's nearly impossible to maintain a program without introducing new errors over time without automated tests. Testing can get dogmatic quickly, but methodologies generally vary on where and how they place emphasis a more or less accepted taxonomy: * unit tests * contract tests * fuzzing * integration tests (also: functional tests, system tests) * acceptance tests Unit tests ensure that functions and types defined in the code produce the correct output. If code is making remote calls to system infrastructure that may be difficult to replicate in the testing or development environment, then a fake localized version (called a Mock) is used as a stand in. The behaviour of the mock must also be verified as matching the behaviour of that system; its "contract" must be tested via contract tests. Finally, programs and routines that take significant user input that is not pre-verified for correctness (parsers, compressors, et al.) should be "fuzz" tested; tested repeatedly and rigorously against a battery of subtly incorrect input to ensure they do not crash. Integration tests ensure that a software system operates correctly when paired with its deployment environment. In some cases, you can get decent integration tests by running unit tests with your mocks disabled. Your contract tests may be integration tests. This is sometimes called "functional testing" or "system testing", and the supposed differences between them are too subtle to be important. Finally, acceptance tests ensure that the system as a whole performs its intended function, not just individual aspects of the system. In smaller systems, these are sometimes not distinguishible from integration tests, and at the level of a command line tool this can be achieved with unit testing. Depending on the project, this may also be an extensive manual test, eg a trial that a blood sugar monitor will likely meet FDA requirements. The most important part of testing is that you have a suite of tests that can be run automatically on new code changes that give you a good base level of confidence that things are functioning properly. This allows developers to see errors and fix them without going through the slower, more expensive testing tiers. ### Automation Automation is an iterative process, and knowing what is worth automating and what isn't is a process of discovery that will be different for every org and every project. As you understand the problem space better, you'll get better at making this decision. Things common to _every_ software project are good candidates for automation up front: * CI/CD * workflow automation (publishing new builds, generating changelogs, etc) * deployment automation (in SaaS; deploying builds, rollbacks, migrations) * operations automation (com/decom of servers, configuration management) However, building a culture of automation is important. A common approach to automation is to perform a simple calculus like: > If I only spend 7 minutes on this a day, and I do it once a week, why spend 2 days automating it? It will take almost 3 years to pay off. It's important to acknowledge that this is actually a comparison of two estimates: the time it will take to automate something, and the amount of time that will be spent in the future doing that thing. Estimates are hard, but the farther out into the future they are the worse they get. Because of this, your time spent automating is likely a better estimate than your time spent manually doing some process. If you develop a culture of automation, you will begin to see how many assumptions that first estimate of time spent relies upon. These processes change, and manual processes have other costs associated with them: - Will other people will have to do this? Then the process requires documentation, education/training, etc. - Is the frequency of this operation controllable? - If dataset size, user count, etc can change the frequency, it may take more of your time in future. - Is it frequency-sensitive? Is it reliably schedulable? Backups and DB Vaccuums are all classic automation tasks because they are repetitive but also because _forgetting_ to do them is harmful. - Is it complex, repetitive, or failure prone? - A script is often more reliable - Could it be time-sensitive in a failure scenario? How much will the delay cost your organization? - Computers are faster than humans - Could similar tasks be automated like this? Is this a component of some other process? Can future automation be built upon this, compounding time savings/reliability? You will grow to know these likelihoods better the longer a project goes on. In other industries, this is known as _experience_. If you start with one process, and a few months later there are 7, you're still only spending about 45 minutes a month still. But how likely are you to forget some? How likely are you to remember to hand them off if you go on holiday? What if you have to run them in response to some system property like normalized load reaching 2, or disk utilization reaching 75%? How much time are you spending tracking that? Organizations that do not automate tend to develop and reinforce habits that make automation more difficult. Components inevitably optimize around their workflows. If part of the 2 days to automate your process involves adding an API to your component that makes other automation tasks much simpler, this may pay itself over many times during its lifetime on yet undreamt of tasks. On the other hand, if you get used to copying elements from some form into some other system, you will optimize around that workflow. New employees will copy the workflows of the past instead of challenging them. ### Operations Answers to common questions about operations: #### Should developers go on call? Yes. Sometimes a developer must sacrifice features and bug fixes to improve stability and operability. If developers are not part of operations, they are not directly incentivized to meet an operations budget and are less likely to achieve this balance. The key here is having objective metrics by which to measure "operability" and "stability", and producing a budget. If a system is blowing the budget, features and bugfixes must halt until it is within budget. If a team is consistently failing over a long period of time, that's almost certainly a failure in management. If the individuals lack the skills, then they should have been replaced. If the workload is too high, then any multitude of business processes are likely failing. #### Should developers have production access? Probably. This can vary, especially with respect to regulatory requirements. The arguments against allowing developers on prod boil down to a few risk assessments: - developers can mistakenly cause outages or damage data - developers can _purposefully_ cause outages or damage data - the state of the production environment becomes unknowable to operations The first two are actually very different, but it's very hard to mitigate a clumsy developer or a disgruntled attacker by not allowing direct prod access when they are given the ability to deploy code to production. Better mitigation strategies against these exist, such as only allowing services access to their dependencies ("security in depth"). The third is more directly solvable with limiting access, however in practice it can be difficult to determine what went wrong without poking around. It's not that uncommon to have subtle behavioural corruption bugs only pop up (or be detected) once every several dozen _computer years_ of operation. Having production access is the easiest way of debugging these rare issues, though it is not the only way. Denying developers production access is a successful strategy in very large organizations that are able to build a significant ammount of infrastructure to make it less necessary. In most other organizations (fewer than ~10k developers), it's enough to makeit a priority to iteratively reduce the requirement for direct access to prod, but allow it for investigation. Verification is _very_ expensive. If you are having performance issues that is affecting the availability of your service, it's far easier and less damaging to allow a developer to try out adding a new index or an experimental build with a small tweak than it would be to re-create the current conditions to test it. Other methods that allow developers the ability to "test in prod" against conditions that may be prohibitively expensive or difficult to reproduce elsewhere include canary deployments, tagged releases, and feature flags. For a lot of detailed information on operations, including more structured suggestions and frameworks that can get you off the ground, read the [SRE book](https://landing.google.com/sre/sre-book) from Google, even if you have no plans of following its operational model. #### Observability ("o11y") This is a bit of a personal mission. Along with data storage, observability is probably my primary field of expertise, and I know many of the people who have written [seminal](https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c) articles [on the subject](https://www.vividcortex.com/blog/monitoring-isnt-observability) personally. My company uses the term as one of its primary marketing pushes. So what is it? Simply put, a system is "observable" if you can easily determine how a system is behaving and _why_ it is behaving that way. That is, observability is a property of your system that makes it understandable and fixable. From an operations perspective, a lot of this will look like metrics and monitoring. What is the system state in our database cluster? How much memory do we use on average? Are there outliers? How do we know when we have to upsize a cluster, or when a new more performant release means we can downsize it? From a development perspective, this means both determining metrics that can be used to understand what a system is doing as well as adding hooks to those systems to allow deeper runtime inspection. - Index hit rate, queries per second; these are metrics. EXPLAIN ANALYZE is inspection. - Request throughput and latency are metrics, but being able to pull up a list of active requests along with their durations and a cost estimation is part of observability. - If a request has a latency of 700msec, and its hit 9 different services, do you know where it has spent most of its time? - Do you _know_ that operations you expect run in parallel are not running sequentially? In my company's marketing literature, we tout our platform as offering the "Three Pillars of Observability." In her book "Distributed Systems Observability", Cindy Sridharan had [a chapter](https://www.oreilly.com/library/view/distributed-systems-observability/9781492033431/ch04.html) devoted to the concept. The pillars are: - logs/events - traces - metrics It's true that these are valuable tools in order to observe systems, but _designing observable systems_ means carefully thinking about what measurements may have value during operation. As [Kelly Sommers](https://twitter.com/kellabyte) memorably ranted: !(https://cdn-images-1.medium.com/max/1600/1*_X85QCeM60sP7se2GqneSA.png)