Overall Application Design & Development
These items cover the sets of questions we should be asking ourselves when embarking and a new application or platform or reviewing a design before beginning development.
Have we documented all conceivable component failure modes and combinations thereof?
The project should document the various identified failure modes of the system. It should be updated as new failures are experienced during testing phases and during production. As a preliminary, we should agree on the resiliance guarantees the system is expected to uphold in the face of the failure and attack. For example, if a component expects messaging from another, should it expect exactly once or at least once semantics of these messages? Should it be expected to experience message lateness? What about message duplication or reordering or omission failures (infinitely late messages)? Is the system exposed to attackers who may wish to modify messages in-flight or spoof messages? See this presentation for a comprehensive list of failure modes to expect.
Does our design tolerate these failure modes? And if not, have we undertaken a risk assessment to determine the risk is acceptable?
Sometimes recovering from failures is preferable to actually avoiding them in the first place, and avoiding failures in design can be more costly than recovering when they happen. Sometimes attempting to recover from failure can over complicate a design unnecessarily, or may not even be possible at all. Remember that reliable messaging is almost never an imperative. Overall, it is acceptable in many cases to forgoe fault-tolerance if the associated costs have been sufficiently evaluated and risk sufficiently mitigated.
Can the service survive failure without human administrative interaction?
A significant class of frequently observed failures can be recovered without administrative interaction. Consider a VM/container that becomes unavailable due to cloud-provider maintenance. In this case a health check should fail and the coordination/scheduling layer should invervene and spawn a new VM/container in its place. This is not to say that manual administrative control should not be made available, but rather that common failures should be accounted for.
Are failure paths frequently tested?
Failure paths are just as important to test as success paths, although usually more difficult and time consuming. Distributed failure testing can happen at a slightly different cadence to the delivery cycle (similar to performance testing), i.e. it is not practically necessary to have failure testing done per-commit however the more regular the better, and certainly before each release. Failure testing can manifest itself as automated tests running and analysing the system such as Jepsen, and/or failure drills (e.g. failure fridays) where specific identified failure modes are manually triggered. Failure drills have the additional benefit of ‘training’ the team to respond to incidents in a controlled environment.
Are we targeting commodity hardware? (That is, our design does not require special h/w)
Large clusters of commodity hardware servers are much less expensive than the small number of powerful servers they replace. Power consumption scales linearly with with servers but cubically with clock frequency therefore
high-performance servers have high operational expenditure. In the cloud, RAM capacity adds additional cost;
2GiB (as may be required to host the JVM and its assocaited heap ocupancy) doubles the cost per-unit-time (versus a C/C++/Go application). Bespoke hardware for media processing (transcode/encryption/
segmentation) is far more expensive than the cloud equivalent.
Are we hosting all users on a single version of the software?
Two factors that make SaaS less expensive to develop and faster to evolve than most package-and-deliver products are a) the software needs to only target a single internal deployment and b) previous versions don’t have to be supported for a decade as is the case for enterprise-targeted products. Enterprises are used to having significant influence over their vendors and to having complete control over when they deploy new versions (typically slowly). This drives up the cost of their operations and the cost of supporting them since so many versions of the software need to be supported simultaneously. The most economic services don’t give customers control over the version they run, and only host one version. Holding this single-version software line requires care in not producing substantial user experience changes release-to-release and a willingness to allow customers that need this level of control to either host internally or switch to an application service provider willing to provide this people-intensive multi-version support. Achieving single-version SaaS requries a good deal of forsight, care and design intuition, especially around the API and pricing model. It also requires finding the right customer-set; one that is willing to participate in iterations to converge on the correct service offering.
Is multi-tenancy required? Can we support multi-tenancy without physical isolation?
Multi-tenancy comes with significant engineering complexity; particularly with regards to isolation and fairness. Isolation refers to the requirement that each tenant should have the impression that they are the sole consumers of the system - both from a security perspective and from a performance perspective. Fairness refers to the sharing of resources via e.g. throttling or queuing. Both of these require a great deal of time and energy to accomplish, and give rise to a classification of testing that is also difficult. Answering the question of whether multi-tenancy can be leveraged from some underlying cloud provider or avoided all-together is an important exercise. Do not assume that providing multi-tenancy and fairness is an imperative; measure its importance against the customer needs and the teams’ technical competence and experience.
Have we implemented (and automated) a quick service health check?
This is the services version of a build verification test. It’s a sniff test that can be run quickly on a developer’s system to ensure that the service isn’t broken in any substantive way. Not all edge cases are tested, but if the quick health check passes, the code can be checked in. Modern cluster schedulers use an HTTP endpoint to identify the health status of a service, and this can be repurposed to support other health status information.
Do our developers work in the full environment? (Requires single server deployment)
For a single logical service (potentially consisting of smaller services), it is pleasant for developers to be able to deploy locally to their laptop. The iterative save-deploy-test cycle gives quick feedback to developers and is cheaper than providing multiple sandbox environments. Again this requires up-front design to identify the set of services (bounded context) to deploy and how to mock dependencies/service calls across this boundary.
Do we have few source code repositories hosting our software?
It is a mistake to segregate code and services at the VCS level - despite the software engineer’s intuition to modularise. If code dependencies within an organisation traverse code repositories then the overhead involved in managing code begins to increase, since each repository must be checked out to the correct version and build/deployed in the specific order. Making large scale refactorings across all related packages can be done very quickly if every package is maintained in one single repository. By contrast changing an API which affects all packages spread across multiple repositories, means making a separate commit in everyone of those affected repositories. Prefer the mono-repo approach taken by Facebook/Google.
Can we continue to operate in reduced capacity if services (components) you depend on fail?
As components get closer to the customer/consumer, reduced capacity planning becomes more important. Most services are formed of pods or sub-clusters of systems that work together to provide the service, where each pod is able to operate relatively independently. Each pod should be as close to 100% independent and without inter-pod correlated failures. Global services even with redundancy are a central point of failure. Sometimes they cannot be avoided but try to have everything that a cluster needs inside the clusters. Clusters close to the user should attempt to provide partial service - e.g. providing access to all free non-adult content on a video platform if the user identity and parental control service is returning failures.
For rare emergency human intervention, have we worked with operations to come up with recovery plans,and documented, scripted, and tested them?
Do we have an incident management centre for documenting incident responses (e.g. Atlassian Service Desk). Have we scripted our runbooks (see Nurtch) for restarting/redeploying/recovering services? Are these tested in the same vein as application code?
Does each of our complexity adding optimizations (if any), give at least an order of magnitude improvement?
Do not optimize at the expense of complexity unless absolutely necessary. Increased compute resources is almost always preferable to increased code complexity; compute resources are far cheaper than developer resources. Complexity breeds problems. Simple things are easier to get right. Avoid unnecessary dependencies. Installation should be simple. Measuring programming progress by lines of code is like measuring aircraft building progress by weight. - Bill Gates
Have we evaluated the effects that a microservice partition will have on end-to-end latency and throughput?
A service hop in a typical Java-based thread-per-request microservices architecture will result in a 25% (anecdotally) drop-off in throughput (i.e. requests/second) due to thread-scheduling inefficiencies by the container and thread-pool management. Reactor-based systems are typically more efficient (no anecdotal figure). Do not start with microservices, start with a well organised monolith (I call this latent microservices) and spin out microservices later where needed.
Have we understood the load this service will put on any backend store / services? Have we measured and validated this load?
Different ‘clusters/bounded contexts’ within an overarching service need to be protected from one-another. A core service can flood the reporting or backend API services if autoscaling is unbounded.
Have we enforced admission control at all levels (and between logical contexts within the overall application architecture)?
Once we understand the inter-service load limitations, we can enforce admission control. Message queueing and stream processing can effectively throttle and control message rates between clusters/bounded contexts, and can dampen the load spikes and troughs.
Have we understood the network design and reviewed it with networking specialists?
Network design is often one of the most hidden aspects to the overall solution from the architect/developer persective since it is typically owned by the site-reliability engineering team. In addition, the flow of information from SRE to Development team is normally poorer than the opposite direction, since development teams have recently become more adjusted to the develop/demo mode and cadence. It is important that architects and developers have a working understanding and knowledge of the network design since it directly affects the failure modes and attack vectors of the software system. Review the network design with the development teams and have it reviewed externally if possible.
Have we analysed throughput and latency and determined the most important metric for capacity planning?
Throuphput and latency requirements have a direct impact on service design (micro-service design in particular) and therefore must be understood before delivery begins. High-throughput / low-latency services come at the cost of increased memory footprint since they require caches to be large and connection pools to be large, etc. This in turn drives up operational cost, so identifying and designing around this early in the project way pay dividends.
Is everything versioned? The goal is to run single-version software, but multiple versions will always exist during rollout and testing etc. Versions n and n+1 of all components need to peacefully co-exist.
During the build step, bake the VCS hash and the version coordinates into the build artifact and have that made available via some endpoint (such as HTTP).
Have we listed the durability and availability guarantees of underlying cloud infrastructure to determine the overall availability for our applications (e.g. 4 9s)
Work out the calculus of service availability for the cloud infrastructure components you rely on and use them to determine service-level objectives (SLO) for the platform. Do not attempt to attain an SLO that is excessively high since it will distort the design unnecessarily with no percetible gain from the user perspective. Define an internal SLO that is higher than the public-external SLO by a factor of 10. E.g. aim for a public statement of 99.99 availability within the year but internally set this at 99.999 as an SLO. A service cannot be more available than the intersection of all its critical dependencies. If you have a critical dependency that does not offer enough 9s (a relatively common challenge!), you must employ mitigation to increase the effective availability of your dependency (e.g., via a capacity cache, failing open, graceful degradation in the face of errors, and so on.) Map the dependency model between services and apply the Rule of 9.
Read and appreciate 10.
References and Further Reading
Specifically this and this.