Auditing, Monitoring and Alerting
The engineers can’t (easily) instrument a service in deployment. Make substantial effort during development to ensure that performance data, health data, throughput data, etc. are all produced by every component in the system.
Do we have automated testing that takes a customer view of the service?
I have previously referred to this as synthetic testing. The idea is that you have a special user classification responsible for continuing to exercise a significant proportion of the usecases on a regular basis. This may entail creating a special tenancy or administrative user with superuser access rights to the system. The client performs a set of usecase interactions and pushes a report to an object store such as an S3 bucket. If the report arrives as a failure or does not arrive within N minutes then an alert is triggered. This can be considered as live or continues smoke testing and ensures that support and reliability engineers are the first to know of incidents (rather than waiting for customers to phone in). This does not remove the need for on going component testing and other types of infrastructure testing.
Have we instrumented every customer interaction that flows through the system? Are we reporting anomalies?
Synthetic testing (as outlined in the previous point) is quick and has good coverage but is seldom exhaustive. Each role should report performance, health and throughput data with alerts when the component these metrics deviate beyond a threshold.
Do we have sufficient data to understand the normal operating behaviour?
Collecting normal operating data allows us to understand what normal behaviour looks like so that we can detect anomalous behaviour, and it allows us to understand user interactions so that we can perform capacity, longevity and performance testing. However, be aware of the storage costs of data and attempt to compress (binary / temporal) it before storing it on cost-effective durable storage (such as Amazon Glacier).
Are we tracking the alerts:trouble-ticket ratio (goal is near 1:1)?
This metic tells us how automated our recovery systems are. If each alert creates a new trouble-ticket, then root causes are not being eliminated. If each alert creates no new trouble-ticket, then the alerts are noise and will start to be ignored.
Do we have sufficient instrumentation to detect throughput and latency issues?
Throughput and latency of core service usecases should be recorded and analysed. Are there any patterns we can identify that cause increased latency? Does latency increase during scaling events or during sharding? Does throughput reduce the deeper we go into core services? For each service, a metric should emerge for capacity planning such as user requests per second per system, concurrent on-line users per system, or some related metric that maps relevant work load to resource requirements.
Do we have performance counters for all operations? (at least latency and number ops/sec data)
Record the latency of operations and number of operations per second at the least. The waxing and waning of these values is a huge red flag.
Is every operation audited?
Every time somebody does something, especially something significant, log it. This serves two purposes: first, the logs can be mined to find out what sort of things users are doing (i.e., the kind of queries they are doing) and second, it helps in debugging a problem once it is found.
Do we have individual accounts for everyone who interacts with the system?
Part of the importance of auditing is attribution. If users are sharing accounts then auditing loses its value and the system will lose standards compliance.
Are we tracking all fault-tolerant mechanisms to expose failures they may be hiding?
Fault tolerance may have a side-effect of masking underlying issues. Ensure that when a fault tolerance provision kicks in, it should be logged and potentially alerted upon. Do not use fault tolerance as a safety net for failures - failures should not occur even if fault tolerance mechanisms are doing their job effectively.
Do we have per-entity / entity-specific audit logs?
Include the specific business entity (document, account, catalogue item) as a log field, specifically if the entity has a lifecycle. This can be used later to trace that entities path through the system. For example, for a video transcode, audit each phase of its lifecycle (e.g. ingest, verification, transcode, quality checking, export) including the identifier for the Job. It is easy to identify partial executions of the lifecycle if this data of collacted against the entity.
Are we keeping historical performance and log data?
One of the difficult jobs in performance testing is target setting. Historical data of the service and similar services can help form expectations around how a service should perform, and thus highlight potential performance improvements if the system falls short of these expectations. We may also want to ensure that performance is trending upwards after engineering effort is invested on performance improvements. Again having historical data is critical for quantifying these results.
Is logging configurable without needing to redeploy?
Sometimes we need to elevate logging temporarily within a service - we should be able to do so without redeploying. Sometimes we need to elevate logging for a specific request - we should consider supporting additional fields (or signed payloads) that enable trace logging for that request. Consider signals such as
SIGUSR2 to trigger thread dumps for non-VM execution environments.
Are we exposing suitable health information for monitoring?
At the minimum, services should expose an HTTP endpoint returning
200 or non-
200 in the case of failures. If the endpoint also uses the same internal messaging as normal application messaging to derive the health status, then a slow response can also be an indicator of health problems. If the service may additionally report on its perspective which can be useful for diagnosis purposes. For example, if a service achieves redundancy through a master-slave setup, having each component report on whether it is master or slave, as well as what it believes are its master/slave peers, can help diagnose deadlock situations (e.g. a phantom master).
Is every error that we report actionable?
Error reports should be has helpful as possible without exposing implementation details or inappropriate abstractions. Take time to consider the probable causes of errors and make helpful suggestions for remediation. Communicate low-level details via error codes documented internally.
Can we snapshot system state for debugging outside of production?
Have we provisioned the production environment with the necessary tooling to support dumping and snapshoting. For example, utility applications such as
jmap allow production state to be snapshotted and exported from the system for offline analysis.
Are we recording all of the licenses that our software is using and assuring our compliance with their restrictions?
Be diliget with applying license headers to all authored code (especially publically hosted/open source files). Include a
LICENSE file in distributed source along with a
THIRD-PARTY file containing all licenses of third-party dependencies.