From "Why SRE Documents Matter" - Shylaja Nukala and Vivek Rau. 2018
What is it? What does it do? Describe at a high level the functionality provided to clients (end users, components, etc.).
Explain how the architecture works. Describe the data flows between components. Consider adding a system diagram with critical dependencies, and request and data flows.
Clients and Dependencies
List any upstream clients (owned by other teams) that rely on it and downstream services (owned by other teams) that it relies on. (These can also be shown in the system diagram.)
Code and Configs
Explain the production setup. Where does it run? List binary names, jobs, data centers, and config file setup, or point to canonical location of these. Also provide code location and build info if relevant.
List and describe the configuration files, changes, and ports needed to operate this product or service.
Address the following: What configuration files have been modified for this product or service? How is the configuration handled?
Address the following: What daemons and other processes must be running to carry out the service? What control scripts were created to manage this service?
List and describe the log files created by or within the component and the monitoring running against it. Address the following: What log files are generated by the component? What does each file contain? What recommendations do you have for examining these log files? What aspects of the component must be monitored to ensure reliable service?
Dashboards and Tools
Link to the relevant dashboards and tools.
List the capacity of a single instance; per-DC; globally: QPS, bandwidth, and latency numbers.
Give availability targets.
Add links to procedures. These could include load testing, updates/pushes/flag flips, etc. Link to alert documentation in the alerts playbook.
Link to design docs on the component or related components, typically written by developer teams, and other related information.
The title should be the name of the alert (e.g., Generic Alert_AlertTooGeneric).
Address the following: What does this alert mean? Is it a paging or an email-only alert? What factors contributed to the alert? What parts of the service are affected? What other alerts accompany this alert? Who should be notified?
Indicate the reason for the severity (email or paging) of the alert and the impact of the alerted condition on the system or service.
Provide specific instructions on how to verify that the condition is ongoing.
List and describe debugging techniques and related information sources. Include links to relevant dashboards. Include warnings. Address the following: What shows up in the logs when this alert fires? What debug handlers are available? What are some useful scripts or commands? What sort of output do they generate? What are some additional tasks that need to be done after the alert is resolved?
List and describe possible solutions for addressing this alert. Address the following: How do I fix the problem and stop this alert? What commands should be run to reset things? Who should be contacted if this alert happened due to user behavior? Who has expertise at debugging this issue?
List and describe paths of escalation. Identify whom to notify (person or team) and when. If there is no need to escalate, indicate that.
Provide links to relevant related alerts, procedures, and overview documentation.
Describe the services that the team is responsible for.
Actual service demand from the prior six to eight quarters, expressed in the metric most relevant to the service (for example, QPS or daily active users).
Current demand forecast for the next eight quarters.
Capacity plan sufficient to meet forecast demand at required redundancy level—highlight shortfalls and/or risks to the capacity plan.
The capacity plan must include an overlay with two to four previous quarterly forecasts, so that readers can assess forecast stability and accuracy over time.
SLA Performance / Availability
All SRE-supported services are required to have a written SLA and to assess their performance relative to the SLA at least quarterly.
The SLA section must contain measurement of quarterly performance against SLA for the service's major components, and a link to the team's written SLA.
Contributing Incidents (Optional)
List three to five top incidents or outages for the quarter.
List top achievements for the quarter.
SLA Modifications (Recommended)
Recent changes to the SLA.
Service Details (Recommended)
May include service growth, latency stats, etc.
Team Info (Optional)
May include team staffing and status, projects, oncall stats.
Data Sources (Required)
Describe the data sources used to derive availability numbers, methods for calculating, and provide links to relevant dashboards.
Who Are We
Add a sentence describing the technology environment (~1 line), the customers and offering of the team, as well as the scope of your team's SRE engagement or special expertise.
Describe the (group of) services your team supports to further define your team's scope.
How Do We Invest Our Time
Deciding the scope of work will help define your roadmap of how you can achieve and maintain your goals in the long run.
Communicate your team values in a clear manner. They will influence how team members interact with each other and how your team is perceived by others.
Copyright © 2018 held by owner/author. Publication rights licensed to ACM.