December 3, 2019
Volume 17, issue 5

Download PDF version of this article PDF

The Reliability of Enterprise Applications

Understanding enterprise reliability

Sanjay Sha

Enterprise reliability is a discipline that ensures applications will deliver the required business functionality in a consistent, predictable, and cost-effective manner without compromising core aspects such as availability, performance, and maintainability.

While achieving a high level of reliability is a common goal of most enterprises, reliability engineering involving third-party applications can be a complex landscape. First-party software affords the luxury of building a modular and extensible application that integrates seamlessly with an enterprise's IT ecosystem. Third-party software doesn't always have the same flexibility. Incorporating an off-the-shelf enterprise application within an existing IT ecosystem, without compromising functionality and reliability, is a classic engineering and philosophical problem that the CIO's office has to deal with all the time.

Despite this complexity, enterprises still pursue and select third-party software to power their business verticals such as HR (human resources), legal, and finance, since it makes economic sense to pay for an enterprise application rather than building the software in-house. Enterprises sometimes base their buying decisions only on the required business functionality, however, and tend to overlook the application's overall reliability. This can compromise the availability and supportability of the application and increase the cost of managing it in the long run.

This article describes a core set of principles and engineering methodologies that enterprises can apply to deliver highly reliable and cost-efficient applications and that can help them navigate the complex environment of enterprise reliability.

Terminology

The following terminology is used throughout:

Business owner — The owner or person leading a business vertical such as legal, finance, or HR; business owner and customer are used interchangeably.

Customer — The business owner of a business vertical such as legal, finance, or HR.

Enterprise applications — Software owned by an external company, also known as third-party software.

SLO (service-level objective) — A quantifiable objective that measures the effectiveness of business functionality.

SRE (site reliability engineering) — The enterprise's support organization. Some enterprises may not have a dedicated SRE team; instead, they have support teams such as DevOps and system or IT admins. The principles and methodologies outlined here are generic enough that they can be applied to any support organization.

User — An employee of an organization who represents the consumer of an enterprise application.

Vendor — A provider of third-party software.

Reliability Axioms

Reliability axioms are a set of principles that emphasize the values and behaviors that help foster and maintain the culture of enterprise reliability.

Culture = Reliability Axioms (Values) × Reliability Engineering (Behaviors)

These five core axioms define enterprise reliability and form the basis of this article: (1) focus on the customer; (2) select the right vendor; (3) invest in a common application platform; (4) engineer reliability to be cost effective; and (5) build an engineering-centric support organization (or SRE).

Focus on the Customer

Customer objectives determine the reliability of an application. Having a well-defined set of customer objectives is foundational, as these translate into tangible and measurable goals. These goals, also known as SLOs, drive the overall reliability posture of an application such as availability, performance, data integrity, monitoring, and responding to incidents. SLOs ensure that an application is engineered to meet the precise needs of the customer.

Select the right Vendor

The choice of vendor impacts the reliability of the core application. Choosing an enterprise application involves much more than just buying software that meets the business functionality. It involves partnering with a vendor that thinks and builds software with similar principles and values to the enterprise (e.g., secure by design, scalable software components, open APIs for extensibility, and ease of support and maintenance).

Invest in a common platform

The overall reliability of an application is the sum total of the reliability of the business's core application and all its dependencies. Transforming the baseline dependencies into a common platform can help standardize and bring consistency in how an application is managed. Using a common platform can drastically decrease technical silos and increase overall reliability and efficiency. A common platform could mean having a shared deployment manager, CI/CD (continuous integration/continuous delivery) frameworks, or shared service management workflows for monitoring, logging, backups, etc.

Engineer reliability to be cost effective

Over-engineering reliability breaks the ROI (return on investment) curve. Reliability is a function of how mature an application is and, as a result, its overall availability. Imagine you have a service with a 99.9 percent SLO. Adding an extra nine (99.99 percent) sublinearly increases the availability of your service, as shown in figure 1.

The Reliability of Enterprise Applications

While improving the reliability of your service from 43.2 minutes of downtime per month to 4.32 minutes can be tempting, it can represent a significant engineering feat with a hefty price tag. Therefore, when specifying the required availability of a service, the decision should be based on the business requirements: "How available (how many nines) does my system need to be, in order to meet the business objectives?"

Build an engineering-centric support organization

Application reliability is preserved by SREs. Designing a perfect application doesn't guarantee a high-quality production experience—at least not without the support of an SRE organization. Both the application and the IT ecosystem where its runs change constantly—with developers pushing new code, vendors publishing new security patches, or infrastructure teams updating the software of the underlying platform.

Reliability is not a "build once and forget for life" construct; it is a continuous process of maintaining and upholding its principles and methodologies. Enterprises that recognize the need and invest in developing SRE skills stand out from the rest because they recognize that without these skills, enterprise reliability cannot be sustained.

Designing Enterprise Reliiability Engineering

Designing for enterprise reliability is a multidimensional problem that spans multiple entities: customer, vendor, platform engineering, cost, and the SRE organization. The rest of this article expands on these axioms and describes the behaviors, principles, and methodologies that influence and shape the discipline of enterprise reliability.

Customer Objectives

"If you don't understand your customer objectives, then you do not need to exist as an org." Whether you are a traditional IT organization or a mature SRE org, this fundamental principle holds true.

Translate customer objectives to SLOs

In an enterprise setting, a typical customer is the owner of a business vertical such as legal, finance, or HR trying to accomplish a specific business goal. Having a well-defined set of business objectives lays the foundation for developing concrete functional requirements, allowing you to effectively translate those requirements into quantifiable and measurable outcomes, also known as SLOs.

Defining SLOs early on leads to a better design and implementation of the overall system. Arriving at a clear set of measurable SLOs, however, is an exhaustive process with a lot of considerations (e.g., what is technically feasible vs. infeasible, expensive vs. cost effective, reliable vs. fragile). Closely involving the customer and vendor throughout this process is crucial, as it develops a shared understanding of requirements, constraints, and tradeoffs, and helps reconcile the gap between aspirational and achievable SLOs.

Documenting the SLOs, including a strong rationale for the established targets and thresholds (e.g., 99.9 uptime) is key, as this becomes the contract among all the parties (SRE team, software vendor, and customer). This rigor also creates a culture of transparency and openness to inform how the system should be designed and how the service should operate. For a deep dive into engineering SLOs effectively, refer to the SLO chapter in the SRE book.¹

Empathy toward Customer and Vendor is key

Customers (business owners) may not always have the same level of understanding of the problem space. Their approach could be purely business driven, and they may expect the application never to go down. Likewise, the vendor may not entirely understand how the IT ecosystem is designed and cannot operate independently to deliver the system. The SRE team should become a true partner to bring alignment between the customer and vendor and develop a shared understanding of the overall objective, specific requirements, and constraints of the domain.

Given the nature of third-party domains, it may be hard to find a perfect system that meets 100 percent of the business functionality, as there are many variables in the equation (e.g., third-party software, hardware, cost, and vendor). Therefore, working closely with the customers in developing a set of detailed requirements and distinguishing core vs. optional requirements helps with the tradeoff analysis—for example, if the application has constraints, evaluating their impact on business objectives or revisiting and adjusting customer requirements without compromising the business objectives, or finding a new vendor altogether.

Taking customers through this journey from beginning to end helps them better understand the space and weigh in on all important considerations, ultimately allowing them to make effective business-driven decisions.

SLOs as a means to customer happiness

Solving for customer happiness based of objective goals is key; it is better to cater to functionality based on the customers' objectives in a measurable way (SLOs). Customers have only one fundamental criterion: Is the system able to translate business objectives into business functionality in a cost-effective and reliable manner?

Having this objective view creates a transparent and blameless culture. The key point to remember, however, is that SLOs are not fixed for life: As business needs evolve, the system SLOs need to be revisited. Therefore, having a strong discipline of revisiting the SLO agreements periodically with the customer helps tackle these changes and adjust the scope and expectations as business needs evolve.

Vendor Selection

Enterprise-application engineering with a vendor is a long-term investment that goes beyond the application itself. Therefore, it's important to select a vendor that aligns with the values and principles of the enterprise—for example, software design discipline (scale and performance), data security and privacy management, use of open standards, and ease of operations and maintenance.

To ensure that a vendor meets its requirements, an enterprise needs a rigorous evaluation and validation process. Two distinct sets of evaluations determine and shape the reliability of an application:

• Functional evaluation: represents the business functionality required by the customer.

• Infrastructure evaluation: represents the application's IT requirements.

Functional Evaluation

Functional requirements are derived directly from customer objectives and form the basis of the evaluation process. Each functional requirement has a set of key functional characteristics. The goal of the evaluation process is to do an in-depth analysis of these characteristics and assess the feasibility of third-party software.

To understand this, consider the following scenario. Assume that your enterprise is evaluating a third-party IT inventory system to manage your corporate IT asset information. One of your business objectives is to predict the supply and demand for your inventory in realtime. This could result in a requirement for a centralized global inventory database that updates in realtime every time a checkout happens.

Based on this scenario, let's analyze the core characteristics that a functional evaluation should delve into.

Functional specification

Does the vendor understand the functional requirement and the expected outcome? In the scenario just described, the functional requirement is to maintain a global inventory database for all asset information. The expected outcome is the ability to track asset information and update the global inventory database in realtime.

Dependencies and constraints

Does the vendor need to be aware of any core dependencies or constraints? For example, does the global inventory database depend on any external entities? Is a centralized database required for reads and writes, or is a distributed setup required? What are the pros and cons of both approaches?

Functional interfaces

Does the vendor understand all the end-to-end functional interfaces involved in this requirement? For example, does the inventory database have any reporting interfaces? How does the admin interact with the database? How do the users interact with the database when they do a checkout? What is the end-to-end flow?

Geographic requirements

Does the enterprise have a presence across the globe? Will users access this inventory system from different regions? What are the specific performance and latency requirements for these users?

Scale and load requirements

How many users are going to use the inventory system, both globally and per region? What are the QPS (queries per second) or load requirements for these users? Are there any peak or off-peak volume requirements or considerations?

Security requirements

Does the vendor understand the security posture of the system? Are there any specific access restrictions based on user type (e.g., admin vs. normal user)? What is the authentication and authorization mechanism? Does the application depend on a centralized authorization service such as LDAP (Lightweight Directory Access Protocol) or AD (Active Directory)? Is there a single sign-on dependency?

Compliance requirements

Does the vendor understand and meet the compliance requirements for this application?

Error- and exception-handling requirements

Does the vendor understand the key failure modes based on the design of the system? How does the vendor's software handle exceptions (e.g., request timeouts, retries during write failures, and connection resets)?

Release management

What software release management discipline does the vendor use? What is the release cycle? How are changes tested before being released to the customer? What is the QA/qualification process?

Load and performance testing and functional validation

Does the vendor have a holistic testing plan that covers the end-to-end workflow, and does it include all the edge cases? What is the testing plan for measuring load and performance?

Infrastructure Evaluation

Infrastructure requirements create the foundation for the whole application. Therefore, ensuring the end-to-end reliability of this base layer is critical.

Every enterprise is unique and has its own set of infrastructure requirements and constraints. When evaluating an enterprise application, you want to ensure that the vendor can comply with the requirements of the enterprise's IT ecosystem. For example, suppose your enterprise has fully adopted virtualization for internal efficiencies and other business reasons. In this case, the vendor's application should be compatible with and supported on VMware. Otherwise, the application could become a nonstandard model in your IT organization, driving up costs related to infrastructure, licensing, hardware, and support.

Following are a set of key infrastructure requirements to ensure that a vendor's software is compatible with an enterprise's IT ecosystem.

Core infrastructure

The vendor must meet an enterprise's hardware, software, and operating-system requirements. This includes specific hardware models, enterprise databases, software and operating systems versions that the IT team supports.

Networking

The vendor must meet the authentication and authorization requirements of the network—for example, LDAP or AD, or single sign-on requirements.

Infrastructure security

The vendor should understand and meet the enterprise's security policies related to access management, perimeter security, and data encryption.

Infrastructure sizing

The enterprise should derive a concrete sizing plan including the number of environments and compute and storage requirements based on its functional requirements, and evaluate the vendor closely to ensure that its software can scale and meet those sizing needs.

High availability and disaster recovery

The SRE team should have a clear understanding of reliability requirements based on SLOs and customer objectives. Deciding on the high-availability design such as active-active or active-passive, disaster-recovery requirements and strategy,and data recovery (recovery point objective) and restore (restore point objective) are all critical when engaging the vendor. The enterprise must ensure that the vendor's application can meet its requirements, or that the vendor is willing to collaborate with the SRE team to provide the needed reliability.

Data management

The vendor should have a clear data management discipline and methodologies when it comes to data integrity, backup, recovery, and retention. Does the vendor have a strong data security discipline such as encryption of data both in transit and at rest?

Integrations

Make a list of all the dependent systems and necessary integrations that the IT ecosystem requires—for example, authentication services such as LDAP or AD; corporate mail service and the necessary integrations; and service management workflows such as centralized backups, monitoring, and logging.

Operability

Ensure that the vendor has a strong discipline of software updates/upgrades, clearly defined maintenance windows, etc.

These requirements provide an overview of the core aspects and characteristics you should evaluate when choosing an enterprise application. Note that this is not an exhaustive list, and requirements may vary among enterprises.

Functional and infrastructure requirements can heavily influence the design and delivery of an application. Therefore, evaluating the feasibility of these requirements is a crucial step in engineering the reliability of an application.

Common Application Platform

Most enterprises rely on third-party software to support the operations and needs of their business verticals (figure 2). Running different third-party applications, however, can lead to a large number of disparate systems within an enterprise. Not having a common baseline across applications makes maintaining the reliability and efficiency of service more difficult over time. This creates a lot of overhead for the SRE team and increases the organization's operational costs.

A common platform provides a standard operating environment in which to run all of a system's applications, enhancing the overall reliability and efficiency of an enterprise. The key principle of implementing a common platform is to identify, build, and enforce a set of shared modules and standards that can be reused across the applications that support the business verticals.

On the other side, overengineering a common platform can have a negative impact. If a platform has many standards in place or becomes too rigid, an enterprise's delivery and execution speed can decrease significantly.

The goal is to develop a strategy that allows enterprises to find the right balance between optimizing for reliability and maintaining the development speed needed to deliver and support business functionality. Finding this balance requires a careful analysis of the tradeoffs and net benefits.

Common Platform Layout

An application platform consists of a set of modules that can be grouped into three main categories (figure 3):

• Infrastructure deployment modules.

• Application management modules.

• Common service modules.

Infrastructure Deployment Modules

Infrastructure deployment modules provide intent-based deployment of an end-to-end application environment based on a set of resource requirements such as CPU, memory, operating systems, and the number of instances. This mechanism is highly efficient since the workflows only need to be configured once and can be triggered as needed. It also provides a standardized, consistent, and predictable environment, which improves overall reliability.

Many enterprises are already embracing open-source technologies to help them manage the underlying infrastructure of their applications. Tools such as Terraform provide abstractions to handle the provisioning and deployment of end-to-end environments agnostic to the underlying platform (e.g., on premises vs. cloud).

Application Management Modules

Application management modules handle critical workflows during the life of an application. A few examples of these workflows include:

• Configuration management workflows to deploy application configuration.

• Release management workflows to manage software releases and rollbacks.

• Security management workflows to manage secrets and certification deployments.

Software solutions such as Puppet, Chef, and Ansible provide frameworks and solutions for enterprises to orchestrate these workflows across their applications.

Common Service Modules

Common service modules manage the standardized workflows that can be shared across all applications, such as logging, monitoring, and reporting. This layer can also include custom service modules for the specific needs of an enterprise, such as a custom web front end or a single sign-on service.

Some examples of common service modules include:

• Monitoring module to collect and publish metrics for reporting and alerting.

• Backup module to execute backups, retention, and recovery.

• Log collection module to securely ship logs to a centralized log service.

• Custom Weblogic/Tomcat as a service offering middleware capabilities.

• Managed DBaaS (database as a service) module to manage database workflows.

Combining infrastructure deployment, application management, and common service modules creates a platform that enables enterprises to move away from managing monolithic applications and into a new realm of modular, extensible, and reusable applications.

Cost Engineering

When enterprises opt for third-party software, they are making a cost- and ROI-based decision to use a "reliable" enterprise application that delivers the business functionality in a cost-effective manner. Determining the right reliability-to-cost tradeoff that sustains the ROI curve is the crux of cost engineering.

Reliability-to-Cost tradeoff

Figure 4 illustrates how reliability (the number of nines) directly influences the overall availability or reduction in downtime. The reduction with each additional nine is sublinear. While it is extremely tempting to add a nine, it is important to recognize that engineering an additional nine can be expensive, and overengineering reliability produces diminishing ROI. To understand this, let's look at the following scenario.

Enterprise ABC is looking for a third-party sales application that can provide market analysis and insights. The sales team predicts they can generate an average of $600/hour of revenue by leveraging those insights. Their revenue target per quarter is approximately $1.2 million. What is the required uptime (availability SLO) for this application?

If the application was available 100 percent of the time, the maximum revenue would be:

Net revenue = hours in a quarter (3 months × 30 days × 24 hours = 2,160) × earnings per hour ($600)

$1,296,000 (~$1.29M) = 2,160 hours in a quarter × $600 per hour

The net revenue (~$1.29 million) clearly exceeds the target revenue of $1.2 million, but 100 percent availability is infeasible. Figure 5 illustrates how to choose the perfect availability SLO that meets the ROI.

Here are the key conclusions reached in this scenario:

1. A 90 percent availability SLO generates ~$1.16 million in revenue, which falls short of the target revenue of $1.2 million. This SLO is not feasible.

2. A 95 percent availability SLO generates ~$1.23 million in revenue, which comfortably meets (slightly exceeds) the revenue objective of $1.2 million. This SLO is feasible.

3. A 99 percent availability SLO generates ~$1.28 million in revenue, which far exceeds the revenue objective of $1.2 million, but it comes with additional overhead:

• A 95 percent SLO guarantees no more than 36 hours downtime per month and still comfortably meets the target revenue.

• In contrast, a 99 percent SLO guarantees no more than 7.2 hours downtime per month, but the cost of engineering and support can be higher.

• As long as the cost to engineer a 99 percent SLO does not exceed $80,000 ($1.28 million to $1.2 million), this is a viable option.

4. The net revenue growth for each additional nine provides diminishing returns (delta revenue)—for example, between 99.99 and 99.999 percent:

• There is a significant reduction in downtime per month from 4.32 minutes to 25.92 seconds, but the revenue increase is only $116.64.

• To choose a 99.999 percent SLO, the added engineering cost should be less than $116.64.

Account for Application dependencies

To design a system with a 99.9 percent SLO, the rule of thumb is to have all critical dependent systems provide an additional nine (i.e., 99.99). This means you have to factor in the reliability investment (additional cost) for your application and all of its critical dependencies, because a system is only as available as the sum of its dependencies.²

Choose an SLO that fits the ROI curve

The ideal SLO is one that delivers the required functionality with a degree of reliability that fits within the ROI curve. In the previous scenario, the best SLO would be 95 percent, because it is the least expensive option that meets the business goal ($1.2 million).

Overengineering reliability produces diminishing ROI

From the previous scenario, it is evident that increasing the availability of a service does not always translate to a significant growth in revenue. This is clearly evident from the scenario. In fact, with each additional nine, the benefit of engineering the reliability increases sublinearly, breaking the ROI curve.

Preserving Enterprise Reliability

Reliability is not just a systems design problem. You can have the world's best-designed system, but without proper rigor and discipline, preserving core aspects of the system such as availability, performance, and security can become extremely difficult. Reliability is a responsibility that should be shared across all teams that are involved in the system, including vendors, development, and SRE. The SRE teams are ultimately accountable, however, since they are responsible for achieving their SLOs. During the lifecycle of an application there are a few critical junctures where maintaining proper rigor can translate into preserving the reliability of the service.

Design for standardization and uniformity

Reliability is preserved when you recognize the importance of uniformity and invest in standardization. One of the challenges of enterprise applications is that there is no agreement or consensus among vendors on common standards around software technologies, operating systems, and workflow orchestration methodologies, such as release management and patch management. Each vendor provides its own flavor.

The role of SRE is to publish common standards for the portfolio of tools and technology that they support (the base operating system, release management, and configuration frameworks) and the minimum operational maturity they expect from the vendor (e.g., automated installs and seamless patching workflows).

Mature enterprises that rely on multiple software vendors recognize the importance of having a baseline ecosystem and strong operational maturity. They not only consider business functionality, but also account for ecosystem maturity when looking for third-party applications.

Change Management

Change is powerful. You can build a highly reliable system, but one small change (a bad config push or a software bug) can compromise the reliability of the entire system. Preserving reliability comes from having a change-management rigor with a set of checks and balances that can detect, prevent, or minimize the impact pf problems. SRE should be responsible for maintaining this rigor. Consider the following checks and balances.

Measure, Monitor, and Alert

Measure, monitor, and introduce thresholds to alert for everything that is on the critical path of your SLO. This provides the ability to proactively detect and fix issues.

Streamlined Change and Release Management

Require all changes to go through validation and regression testing. This should be enforced as a strong requirement across all teams that introduce changes.

Dedicated canary environment

Every critical production application should have a dedicated canary environment as a prerequisite. It should be an exact replica of the production environment. This allows for testing user-facing impact such as load and performance.

Phased rollouts

Phased rollouts help reveal unforeseen issues (those not uncovered by tests) that are discovered only in production. This provides the agility to roll back the changes quickly and minimize the impact.

Rollbacks and Restores

Another key discipline is to ensure that every change can be rolled back. It is particularly important to understand the dependency graph of the change and ensure an atomic rollback. This is difficult in complex systems, but in such cases having a clear restore point is key for most critical changes.

Error budgets

Error budgets are a simple concept. Every service has a target SLO, and if it exceeds that SLO, then that positive delta of uptime becomes the budget to use in pushing any changes or releases. This is a powerful concept explained in depth in the SRE book.¹ Sharing this rigor with your application development team is a good way to ensure service reliability.

Outages and Incidents

No matter how reliable a system is, you should anticipate and prepare for a disaster. Rather than solving for no outages, which is impractical, the focus should be on effectively managing the outage (minimizing downtime) and learning from it, so the same patterns don't repeat.

Resiliency testing

The goal here is to stress test application resiliency by breaking the system, observing the effects of the breakage, and subsequently improving the reliability of the application.

Incident preparedness

The SRE team should periodically run fire drills to practice incident management that involves extensive coordination with partner teams, timely communication to stakeholders, and restoring the service as soon as possible. Responding to and handling an actual incident without this preparation can reduce the speed and effectiveness of restoring the service.

Learning from Outages

A repeated outage is not an outage anymore; it is a mistake. For every outage there should be a thorough post-mortem that clearly identifies the root cause of the outage and focuses on what went wrong and what can be improved going forward. It is critical for enterprises to foster a blameless post-mortem culture that focuses on improving the reliability of the application.

The future of Enterprise Reliability

Over the past few years, cloud platform providers have increasingly focused on enterprises, offering a suite of secure, reliable, and cost-effective products from highly scalable compute, storage, and networking services to modernized managed offerings such as container as a service (Kubernetes), serverless, and DBaaS. In addition, cloud providers are delivering advanced services in the realms of AI (artificial intelligence), ML (machine learning), and big data, opening a wide range of possibilities for enterprises to rethink and transform their business verticals.

This shift represents a tremendous opportunity for enterprises to embrace and adopt the cloud. Undertaking such a large-scale migration, however, introduces a new challenge: How can enterprises adapt and rapidly evolve without reducing their reliability?

Cloud Migration Strategy

Enterprises typically have complex business requirements, so a lift-and-shift strategy to migrate 100 percent of their workloads to a single cloud provider may not be feasible. A hybrid cloud environment provides the flexibility for workloads to operate seamlessly across both public and private cloud environments. This approach greatly simplifies the cloud adoption strategy and provides a controlled environment that ensures a predictable level of reliability throughout the transition to the cloud.

Enterprises that thoughtfully embrace the hybrid cloud strategy have less risk in terms of overall reliability and have a faster path to cloud transformation. Investing in a common application platform, coupled with the adoption of technologies such as Kubernetes (https://kubernetes.io/), Istio (https://istio.io/), and serverless computing (https://en.wikipedia.org/wiki/Serverless_computing), provides the flexibility to operate workloads, agnostic to the cloud provider. Technologies such as the GCP (Google Cloud Platform) Anthos platform (https://cloud.google.com/anthos/) can also help enterprises expedite their transition to the cloud in a reliable and efficient manner.

VEC Ecosystem

Developing a strong relationship among vendors, enterprises, and cloud providers is pivotal to the future of enterprise reliability. Cloud providers need to motivate software vendors, through partnership programs, to modernize third-party software embracing cloud-based technologies and building certified multicloud-compliant software offerings. This VEC (vendor-enterprise-cloud) ecosystem coupled with the technological shift will bring a rapid transformation shaping the enterprise domain.

Maintaining enterprise reliability is a continuous process that is in a crucial moment with the advent of the cloud. The next decade will be the era of large-scale enterprise transformations leveraging cloud capabilities, and only those enterprises that grasp the discipline of reliability engineering will be able to transform successfully into the realm of cloud-based enterprise computing.

References

1. Jones, C., Wilkes, J., Murphy, N., Smith, C. 2016. Service-level objectives. In Site Reliability Engineering, ed. Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy. O'Reilly Media; https://landing.google.com/sre/sre-book/chapters/service-level-objectives/.

2. Treynor, B., Dahlin, M., Rau, V., Beyer, B. 2017. The calculus of service availability. acmqueue 15(2); https://queue.acm.org/detail.cfm?id=3096459.

Toward Software-defined SLAs
Enterprise computing in the public cloud
Jason Lango
https://queue.acm.org/detail.cfm?id=2560948

Enterprise Software as Service
Online services are changing the nature of software.
Dean Jacobs
https://queue.acm.org/detail.cfm?id=1080875

Why Cloud Computing Will Never Be Free
The competition among cloud providers may drive prices downward, but at what cost?
Dave Durkee
https://queue.acm.org/detail.cfm?id=1772130

Sanjay Sha is an SRE Manager at Google. He is a long-time Googler with more than 14 years' experience running several large-scale systems at Google. He leads the Enterprise domain, managing SRE teams supporting Google's key business verticals. He is currently working on the Corp to Cloud initiative to run Google's internal enterprise workloads on GCP.

Originally published in Queue vol. 17, no. 5—
Comment on this article in the ACM Digital Library