Diving Deep on S3 Consistency

I recently wrote about the Amazon S3 and how it’s evolved over the last 15 years since we launched the service in 2006 as “storage space for the Internet.” We built S3 because we knew customers wanted to save backups, videos, and images for applications like e-commerce sites. Our main design priorities at the time were safety, elasticity, reliability, durability, performance and cost, because that’s what customers told us was most important to them for these types of applications. And that is still true today. But over the years, the S3 has also become the repository used for analysis and machine learning on massive data lakes. Instead of just saving images to e-commerce sites, these data lakes serve the data for applications such as satellite image analysis, vaccine research, and autonomous truck and car development.

To provide storage for such a wide range of uses requires constant capacity development. And this is what I think is one of the most interesting things about the S3. It has fundamentally changed how customers use storage space. Before S3, customers were stuck for 3-5 years with the capacity and capabilities of the expensive local storage system they purchased for their data center. If you wanted more capacity or new features, you would buy a new local storage device and then have to migrate data between storage arrays. With S3’s pay-as-you model for capacity and constant innovation for new opportunities, S3 changed the game for companies that could now develop their data usage without making major changes to their applications.

One of these exciting new innovations is the S3 Strong Consistency, and that’s what I would like to dive into today.

Consistency, consistent

Consistency models are a distributed system concept that defines the rules for the order and visibility of updates. They come with a continuum of trade-offs that allow architects to optimize a distributed system for the key components.

For S3, we built caching technology into our metadata subsystem that optimized for high availability, but one of the consequences of this design decision was that under extremely rare circumstances, we would exhibit any consistency on write. In other words, the system would almost always be available, but sometimes an API call would return an older version of an object that had not yet been fully deployed in all the nodes of the system. Final consistency was appropriate when serving site images or backups in 2006.

Fast forward 15 years later to today. The S3 has well over 100 trillion objects and serves tens of thousands of requests every second. Over the years, customers have found many new application cases for S3. Eg. Uses tens of thousands of customers S3 for data lakes where they perform analytics, creating new insights (and competitive advantages) for their businesses on a scale that was impossible a few years ago. Customers also use S3 to store petabytes of data to train machine learning models. The vast majority of these interactions with storage are done using application code. These computing applications often require strong consistency – objects must be the same across all nodes in parallel – and therefore customers create their own application code to track consistency outside of S3 for their S3 usage. Customers loved the S3’s resilience, cost, performance, operating profile and simplicity of the programming model, and since it was important for their application to have a strong consistency in storage, they added it themselves in the application code to take advantage of the S3. As an example, Netflix opened s3mper open source, which used Amazon DynamoDB as a consistent store to identify the rare instances where S3 would earn an inconsistent response. The Cloudera and Apache Hadoop communities worked on S3Guard, which similarly provided a separate view for applications to mitigate rare instances of inconsistency.

While customers could use metadata tracking systems to add strong consistency to their applications’ use of S3, it was additional infrastructure that needed to be built and managed. Remember that 90% of our schedule at AWS comes directly from customers, and customers asked us if we could change the S3 to avoid having to run extra infrastructure. We thought back to the central design principle of simplicity. That was true in 2006 and remains true today as a cornerstone in how we think about building S3 features. And then we started thinking about how to change the consistency model for S3. We knew it would be difficult. The consistency model is baked into the core infrastructure of S3.

We thought of strong consistency the same way we think about all the decisions we make: starting with the customer. We considered approaches that would have required a balancing of costs within which objects had consistency or performance. We would not make any of these trade-offs. So we kept working towards a higher bar:
we wanted a strong consistency at no extra cost, applied to all new and existing objects and without compromising on performance or availability.

Other providers compromise, such as making strong consistency an opt-in option for a bucket or account over all storage, implementing consistency with cross-region dependencies that undermine the regional availability of a service or other constraints. If we wanted to change this basic concept of consistency and stay true to our S3 design principles, we had to make strong consistency the standard of every request, for free, with no impact on performance, and be true to our reliability model. This made a tough engineering problem much more difficult, especially on the S3’s scale.

S3’s metadata subsystem

Per-object metadata is stored in a discrete S3 subsystem. This system is on the data path for GET, PUT and DELETE requests and is responsible for handling LIST and HEAD requests. The core of this system is a persistence level that stores metadata. Our persistence level uses a caching technology designed to be highly resistant. S3 requests should still succeed even if the cache-supporting infrastructure is degraded. This meant that in rare cases, one can write through part of the cache infrastructure, while readings end up asking another. This was the primary source of S3s eventual consistency.

An early consideration to deliver strong consistency was to bypass our caching infrastructure and send requests directly to the persistence layer. But this would not meet our bar for no compromises on performance. We needed to keep the cache. To keep the values ​​properly synchronized across cores, CPUs implement cache coherence protocols. And this is what we needed here: A cache coherence protocol for our metadata caches that allowed strong consistency for all requests.

Cache context

Our strong consistency history required us to make our metadata cache highly consistent. This was a high order in the S3 scale, and we wanted to make that change while respecting the experience learned for scale, resilience, and operations for our metadata systems.

We had introduced new replication logic in our persistence level that acts as a building block for our event message delivery system at least once and our replication time function. This new replication logic allows us to reason about the “sequence of operations” per object in S3. This is the core of our cache coherence protocol.

We introduced a new component of the S3 metadata subsystem to understand if the cache’s perception of an object’s metadata was outdated. This component acts as a witness to write, notified each time an object changes. This new component acts as a reading barrier during read operations so that the cache can learn if its view of an object is outdated. The cached value can be displayed if it is not obsolete or invalidated and read from the persistence level if it is obsolete.

This new design presented us with challenges along two dimensions. First, the cache coherence protocol itself had to be correct. Strong consistency must always be strong without exception. Second, customers love the high availability of the S3, so our design for the new witness component must ensure that it does not reduce the availability that the S3 is designed to deliver.

High availability

Witnesses are popular in distributed systems because they often only need to track a small bit of state, in memory, without having to go to disk. This allows them to achieve extremely high query processing rates with very low latency times. And that’s what we did here. We can continue to scale this fleet out as S3 continues to grow.

In addition to extremely high capacity, we built this system to exceed S3’s high availability requirements and leveraged our learning by using large systems for 15 years. As I have long said, everything fails, all the time, and as such we have designed the system assuming that individual hosts / servers will fail. We built automation that can quickly respond to load concentration and individual server errors. Because the consistency witness tracks minimal condition and only in memory, we are able to quickly replace them without waiting for long state transfers.


It is important that strong consistency is implemented correctly so that there are no edge issues that break the consistency. S3 is a massively distributed system. Not only must this new cache coherence protocol be correct in the normal case, but in all cases. This must be correct when simultaneous writing to the same object was in progress. Otherwise, we would potentially see values ​​”flicker” between old and new. It must be correct when a single object sees very high simultaneity on GET, LIST, PUT and DELETE while versioning is enabled and has a deep version stack. There are innumerable entanglements of operations and interstates, and to our extent, even if something only happens once in a billion requests, it means that it happens several times a day within S3.

Common testing techniques such as unit testing and integration testing are valuable, necessary tools in any production system. But they are not enough when you have to build a system with such a high bar of accuracy. We want a system that is “provably correct”, not just “probably correct”. So for a strong consistency, we used a variety of techniques to ensure that what we built is correct and remains correct as the system evolves. We used integration testing, deductive evidence for our proposed cache coherence algorithm, model checking to formalize our consistency design and demonstrate its correctness, and we extended our model checking to examine the actual executable code.

These verification techniques were a lot of work. They were actually more work than the implementation itself. But we put this rigor in the design and implementation of S3’s strong consistency, because that’s what our customers need.


We built the S3 on the design principles we called out when we launched the service in 2006, and every time we review a design for a new feature or micro-service in the S3, we go back to the same principles. Offering strong consistency as standard at no extra cost to customers and with high performance was a huge challenge. S3 draws on the experience of over 15 years of large-scale and across millions of customer cloud storage operations to innovate capabilities not available elsewhere, and we leveraged this experience to add strong consistency to the high availability that S3’s customers have come to appreciate. And by leveraging a variety of testing and verification techniques, we were able to deliver the accuracy required by customers from a highly consistent system. And most importantly, we were able to do it in a way that was transparent to customers and remained true to the core values ​​of S3.


Leave a Reply

Your email address will not be published.