Courses / Learning AWS Serverless / Part 5: DynamoDB / Backgrounders

Understanding DynamoDB

In the AWS family of services, DynamoDB is the data store equivalent to Lambda. We saw in the previous section how Lambda gives you the benefits of serverless compute in exchange for some specific limitations. And there are similar trade-offs with DynamoDB. Like with Lambda, DynamoDB gives you high performance, automatic scaling, and on-demand pricing. The trade-offs compared to traditional relational databases are in the areas of consistency, relational integrity, and the ability to use a language like SQL to perform cross-table queries.

What we’re going to cover in this first lesson are the underlying reasons for these trade-offs. We’ll start by first looking at the strategies you might use to scale a relational database to handle high traffic volumes.

Scaling Relational Databases

Traditional relational databases are able to maintain consistency and enforce relational integrity because, at their heart, they have a single process that coordinates updates. It’s because organizations want to maintain this consistency and integrity that they often scale high-volume databases vertically on very large servers.

Strategy 1: Read Replicas

If, however, you have a database with volumes of traffic that can’t be scaled vertically alone, then there are two basic strategies you might adopt in order to distribute the database workload horizontally. The first, and most common strategy you can use is to set up one or more read replicas, as shown in the diagram below:

This diagram depicts a traditional EC2 server that accesses one database instance for writes, and a replicated instance for reads. If, as is usually the case, your database has proportionately more read than write operations, then you might add multiple read replicas to balance the traffic load. The overarching goal is to distribute the compute load across multiple database servers.

Now, a key point with this strategy is that to properly scale the write performance on the master database, the replication should be asynchronous. Otherwise write operations get slower with each added replica, as they block until all the replicas are updated. But as soon as you move to asynchronous replication you have also moved to an “eventual” consistency model, meaning that newly written data will not immediately be visible across all the replicas. For some period of time following a write operation, different client applications will read different versions of the data, depending on which replica they are accessing.

Strategy 2: Partitioned Data

Another strategy you might apply to distribute traffic volume across multiple database servers is to vertically partition data. Instead of all the records for a table being in one database, you define some criteria for the data that determines into which one of multiple partitioned databases they will be stored.

The diagram above depicts this second strategy, showing that the application on the server now has two physical databases to which it must read and write for what is, from an application’s point of view, a single logical table. Unlike with asynchronously updated read replicas, with this strategy you don’t lose any consistency because you’re reading and writing to the same database for a given record. The things you do lose are the ability to enforce relational integrity for the table and to construct SQL queries that incorporate data across the partitions.

DynamoDB as a Distributed Data Service

DynamoDB achieves the performance that it does by being a highly distributed data service. The implementation of DynamoDB, under the covers, likely bears no resemblance to that of a relational database. But it’s able to be distributed in this manner, fundamentally, by applying the two strategies we just outlined for scaling relational databases. First, table data is partitioned across multiple compute and storage nodes. Then, the data within these partitions is additionally replicated across multiple nodes.

The diagram below depicts the upside to how DynamoDB distributes data. Applying the scaling strategies outlined above for relational databases all require some combination of infrastructure changes, data migrations, and additional application logic. The magic of DynamoDB lies in how client code always reads and writes to a logical table. Meanwhile, DynamoDB transparently partitions and replicates the data in that table across nodes, on the fly.

To summarize, with DynamoDB, you gain the benefits of a distributed data service that scales up to very high performance for very large datasets, without any extra work on your part. As you would expect based on what we’ve outlined, you have two major trade-offs in exchange. First, to achieve maximum performance, you're sacrificing consistency. Second, you lose the ability to join tables to enforce relational integrity or to perform cross-table queries.

These trade-offs don’t always present an obstacle. If you’re building a content management system, for example, having a brief period of inconsistency in the data is likely not an issue. But you wouldn’t build anything involving, let’s say, financial transactions on this basis. For some use cases, “eventual” consistency is simply not workable. There are also a lot of applications with complex data models where the lack of cross-table querying capability is a problem. It’s a lot easier to define views that join tables in SQL than to write code that does the same, as you must do for DynamoDB.

The sample code for this course is a good fit for DynamoDB. The data model is not complex, the eventual consistency model is acceptable, and the envisioned traffic volumes benefit from the scaling and performance that DynamoDB offers. The service isn’t a solution to every data problem, but like Lambda, it’s a very good solution to certain problems.

Enabling On-Demand Pricing

As with Lambda, an advantage to having mediated access to compute resources that are allocated dynamically is that it enables on-demand pricing. You do have the option pay for provisioned capacity with DynamoDB, but as we noted about Lambda, with the on-demand option you have the option to deploy low-volume environments that cost almost nothing. Not surprisingly, for DynamoDB the on-demand pricing model includes costs for storage and for read/write operations.

Going Schema-less

With DynamoDB, you specify either a single partition key or a combination of partition and range key when defining a table. Once deployed, these table keys can’t be changed. On the other hand, there’s no fixed schema for non-key attributes, so, depending on the client interface used you can store different data structures in a single table, or augment the attributes stored in a table over time without having to alter any table schemas. Note that there are predefined data types for attributes, but these include JSON which allows for a lot of flexibility in the data you can store.

Using Global Secondary Indexes

We’ll dive into this topic further in the next lesson, but we’ll note here that with DynamoDB you have a fairly limited repertoire of table querying capabilities. With a single table key, you can only query for unique items. With two table keys, you can query for unique items, or for a range of items within the one dimension of the first key. You can scan tables for any attribute, but doing so is slow and expensive in terms of capacity units used. To perform alternative queries on a table you have the option to define Global Secondary Indexes, which function like materialized views of the source table, with different keys applied. The keys defined in these indexes are still limited to the querying limitations that apply to tables generally. Be aware that secondary indexes also increase the capacity units used for storage and write operations, because they contain duplicate data.

Avoiding Hot Partitions

Like cold starts with Lambda, a frequent topic of discussion with DynamoDB is what’s called hot partitions. Remember that the performance for DynamoDB partly hinges on the partitioning of the data across multiple compute and storage nodes. This partitioning process uses a hashing function on the first table key (thus called the “hash” key or “partition” key) to determine how the data will be distributed. If you use two keys in a table, and the “hash” key doesn’t have a good distribution of values, then too much data will end up concentrated in one node. This can particularly be an issue if you’re paying for capacity, since you’ll have to scale up the capacity evenly across all nodes based on peak usage in the one “hot partition”.

In a nutshell, you need to model your data, and therefore the keys used in that data, to properly take advantage of how DynamoDB works. Where possible, use GUIDs for hash keys, or when combined with range keys, use data that has a good distribution of values.