Multi-region Serverless Architecture: Aurora Global

Multi-region Serverless Architecture: Aurora Global

By Sander van de Graaf and Daniel Jakobsson

In our previous blog post, we explained our plans and architecture for our multi-region rollout. If you want more information on the specifics, we recommend you read that post first. In this post, we'll describe the steps we needed to take to make our application to support multi-region Aurora.

A core part of our application is our MySQL database (RDS) that powers our Django application and some parts of our internal APIs.

This database contains most company details, blocklists, and some of the “hot” data we use for indexing, querying, moderation, and training for our ML models. This data is maintained by our various staff members and needs to be available to render some of the pages on our public sites (ie: downdetector.com).

We mentioned the lookup service in our previous post, which acts as a MySQL cache in DynamoDB for our Lambda’s, to reduce cold startup times and improve performance.

We need our database to be available in our secondary region for our application to work. Fortunately, AWS has a solution for this: MySQL RDS Aurora Multi-Region.

Aurora multi-region takes care of syncing DB writes to another region, over AWS internal backbone. which works out great for us, as we now don’t have to think about setting up VPC peerings between regions.

Aurora has a feature called "write forwarding", that saves us from writing our own mechanism to transmit writes from our secondary region to the primary region.

Screenshot 2022-08-15 at 11.14.31.png Simplified multi-region architecture using write forwarding

Aurora write forwarding

In a nutshell, Aurora write forwarding makes your application believe it's writing to a local endpoint in the region, but the write actually happens on another (primary) cluster in the primary region. In our case, a client writing in us-west-2, will receive a local confirmation that the write was done, and AWS will take care of writing the message to the other region, over the AWS network.

Endpoints

In our case, both regions will have a writer and reader endpoint, and the application will be none the wiser about what is happening behind the scenes. Each region will have a local endpoint:

Endpoints for us-west-2:

Endpoints for eu-west-1:

These endpoints will then CNAME to the actual local Aurora endpoints.

Write forwarding consistency

There are different consistency levels for write forwarding, with different considerations. Choose whichever works for your application:

Eventual

With eventual consistency, the local client will receive a successful write directly after a commit. The AWS layer will sync the write to the primary region. This results in fast writes on the local endpoint (we don’t have to wait for the write to be actually written on the primary region), but can lead to different values being returned globally. The data will eventually be consistent (hence the value). There is a potential for data loss here if in the writing process the primary region gets lost and the data is not yet synced to all writers yet.

Session

When the value is to Session, any writes will wait for the write to be successfully written to the primary region, this can slow down writes. Any selects during the same session will wait for the successful write signal before returning the result, resulting in consistent reads of values locally inside the session (but sometimes slower select queries). Any other sessions could read different values during the same timeframe, but will eventually be consistent for all sessions.

Global

When setting the value to Global, all writes need to be confirmed by all secondary clusters globally. This will result in even slower writes, but all sessions will receive the same values across the globe.

We choose to use the eventual consistency, as we are not too worried about losing some data, and our application has been designed to handle eventual consistency for all of our data already. This largely depends on your application and use case though, so think about your decision.

Using write forwarding

To make use of write forwarding, this feature needs to be enabled on the secondary cluster that you want to use it on. Once done, each sql session needs to enable a flag during the init phase of the connection. (or at least: before doing any other activity in the session). The sql command to do this is:

SET SESSION aurora_replica_read_consistency='EVENTUAL’

Unfortunately, you can only run this command on the secondary clusters. Running this on the primary cluster will result in an error, and a non-working connection.

In our case, we need to store a flag somewhere that holds the value for the primary region. There are several options available (SSM, DynamoDB, environment variables, etc). In the end we settled on DNS.

Utilizing DNS records

Why DNS? We think it is a good tradeoff between ease of use, scalability (it can easily handle tons of requests), has built-in caching, and by using the RFC7208 format for SPF records, we can add a bit more key/value information if we want to, with only one lookup.

Our DNS TXT record contains the following at this moment:

"region=eu-west-1;max-age=10"

This means, the primary region is eu-west-1, and "locally cache this value for 10 seconds". As long as the local region does not equal that region value, it will set the read_consistency variable. We use the max-age value to have the application cache the value locally for X seconds (on top of DNS caching). With this, we can easily do maintenance on the record if needed in the future, by temporarily setting it to a higher value. We can also use this to tweak our RTO targets, as lowering this value means that our application will pick up master region changes faster.

We will use this DNS record in the application database connectors to make a decision based on whether to append the flag to the sql session or not, thus also giving us the flexibility to switch primary regions and/or perform a failover.

Application database connectors

For Django, this is fairly easy to do in the database settings section of settings.py and using a database router for reads and writes.

Using django.db.backends.mysql

If you’re using django.db.backends.mysql for your backend, this is fairly easy, as it supports the init_command setting.

# settings.py
DATABASES = {
   "writer": {
       "ENGINE": "django.db.backends.mysql",
       "NAME": "downdetector",
       "USER": os.environ.get("DATABASE_USERNAME"),
       "PASSWORD": os.environ.get("DATABASE_PASSWORD"),
       "HOST": f"writer.rds.{os.environ.get('AWS_REGION')}.downdetector.abc",
   },
   "writer-with-forwarding": {
       "ENGINE": "django.db.backends.mysql",
       "NAME": "downdetector",
       "USER": os.environ.get("DATABASE_USERNAME"),
       "PASSWORD": os.environ.get("DATABASE_PASSWORD"),
       "HOST": f"writer.rds.{os.environ.get('AWS_REGION')}.downdetector.abc",
       "OPTIONS": {"init_command": "SET SESSION aurora_replica_read_consistency='EVENTUAL’"},
   },
   "reader": {
       "ENGINE": "django.db.backends.mysql",
       "NAME": "downdetector",
       "USER": os.environ.get("DATABASE_USERNAME"),
       "PASSWORD": os.environ.get("DATABASE_PASSWORD"),
       "HOST": f"reader.rds.{os.environ.get('AWS_REGION')}.downdetector.abc",
   }
}
# dbrouter.py
import os
import dns.resolver
from cachetools import cached, TTLCache
from cached_property import cached_property_with_ttl

AURORA_TXT_MAX_AGE = int(os.environ.get("AURORA_TXT_MAX_AGE", 60))


class ReaderWriterRouter(object):
    """
    A router to control all database operations on models in the
    auth application.
    """

    def db_for_read(self, model, **hints):
        return "reader"

    def db_for_write(self, model, **hints):
        if self._needs_write_forwarding:
            return "writer-with-forwarding"
        return "writer"

    def allow_relation(self, obj1, obj2, **hints):
        return True

    def allow_migrate(self, db, app_label, model_name, **hints):
        return db == "default"

    @cached_property_with_ttl(ttl=AURORA_TXT_MAX_AGE)
    def _needs_write_forwarding(self):
        primary_region = None
        global AURORA_TXT_MAX_AGE

        # You will want to implement proper error handling procedure here, alarm, alert,
        # or have a proper fallback case
        #
        # get the current primary region from the dns record
        try:
            answers = []
            answers = dns.resolver.resolve(
                f"primary_region.rds.downdetector.abc",
                "TXT",
            )

            # turn the txt record data into a dict with key/values, split by ";"
            # example foo=bar;foobar=true;max-age=10
            txt_record = answers[0].strings[0].decode("ascii")
            data = {kv.split("=")[0]: kv.split("=")[1] for kv in txt_record.split(";")}

            # set the current primary region
            primary_region = data["region"]

            # if the cache ttl changed, update the internal ttl
            if "max-age" in data and int(data["max-age"]) != AURORA_TXT_MAX_AGE:
                AURORA_TXT_MAX_AGE = int(data["max-age"])
        except Exception as e:
            # default to False
            AURORA_TXT_MAX_AGE = 60
            return False

        # if this is the primary region, we don't need write forwarding
        return not primary_region == os.environ.get("AWS_REGION")

Example implementation of a Django DB router for Aurora multi-region

Using mysql.connector.django

When using mysql.connector.django (the pure python client), things are a bit different. Unfortunately, at the time of writing, the mysql-connector-python does not support the init command option (it is ignored), so you will have to handle this yourself using the db.post_init signal and then running the sql command. Another option is to extend the MySQLConnectionAbstract class and add support for the init_command there.

Using SQLAlchemy

With SQLAlchemy you can use the connect signal to run the initial command:

@event.listens_for(engines["reader-with-write-forwarding"], "connect", insert=True)
def db_connect(conn, *args, **kwargs):
   """
   On the database connect event, sets the aurora_replica_read_consistency to
   the desired level (defaults to EVENTUAL)
   """
   try:
       conn.cmd_query(
           query=f"SET SESSION aurora_replica_read_consistency='{os.environ.get('AURORA_CONSISTENCY', 'EVENTUAL')}'"
       )
       conn.cmd_query(query="SET AUTOCOMMIT=0;")
   except Exception as e:
       logger.debug(e)
       logger.exception(e)

Failover

Aurora regional failover happens when the Aurora service is unhealthy in a particular region. One of the secondary regions needs to be promoted to become the new primary region.

Process

We have automated the failover process to regularly do failover tests and know that our process is in order. The steps for failover with zero data loss are:

  • Pause all writes (we pause all of our kinesis integrations)
  • Select a new region, and promote it to master
  • Update our DNS records to point to the new master region
  • Enable write forwarding on the previous primary region (this is not enabled by default)
  • Unpause all writes

This process takes a couple of minutes to run, but after that, all writes now happen in the new primary region. You can see our failover steps in action here:

Conclusion

Adopting Aurora multi-region was quite an undertaking, but this solves all of our core data syncing issues. This also gives us more resiliency to outages and more flexibility in our application logic.