By Sander van de Graaf and Daniel Jakobsson
In our previous blog post, we explained our plans and architecture for our multi-region rollout. If you want more information on the specifics, we recommend you read that post first. In this post, we'll describe the steps we needed to take to make our application to support multi-region Aurora.
A core part of our application is our MySQL database (RDS) that powers our Django application and some parts of our internal APIs.
This database contains most company details, blocklists, and some of the “hot” data we use for indexing, querying, moderation, and training for our ML models. This data is maintained by our various staff members and needs to be available to render some of the pages on our public sites (ie: downdetector.com).
We mentioned the lookup service in our previous post, which acts as a MySQL cache in DynamoDB for our Lambda’s, to reduce cold startup times and improve performance.
We need our database to be available in our secondary region for our application to work. Fortunately, AWS has a solution for this: MySQL RDS Aurora Multi-Region.
Aurora multi-region takes care of syncing DB writes to another region, over AWS internal backbone. which works out great for us, as we now don’t have to think about setting up VPC peerings between regions.
Aurora has a feature called "write forwarding", that saves us from writing our own mechanism to transmit writes from our secondary region to the primary region.
Simplified multi-region architecture using write forwarding
Aurora write forwarding
In a nutshell, Aurora write forwarding makes your application believe it's writing to a local endpoint in the region, but the write actually happens on another (primary) cluster in the primary region.
In our case, a client writing in us-west-2
, will receive a local confirmation that the write was done, and AWS will take care of writing the message to the other region, over the AWS network.
Endpoints
In our case, both regions will have a writer and reader endpoint, and the application will be none the wiser about what is happening behind the scenes. Each region will have a local endpoint:
Endpoints for us-west-2:
Endpoints for eu-west-1:
These endpoints will then CNAME to the actual local Aurora endpoints.
Write forwarding consistency
There are different consistency levels for write forwarding, with different considerations. Choose whichever works for your application:
Eventual
With eventual consistency, the local client will receive a successful write directly after a commit. The AWS layer will sync the write to the primary region. This results in fast writes on the local endpoint (we don’t have to wait for the write to be actually written on the primary region), but can lead to different values being returned globally. The data will eventually be consistent (hence the value). There is a potential for data loss here if in the writing process the primary region gets lost and the data is not yet synced to all writers yet.
Session
When the value is to Session
, any writes will wait for the write to be successfully written to the primary region, this can slow down writes. Any selects during the same session will wait for the successful write signal before returning the result, resulting in consistent reads of values locally inside the session (but sometimes slower select queries).
Any other sessions could read different values during the same timeframe, but will eventually be consistent for all sessions.
Global
When setting the value to Global
, all writes need to be confirmed by all secondary clusters globally. This will result in even slower writes, but all sessions will receive the same values across the globe.
We choose to use the eventual
consistency, as we are not too worried about losing some data, and our application has been designed to handle eventual consistency for all of our data already. This largely depends on your application and use case though, so think about your decision.
Using write forwarding
To make use of write forwarding, this feature needs to be enabled on the secondary cluster that you want to use it on. Once done, each sql session needs to enable a flag during the init
phase of the connection. (or at least: before doing any other activity in the session). The sql command to do this is:
SET SESSION aurora_replica_read_consistency='EVENTUAL’
Unfortunately, you can only run this command on the secondary clusters. Running this on the primary cluster will result in an error, and a non-working connection.
In our case, we need to store a flag somewhere that holds the value for the primary region. There are several options available (SSM, DynamoDB, environment variables, etc). In the end we settled on DNS.
Utilizing DNS records
Why DNS? We think it is a good tradeoff between ease of use, scalability (it can easily handle tons of requests), has built-in caching, and by using the RFC7208 format for SPF records, we can add a bit more key/value information if we want to, with only one lookup.
Our DNS TXT record contains the following at this moment:
"region=eu-west-1;max-age=10"
This means, the primary region is eu-west-1
, and "locally cache this value for 10 seconds". As long as the local region does not equal that region
value, it will set the read_consistency variable.
We use the max-age
value to have the application cache the value locally for X seconds (on top of DNS caching). With this, we can easily do maintenance on the record if needed in the future, by temporarily setting it to a higher value. We can also use this to tweak our RTO targets, as lowering this value means that our application will pick up master region changes faster.
We will use this DNS record in the application database connectors to make a decision based on whether to append the flag to the sql session or not, thus also giving us the flexibility to switch primary regions and/or perform a failover.
Application database connectors
For Django, this is fairly easy to do in the database settings section of settings.py
and using a database router for reads and writes.
Using django.db.backends.mysql
If you’re using django.db.backends.mysql
for your backend, this is fairly easy, as it supports the init_command
setting.
# settings.py
DATABASES = {
"writer": {
"ENGINE": "django.db.backends.mysql",
"NAME": "downdetector",
"USER": os.environ.get("DATABASE_USERNAME"),
"PASSWORD": os.environ.get("DATABASE_PASSWORD"),
"HOST": f"writer.rds.{os.environ.get('AWS_REGION')}.downdetector.abc",
},
"writer-with-forwarding": {
"ENGINE": "django.db.backends.mysql",
"NAME": "downdetector",
"USER": os.environ.get("DATABASE_USERNAME"),
"PASSWORD": os.environ.get("DATABASE_PASSWORD"),
"HOST": f"writer.rds.{os.environ.get('AWS_REGION')}.downdetector.abc",
"OPTIONS": {"init_command": "SET SESSION aurora_replica_read_consistency='EVENTUAL’"},
},
"reader": {
"ENGINE": "django.db.backends.mysql",
"NAME": "downdetector",
"USER": os.environ.get("DATABASE_USERNAME"),
"PASSWORD": os.environ.get("DATABASE_PASSWORD"),
"HOST": f"reader.rds.{os.environ.get('AWS_REGION')}.downdetector.abc",
}
}
# dbrouter.py
import os
import dns.resolver
from cachetools import cached, TTLCache
from cached_property import cached_property_with_ttl
AURORA_TXT_MAX_AGE = int(os.environ.get("AURORA_TXT_MAX_AGE", 60))
class ReaderWriterRouter(object):
"""
A router to control all database operations on models in the
auth application.
"""
def db_for_read(self, model, **hints):
return "reader"
def db_for_write(self, model, **hints):
if self._needs_write_forwarding:
return "writer-with-forwarding"
return "writer"
def allow_relation(self, obj1, obj2, **hints):
return True
def allow_migrate(self, db, app_label, model_name, **hints):
return db == "default"
@cached_property_with_ttl(ttl=AURORA_TXT_MAX_AGE)
def _needs_write_forwarding(self):
primary_region = None
global AURORA_TXT_MAX_AGE
# You will want to implement proper error handling procedure here, alarm, alert,
# or have a proper fallback case
#
# get the current primary region from the dns record
try:
answers = []
answers = dns.resolver.resolve(
f"primary_region.rds.downdetector.abc",
"TXT",
)
# turn the txt record data into a dict with key/values, split by ";"
# example foo=bar;foobar=true;max-age=10
txt_record = answers[0].strings[0].decode("ascii")
data = {kv.split("=")[0]: kv.split("=")[1] for kv in txt_record.split(";")}
# set the current primary region
primary_region = data["region"]
# if the cache ttl changed, update the internal ttl
if "max-age" in data and int(data["max-age"]) != AURORA_TXT_MAX_AGE:
AURORA_TXT_MAX_AGE = int(data["max-age"])
except Exception as e:
# default to False
AURORA_TXT_MAX_AGE = 60
return False
# if this is the primary region, we don't need write forwarding
return not primary_region == os.environ.get("AWS_REGION")
Example implementation of a Django DB router for Aurora multi-region
Using mysql.connector.django
When using mysql.connector.django
(the pure python client), things are a bit different. Unfortunately, at the time of writing, the mysql-connector-python
does not support the init command option (it is ignored), so you will have to handle this yourself using the db.post_init signal and then running the sql command. Another option is to extend the MySQLConnectionAbstract
class and add support for the init_command there.
Using SQLAlchemy
With SQLAlchemy you can use the connect
signal to run the initial command:
@event.listens_for(engines["reader-with-write-forwarding"], "connect", insert=True)
def db_connect(conn, *args, **kwargs):
"""
On the database connect event, sets the aurora_replica_read_consistency to
the desired level (defaults to EVENTUAL)
"""
try:
conn.cmd_query(
query=f"SET SESSION aurora_replica_read_consistency='{os.environ.get('AURORA_CONSISTENCY', 'EVENTUAL')}'"
)
conn.cmd_query(query="SET AUTOCOMMIT=0;")
except Exception as e:
logger.debug(e)
logger.exception(e)
Failover
Aurora regional failover happens when the Aurora service is unhealthy in a particular region. One of the secondary regions needs to be promoted to become the new primary region.
Process
We have automated the failover process to regularly do failover tests and know that our process is in order. The steps for failover with zero data loss are:
- Pause all writes (we pause all of our kinesis integrations)
- Select a new region, and promote it to master
- Update our DNS records to point to the new master region
- Enable write forwarding on the previous primary region (this is not enabled by default)
- Unpause all writes
This process takes a couple of minutes to run, but after that, all writes now happen in the new primary region. You can see our failover steps in action here:
Conclusion
Adopting Aurora multi-region was quite an undertaking, but this solves all of our core data syncing issues. This also gives us more resiliency to outages and more flexibility in our application logic.