AWS Database Blog
Performing Parallel Queries and Phrase Searching with Amazon RDS for PostgreSQL 9.6.1
Jeff Levine is a solutions architect for Amazon Web Services.
As a solutions architect for Amazon Web Services, one of my responsibilities is to help our customers to learn about new features and capabilities of our services. Amazon RDS makes it possible to set up, operate, and scale relational databases in the cloud without having to focus on time-consuming administrative tasks. Amazon RDS recently announced the support of PostgreSQL 9.6.1. In this blog post, I will discuss two of the features included in this release: the ability to perform parallel queries and phrase searches.
Parallel Queries
Parallel queries make it possible to use multiple CPU cores to speed up the execution of queries by creating multiple worker processes. Each worker process performs the query on a subset of data. The results are then combined to produce the final query result. The number of worker processes is determined by the value of the parameter max_parallel_workers_per_gather. The default value of this parameter is 0, which disables parallelism.
In the examples that follow, I’ve created a VPC with a PostgreSQL RDS database on a db.m4.large instance class with 2 vCPUs and 8 GB RAM. I’m going to use pgcli, a command line tool for PostgreSQL, to interact with the database named pg961db to create a table.

The create table statement uses the generate_series function to create a set of integers from 1 to 100 million that serve as identifiers for the rows of the table. I am also using the random() and floor() functions to generate a column with a random number from 1 to 9. I then use the select statement to show the first three rows of the table. Now I will run a standard query:

In the preceding screenshot, max_parallel_workers_per_gather is set to 0 to disable parallel queries and then perform a select. Notice that the query plan calls for a sequential scan (as denoted by Seq Scan). The total query time is approximately 44 seconds. Now let’s see what happens when parallel queries are enabled:

In this example, max_parallel_workers_per_gather is set to 2 to enable parallel queries. The output from explain shows that two workers have been launched and that the total processing time was reduced to approximately 29 seconds, a 34% reduction in the time required to perform the same query with only one worker.
This example shows how parallel queries can reduce the processing time of queries using multiple workers in a divide and conquer manner. This is especially useful when processing aggregate functions such as avg() in the preceding example. The resulting increase in efficiency will depend on a number of factors, including the data itself, the specific query, and the database instance class.
Phrase Searches
PostgreSQL 9.6.1 also introduced the ability to perform phrase searches, in which a search query consists of an ordered set of lexemes rather than, as with standard searches, an unordered set. Consider these standard search examples:

In the first example, the search for “Amazon Web Services” in “Amazon Web Services has over 90 services” yields the expected result of True. However, the search for “Amazon Web Services” in “The Web Travel Services Agency offers tours of the Amazon” also yields a result of True even though it is likely not what we wish to see. To understand why this happens, see how the plainto_tsquery() function parses query strings:

The parser first applies the English language processing rules to the string “Amazon Web Services” by normalizing the case of each word and extracting the root of each token before converting it to a lexeme (hence, the change of “services” to “servic”). The resulting query string consists of the lexemes separated by the intersection operator “&.” A match occurs if all of the lexemes are found, regardless of order. PostgreSQL 9.6.1 includes support for phrase search query strings using the phraseto_tsquery() function, as shown here:

The query string contains the same three lexemes, but with a new ordering operator <->, which means “is followed by.” For a match to occur, the three lexemes must appear in the correct order. This capability makes it much easier to refine searches and produce the most desirable results.
Amazon RDS for PostgreSQL version 9.6.1 brings the latest capabilities of PostgreSQL to the cloud as a managed service. The parallel query feature can increase the performance of searches by using additional CPU capacity to run multiple worker processes. The phrase search capability provides for a more tailored search experience where the specific order of terms is required. It’s easy to get started. Use our Getting Started Guide to create your PostgreSQL database in just a few clicks.
Streaming Changes in a Database with Amazon Kinesis
Emmanuel Espina is a software development engineer at Amazon Web Services.
In this blog post, I will discuss how to integrate a central relational database with other systems by streaming its modifications through Amazon Kinesis.
The following diagram shows a common architectural design in distributed systems. It includes a central storage referred to as a “single source of truth” and several derived “satellite” systems that consume this central storage.

You could use this design architecture and have a relational database as the central data store, taking advantage of the transactional capabilities of this system for maintaining the integrity of the data. A derived system in this context could be a full-text search system that observes this single source of truth for changes, transforms and filters those modifications, and finally updates its internal indexes. Another example could be a columnar storage more appropriate for OLAP queries. In general, any system that requires taking action upon modification of individual rows of the central relational system is a good candidate to become a derived data store.
A naive implementation for these kinds of architectures will have the derived systems issuing queries periodically to retrieve modified rows, essentially polling the central database with a SELECT-based query.
A better implementation for this architecture is one that uses an asynchronous stream of updates. Because databases usually have a transaction log where all of the changes in rows are stored, if this stream of changes is exposed to external observer systems, those systems could attach to these streams and start processing and filtering row modifications. I will show a basic implementation of this schema using MySQL as the central database and Amazon Kinesis as the message bus.
Normally, MYSQL binlog is exposed to read replicas that read all of the changes on the master and then apply them locally. In this post, I am going to create a generalized read replica that will publish changes to an Amazon Kinesis stream instead of applying the modifications to a local database.

One important detail of this method is that the consumers won’t receive SQL queries. Those can be exposed too, but in general observers won’t be very interested in SQL unless they maintain a SQL-compatible replica of the data themselves. Instead, they will receive modified entities (rows) one by one. The benefits of this approach are that consumers do not need to understand SQL and the single source of truth does not need to know who will be consuming its changes. That means that different teams can work without coordinating among themselves on the required data format. Even better, given the capabilities of Amazon Kinesis clients to read from a specific point in time, each consumer will process messages at its own pace. This is why a message bus is one of the less coupled ways to integrate your systems.
In the example used in this post, the rows fetcher is a regular Python process that will attach to the central database, simulating a read replica.
The database can be either Amazon RDS or any installation of MySQL. In the case of RDS, the fetcher process must be installed on a different host (for example, EC2) because it is not possible to install custom software on RDS instance hosts. For external installations, the fetcher process can be installed on the same host as the database.
Prepare the master MySQL instance
The MySQL master (the single source of truth) must be configured as if it were a master for regular replication. Binlogs must be enabled and working in ROW format to receive individual modified rows. (Otherwise, you would end up with SQL queries only.) For information, see The Binary Log on the MySQL site.
To enable the binlog, add these two lines to your my.cnf configuration file:
log_bin=<path to binlog>
binlog_format=ROW
It is possible to get row-based logging by setting the transaction isolation level to READ-COMMITTED at the global or session level for all connections (for example, using init_connect or a database API like JDBC).
If you are using RDS (MySql 5.6+), things are easy! You can create the required configuration by enabling periodic backups (binlogs are disabled if backups are not enabled) and updating the parameter group variable binlog_format to ROW. (You can do this from the RDS Dashboard under Parameter Groups.)

Add permissions
If you are using the default user created by RDS, you might already have these permissions. If not, you’ll need to create a user with REPLICATION SLAVE permissions. For information, see Creating a User for Replication.
mysql> CREATE USER 'repl'@'%.mydomain.com' IDENTIFIED BY 'slavepass';
mysql> GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%.mydomain.com';
Create an Amazon Kinesis stream
You need an Amazon Kinesis stream and boto3 client credentials. For information about client credentials, see the Boto 3 documentation.
Open the Amazon Kinesis console and choose Create Stream.

Enter the name of your stream and the number of shards. In this example, there is a single shard.

After a few minutes, your stream will be ready to accept row modifications!
Assign permissions to your CLI user
You can use the AWS Key Management Service to give permissions to the CLI user that will be accessing this stream.

In this example, that user is KinesisRDSIntegration. You can create a user or use an existing one, but you need to add permissions for writing to the Amazon Kinesis stream.

You can create a policy specific for your stream. This example uses a standard policy that gives complete access to Amazon Kinesis.

Connecting to the master and publishing changes
To install libraries required by the Python publisher, run the following command:
pip install mysql-replication boto3
For more detailed instructions, see:
https://github.com/noplay/python-mysql-replication
https://boto3.readthedocs.io/en/latest/guide/quickstart.html
Here is the Python script that performs the magic. Remember to replace the <HOST>, <PORT>, <USER>, <PASSWORD> and <STREAM_NAME> variables with the values for your configuration.
import json
import boto3
from pymysqlreplication import BinLogStreamReader
from pymysqlreplication.row_event import (
DeleteRowsEvent,
UpdateRowsEvent,
WriteRowsEvent,
)
def main():
kinesis = boto3.client("kinesis")
stream = BinLogStreamReader(
connection_settings= {
"host": "<HOST>",
"port": <PORT>,
"user": "<USER>",
"passwd": "<PASSWORD>"},
server_id=100,
blocking=True,
resume_stream=True,
only_events=[DeleteRowsEvent, WriteRowsEvent, UpdateRowsEvent])
for binlogevent in stream:
for row in binlogevent.rows:
event = {"schema": binlogevent.schema,
"table": binlogevent.table,
"type": type(binlogevent).__name__,
"row": row
}
kinesis.put_record(StreamName="<STREAM_NAME>", Data=json.dumps(event), PartitionKey="default")
print json.dumps(event)
if __name__ == "__main__":
main()
This script will publish each modified row as an Amazon Kinesis record, serialized in JSON format.
Consuming the messages
Now you are ready to consume the modified records. Any consumer code would work. If you use the code in this post, you will get messages in this format:
{"table": "Users", "row": {"values": {"Name": "Foo User", "idUsers": 123}}, "type": "WriteRowsEvent", "schema": "kinesistest"}{"table": "Users", "row": {"values": {"Name": "Bar user", "idUsers": 124}}, "type": "WriteRowsEvent", "schema": "kinesistest"}{"table": "Users", "row": {"before_values": {"Name": "Foo User", "idUsers": 123}, "after_values": {"Name": "Bar User", "idUsers": 123}}, "type": "UpdateRowsEvent", "schema": "kinesistest"}
Summary
In this blog post, I have shown how to expose the changes stream to the records of a database using a fake read replica and Amazon Kinesis. Many data-oriented companies are using architectures similar to this. The example provided in this post, while not ready for a real production environment, can be used to experiment with this integration style and improve the scaling capabilities of your enterprise architecture. The most complex part is probably what is already solved behind the scenes by Amazon Kinesis. You only need to provide the glue!
Additional resources
What every software engineer should know about real-time data’s unifying abstraction
All aboard the Databus: LinkedIn’s scalable consistent change data capture platform
Get Started with Amazon Elasticsearch Service: How Many Data Instances Do I Need?
Jon Handler (@_searchgeek) is a principal solutions architect at Amazon Web Services.
Welcome to the first in a series of blog posts about Elasticsearch and Amazon Elasticsearch Service, where we will provide the information you need to get started with Elasticsearch on AWS.
How many instances will you need?
When you create an Amazon Elasticsearch Service domain, this is one of the first questions to answer.

To determine the number of data nodes to deploy in your Elasticsearch cluster, you’ll need to test and iterate. Start by setting the instance count based on the storage required to hold your indices, with a minimum of two instances to provide redundancy.
Storage Needed = Source Data x Source:Index Ratio x (Replicas + 1)
First, figure out how much source data you will hold in your indices. Then, apply a source-data to index-size ratio to determine base index size. Finally, multiply by the number of replicas you are going to store plus one (replica count is 0-based) to get the total storage required. As soon as you know the storage required, you can pick a storage option for the data nodes that dictates how much storage you will have per node. To get the node count, divide the total storage required by the storage per node.
Instances Needed = Storage Needed / Storage per data node
As you send data and queries to the cluster, continuously evaluate the resource usage and adjust the node count based on the performance of the cluster. If you run out of storage space, add data nodes or increase your Amazon Elastic Block Store (Amazon EBS) volume size. If you need more compute, increase the instance type, or add more data nodes. With Amazon Elasticsearch Service, you can make these changes dynamically, with no down time.
Determine how much source data you have
To figure out how much storage you need for your indices, start by figuring out how much source data you will be storing in the cluster. In the world of search engines, the collection of source data is called the corpus. Broadly speaking, there are two kinds of workloads AWS customers run:
- Single index workloads use an external “source of truth” repository that holds all of the content. You write scripts to put the content into the single index for search, and that index is updated incrementally as the source of truth changes. These are commonly full-text workloads like website, document, and e-commerce search.
- Rolling index workloads receive data continuously. The data is put into a changing set of indices, based on a timestamp and an indexing period (usually one day). Documents in these indices are not usually updated. New indices are created each day and the oldest index is removed after some retention period. These are commonly for analytics use cases like log analytics, time-series processing, and clickstream analytics.

If you have a single index workload, you already know how much data you have. Simply check your source of truth for how much data you’re storing, and use that figure. If you are collecting data from multiple sources (such as documents and metadata), sum up the size of all data sources to get the total.
If you have a rolling index workload, you’ll need to calculate how much data you will be storing, based on a single time period and a retention length. A very common case is to store the logs generated every 24 hours (the time period) for two weeks (the retention period). If you don’t already know how much log data you’re generating daily, you can get a rough estimate based on 256 bytes per log line times the number of log lines you’re generating daily. Multiply your daily source data size by the number of days in the retention period to determine the total source data size.
How much index space?

The amount of storage space you’ll use for your index depends on a number of factors. As you send your documents to Elasticsearch, they are processed to create the index structures to support searching them. The on-disk size of these index structures depends on your data and the schema you set up. In practice, and using the default settings, the ratio of source data to index size is usually approximately 1:1.1.
For all practical purposes, and remembering to leave 10% overhead, you can use the source data size as the required index storage size.
Replicas increase the index size
Elasticsearch allows you to set (and change dynamically) a number of replicas for your index. The most important reason to use a replica is to create redundancy in the cluster. For production workloads and for all cases where you cannot tolerate data loss, we recommend using a single replica for redundancy. You might need more replicas to increase query processing capacity. We’ll cover that in a future post. You can have node-level redundancy only if you have more than one node. A single node, even with a replica, will not provide high availability.
Each replica is a full copy of the index, at the shard level. As such, it uses the same storage as the primary copy of the index. If you are using one replica, double the amount of storage for the index.
What’s the storage per instance?
When you configure your Amazon Elasticsearch Service domain, you choose your storage option: instance (ephemeral) storage or EBS storage. If you choose instance storage, then the storage per data node is already set based on your instance type selection. If you choose EBS storage, you can configure the amount of storage per instance, up to the Amazon Elasticsearch Service EBS storage limit for that instance type.
For example, if you choose to use m3.medium.elasticsearch instances and choose instance store as your storage option, each node will have 4 GB of SSD storage. If you choose to use EBS as your storage, you can attach up to 100 GB to each m3.medium.elasticsearch instance.
The amount of usable storage per instance is less than the total storage available. The service files and operating system files take 3% of the storage on an m3.medium (less on larger instances). The service also reserves 20% of the disk, up to a maximum of 20 GB. What this means, especially for smaller instance types, is that if your computed storage is close to the boundary, opt for an additional data node.
Putting it all together
Let’s work through two examples.
The first example is a single index workload that represents a product catalog for an e-commerce website. The company has a catalog of 100,000 products that take up 1 GB of storage in its database. We multiply this 1 GB by the compression ratio (1.0) to get 1 GB of index size. The company will have one replica, so the total storage required is 2 GB. Because they have 4 GB available for each m3.medium.elasticsearch instance, the company could use just one node. However, they would not have anywhere to deploy a redundant replica, so they choose two m3.medium instances.
The second example is a dynamic index workload. The same company ingests data from the Twitter firehose to do brand sentiment analysis and improve their rank function for their product search. They download 100 GB of Twitter data each day and retain it for seven days. We multiply this 100 GB by the compression ratio (1.0) to get 100 GB of index daily. The company will have one replica of this data, yielding 200 GB of daily index, which they will retain for seven days. Multiplying 200 GB by seven days, the company will need 1,400 GB of storage. They choose m3.large.elasticsearch instances, to which they will attach 512 GB, General Purpose SSD (gp2) EBS volumes. At a minimum, they need three of these volumes, but decide on four m3.large.elasticsearch instances to provide additional storage.
How many instances?
It’s easy to get started with Elasticsearch using Amazon Elasticsearch Service, but there are some choices to make up front. One of your first decisions is to decide how many nodes you’ll need. You can calculate the number of nodes by determining how much data you want to store, multiplying by a compression ratio (usually 1), and then multiplying by the number of replicas. This yields the total on-disk size of the index or indices. Divide that by the per-node storage amount to get the total number of nodes required.
Introducing Cost Allocation Tags for Amazon DynamoDB
Nitin Sagar is a senior product manager for DynamoDB.
You can now add tags to your Amazon DynamoDB tables. Tags are simple, user-customizable key and value pairs that are supported by many AWS services. Tagging for DynamoDB provides fine-grained visibility into your DynamoDB bill. You can assign tags to your tables and see cost breakouts per tag.
Let’s consider a real-life scenario in which you have multiple DynamoDB tables for your different development environments—development, staging, and production. You can assign tags to your DynamoDB tables, representing your various environments by adding a tag and setting the tag key to Environment, and the tag value to Development, Staging, or Production.
Let’s look at how to do this using the DynamoDB console. Before you begin, ensure you have the necessary permissions for the following API operations – ListTagsOfResource and TagResource.
- Sign in to the AWS Management Console and open the DynamoDB console at https://console.aws.amazon.com/dynamodb/.
- Choose Tables, and then choose the table that you want to modify.
- In the Settings tab, choose Tags from the navigation menu.
- In the Add Tags section, type
Environmentfor Key andDevelopmentfor Value. Choose Apply Changes.

By default, a newly added tag key is inactive for the billing pipeline. You can activate cost allocation tags in the Billing console by following these steps:
- Sign in to the AWS Management Console and open the Billing console at https://console.aws.amazon.com/billing/.
- Choose Cost Allocation Tags from the navigation menu.
- In the User-Defined Cost Allocation Tags section, select the check box next to the Environment tag key and choose Activate.

When you have activated your cost allocation tags, the AWS Cost Explorer provides an easy way to see a breakdown of your AWS costs by tagged AWS resources:
- Sign in to the AWS Management Console and open the Billing console at https://console.aws.amazon.com/billing/.
- Choose Cost Explorer from the navigation menu. Then choose Launch Cost Explorer.
- From the menu at top left, choose Monthly costs by service. Select the appropriate time range in the Time range section from the right menu.
- In the Filtering section, choose Tag from Filter by.
- Choose Environment in the tag key autocomplete field and choose Development in the tag value autocomplete field. Choose Apply.

Your costs are now filtered by the selected tag (Environment=Development). Costs are shown starting from when you applied your tag to your AWS resources (January 15 onward in our example).

You can add up to 50 tags to your DynamoDB tables by using the DynamoDB Management Console, AWS CLI, or AWS SDK. Global secondary indexes (GSIs) and local secondary indexes (LSIs) associated with base DynamoDB tables are automatically tagged with the same tags that are used for the corresponding DynamoDB base tables. Tagging support for DynamoDB is available in all AWS regions.
You can read more about implementing a consistent and effective tagging strategy for your AWS resources by visiting the AWS Tagging Strategies page.
Introducing AWS Schema Conversion Tool Version 1.0.502
Eran Schitzer is a product manager in Amazon Web Services.
We are pleased to announce the recent release of version 1.0.502 of the AWS Schema Conversion Tool (AWS SCT). Two key capabilities included with this release are the use of Amazon Aurora with PostgreSQL compatibility as a target and integration with AWS Database Migration Service (AWS DMS).
Amazon Aurora with PostgreSQL compatibility as a target
Following the announcement of Amazon Aurora with PostgreSQL compatibility preview at re:Invent, you can now use the AWS SCT to convert your schema to use with a PostgreSQL-compatible Aurora DB cluster.
The PostgreSQL-compatible edition of Amazon Aurora delivers up to two to three times the performance of PostgreSQL, along with other PostgreSQL capabilities valuable to customers—user-defined functions and data types, complex SQL support, NoSQL and JSON support, and broad application development language support. At the same time, it takes advantage of the scalability, durability, and security capabilities of Amazon Aurora, all for one-tenth the cost of commercial grade databases such as Oracle Database and Microsoft SQL Server. To sign up for the preview, see the preview webpage.
For more information about converting database schemas, see Converting Database Schema to Amazon RDS by Using the AWS Schema Conversion Tool.
Integration with DMS
After converting a schema with SCT, you can have the associated endpoints and tasks created automatically on a DMS instance. You can run and monitor tasks from either the DMS console or from SCT.
The following procedure shows how to create a DMS task from SCT. Once the schema conversion is complete, the DMS task created through the SCT will migrate the data to the target database engine.
To create and access DMS tasks from SCT, you’ll need to define at least one service profile. You can define one or more service profiles to let SCT access your resources running on AWS, including DMS, RDS and AWS Lambda.
- Open SCT, choose Settings, and then choose Global settings.
- On the AWS Service Profiles tab, type a name for Profile name, enter your AWS access key and AWS secret key, and choose the region where the profile will be created.

- Choose Save to save the profile.
- Choose Current project settings, and then choose the Project Environment tab. For AWS Service Profile, choose the default profile for the current project, and then choose OK.
Once you have defined an AWS service profile, you can define, access, and execute DMS tasks from SCT.
- Choose the converted schema you’d like to migrate, open the context (right-click) menu for it, and choose Create DMS Task.

- In the Create DMS task window, enter the task details—task name, replication instance and source and target endpoints (you can also create new endpoints). You can change the defaults, if needed, for the rest of the parameters.
- Choose Create.
At this point, a DMS task is being created in AWS, as shown by this screenshot from the DMS console. 
You can view, start, stop, resume, delete, refresh, and show logs of the new DMS task in the DMS tasks list in both SCT and DMS.

For more information about version 1.0.502, see AWS Schema Conversion Tool User Guide. You can download SCT from here.
Auditing an Amazon Aurora Cluster
Sirish Chandrasekaran is a product manager at Amazon Web Services.
At re:Invent, we announced a number of new features for Amazon Aurora, the MySQL-compatible relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.
Today we are happy to announce that our advanced auditing feature is now available to all customers.
What is advanced auditing, and why is it important?
Auditing refers to logging a specified set of events and publishing these logs in a way that can be consumed either manually or by another application. These logs can form the bedrock of your governance and regulatory compliance programs. Examples of advanced auditing include log analysis, auditing of user actions (both past events and near real-time threat detection), and support for configurable alarms set on security-related events. Aurora advanced auditing is designed to provide this functionality with minimal impact on database performance.
Getting started with advanced auditing
To get started, enable advanced auditing and then take a look at the audit logs.
Enabling advanced auditing
You can enable and configure advanced auditing by setting the parameters listed following in the DB cluster parameter group. Modifying these parameters doesn’t require a DB cluster restart, and the behaviors are the same as for Aurora DB instance parameters.
You use the server_audit_logging parameter to enable or disable auditing, and the server_audit_events parameter to specify what events to log.
You use the server_audit_excl_users and server_audit_incl_users parameters to specify who gets audited:
- If server_audit_excl_users and server_audit_incl_users are empty (the default), all users are audited.
- If you add users to server_audit_incl_users and leave server_audit_excl_users empty, then only those users added to server_audit_incl_users are audited.
- If you add users to server_audit_excl_users and leave server_audit_incl_users empty, then only those users added to server_audit_excl_users are not audited, and all other users are.
- If you add the same users to both server_audit_excl_users and server_audit_incl_users, then those users are audited because server_audit_incl_users is given higher priority.
The following list describes each of the advanced auditing parameters in more detail.
server_audit_logging enables or disables audit logging. It defaults to OFF; set it to ON to enable logging.
- Scope: Global
- Dynamic: Yes
- Data type: Boolean
- Default value: OFF (disabled)
server_audit_events contains the comma-delimited list of events to log. There should be no white space between the list elements.
- Scope: Global
- Dynamic: Yes
- Data type: String
- Default value: Empty string
- Valid values: You can log any combination of the following events:
- CONNECT — Logs successful connections, failed connections, and disconnections. This value includes the user information.
- QUERY — Logs all query text and query results in plain text, including queries that fail due to syntax or permission errors.
- QUERY_DCL — Similar to Query, but returns only DCL-type queries (GRANT, REVOKE, and so on).
- QUERY_DDL — Similar to Query, but returns only DDL-type queries (CREATE, ALTER, and so on).
- QUERY_DML — Similar to Query, but returns only DML-type queries (INSERT, UPDATE, and so on).
- TABLE — Logs the tables that were affected by query execution.
server_audit_excl_users contains the comma-delimited list of users whose activity isn’t logged. There should be no white space between the list elements. Connect and disconnect events aren’t affected by this variable; they are always logged if logging for them is specified. Even if a user is listed in server_audit_excl_users, that user’s actions are logged if that user is also specified in server_audit_incl_users, because that setting has higher priority than server_audit_excl_users.
- Scope: Global
- Dynamic: Yes
- Data type: String
- Default value: Empty string
server_audit_incl_users contains the comma-delimited list of users whose activity is logged. There should be no white space between the list elements. Connect and disconnect events aren’t affected by this variable; they are always logged if logging for them is specified. If a user is listed in server_audit_incl_users, that user is logged even if that user is also specified in server_audit_excl_users, because server_audit_incl_users has higher priority.
- Scope: Global
- Dynamic: Yes
- Data type: String
- Default value: Empty string
Viewing audit logs
You can view the audit logs by using the AWS Management Console. On the Instances page, select and expand the DB cluster, then choose Logs.

If you are familiar with the MariaDB Audit Plugin, you will notice a few differences in Aurora’s approach to auditing.
First, logs are presented somewhat differently:
- The Aurora advanced auditing time stamp is formatted in Unix time format.
- Log events are written to multiple files, and the log records aren’t in sequential order. Customers can concatenate and sort the files using the timestamp and query_id fields to produce a single file with ordered events. You can do this in Unix as follows: cat audit.log.* | sort -t”,” -k1,1 –k6,6
- The number of files might vary based on DB instance size.
- File rotation is fixed at 100 MB and isn’t configurable.
Also, enabling Aurora advanced auditing after migration from MySQL works differently. Audit log configuration is set using the parameter group for the DB cluster.
How does Aurora implement advanced auditing?
Although auditing capability is commonly available in both commercial databases and some open source databases, auditing in these databases typically has a significant impact on performance, especially when the load is high. One of our Aurora implementation’s main goals is to provide users with a rich set of information but without sacrificing performance.
Maintaining performance
To understand how we achieve our performance objective, let’s compare our implementation of advanced auditing to that of the MariaDB Audit Plugin. We use this as our comparison point because MySQL Community Edition doesn’t have a native audit log, and the MariaDB Audit Plugin has emerged as the most popular option to fill this void in the open source community.
MariaDB Audit Plugin uses a single thread with a single mutex for processing and writing each event. Although the design strictly preserves the order of events, it can lead to a performance drop due to the bottleneck in writing logs. If we had used a similar approach for Aurora, the performance impact would have been even more significant due to the much higher scalability and expected throughput of the engine.
To maintain our high-performance advantage, we redesigned the event processing and event writing logic. On the input side, we used a latch-free queue to store audit events without blocking other threads. On the output side, we use multiple threads writing events from the latch-free queue to multiple files. The files can be post-processed to present a complete audit log with events in order.

Log format
The audit log is stored separately on each instance’s local (ephemeral) storage. Each Aurora instance spreads log writes across four log files at a time:
- Encoding: UTF-8
- File name pattern: audit.log.[0-3].[%Y-%m-%d-%H-%M][-rotation]
- Location: /rdsdbdata/log/audit/ (on each host)
- Rotation: Maximum size 100 MB per log file, currently not customer-configurable. When the largest of the four log files reaches 100 MB, the system rotates to a new set of log files.
- Clean-up: The system will clean up older audit files to free up space past a certain space consumption and/or age.
- Log format: [timestamp],[serverhost],[username],[host],[connectionid],[queryid],[operation],[database],[object],[retcode]
| Parameter | Description |
|---|---|
| timestamp | The Unix time stamp for the logged event with second precision. |
| serverhost | The name of the instance that the event is logged for. |
| username | The connected user. |
| host | The host that the user connected from. |
| connectionid | The connection ID number for the logged operation. |
| queryid | The query ID number, which can be used for finding the relational table events and related queries. For TABLE events, multiple lines are added. |
| operation | The recorded action type. Possible values are CONNECT, QUERY, READ, WRITE, CREATE, ALTER, RENAME, and DROP. |
| database | The active database, as set by the USE command. |
| object | For QUERY events, this is the executed query. For TABLE events, this is the table name. |
| retcode | The return code of the logged operation. |
How does our approach compare to the alternatives?
As mentioned above, many databases offer audit log capability, but suffer from poor performance when auditing is enabled. We compared the performance of Aurora against MySQL 5.7 with MariaDB Audit Plugin for a select-only workload on an 8xlarge instance. As the results following show, MySQL performance degrades significantly when auditing is turned on, where Aurora only sees modest reduction in performance. There’s a 65% reduction in throughput for MySQL 5.7 as opposed to just 15% for Aurora. In effect, Aurora’s performance edge over MySQL 5.7 more than doubles to over an order of magnitude when auditing is turned on.

Advanced auditing is available now, and you can start using it today! You can learn more about this feature in the advanced auditing documentation.
Indexing Metadata in Amazon Elasticsearch Service Using AWS Lambda and Python
Amit Sharma (@amitksh44) is a solutions architect at Amazon Web Services.
You can use Amazon S3 to implement a data lake architecture as the single source of truth for all your data. Taking this approach not only allows you to reliably store massive amounts of data but also enables you to ingest the data at a very high speed and do further analytics on it. Ease of analytics is important because as the number of objects you store increases, it becomes difficult to find a particular object—one needle in a haystack of billions.
Objects in S3 contain metadata that identifies those objects along with their properties. When the number of objects is large, this metadata can be the magnet that allows you to find what you’re looking for. Although you can’t search this metadata directly, you can employ Amazon Elasticsearch Service to store and search all of your S3 metadata. This blog post gives step-by-step instructions about how to store the metadata in Amazon Elasticsearch Service (Amazon ES) using Python and AWS Lambda.
Using S3 event notifications and Lambda triggers
In this post, we use S3 event notifications and Lambda triggers to maintain metadata for S3 objects in Amazon ES. S3 notification enables you to receive notifications when certain events happen in your bucket. These events can be for any action in an S3 bucket, such as PUT, COPY, POST, DELETE, and so on. More details about S3 event notifications are available in the AWS documentation.
S3 event notifications integrate with Lambda using triggers. Using this integration, you can write Lambda functions that process Amazon S3 events. To do this, in Amazon S3 you add a bucket notification configuration that identifies the type of event that you want Amazon S3 to publish and the Lambda function that you want to invoke.

High-level flow between S3 and Lambda
Putting it together
To put all these parts together, you can take the following steps.
Configuring AWS Lambda with Amazon S3
To configure Lambda with S3, start by choosing AWS Lambda on the console.

If this is the first time you’ve created a Lambda function, choose Get Started Now.

Choose Configure triggers.

On the next page, you should be able to select the triggers you want to work with.

Choose the S3 bucket and the type of event that you want to capture. You can leave the Prefix and Suffix fields blank or, based on your use case, fill them in.
For example, if you expect all files to come in a folder called /appTier/appServer1, you can use that path as the Prefix value. Similarly, if you expect the files to arrive with a certain suffix like .log, .jpg, .avi, and so on, you can use that in the Suffix field. Events are triggered for an object only if both the Prefix and Suffix fields are matched. Also, select the Enable Trigger check box.

Next, provide a name and description and choose Python 2.7 as the run-time environment. Because we are going to upload the code separately, choose Upload a .ZIP file for Code entry. Leave the handler information as the default: lambda_function.lambda_handler.

Now, let’s create the AWS Identity and Access Management (IAM) roles and related permissions so that our Lambda function can access the AWS resources we need. To do this, choose Create a new role from template(s), give a name to this new role, and for Policy templates, choose S3 object read-only-permission.


In Advanced settings, leave the Memory, Timeout, and VPC settings as the default. Choosing Next will create the Lambda function and also associates the right permissions in S3 so you can invoke this Lambda function. You can verify this by checking this in S3 console. To do this, go to the properties of the S3 bucket you specified earlier and to the Events section, as shown following:

Choose the modify icon to see the details and verify the name of the Lambda function.
Creating the Amazon ES domain
Now, let’s create the Amazon ES domain. Go to Services, and choose Elasticsearch Service in Analytics:

Choose the Get Started button on the front page and type a name for your domain (I chose my-es-cluster):

As shown following, choose an instance type and an instance count (both can be changed later if necessary). We recommend choosing m3.medium or larger if you are planning to put this feature into production. Alternatively, t2.micro is a good choice if you are creating a development environment or a small proof of concept.
For storage, you have choices between instance-based storage and various types of Amazon EBS volumes (General Purpose, Provisioned IOPS and Magnetic). Start with a General Purpose EBS volume and monitor the overall performance with the FreeStorageSpace, JVMMemoryPressure, and CPUUtilization metrics and metrics about query response times before changing the storage type. For a good reference to handling errors and mitigations, see the AWS documentation.
An important question is: How much storage do you need? For example, if every object uploaded to S3 has metadata sized 1 KB and you expect 10 million objects, you should provision a total of at least 20 GB: 10 GB for the primary instance and an additional 10 GB for the replica. For a more detailed discussion on scaling and capacity planning for Elasticsearch, see the Elasticsearch documentation.

Next, set the access policy. I chose to make mine wide open in order to simplify testing, but don’t do this for your cluster. I could also have used one of the IP-based or user-based templates in the wizard to create a more restrictive policy. For more details on controlling access to your cluster, see this blog post.

Finally, review the settings and choose Confirm and create. That’s it! The cluster will be created in a few minutes.

Creating the Lambda function
Now comes the main code that will actually push the metadata coming from every trigger generated by object creation events. Remember that Lambda has been configured with an execution role that has read-only permissions to read from S3. At a high level, the Python code does the following:
- Reads the metadata from S3 event
- Connects to the Amazon ES domain endpoint
- Creates an index if one has not already been created
- Writes the metadata into Amazon ES
To connect to Amazon ES, the Python code uses a few specific libraries such as Elasticsearch, RequestsHttpConnection, and urllib. We are going to upload the code to the Lambda function so you can download these packages in a specific folder by using the following command. But first, make sure pip is installed—find steps to do this on the pip website. Note that the sample code available for download includes all the required libraries, so this step is optional and given here mainly for your understanding:
pip install requests -t /path/to/project-dir
pip install Elasticsearch -t /path/to/project-dir
pip install urllib3 -t /path/to/project-dir
Make sure these libraries are now available in the current directory. Now we are ready to look at the code.
The following function connects to Amazon ES:
def connectES(esEndPoint):
print ('Connecting to the ES Endpoint {0}'.format(esEndPoint))
try:
esClient = Elasticsearch(
hosts=[{'host': esEndPoint, 'port': 443}],
use_ssl=True,
verify_certs=True,
connection_class=RequestsHttpConnection)
return esClient
except Exception as E:
print("Unable to connect to {0}".format(esEndPoint))
print(E)
exit(3)
This function takes the domain endpoint as an argument and returns the Elasticsearch client instance. Be sure to use your domain’s endpoint to declare esClient:
esClient = connectES("search-domainname-yourDomainEndpoint.REGION.es.amazonaws.com")
The following function creates an Amazon ES index:
def createIndex(esClient):
try:
res = esClient.indices.exists('metadata-store')
print("Index Exists ... {}".format(res))
if res is False:
esClient.indices.create('metadata-store', body=indexDoc)
return 1
except Exception as E:
print("Unable to Create Index {0}".format("metadata-store"))
print(E)
exit(4)
Note that this function takes esClient as an instance of the Elasticsearch client returned by the connectES function. Also note that ‘metadata-store’ and ‘indexDoc’ are the name and mapping of the index we are trying to create. The ‘indexDoc’ mapping is defined following:
indexDoc = {
"dataRecord" : {
"properties" : {
"createdDate" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"objectKey" : {
"type" : "string",
"format" : "dateOptionalTime"
},
"content_type" : {
"type" : "string"
},
"content_length" : {
"type" : "long"
},
"metadata" : {
"type" : "string"
}
}
},
"settings" : {
"number_of_shards": 1,
"number_of_replicas": 0
}
}
We are storing five fields:
- createdDate
- objectKey
- content_type
- content_length
- metadata
As part of this, there’s a couple of important points to consider.
First, it’s important to plan your shards. The best number of primary and replica shards depends upon multiple things such as instance sizes, amount of data, frequency of new data being generated and old data being purged, query types, and so on. To give an example, for time-series data (for example, Logfile) you can maintain different indexes per hour, per day, and per week depending upon the speed of data being generated—we recommend daily indexes in most cases. Because older logs are less likely to be queried, you can re-index those to lower primary shard numbers or else drop the entire index. A more detailed discussion is provided in the Elasticsearch documentation.
Also, consider using bulk indexing. The preceding code sample works fine for a lot of use cases with low to moderate traffic—for example, up to 100 PUTs per second on S3 with 1KB of metadata. However, for higher traffic volumes we recommend to use larger instances and instead of indexing every document use the _bulk index API call to efficiently dump the data into an Elasticsearch cluster. In a follow-up blog, we will give architectural patterns and recommendations on how to do _bulk indexing efficiently and cost-effectively.
For a detailed explanation about shard settings as part of the cluster planning, refer to the Elasticsearch documentation.
Following is the function that actually writes metadata into Elasticsearch:
def indexDocElement(esClient, key, response):
try:
indexObjectKey = key
indexcreatedDate = response['LastModified']
indexcontent_length = response['ContentLength']
indexcontent_type = response['ContentType']
indexmetadata = json.dumps(response['Metadata'])
retval = esClient.index(index='metadata-store', doc_type='images', body={
'createdDate': indexcreatedDate,
'objectKey': indexObjectKey,
'content_type': indexcontent_type,
'content_length': indexcontent_length,
'metadata': indexmetadata
})
except Exception as E:
print("Doc not indexed")
print("Error: ",E)
exit(5)
This function takes esClient, an S3 object key, and the complete response of the S3.get_object function. This response contains the actual metadata. The elements in response are indexed by calling esClient.index. The document ID is autogenerated by Elasticsearch. You can see all the index options in the Elasticsearch documentation.
Finally, following is the main Lambda handler code that calls all these functions at the invocation when it is triggered:
def lambda_handler(event, context):
esClient = connectES("search-domainname-yourDomainEndpoint.REGION.es.amazonaws.com ")
createIndex(esClient)
# Get the object from the event and show its content type
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode('utf8'))
try:
response = s3.get_object(Bucket=bucket, Key=key)
print(response)
print("KEY: " + key)
print("CONTENT TYPE: " + response['ContentType'])
print("Metadata : " + json.dumps(response['Metadata']))
print("Custom 1: " + response['ResponseMetadata']['HTTPHeaders']['x-amz-meta-custom1'])
print("Custom 2: " + response['ResponseMetadata']['HTTPHeaders']['x-amz-meta-custom2'])
indexDocElement(esClient,key,response)
return response['ContentType']
except Exception as e:
print(e)
print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
raise e
You can also download the entire handler code from here for the index creation.
To verify that the metadata has been entered into Elasticsearch, you can use Kibana and search using the standard Elasticsearch API calls and queries. For example, you can query by object name as shown following:
bash# curl -XGET https:// search-domainname-yourDomainEndpoint.REGION.esamazonaws.com/metadata-store/images/_search?pretty\&q=objectKey:YOURFILENAME
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 9.516893,
"hits" : [ {
"_index" : "metadata-store",
"_type" : "images",
"_id" : "AVgGSFxdQ43eQcLduwj9",
"_score" : 9.516893,
"_source" : {
"content_length" : 61194,
"objectKey" : "YOURFILENAME",
"metadata" : "{\"custom1\": \"banana\", \"custom2\": \"shake\"}",
"content_type" : "application/octet-stream",
"createdDate" : "2016-10-27T13:15:54+00:00"
}
} ]
}
}
Following is a screenshot of Kibana after indexing a few documents:

Deleting metadata when an S3 object is deleted
To delete the related metadata when you delete an S3 object, follow the same steps as listed preceding—except that at event type selection, choose Object Removed Event as shown following:

The rest of the steps remain the same. Create an additional trigger for object removal for a total of two triggers and two Lambda functions for two different types of events—object PUT, COPY, or POST and object DELETE.
Following is the main handler code:
def lambda_handler(event, context):
esClient = connectES("search-domainname-yourDomainEndpoint.REGION.es.amazonaws.com ")
# Get the object from the event and show its content type
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode('utf8'))
try:
clearMetaData(esClient,key)
return 'Removed metadata for ' + key
except Exception as e:
print(e)
print('Error removing object metadata from Elasticsearch Domain.)
raise e
The clearMetaData function is defined as following:
def clearMetaData(esClient,key):
try:
retval = esClient.search(index='metadata-store', doc_type='images', q='objectKey:' + key, fielddata_fields='_id')
total = retval['hits']['total']
count = 0
while (count < total):
docId = retval['hits']['hits'][count]['_id']
print("Deleting: " + docId)
removeDocElement(esClient,docId)
count = count + 1
return 1
except Exception as E:
print("Removing metadata failed")
print("Error: ",E)
exit(5)
This function searches the domain for the given S3 object name and calls another function, removeDocElement, with the document ID as an argument that is unique in the domain. The removeDocElement is defined as following:
def removeDocElement(esClient,docId):
try:
retval = esClient.delete(index='metadata-store', doc_type='images', id=docId)
print("Deleted: " + docId)
return 1
except Exception as E:
print("DocId delete command failed at Elasticsearch.")
print("Error: ",E)
exit(5)
This code deletes all the references to that S3 key by using the unique document ID. You can confirm the deletion from Elasticsearch index by using following command:
bash# curl -XGET https:// search-domainname-yourDomainEndpoint.REGION.es.amazonaws.com/metadata-store/images/_search?pretty\&q=objectKey:train.csv
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
Following is the Amazon CloudWatch monitoring snapshot for the Elasticsearch cluster—you can see a number of metrics such as those dealing with searchable documents, free storage space, cluster health, and so on. These metrics can help you decide how to scale the cluster from both compute and storage perspective. For ex. by monitoring FreeStorageSpace or CPUUtilization you can decide to scale out or scale up the Elasticseach cluster nodes.

You can also download the entire handler code from here for the index deletion.
Thanks for exploring these technologies with me. For deeper information, take a look at Amazon Elasticsearch Service and AWS Lambda. Let me know in the comments below how this post works for you!
A Large-Scale Cross-Regional Migration of NoSQL DB Clusters
Andrey Zaychikov is solutions architect at Amazon Web Services
In this blog post, I will share experiences from a cross-regional migration of large-scale NoSQL database clusters (from Europe to the U.S.). The migration was a joint project implemented by the Ops team of our reference customer, FunCorp, and AWS Solutions Architects.
“Our flagship product, iFunny, is one of the most popular entertaining apps for American youth,” said Roman Skvazh, CTO of FunCorp. “For the last five years, the service has gained a considerable audience. Our mobile apps are utilized by more than 3.5 million active users per day and are operated in all time zones. A back end of the app is implemented on Amazon Web Services and its scope is impressive.”
What Should We Migrate?
The iFunny product uses the following database services running on AWS:
• Apache Cassandra cluster – 25 nodes, 18 TB of data
• Apache Cassandra cluster – 18 nodes, 16 TB of data
• MongoDB cluster – 5 TB of data, distributed across 8 shards, each shard – replica set includes the master and two slaves
• MongoDB cluster– 150 GB of data, distributed across 4 shards, each shard – replica set includes the master and two slaves
• Elasticsearch cluster – search index of 1 TB
• Redis – one master and two slaves, with 15 GB, 10 GB, and 1 GB of data and very high write speeds
“Through the efforts of our DevOps and Back End teams, and together with AWS SA, we were to move all of our infrastructure from one AWS region to another with no downtime and no major changes in the application.”
Why Should We Migrate?
If you’ve ever asked yourself the question, “Why migrate everything?” consider FunCorp’s reason: In the beginning, FunCorp selected an AWS region based on the Russian market, but the iFunny app quickly gained enormous popularity in the United States. Up to a certain point, it was convenient to have a backend located on a different continent. However, the high networking latency began to affect the user experience. For this reason, a migration became one of the integral parts of FunCorp’s global UX improvement program.
FunCorp had some constraints and conditions:
- Zero downtime
- No changes in the application and database structure
- Minimal use of third-party tools
How Should We Migrate?
At first, we planned to implement the migration project using own capabilities for geo-distributed replication for each mentioned database (Cassandra, MongoDB, Amazon ElasticSearch Service, and Redis), but we faced several significant problems unique to each database.
MongoDB
The first issue had to do with mongos for MongoDB Shard, a routing service for shard configurations. The replication between the European cluster and the USA cluster ran smoothly and was implemented by using the standard features of MongoDB, but after switching the users to a new back-end MongoDB, the performance of mongos in the U.S. decreased considerably. The problem was that a geo-distributed cluster of MongoDB supported only one set of configuration servers, which were located in Europe at the time. The following diagram illustrates the initial setup.

Mongos had to call the configuration servers, which were located overseas, for virtually every operation. The good news is that mongos keeps its configuration cached. However, it’s not possible to increase the number of chunks or to reorganize the data between them by relying only on caches and without connecting to the configuration servers. Therefore, the server migration should have been done quickly.
The final migration algorithm looked like this:
- We created some instances in the target region beforehand, with a configuration identical to the current one, and then we incorporated them into the existing replica set of the initial clusters. Then we waited until the end of the data replication.
- Just before the start of the migration process, we switched off mongos balancing. Thus, the mongos processes in the target region used a cached version of the configuration.
- During the migration, we moved away all of the instances with the data in the source region, one by one.
- In the replica set, there were only instances in the target region left, and new primary servers were elected from among these.
- Finally, we migrated the replica set configuration to the new region.
Cassandra
To migrate the Cassandra clusters, we created an additional Cassandra data center in a Cassandra geo-distributed cluster with a standard EC2Snitch in the target region and connected them to existing clusters through a software VPN tunnel. After that, we started the nodes and began the replication process. The VPN tunnel was essential, because it allowed us to enable interactions between geo-distributed clusters without having to change the snitch from EC2Snitch to EC2MultiRegionalSnitch. In our case, that meant not only a change of the snitch itself, but also manual cluster scaling, support for address lists in security groups, interaction between nodes on a public IP, and many other things. We thought most of the problems were solved.
Unfortunately, just after we started the replication process, the performance of the original Cassandra cluster dropped significantly.

The problem was that Cassandra started replicating data to the new DC simultaneously, from all of the nodes in the source region to all of the nodes in the destination region, with respect to our replication factor and level of consistency. To solve the problem of performance degradation, we stopped the clusters’ replication processes and started rebuilding the nodes of the target cluster, piece by piece, two to three nodes at a time.
By taking this approach, we replicated both clusters without interrupting the work process and with almost no significant performance degradation.
After the replication process was complete, we were able to repair all of the nodes in the target cluster. After that, we switched the applications from the source DC to the new DC. Finally, we excluded the source DC from the cluster, and then terminated that DC.
Redis
We took a similar approach for the Redis migration. We created some new clusters, connected them to the existing one in the source region, and then started replication. The volume of data to be migrated was not a factor, but the data was changing at a very high speed. As a result, the data failed to be replicated within the time defined by the cluster replication window, and all of the data in the target cluster was invalidated.
After several weeks of research, we found a very simple solution to this problem with Redis replication. We created an SSH tunnel with compression and with ports forwarded to localhost:
ssh -C -L 6280:localhost:6379 $MASTER_REDIS
Next, we told the slaves to synchronize with localhost instead of the master:
redis-cli slaveof localhost:6280
Done! Now the Redis replication was running successfully. Because of compression, the replication lag did not increase and never approached a critical threshold in the replication window.
ElasticSearch
Our initial plan was to avoid the migration of the ElasticSearch search index and to instead re-create it in the destination region. Unfortunately, the procedure of index creation was too complicated. It required engagement from our development teams, which was outside the constraints and conditions of the project.
Fortunately, we found a perfect ElasticSearch plugin that allowed us to do backups, including incremental backups. You can find a version of the plugin here. (To find the right version, see the README.md.)
The use of the plugin greatly simplified the migration by:
- Creating an ElasticSearch cluster in a destination region.
- Creating an Amazon S3 bucket, with versioning and cross-region replication enabled.
- Creating a snapshot of the main data set and writing it to S3. The snapshot was transferred to the new region automatically.
- Restoring data from the snapshot in the target region.
- Repeating the preceding steps for the incremental backups. (In our case, the process of applying the incremental backups took 12 minutes.)
The results of the project
The project was successfully implemented and the back-end migration reduced latency by at least 40% for end-users when accessing the iFunny mobile back-end API.
However, it took us more than two months of experimenting and a month and a half for an actual migration.
Conclusion
Any database migration is a complex process. Ambiguous situations, such as the non-standard behavior of a database engines or a process that places a significant impact on the network, can always occur.
The best and safest way to migrate, free of data loss or similar surprises, is to test with real data and real configurations – and only afterward, to proceed with transferring the production database to a new location. Even if your application is not cloud-native, the use of cloud services allows you to experiment and to reproduce the existing use case in a geographically distributed environment, at real scale of clusters and data.
Cross-Engine Database Replication Using AWS Schema Conversion Tool and AWS Database Migration Service
Prahlad Rao is a solutions architect at Amazon Web Services.
Customers often replicate databases hosted on-premises to the AWS Cloud as a means to migrate database workloads to AWS, or to enable continuous replication for database backup and disaster recovery. You can both migrate and perform continuous replication on the AWS Cloud with AWS Database Migration Service (AWS DMS).
Although moving from on-premises to the cloud using the same database engine is a common pattern, increasingly customers also use different engines between source and target when replicating. This pattern especially occurs where the source is a commercial database engine and the target is open source. Such an approach can save on licensing costs. In other use cases, the target database might be used as read-only for a reporting application that is compatible with a certain database engine, or the application team is comfortable using an open source database for their needs. Or the customer might be migrating from a commercial database to an open source database on the AWS Cloud for cost and flexibility purposes. This blog post will walk you through the process of replicating a Microsoft SQL Server database to PostgreSQL using AWS Schema Conversion Tool (AWS SCT) and AWS DMS.
AWS DMS helps you migrate databases to AWS with relative ease and security. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database. The service supports homogenous migrations (like to like databases) and also heterogeneous migrations (between database platforms such as Oracle to Amazon Aurora or SQL Server to MySQL). AWS DMS can also be used for continuous data replication with high availability.
AWS SCT makes heterogeneous database migrations easier by automatically converting source database schema. AWS SCT also converts the majority of custom code, including views and functions, to a format compatible with the target database.
Heterogeneous database replication is a two-step process:
- Use AWS SCT to convert the source database schema (from SQL Server) to a format compatible with target database, in this case PostgreSQL.
- Replicate data between source and target using AWS DMS.

Step 1: Convert schema from source to target database
AWS Schema Conversion Tool is a stand-alone application that provides a project-based user interface and is available for Microsoft Windows, Mac OS X, and Linux versions. You can find detailed installation instructions in the AWS documentation, so I won’t go into all the details, but high-level steps are outlined following:
- Download and install the AWS SCT tool for your operating system.
- Install the required JDBC drivers for source and target database engines on the system that has AWS SCT installed.
- Once required JDBC drivers have been downloaded, provide the location of driver files to SCT as shown following, under Settings, Global Settings, Drivers.

- Create an SCT project and connect to source and target databases. SCT needs access to both source and target database engines. You can use an existing user on the database that already has the required permissions or create a new user for SCT to access the schema. The user should have necessary permissions to read the schema as required by SCT. The permissions should also provide read-only access to the views and system tables.
- Choose New Project from the File menu. Under source and target database engine, select Microsoft SQL Server and PostgreSQL respectively.

- Connect to the source and target database engines. Our source database SQL Server 2014 is on an Amazon EC2 instance, and the target database is Amazon RDS for PostgreSQL. Although we’ve used an EC2 instance as a source for Microsoft SQL database in this blog post to demonstrate functionality, you can use the same procedure to replicate a source database that is hosted in your own data center. Make sure the target database is created ahead of time. Choose Connect for Microsoft SQL Server, fill in the required details, and choose Test Connection.

- For the target database, choose Connect to PostgreSQL, fill in the required details, and choose Test Connection.

Now that the source and target databases are accessible from AWS SCT, we will begin converting schema that are compatible with the target database. Note that schema on the source database aren’t modified, but rather a new schema with a compatible structure is created on the target database. At the same time, users can continue to access the source database without any interruption.

Select the source database on the left and refresh the database by right-clicking on the database to ensure we are connected to the latest schema set. In our case, it’s AdventureWorks2014. You also notice schema listed in the database. With SCT, you can replicate specific schema or an entire database, giving you the flexibility to migrate or replicate part of the dataset from source to target. This flexibility is useful if you want to fan out into multiple target databases from one source database. We will replicate the HumanResources schema and the dataset associated with that schema here.
Select the HumanResources schema, right-click, and choose Convert Schema.

The schema will now be converted. Notice the new schema on the target PostgreSQL database on the right, following.

Expand source and target database schema, and notice SCT has used the appropriate format at the target database. Although the source SQL Server schema lists Tables, Views, and Procedures, the target PostgreSQL schema lists Tables, Views and Functions.

For schema that were not converted automatically using SCT, you can manually create appropriate schema on the target. This approach avoids the need to manually create the entire schema structure on the target database.
The database migration assessment report provides important information about schema conversion. It summarizes all conversion tasks and details the action items for the schema that can’t be converted to the target DB engine. The report can also help you analyze requirements for migrating to the cloud and for changing license type. You can find more details on creating this assessment report in the AWS documentation, and also in this nice blog post.
At this point, the schema is only reflected on the target database and has not been applied yet. Let us go ahead and apply it to the target database. To do this, select the schema on the target database, right-click, and choose Apply to database.

Let’s confirm that the schema has been applied on the target by logging into the target RDS PostgreSQL database. I’m using a psql client on my Mac to access the target database:
Psql –h <database host name RDS endpoint> -p 5432 –U <username> <database name>

Notice the adventureworks2014_humanresources schema, created by SCT. Let’s also verify the tables as part of the schema and look at the structure of a table:

Now that we’ve migrated schema from source to target, it’s time to use DMS to actually replicate data from the source to the target database.
Step 2: Replicate data from source to target database
To replicate databases using AWS Database Migration Service, we’ll need to provision a replication instance to perform the tasks and replicate data from source to target database. The required size of the instance varies depending on the amount of data needed to replicate or migrate. You can find additional information about choosing replication instance in the AWS documentation. We’ll use a dms.t2.medium instance for our replication.
On the DMS console, choose Replication Instances and Create replication instance as follows. The replication instance should be able to connect to both source and target databases.

When the replication instance has been created, create source and target endpoints to connect to the source and target database respectively. The source is SQL Server on an EC2 instance:

Similarly, set target endpoint details to our RDS PostgreSQL database:

Make sure to test the connection for both the endpoints. The replication instance should be able to connect successfully to both source and target database endpoints for the replication to be successful.
Finally, we can create a replication task to replicate data from the source to target database, and also to replicate changes as they happen on the source database.
With AWS services, you can not only use our console for provisioning services, but also the AWS Command Line Interface (AWS CLI) to automate and script the migration process. Let me show you how easy it is to use the CLI to create the task and script the migration process.
Because we’ve already migrated schema to the target database, we need to make sure to specify the same schema that was created on the PostgreSQL database. Before we run the create-replication-task AWS CLI command, we need to specify important task settings and table mappings files using JSON as shown following. The task_settings.json file specifies target schema to be used when replicating data and other parameters to determine how large objects are managed. Because we’re replicating data here and not migrating, we need to ensure ApplyChangesEnabled is set to true, which ensures the task will keep target tables up-to-date by applying changes using change data capture (CDC). The other parameter will be DO_NOTHING on the table prep mode, which means data and metadata of the existing target table are not affected. The task_settings.json file for our replication task follows:
{
"TargetMetadata": {
"TargetSchema": "adventureworks2014_humanresources",
"SupportLobs": true,
"FullLobMode": false,
"LobChunkSize": 64,
"LimitedSizeLobMode": true,
"LobMaxSize": 32
},
"FullLoadSettings": {
"FullLoadEnabled": false,
"ApplyChangesEnabled": true,
"TargetTablePrepMode": "DO_NOTHING",
"CreatePkAfterFullLoad": false,
"StopTaskCachedChangesApplied": false,
"StopTaskCachedChangesNotApplied": false,
"ResumeEnabled": false,
"ResumeMinTableSize": 100000,
"ResumeOnlyClusteredPKTables": true,
"MaxFullLoadSubTasks": 8,
"TransactionConsistencyTimeout": 600,
"CommitRate": 10000
},
"Logging": {
"EnableLogging": false
},
"ControlTablesSettings": {
"ControlSchema":"",
"HistoryTimeslotInMinutes":5,
"HistoryTableEnabled": false,
"SuspendedTablesTableEnabled": false,
"StatusTableEnabled": false
},
"StreamBufferSettings": {
"StreamBufferCount": 3,
"StreamBufferSizeInMB": 8
},
"ChangeProcessingDdlHandlingPolicy": {
"HandleSourceTableDropped": true,
"HandleSourceTableTruncated": true,
"HandleSourceTableAltered": true
}
}
You can find additional details on the task_settings.json file in the AWS documentation.
The table_mappings.json file specifies tables from a particular source schema that you want to replicate to the target endpoint. Table mappings offer a lot of flexibility to migrate specific tables and datasets to a target. Migration can also be rule-based on datasets that you want to filter when replicating data. These approaches provide powerful capabilities to granularly replicate datasets across database engines. For additional details, see the AWS documentation.
In our case, because we’re only replicating HumanResources data, we will specify that in our table_mappings.json file. The table-name value % tells DMS to replicate all tables for that particular schema.
{
"rules": [
{
"rule-type": "selection",
"rule-id": "1",
"rule-name": "1",
"object-locator": {
"schema-name": "HumanResources",
"table-name": "%"
},
"rule-action": "include"
}
]
}
Now that we have the required settings to begin replication, we’ll use create-replication-task to replicate data. Because we’re replicating data as changes happen to source database, full-load-and-cdc should be selected as migration type in our command. Doing this will ensure both initial load and ongoing changes are replicated.
aws dms create-replication-task --replication-task-identifier sql-replicate --source-endpoint-arn arn:aws:dms:us-west-2:904672585901:endpoint:JT6WECWJJ4YSO7AKSISBJEJD6A --target-endpoint-arn arn:aws:dms:us-west-2:904672585901:endpoint:CA2UW3DRV5KI3FG45JW45ZEZBE --replication-instance-arn arn:aws:dms:us-west-2:904672585901:rep:LNQ6KQIT52DIAWLFQU5TCNXY4E --migration-type full-load-and-cdc --replication-task-settings 'file://task_settings.json' --table-mappings 'file://table_mappings.json'
Let’s verify task creation on the AWS Management Console:

The replication task is ready; let’s start it.
Once the task is run, the initial load will be completed, followed by any changes as they happen on the source database. On the task tab, choose Table statistics to verify tables and rows that were replicated to the target database with additional details, in our case HumanResources data.

Let’s verify data on the target PostgreSQL database:


Now that the source and target databases are being replicated, we’ll go ahead and insert a record into our source SQL Server database. Let’s insert Operations as Department and Group Name. Then we’ll verify changes are propagated to the target PostgreSQL database:

On the DMS console under Tasks, Table Statistics, we immediately notice an entry in the Inserts column into the Department table as follows:

We verify if the data is in fact on the target database. The new row is replicated to the target database. Notice Operations in the last row.

Now let’s modify the table to add a new column to Department table on the source SQL Server database to verify data definition language (DDL) updates are also being replicated. We’ll add a new column, GroupNumber:

Let’s confirm the DDL update that we just made on the DMS console. Notice the DDL column has incremented to 1:

Finally, let’s confirm on the target PostgreSQL database:

The target database has a new column, GroupNumber, added to the Department table.
Summary
By using the AWS Schema Conversion Tool (AWS SCT) and AWS Database Migration Service (AWS DMS), you can not only migrate databases across heterogeneous engines more easily but also perform continuous data replication. You can work with different database engines across source and target databases, giving you flexibility and lowering your cost to run your favorite database engines on the AWS Cloud, without complex migration tools or manual schema conversion methods.
Amazon DynamoDB at AWS re:Invent 2016—Wrap-Up
Vrutik Ghai is a product manager at Amazon Web Services
We wrapped up an exciting AWS re:Invent. It was great to interact with current and future Amazon DynamoDB customers and hear their feedback and suggestions.
Multiple re:Invent breakout sessions highlighted DynamoDB. These sessions consisted of deep dives, best practices, and customer talks with real-life examples from industries like gaming, adtech, IoT, and others.
In case you missed attending a session, following are links to the session recordings, along with session abstracts to give you an idea of what each session is about. We hope you find these videos useful as you leverage the performance and flexibility of DynamoDB for your applications.
- AWS Database State of the Union (DAT320)
Speaker:
Raju Gulabani, VP Database Services, AWS
Session Abstract:
In this session, Raju Gulabani, vice president of AWS Database Services (AWS), discusses the evolution of database services on AWS and the new database services and features we launched this year, and shares our vision for continued innovation in this space. We are witnessing an unprecedented growth in the amount of data collected, in many different shapes and forms. Storage, management, and analysis of this data requires database services that scale and perform in ways not possible before. AWS offers a collection of such database and other data services like Amazon Aurora, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Amazon ElastiCache, Amazon Kinesis, and Amazon EMR to process, store, manage, and analyze data. In this session, we provide an overview of AWS database services and discuss how our customers are using these services today. - Introduction to Managed Database Services on AWS (DAT307)
Speakers:
Steve Hunt, Director of Infrastructure, FanDuel
Alan Murray, Director of Architecture, FanDuel
Robin Spira, CTO, FanDuel
Darin Briskman, AWS Database Services
Session Abstract:
In this session, we look at questions such as: Which database is best suited for your use case? Should you choose a relational database or NoSQL or a data warehouse for your workload? Would a managed service like Amazon RDS, Amazon DynamoDB, or Amazon Redshift work better for you, or would it be better to run your own database on Amazon EC2? FanDuel has been running its fantasy sports service on Amazon Web Services (AWS) since 2012. We learn best practices and insights from FanDuel’s successful migrations from self-managed databases on EC2 to fully managed database services. - Deep Dive on Amazon DynamoDB (DAT304)
Speaker:
Rick Houlihan, Principal TPM, DBS NoSQL
Session Abstract:
In this session, we explore Amazon DynamoDB capabilities and benefits in detail and learn how to get the most out of your DynamoDB database. We go over best practices for schema design with DynamoDB across multiple use cases, including gaming, AdTech, IoT, and others. We explore designing efficient indexes, scanning, and querying, and go into detail on a number of recently released features, including JSON document support, DynamoDB Streams, and more. We also provide lessons learned from operating DynamoDB at scale, including provisioning DynamoDB for IoT. - Migrating from RDBMS to NoSQL: How PlayStation Network Moved from MySQL to Amazon DynamoDB (DAT318)
Speakers:
Nate Slater, Senior Manager, AWS Solutions Architecture
Benedikt Neuenfeldt, Architect, SIE Inc.
Aki Kusumoto, VP of NPS Development Department, SIE Inc.
Session Abstract:
In this session, we talk about the key differences between a relational database management service (RDBMS) and nonrelational (NoSQL) databases like Amazon DynamoDB. You will learn about suitable and unsuitable use cases for NoSQL databases. You’ll learn strategies for migrating from an RDBMS to DynamoDB through a five-phase, iterative approach. See how Sony migrated an on-premises MySQL database to the cloud with Amazon DynamoDB, and see the results of this migration. - Migrating a Highly Available and Scalable Database from Oracle to Amazon DynamoDB (ARC404)
Speaker:
Shreekant Mandke, Software Development Manager, Amazon Marketplace
Session Abstract:
In this session, we share how an Amazon.com team that owns a document management platform that manages billions of critical customer documents for Amazon.com migrated from a relational to a nonrelational database. Initially, the service was built as an Oracle database. As it grew, the team discovered the limits of the relational model and decided to migrate to a nonrelational database. They chose Amazon DynamoDB for its built-in resilience, scalability, and predictability. We provide a template that customers can use to migrate from a relational data store to DynamoDB. We also provided details about the entire process: design patterns for moving from a SQL schema to a NoSQL schema; mechanisms used to transition from an ACID (Atomicity, Consistency, Isolation, Durability) model to an eventually consistent model; migration alternatives considered; pitfalls in common migration strategies; and how to ensure service availability and consistency during migration. - How Toyota Racing Development Makes Racing Decisions in Real Time with AWS (DAT311)
Speakers:
Jason Chambers, Toyota Racing Development
Philip Loh, Toyota Racing Development
Martin Sirull, AWS
Session Abstract:
In this session, we learn how Toyota Racing Development (TRD) developed a robust and highly performant real-time data analysis tool for professional racing. Learn how TRD structured a reliable, maintainable, decoupled architecture built around Amazon DynamoDB as both a streaming mechanism and a long-term persistent data store. In racing, milliseconds matter and even moments of downtime can cost a race. We see how TRD used DynamoDB together with Amazon Kinesis and Amazon Kinesis Firehose to build a real-time streaming data analysis tool for competitive racing. - Streaming ETL for RDS and DynamoDB (DAT315)
Speakers:
Greg Brandt, Liyin Tang, Airbnb
Session Abstract:
In this session Greg Brandt and Liyin Tang, Data Infrastructure engineers from Airbnb, discuss the design and architecture of Airbnb’s streaming ETL infrastructure, which exports data from RDS for MySQL and DynamoDB into Airbnb’s data warehouse, using a system called SpinalTap. We also discuss how we leverage Apache Spark Streaming to compute derived data from tracking topics and/or database tables, and HBase to provide immediate data access and generate cleanly time-partitioned Hive tables. - How DataXu Scaled Its Attribution System to Handle Billions of Events per Day with Amazon DynamoDB (DAT312)
Speakers:
Padma Malligarjunan, AWS
Yekesa Kosuru, DataXu
Rohit Dialani, DataXu
Session Abstract:
“Attribution” is the marketing term of art for allocating full or partial credit to individual advertisements that eventually lead to a purchase, sign up, download, or other desired consumer interaction. In this session, DataXu shares how they used DynamoDB at the core of their attribution system to store terabytes of advertising history data. The system is cost-effective and dynamically scales from 0 to 300K requests per second on demand with predictable performance and low operational overhead. - Cross-Region Replication with Amazon DynamoDB Streams (DAT201)
Speakers:
Carl Youngblood, Lead Engineer, Under Armour
Prahlad Rao, Solutions Architect, AWS
Session Abstract:
In this session, Carl Youngblood, Lead Engineer of Under Armour, shares the keys to success as Under Armour implemented cross-region replication with Amazon DynamoDB Streams. The session also includes a quick recap of DynamoDB and its features. - Building Real-Time Campaign Analytics Using AWS Services (DAT310)
Speakers:
Radhika Ravirala, Solutions Architect, AWS
Nabil Zaman, Software Engineer, Quantcast
Session Abstract:
In this session, we talk about how Quantcast used AWS services including DynamoDB to implement real-time campaign analytics. Quantcast provides its advertising clients the ability to run targeted ad campaigns reaching millions of online users. The real-time bidding for campaigns runs on thousands of machines across the world. When Quantcast wanted to collect and analyze campaign metrics in real-time, they turned to AWS to rapidly build a scalable, resilient, and extensible framework. Quantcast used Amazon Kinesis streams to stage data, Amazon EC2 instances to shuffle and aggregate the data, and Amazon DynamoDB and Amazon ElastiCache for building scalable time-series databases. With Elastic Load Balancing and Auto Scaling groups, they are able to set up distributed microservices with minimal operation overhead. This session discusses their use case, how they architected the application with AWS technologies integrated with their existing home-grown stack, and the lessons they learned. - How Fulfillment by Amazon (FBA) and Scopely Improved Results and Reduced Costs with a Serverless Architecture (DAT309)
Speakers:
Vlad Vlasceanu, Ganesh Subramaniam & Brandon Cuff, AWS
Session Abstract:
In this session, we share an overview of leveraging serverless architectures to support high performance data intensive applications. Fulfillment by Amazon (FBA) built the Seller Inventory Authority Platform (IAP) using Amazon DynamoDB Streams, AWS Lambda functions, Amazon Elasticsearch Service, and Amazon Redshift to improve results and reduce costs. Scopely shares how they used a flexible logging system built on Amazon Kinesis, Lambda, and Amazon Elasticsearch Service to provide high-fidelity reporting on hotkeys in Memcached and DynamoDB, and drastically reduce the incidence of hotkeys. Both of these customers are using managed services and serverless architecture to build scalable systems that can meet the projected business growth without a corresponding increase in operational costs. - 6 Million New Registrations in 30 Days: How the Chick-fil-A One App Scaled with AWS (DAT313)
Speakers:
Chris Taylor, Director, Customer Experience Architecture, Chick-fil-A
Andrew Baird, Solutions Architect, AWS
Session Abstract:
In this session, Chris Taylor from Chick-fil-A shares how they managed to scale using AWS services. Chris leads the team providing back-end services for the massively popular Chick-fil-A One mobile app that launched in June 2016. Chick-fil-A follows AWS best practices for web services and leverages numerous AWS services, including AWS Elastic Beanstalk, Amazon DynamoDB, AWS Lambda, and Amazon S3. This was the largest technology-dependent promotion in Chick-fil-A history. To ensure their architecture would perform at unknown and massive scale, Chris worked with AWS Support through an AWS Infrastructure Event Management (IEM) engagement and leaned on automated operations to enable load testing before launch. - How Telltale Games Migrated Its Story Analytics from Apache CouchDB to Amazon DynamoDB (DAT316)
Speakers:
Zac Litton, VP of Engineering, Telltale Games
Greg McConnel, Solutions Architect, AWS
Session Abstract:
In this session, you’ll learn about Telltale Games’ migration from Apache CouchDB to Amazon DynamoDB, the challenges of adjusting capacity to handling spikes in database activity, and how Telltale Games streamlined its analytics storage to provide new perspectives of player interaction to improve its games. Every choice made in Telltale Games titles influences how your character develops and how the world responds to you. With millions of users making thousands of choices in a single episode, Telltale Games tracks this data and leverages it to build more relevant stories in real time as the season is developed. - Capturing Windows of Opportunity: Real-Time Analytics for Less Than $1000? (DAT208)
Speakers:
Craig Stires, Head of Big Data & Analytics APAC, AWS
Session Abstract:
In this session, we look at how some AWS customers are using real-time analytics to capture windows of opportunity: a telco with a major promotion, an advertising retargeter with global demands, and a personal IoT provider with a lifestyle solution. We dig deeper into their architecture and look for common patterns that can be used to build a real-time analytics platform in a cost-optimized way. We even see how a light-load, real-time analytics system can be built for less than $1000.