AWS Big Data Blog
Implement Serverless Log Analytics Using Amazon Kinesis Analytics
Applications log a large amount of data that—when analyzed in real time—provides significant insight into your applications. Real-time log analysis can be used to ensure security compliance, troubleshoot operation events, identify application usage patterns, and much more.
Ingesting and analyzing this data in real time can be accomplished by using a variety of open source tools on Amazon EC2. Alternatively, you can use a set of simple, managed AWS services to perform serverless log analytics. The Amazon Kinesis platform includes the following managed services:
- Amazon Kinesis Streams streams data on AWS, which allows you to collect, store, and process TBs per hour at a low cost.
- Amazon Kinesis Firehose loads streaming data in to Amazon Kinesis Analytics, Amazon S3, Amazon Redshift, or Amazon Elasticsearch Service.
- Amazon Kinesis Analytics helps you analyze streaming data by writing SQL queries and in turn overcoming the management and monitoring of streaming logs in near real time. Analytics allows you to reference metadata stored in S3 in SQL queries for real-time analytics.
In this post, I show you how to implement a solution that analyzes streaming Apache access log data from an EC2 instance aggregated over 5 minutes. The solution helps you understand where requests to your applications are coming from. If the source is an unknown application or if a particular source application is trying to clog your application, you can contact the application owner.
Some challenges that this solution entails:
- You do not want to maintain (patch/upgrade) the log application or servers to do log analytics. You also want your log analytics to scale on demand by default, and so all components are managed services.
- Apache Logs logs the host IP address or host name. However, that information isn’t useful in the cloud where servers are fungible and hosts change constantly either to scale or heal automatically. So you maintain a flat file list of servers in an S3 bucket that can be updated by Auto Scaling policies and mapped to streaming log data.
Architecture
The following diagram shows how this solution works.

- Application nodes run Apache applications and write Apache logs locally to disk. The Amazon Kinesis agent on the EC2 instance ingests the log stream in to the Amazon Kinesis stream.
- The log input stream from various application nodes is ingested in to the Amazon Kinesis stream.
- Machine metadata about the machine or application is stored in flat files in an S3 bucket. It is a mapping of host IP addresses with the application name and contact.
- The Analytics application processes streaming logs over tumbling windows by adding referenced machine metadata from S3.
- The output stream, which is the result of the aggregated responses from the Analytics application, is written into the Amazon Kinesis stream.
- The Lambda function consumes the aggregated response from the destination stream, processes it, and publishes it to Amazon CloudWatch. It is event driven: as soon as new records are pushed to the destination stream, they are processed in batches of 200 records.
- The CloudWatch dashboard is used to view response trends.
- Alarms on aggregated data are generated when specified thresholds are reached.
Month in Review: January 2017
Another month of big data solutions on the Big Data Blog!
Take a look at our summaries below and learn, comment, and share. Thank you for reading!
NEW POSTS
Decreasing Game Churn: How Upopa used ironSource Atom and Amazon ML to Engage Users
Ever wondered what it takes to keep a user from leaving your game or application after all the hard work you put in? Wouldn’t it be great to get a chance to interact with the users before they’re about to leave? In this post, learn how ironSource worked with gaming studio Upopa to build an efficient, cheap, and accurate way to battle churn and make data-driven decisions using ironSource Atom’s data pipeline and Amazon ML.
Create a Healthcare Data Hub with AWS and Mirth Connect
Healthcare providers record patient information across different software platforms. Each of these platforms can have varying implementations of complex healthcare data standards. Also, each system needs to communicate with a central repository called a health information exchange (HIE) to build a central, complete clinical record for each patient. In this post, learn how to consume different data types as messages, transform the information within the messages, and then use AWS services to take action depending on the message type.
Call for Papers! DEEM: 1st Workshop on Data Management for End-to-End Machine Learning
Amazon and Matroid will hold the first workshop on Data Management for End-to-End Machine Learning (DEEM) on May 14th, 2017 in conjunction with the premier systems conference SIGMOD/PODS 2017 in Raleigh, North Carolina. DEEM brings together researchers and practitioners at the intersection of applied machine learning, data management, and systems research to discuss data management issues in ML application scenarios. The workshop is soliciting research papers that describe preliminary and ongoing research results.
Converging Data Silos to Amazon Redshift Using AWS DMS
In this post, learn to use AWS Database Migration Service (AWS DMS) and other AWS services to easily converge multiple heterogonous data sources to Amazon Redshift. You can then use Amazon QuickSight, to visualize the converged dataset to gain additional business insights.
Run Mixed Workloads with Amazon Redshift Workload Management
It’s common for mixed workloads to have some processes that require higher priority than others. Sometimes, this means a certain job must complete within a given SLA. Other times, this means you only want to prevent a non-critical reporting workload from consuming too many cluster resources at any one time. Without workload management (WLM), each query is prioritized equally, which can cause a person, team, or workload to consume excessive cluster resources for a process which isn’t as valuable as other more business-critical jobs. This post provides guidelines on common WLM patterns and shows how you can use WLM query insights to optimize configuration in production workloads.
Secure Amazon EMR with Encryption
In this post, learn how to set up encryption of data at multiple levels using security configurations with EMR. You’ll walk through the step-by-step process to achieve all the encryption prerequisites, such as building the KMS keys, building SSL certificates, and launching the EMR cluster with a strong security configuration.
Secure Amazon EMR with Encryption
In the last few years, there has been a rapid rise in enterprises adopting the Apache Hadoop ecosystem for critical workloads that process sensitive or highly confidential data. Due to the highly critical nature of the workloads, the enterprises implement certain organization/industry wide policies and certain regulatory or compliance policies. Such policy requirements are designed to protect sensitive data from unauthorized access.
A common requirement within such policies is about encrypting data at-rest and in-flight. Amazon EMR uses “security configurations” to make it easy to specify the encryption keys and certificates, ranging from AWS Key Management Service to supplying your own custom encryption materials provider.
You create a security configuration that specifies encryption settings and then use the configuration when you create a cluster. This makes it easy to build the security configuration one time and use it for any number of clusters.

In this post, I go through the process of setting up the encryption of data at multiple levels using security configurations with EMR. Before I dive deep into encryption, here are the different phases where data needs to be encrypted.
Run Mixed Workloads with Amazon Redshift Workload Management
Mixed workloads run batch and interactive workloads (short-running and long-running queries or reports) concurrently to support business needs or demand. Typically, managing and configuring mixed workloads requires a thorough understanding of access patterns, how the system resources are being used and performance requirements.
It’s common for mixed workloads to have some processes that require higher priority than others. Sometimes, this means a certain job must complete within a given SLA. Other times, this means you only want to prevent a non-critical reporting workload from consuming too many cluster resources at any one time.
Without workload management (WLM), each query is prioritized equally, which can cause a person, team, or workload to consume excessive cluster resources for a process which isn’t as valuable as other more business-critical jobs.
This post provides guidelines on common WLM patterns and shows how you can use WLM query insights to optimize configuration in production workloads.
Workload concepts
You can use WLM to define the separation of business concerns and to prioritize the different types of concurrently running queries in the system:
- Interactive: Software that accepts input from humans as it runs. Interactive software includes most popular programs, such as BI tools or reporting applications.
- Short-running, read-only user queries such as Tableau dashboard query with low latency requirements.
- Long-running, read-only user queries such as a complex structured report that aggregates the last 10 years of sales data.
- Batch: Execution of a job series in a server program without manual intervention (non-interactive). The execution of a series of programs, on a set or “batch” of inputs, rather than a single input, would instead be a custom job.
- Batch queries includes bulk INSERT, UPDATE, and DELETE transactions, for example, ETL or ELT programs.
Amazon Redshift Workload Management
Amazon Redshift is a fully managed, petabyte scale, columnar, massively parallel data warehouse that offers scalability, security and high performance. Amazon Redshift provides an industry standard JDBC/ODBC driver interface, which allows customers to connect their existing business intelligence tools and re-use existing analytics queries.
Amazon Redshift is a good fit for any type of analytical data model, for example, star and snowflake schemas, or simple de-normalized tables.
Managing workloads
Amazon Redshift Workload Management allows you to manage workloads of various sizes and complexity for specific environments. Parameter groups contain WLM configuration, which determines how many query queues are available for processing and how queries are routed to those queues. The default parameter group settings are not configurable. Create a custom parameter group to modify the settings in that group, and then associate it with your cluster. The following settings can be configured:
- How many queries can run concurrently in each queue
- How much memory is allocated among the queues
- How queries are routed to queues, based on criteria such as the user who is running the query or a query label
- Query timeout settings for a queue
When the user runs a query, WLM assigns the query to the first matching queue and executes rules based on the WLM configuration. For more information about WLM query queues, concurrency, user groups, query groups, timeout configuration, and queue hopping capability, see Defining Query Queues. For more information about the configuration properties that can be changed dynamically, see WLM Dynamic and Static Configuration Properties.
For example, the WLM configuration in the following screenshot has three queues to support ETL, BI, and other users. ETL jobs are assigned to the long-running queue and BI queries to the short-running queue. Other user queries are executed in the default queue.

Converging Data Silos to Amazon Redshift Using AWS DMS
Organizations often grow organically—and so does their data in individual silos. Such systems are often powered by traditional RDBMS systems and they grow orthogonally in size and features. To gain intelligence across heterogeneous data sources, you have to join the data sets. However, this imposes new challenges, as joining data over dblinks or into a single view is extremely cumbersome and an operational nightmare.
This post walks through using AWS Database Migration Service (AWS DMS) and other AWS services to make it easy to converge multiple heterogonous data sources to Amazon Redshift. You can then use Amazon QuickSight, to visualize the converged dataset to gain additional business insights.
AWS service overview
Here’s a brief overview of AWS services that help with data convergence.
AWS DMS
With DMS, you can migrate your data to and from most widely used commercial and open-source databases. The service supports homogenous migrations such as Oracle to Oracle, as well as heterogeneous migrations between different database platforms, such as Oracle to Amazon Aurora or Microsoft SQL Server to MySQL. It also allows you to stream data to Amazon Redshift from any of the supported sources including:
- Amazon Aurora
- PostgreSQL
- MySQL
- MariaDB
- Oracle
- SAP ASE
- SQL Server
DMS enables consolidation and easy analysis of data in the petabyte-scale data warehouse. It can also be used for continuous data replication with high availability.
Amazon QuickSight
Amazon QuickSight provides very fast, easy-to-use, cloud-powered business intelligence at 1/10th the cost of traditional BI solutions. QuickSight uses a new, super-fast, parallel, in-memory calculation engine (“SPICE”) to perform advanced calculations and render visualizations rapidly.
QuickSight integrates automatically with AWS data services, enables organizations to scale to hundreds of thousands of users, and delivers fast and responsive query performance to them. You can easily connect QuickSight to AWS data services, including Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon S3, and Amazon Athena. You can also upload CSV, TSV, and spreadsheet files or connect to third-party data sources such as Salesforce.
Amazon Redshift
Amazon Redshift delivers fast query performance by using columnar storage technology to improve I/O efficiency and parallelizing queries across multiple nodes. Amazon Redshift is typically priced at 1/10th of the price of the competition. We have many customers running petabyte scale data analytics on AWS using Amazon Redshift.
Amazon Redshift is also ANSI SQL compliant, supports JDBC/ODBC, and is easy to connect to your existing business intelligence (BI) solution. However, if your storage requirement is in the 10s of TB range and requires high levels of concurrency across small queries, you may want to consider Amazon Aurora as the target converged database.
Walkthrough
Assume that you have an events company specializing on sports, and have built a MySQL database that holds data for the players and the sporting events. Customers and ticket information is stored in another database; in this case, assume it is PostgresSQL and this gets updated when customer purchases tickets from our website and mobile apps. You can download a sample dataset from the aws-database-migration-samples GitHub repo.
These databases could be anywhere: at an on-premises facility; on AWS in Amazon EC2 or Amazon RDS, or other cloud provider; or in a mixture of such locations. To complicate things a little more, you can assume that the lost opportunities (where a customer didn’t complete buying the ticket even though it was added to the shopping cart) are streamed via clickstream through Amazon Kinesis and then stored on Amazon S3. We then use AWS Data Pipeline to orchestrate a process to cleanse that data using Amazon EMR and make it ready for loading to Amazon Redshift. The clickstream integration is not covered in this post but was demonstrated in the recent Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics post.
Architecture
In this solution, you use DMS to bring the two data sources into Amazon Redshift and run analytics to gain business insights. The following diagram demonstrates the proposed solution.

Call for Papers! DEEM: 1st Workshop on Data Management for End-to-End Machine Learning

Amazon and Matroid will hold the first workshop on Data Management for End-to-End Machine Learning (DEEM) on May 14th, 2017 in conjunction with the premier systems conference SIGMOD/PODS 2017 in Raleigh, North Carolina. For more details about the workshop focus, see Challenges and opportunities in machine learning below.
DEEM brings together researchers and practitioners at the intersection of applied machine learning, data management, and systems research to discuss data management issues in ML application scenarios.
We’re soliciting research papers that describe preliminary and ongoing research results. We’re also looking for reports from industry describing end-to-end ML deployments. Submissions can either be short papers (4 pages) or long papers (up to 10 pages) following the ACM proceedings format.
Register and submit: https://cmt3.research.microsoft.com/DEEM2017/ (account needed)
Submission Deadline: February 1, 2017
Notification of Acceptance: March 1, 2017
Final papers due: March 20, 2017
Workshop: May 14th, 2017
Follow us on twitter @deem_workshop.
Challenges and opportunities in machine learning
Applying machine learning (ML) in real-world scenarios is challenging. In recent years, the database community has focused on creating systems and abstractions for efficiently training ML models on large datasets. But model training is only one of many steps in an end-to-end ML application. Many orthogonal data management problems arise from the large-scale use of ML. The data management community needs to focus on these problems.
For example, preprocessing data and extracting feature workloads causes complex pipelines that often require the simultaneous execution of relational and linear algebraic operations. Next, the class of the ML model to use needs to be chosen. For that, a set of popular approaches such as linear models, decision trees, and deep neural networks often must be analyzed, evaluated, and interpreted.
The prediction quality of such ML models depends on the choice of features and hyperparameters, which are typically selected in a costly offline evaluation process. Afterwards, the resulting models must be deployed and integrated into existing business workflows in a way that enables fast and efficient predictions while allowing for the lifecycle of models (that become stale over time) to be managed.
As a further complication, the resulting systems need to take the target audience of ML applications into account. This audience is heterogeneous, ranging from analysts without programming skills that possibly prefer an easy-to-use, cloud-based solution, to teams of data processing experts and statisticians that develop and deploy custom-tailored algorithms.
DEEM aims to bring together researchers and practitioners at the intersection of applied machine learning, data management and systems research to discuss data management issues in ML application scenarios. This workshop solicits regular research papers describing preliminary and ongoing research results. In addition, the workshop encourages the submission of industrial experience reports of end-to-end ML deployments.
Questions? Please send them to [email protected]

Create a Healthcare Data Hub with AWS and Mirth Connect
As anyone visiting their doctor may have noticed, gone are the days of physicians recording their notes on paper. Physicians are more likely to enter the exam room with a laptop than with paper and pen. This change is the byproduct of efforts to improve patient outcomes, increase efficiency, and drive population health. Pushing for these improvements has created many new data opportunities as well as challenges. Using a combination of AWS services and open source software, we can use these new datasets to work towards these goals and beyond.
When you get a physical examination, your doctor’s office has an electronic chart with information about your demographics (name, date of birth, address, etc.), healthcare history, and current visit. When you go to the hospital for an emergency, a whole new record is created that may contain duplicate or conflicting information. A simple example would be that my primary care doctor lists me as Joe whereas the hospital lists me as Joseph.
Providers record patient information across different software platforms. Each of these platforms can have varying implementations of complex healthcare data standards. Also, each system needs to communicate with a central repository called a health information exchange (HIE) to build a central, complete clinical record for each patient.
In this post, I demonstrate the capability to consume different data types as messages, transform the information within the messages, and then use AWS service to take action depending on the message type.
Overview of Mirth Connect
Using open source technologies on AWS, you can build a system that transforms, stores, and processes this data as needed. The system can scale to meet the ever-increasing demands of modern medicine. The project that ingests and processes this data is called Mirth Connect.
Mirth Connect is an open source, cross-platform, bidirectional, healthcare integration engine. This project is a standalone server that functions as a central point for routing and processing healthcare information.
Running Mirth Connect on AWS provides the necessary scalability and elasticity to meet the current and future needs of healthcare organizations.
Healthcare data hub walkthrough
Healthcare information comes from various sources and can generate large amounts of data:
- Health information exchange (HIE)
- Electronic health records system (EHR)
- Practice management system (PMS)
- Insurance company systems
- Pharmacy systems
- Other source systems that can make data accessible
Messages typically require some form of modification (transformation) to accommodate ingestion and processing in other systems. Using another project, Blue Button, you can dissect large healthcare messages and locate the sections/items of interest. You can also convert those messages into other formats for storage and analysis.
Data types
The examples in this post focus on the following data types representing information made available from a typical healthcare organization:
- HL7 (Health Level Seven) version 2 messages
- CDA (Clinical Data Architecture)/CDD (Continuity of Care Document)
- DICOM (Digital Imaging and Communications in Medicine)
- CSV (comma-separated variable)
HL7 version 2 messages define both a message format and communication protocol for health information. They are broken into different message types depending on the information that they transmit.
There are many message types available, such as ordering labs, prescription dispensing, billing, and more. During a routine doctor visit, numerous messages are created for each patient. This provides a lot of information but also a challenge in storage and processing. For a full list of message types, see Data Definition Tables, section A6. The two types used for this post are:
- ADT A01 (patient admission and visit notification)
View a sample HL7 ADT A01 message
- SIU S12 (new appointment booking)
As you can see, this text is formatted as delimited data, where the delimiters are defined in the top line message called the MSG segment. Mirth Connect can parse these messages and communicate using the standard HL7 network protocol.

Decreasing Game Churn: How Upopa used ironSource Atom and Amazon ML to Engage Users
This is a guest post by Tom Talpir, Software Developer at ironSource. ironSource is as an Advanced AWS Partner Network (APN) Technology Partner and an AWS Big Data Competency Partner.
Ever wondered what it takes to keep a user from leaving your game or application after all the hard work you put in? Wouldn’t it be great to get a chance to interact with the users before they’re about to leave?
Finding these users can be difficult, mainly because most churn happens within the first few minutes or hours of a user’s gameplay. However, machine learning (ML) can make this possible by providing insights to help developers identify these users and engage with them to decrease the churn rate.
Upopa is a gaming studio that creates cool games (that you should definitely check out), and they were a great fit for our new project, leveraging Amazon Machine Learning (Amazon ML) to offer game developers an ability to predict the future actions of their players, and ultimately reduce churn without having to learn the complex ML algorithms.

Upopa sends all their data to Amazon Redshift, using ironSource Atom, a data flow management solution that allows developers to send data from their application into many different types of data targets (including Amazon Redshift, Amazon S3, Amazon Elasticsearch Service, and other relational databases) with great ease.
Amazon ML turned out to be the right solution for Upopa, because it integrates easily with Amazon Redshift, and makes everything much easier with visualization tools and wizards that guides you through the process of creating ML models.
Month in Review: December 2016
Another month of big data solutions on the Big Data Blog.
Take a look at our summaries below and learn, comment, and share. Thank you for reading!
Implementing Authorization and Auditing using Apache Ranger on Amazon EMR
Apache Ranger is a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform. Features include centralized security administration, fine-grained authorization across many Hadoop components (Hadoop, Hive, HBase, Storm, Knox, Solr, Kafka, and YARN) and central auditing. In this post, walk through the steps to enable authorization and audit for Amazon EMR clusters using Apache Ranger.
Amazon Redshift Engineering’s Advanced Table Design Playbook
Amazon Redshift is a fully managed, petabyte scale, massively parallel data warehouse that offers simple operations and high performance. In practice, the best way to improve query performance by orders of magnitude is by tuning Amazon Redshift tables to better meet your workload requirements. This five-part blog series will guide you through applying distribution styles, sort keys, and compression encodings and configuring tables for data durability and recovery purposes.
- Part 1: Preamble, Prerequisites, and Prioritization
- Part 2: Distribution Styles and Distribution Keys
- Part 3: Compound and Interleaved Sort Keys
- Part 4: Compression Encodings
- Part 5: Table Data Durability
Interactive Analysis of Genomic Datasets Using Amazon Athena
In this post, learn to prepare genomic data for analysis with Amazon Athena. We’ll demonstrate how Athena is well-adapted to address common genomics query paradigms using the Thousand Genomes dataset hosted on Amazon S3, a seminal genomics study. Although this post is focused on genomic analysis, similar approaches can be applied to any discipline where large-scale, interactive analysis is required.
Joining and Enriching Streaming Data on Amazon Kinesis
In this blog post, learn three approaches for joining and enriching streaming data on Amazon Kinesis Streams by using Amazon Kinesis Analytics, AWS Lambda, and Amazon DynamoDB.
Using SaltStack to Run Commands in Parallel on Amazon EMR
SaltStack is an open source project for automation and configuration management. It started as a remote execution engine designed to scale to many machines while delivering high-speed execution. You can now use the new bootstrap action that installs SaltStack on Amazon EMR. It provides a basic configuration that enables selective targeting of the nodes based on instance roles, instance groups, and other parameters.
Building an Event-Based Analytics Pipeline for Amazon Game Studios’ Breakaway
Amazon Game Studios’ new title Breakaway is an online 4v4 team battle sport that delivers fast action, teamwork, and competition. In this post, learn the technical details of how the Breakaway team uses AWS to collect, process, and analyze gameplay telemetry to answer questions about arena design.
Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning
Air travel can be stressful due to the many factors that are simply out of airline passengers’ control. As passengers, we want to minimize this stress as much as we can. We can do this by using past data to make predictions about how likely a flight will be delayed based on the time of day or the airline carrier.
In this post, we generate a predictive model for flight delays that can be used to help us pick the flight least likely to add to our travel stress. To accomplish this, we will use Apache Spark running on Amazon EMR for extracting, transforming, and loading (ETL) the data, Amazon Redshift for analysis, and Amazon Machine Learning for creating predictive models. This solution gives a good example of combining multiple AWS services to build a sophisticated analytical application in the AWS Cloud.
Architecture
At a high level, our solution includes the following steps:
Step 1 is to ingest datasets:
- We will download publicly available Federal Aviation Administration (FAA) flight data and National Oceanic and Atmospheric Administration (NOAA) weather datasets and stage them in Amazon S3.
- Note: A typical big data workload consists of ingesting data from disparate sources and integrating them. To mimic that scenario, we will store the weather data in an Apache Hive table and the flight data in an Amazon Redshift cluster.
Step 2 is to enrich data by using ETL:
- We will transform the maximum and minimum temperature columns from Celsius to Fahrenheit in the weather table in Hive by using a user-defined function in Spark.
- We enrich the flight data in Amazon Redshift to compute and include extra features and columns (departure hour, days to the nearest holiday) that will help the Amazon Machine Learning algorithm’s learning process.
- We then combine both the datasets in the Spark environment by using the spark-redshift package to load data from Amazon Redshift cluster to Spark running on an Amazon EMR cluster. We write the enriched data back to a Amazon Redshift table using the spark-redshift package.
Step 3 is to perform predictive analytics:
- In this last step, we use Amazon Machine Learning to create and train a ML model using Amazon Redshift as our data source. The trained Amazon ML model is used to generate predictions for the test dataset, which are output to an S3 bucket.
