AWS Blog
New – USASpending.gov on an Amazon RDS Snapshot
My colleague Jed Sundwall runs the AWS Public Datasets program. He wrote the guest post below to tell you about an important new dataset that is available as an Amazon RDS Snapshot. In the post, Jed introduces the dataset and shows you how to create an Amazon RDS DB Instance from the snapshot.
— Jeff;

I am very excited to announce that, starting today, the entire public USAspending.gov database is available for anyone to copy via Amazon Relational Database Service (RDS). USAspending.gov data includes data on all spending by the federal government, including contracts, grants, loans, employee salaries, and more. The data is available via a PostgreSQL snapshot, which provides bulk access to the entire USAspending.gov database, and is updated nightly. At this time, the database includes all USAspending.gov for the second quarter of fiscal year 2017, and data going back to the year 2000 will be added over the summer. You can learn more about the database and how to access it on its AWS Public Dataset landing page.
Through the AWS Public Datasets program, we work with AWS customers to experiment with ways that the cloud can make data more accessible to more people. Most of our AWS Public Datasets are made available through Amazon S3 because of its tremendous flexibility and ability to scale to serve any volume of any kind of data files. What’s exciting about the USAspending.gov database is that it provides a great example of how Amazon RDS can be used to share an entire relational database quickly and easily. Typically, sharing a relational database requires extract, transfer, and load (ETL) processes that require redundant storage capacity, time for data transfer, and often scripts to migrate your database schema from one database engine to another. ETL processes can be so intimidating and cumbersome that they’re effectively impossible for many people to carry out.
By making their data available as a public Amazon RDS snapshot, the team at USASPending.gov has made it easy for anyone to get a copy of their entire production database for their own use within minutes. This will be useful for researchers and businesses who want to work with real data about all US Government spending and quickly combine it with their own data or other data resources.
Deploying the USASpending.gov Database Using the AWS Management Console
Let’s go through the steps involved in deploying the database in your AWS account using the AWS Management Console.
- Sign in to the AWS Management Console and select the US East (N. Virginia) region in the menu bar.
- Open the Amazon RDS Console and choose Snapshots in the navigation pane.
- In the filter for the search bar, select All Public Snapshots and search for 515495268755:

- Select the snapshot named arn:aws:rds:us-east-1:515495268755:snapshot:usaspending-db.
- Select Snapshot Actions -> Restore Snapshot. Select an instance size, and enter the other details, then click on Restore DB Instance.
- You will see that a DB Instance is being created from the snapshot, within your AWS account.
- After a few minutes, the status of the instance will change to Available.
- You can see the endpoint for your database on the main page along with other useful info:

Deploying the USASpending.gov Database Using the AWS CLI
You can also install the AWS Command Line Interface (CLI) and use it to create a DB Instance from the snapshot. Here’s a sample command:
$ aws rds restore-db-instance-from-db-snapshot --db-instance-identifier my-test-db-cli \
--db-snapshot-identifier arn:aws:rds:us-east-1:515495268755:snapshot:usaspending-db \
--region us-east-1
This will give you an ARN (Amazon Resource Name) that you can use to reference the DB Instance. For example:
$ aws rds describe-db-instances \
--db-instance-identifier arn:aws:rds:us-east-1:917192695859:db:my-test-db-cli
This command will display the Endpoint.Address that you use to connect to the database.
Connecting to the DB Instance
After following the AWS Management Console or AWS CLI instructions above, you will have access to the full USAspending.gov database within this Amazon RDS DB instance, and you can connect to it using any PostgreSQL client using the following credentials:
- Username: root
- Password: password
- Database: data_store_api
If you use psql, you can access the database using this command:
$ psql -h my-endpoint.rds.amazonaws.com -U root -d data_store_api
You should change the database password after you log in:
ALTER USER "root" WITH ENCRYPTED PASSWORD '{new password}';
If you can’t connect to your instance but think you should be able to, you may need to check your VPC Security Groups and make sure inbound and outbound traffic on the port (usually 5432) is allowed from your IP address.
Exploring the Data
The USAspending.gov data is very rich, so it will be hard to do it justice in this blog post, but hopefully these queries will give you an idea of what’s possible. To learn about the contents of the database, please review the USAspending.gov Data Dictionary.
The following query will return the total amount of money the government is obligated to pay for contracts awarded by NASA that include “Mars” or “Martian” in the description of the award:
select sum(total_obligation) from awards, subtier_agency
where (awards.description like '% MARTIAN %' OR awards.description like '% MARS %')
AND subtier_agency.name = 'National Aeronautics and Space Administration';
As I write this, the result I get for this query is $55,411,025.42. Note that the database is updated nightly and will include more historical data in the coming months, so you may get a different result if you run this query.
Now, here’s the same query, but looking for awards with “Jupiter” or “Jovian” in the description:
select sum(total_obligation) from awards, subtier_agency
where (awards.description like '%JUPITER%' OR awards.description like '%JOVIAN%')
AND subtier_agency.name = 'National Aeronautics and Space Administration';
The result I get is $14,766,392.96.
Questions & Comments
I’m looking forward to seeing what people can do with this data. If you have any questions about the data, please create an issue on the USAspending.gov API’s issue tracker on GitHub.
— Jed
Event: AWS Community Day in San Francisco
Spring is in the air, new technologies are budding, and the community is buzzing.
Which means it’s also springtime in the city by the bay and AWS is thrilled to announce an exciting event, AWS Community Day.

AWS Community Day is a community-led event in San Francisco where AWS Community Heroes, user group leaders, and other AWS enthusiasts will come together to deliver a full day of technical sessions on the latest in cloud computing.
During the event, you will get an opportunity to learn about the latest cloud computing trends, optimization best practices, and practical insights securing your infrastructure. Additionally, you will have the chance to discuss approaches to building healthy AWS meetups and community knowledge-sharing sessions.
In order to learn more details about the AWS Community Day in San Francisco, take a look at the blog post written by Community Hero, Eric Hammond, found here: https://alestic.com/2017/05/aws-community-day-san-francisco/.
Don’t miss this great event, register today to take part in the AWS Community Day.
– Tara
Amazon Chime Update – Use Your Existing Active Directory, Claim Your Domain
I first told you about Amazon Chime this past February (Amazon Chime – Unified Communications Service) and told you how I connect and collaborate with people all over the world.
Since the launch, Amazon Chime has quickly become the communication tool of choice within the AWS team. I participate in multiple person-to-person and group chats throughout the day, and frequently “Chime In” to Amazon Chime-powered conferences to discuss upcoming launches and speaking opportunities.
Today we are adding two new features to Amazon Chime: the ability to claim a domain as your own and support for your existing Active Directory.
Claiming a Domain
Claiming a domain gives you the authority to manage Amazon Chime usage for all of the users in the domain. You can make sure that new employees sign up for Amazon Chime in an official fashion and you can suspend accounts for employees that leave the organization.
To claim a domain, you assert that you own a particular domain name and then back up the assertion by entering a TXT record to your domain’s DNS entry. You must do this for each domain and subdomain that your organization uses for email addresses.
Here’s how I would claim one of my own domains:

After I click on Verify this domain, Amazon Chime provides me with the record for my DNS:

After I do this, the domain’s status will change to Pending Verification. Once Amazon Chime has confirmed that the new record exists as expected, the status will change to Verified and the team account will become an enterprise account.
Active Directory Support
This feature allows your users to sign in to Amazon Chime using their existing Active Directory identity and credentials. After you have set it up, you can enable and take advantage of advanced AD security features such as password rotation, password complexity rules, and multi-factor authentication. You can also control the allocation of Amazon Chime’s Plus and Pro licenses on a group-by-group basis (check out Plans and Pricing to learn more about each type of license).
In order to use this feature, you must be using an Amazon Chime enterprise account. If you are using a team account, follow the directions at Create an Enterprise Account before proceeding.
Then you will need to set up a directory with the AWS Directory Service. You have two options at this point:
- Use the AWS Directory Service AD Connector to connect to your existing on-premises Active Directory instance.
- Use Microsoft Active Directory, configured for standalone use. Read How to Create a Microsoft AD Directory for more information on this option.
After you have set up your directory, you can connect to it from within the Amazon Chime console by clicking on Settings and Active directory and choosing your directory from the drop-down:

After you have done this you can select individual groups within the directory and assign the appropriate subscriptions (Plus or Pro) on a group-by-group basis.
With everything set up as desired, your users can log in to Amazon Chime using their existing directory credentials.
These new features are available now and you can start using them today!
If you would like to learn more about Amazon Chime, you can watch the recent AWS Tech Talk: Modernize Meetings with Amazon Chime:
Here is the presentation from the talk:
— Jeff;
EC2 Price Reductions – Reserved Instances & M4 Instances
As AWS grows, we continue to find ways to make it an even better value. We work with our suppliers to drive down costs while also finding ways to build hardware and software that is increasingly more efficient and cost-effective.
In addition to reducing our prices on a regular and frequent basis, we also give customers options that help them to optimize their use of AWS. For example, Reserved Instances (first launched in 2009) allow Amazon EC2 users to obtain a significant discount when compared to On-Demand Pricing, along with a capacity reservation when used in a specific Availability Zone.
Our customers use multiple strategies to purchase and manage their Reserved Instances. Some prefer to make an upfront payment and earn a bigger discount; some prefer to pay nothing upfront and get a smaller (yet still substantial) discount. In the middle, others are happiest with a partial upfront payment and a discount that falls in between the two other options. In order to meet this wide range of preferences we are adding 3 Year No Upfront Standard Reserved Instances for most of the current generation instance types. We are also reducing prices for No Upfront Reserved Instances, Convertible Reserved Instances, and General Purpose M4 instances (both On-Demand and Reserved Instances). This is our 61st AWS Price Reduction.
Here are the details (all changes and reductions are effective immediately):
New No Upfront Payment Option for 3 Year Standard RIs – We previously offered a no upfront payment option with a 1 year term for Standard RIs. Today, we are adding a No Upfront payment option with a 3 year term for C4, M4, R4, I3, P2, X1, and T2 Standard Reserved Instances.
Lower Prices for No Upfront Reserved Instances – We are lowering the prices for No Upfront 1 Year Standard and 3 Year Convertible Reserved Instances for the C4, M4, R4, I3, P2, X1, and T2 instance types by up to 17%, depending on instance type, operating system, and region.
Here are the average reductions for No Upfront Reserved Instances for Linux in several representative regions:
| US East (Northern Virginia) |
US West (Oregon) |
EU (Ireland) |
Asia Pacific (Tokyo) |
Asia Pacific (Singapore) |
|
| C4 | -11% | -11% | -10% | -10% | -9% |
| M4 | -16% | -16% | -16% | -16% | -17% |
| R4 | -10% | -10% | -10% | -10% | -10% |
Lower Prices for Convertible Reserved Instances – Convertible Reserved Instances allow you to change the instance family and other parameters associated with the RI at any time; this allows you to adjust your RI inventory as your application evolves and your needs change. We are lowering the prices for 3 Year Convertible Reserved Instances by up to 21% for most of the current generation instances (C4, M4, R4, I3, P2, X1, and T2).
Here are the average reductions for Convertible Reserved Instances for Linux in several representative regions:
| US East (Northern Virginia) |
US West (Oregon) |
EU (Ireland) |
Asia Pacific (Tokyo) |
Asia Pacific (Singapore) |
|
| C4 | -13% | -13% | -5% | -5% | -11% |
| M4 | -19% | -19% | -17% | -15% | -21% |
| R4 | -15% | -15% | -15% | -15% | -15% |
Similar reductions will go into effect for nearly all of the other regions as well.
Lower Prices for M4 Instances – We are lowering the prices for M4 Linux instances by up to 7%.
Visit the EC2 Reserved Instance Pricing Page and the EC2 Pricing Page, or consult the AWS Price List API for all of the new prices.
Learn More
The following blog posts contain additional information about some of the improvements that we have made to the EC2 Reserved Instance model:
- New – Instance Size Flexibility for EC2 Reserved Instances – This post introduces the ability to use a single RI for multiple instance sizes within an instance family and region.
- EC2 Reserved Instance Update – Convertible RIs and Regional Benefit – This post introduces the regional benefit (waiving the capacity reservation in exchange for the ability to run an instance in any AZ within the region) and the Convertible RIs that allow you to change the instance family and other parameters.
- Simplifying the EC2 Reserved Instance Model – This post introduces the 3 payment options (All Upfront, Partial Upfront, and No Upfront).
You can also read AWS Pricing and the Reserved Instances FAQ to learn more.
— Jeff;
Roundup of AWS HIPAA Eligible Service Announcements
At AWS we have had a number of HIPAA eligible service announcements. Patrick Combes, the Healthcare and Life Sciences Global Technical Leader at AWS, and Aaron Friedman, a Healthcare and Life Sciences Partner Solutions Architect at AWS, have written this post to tell you all about it.
-Ana
We are pleased to announce that the following AWS services have been added to the BAA in recent weeks: Amazon API Gateway, AWS Direct Connect, AWS Database Migration Service, and Amazon SQS. All four of these services facilitate moving data into and through AWS, and we are excited to see how customers will be using these services to advance their solutions in healthcare. While we know the use cases for each of these services are vast, we wanted to highlight some ways that customers might use these services with Protected Health Information (PHI).
As with all HIPAA-eligible services covered under the AWS Business Associate Addendum (BAA), PHI must be encrypted while at-rest or in-transit. We encourage you to reference our HIPAA whitepaper, which details how you might configure each of AWS’ HIPAA-eligible services to store, process, and transmit PHI. And of course, for any portion of your application that does not touch PHI, you can use any of our 90+ services to deliver the best possible experience to your users. You can find some ideas on architecting for HIPAA on our website.
Amazon API Gateway
Amazon API Gateway is a web service that makes it easy for developers to create, publish, monitor, and secure APIs at any scale. With PHI now able to securely transit API Gateway, applications such as patient/provider directories, patient dashboards, medical device reports/telemetry, HL7 message processing and more can securely accept and deliver information to any number and type of applications running within AWS or client presentation layers.
One particular area we are excited to see how our customers leverage Amazon API Gateway is with the exchange of healthcare information. The Fast Healthcare Interoperability Resources (FHIR) specification will likely become the next-generation standard for how health information is shared between entities. With strong support for RESTful architectures, FHIR can be easily codified within an API on Amazon API Gateway. For more information on FHIR, our AWS Healthcare Competency partner, Datica, has an excellent primer.
AWS Direct Connect
Some of our healthcare and life sciences customers, such as Johnson & Johnson, leverage hybrid architectures and need to connect their on-premises infrastructure to the AWS Cloud. Using AWS Direct Connect, you can establish private connectivity between AWS and your datacenter, office, or colocation environment, which in many cases can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.
In addition to a hybrid-architecture strategy, AWS Direct Connect can assist with the secure migration of data to AWS, which is the first step to using the wide array of our HIPAA-eligible services to store and process PHI, such as Amazon S3 and Amazon EMR. Additionally, you can connect to third-party/externally-hosted applications or partner-provided solutions as well as securely and reliably connect end users to those same healthcare applications, such as a cloud-based Electronic Medical Record system.
AWS Database Migration Service (DMS)
To date, customers have migrated over 20,000 databases to AWS through the AWS Database Migration Service. Customers often use DMS as part of their cloud migration strategy, and now it can be used to securely and easily migrate your core databases containing PHI to the AWS Cloud. As your source database remains fully operational during the migration with DMS, you minimize downtime for these business-critical applications as you migrate your databases to AWS. This service can now be utilized to securely transfer such items as patient directories, payment/transaction record databases, revenue management databases and more into AWS.
Amazon Simple Queue Service (SQS)
Amazon Simple Queue Service (SQS) is a message queueing service for reliably communicating among distributed software components and microservices at any scale. One way that we envision customers using SQS with PHI is to buffer requests between application components that pass HL7 or FHIR messages to other parts of their application. You can leverage features like SQS FIFO to ensure your messages containing PHI are passed in the order they are received and delivered in the order they are received, and available until a consumer processes and deletes it. This is important for applications with patient record updates or processing payment information in a hospital.
Let’s get building!
We are beyond excited to see how our customers will use our newly HIPAA-eligible services as part of their healthcare applications. What are you most excited for? Leave a comment below.
AWS Hot Startups – April 2017
Spring is here, the flowers are blooming and Tina Barr is back with more great startups for you to check out!
-Ana
Welcome back to another month of hot AWS-powered startups! Today we have three exciting startups:
- Beekeeper – simplifying employee communication in the workplace.
- Betterment – making investing easier for everyone.
- ClearSlide – a leading sales engagement platform.
Be sure to check out our March hot startups in case you missed them.
Beekeeper (Zurich, Switzerland)
Flavio Pfaffhauser and Christian Grossmann, both graduates of ETH Zurich, were passionate about building a technology that would connect and bring people together. What started as a student’s social community soon turned into Beekeeper – a communication platform for the workplace that allows employees to interact wherever they are. As Flavio and Christian learned how to build a social platform that engaged people properly, businesses began requesting a platform that could be adapted to their specific processes and needs. The platform started with the concept of helping people feel as if they are sitting right next to each other, whether they’re at a desk or in the field. Founded in 2012, Beekeeper is focused on improving information sharing, communication and peer collaboration, and the company strongly believes that listening to employees is crucial for organizations.
The “Mobile First, Desktop Friendly” platform has a simple and intuitive interface that easily integrates multiple operating systems into one ecosystem. The interface can be styled and customized to match a company’s brand and identity. Employees can connect with their colleagues anytime and anywhere with private and group chats, video and file sharing, and feedback surveys. With Beekeeper’s analytical dashboard leadership teams can identify trending topics of discussion and track employee engagement and app usage in real-time. Beekeeper is currently connecting users in 137 countries across industries including hospitality, construction, transportation, and more.
Beekeeper likes using AWS because it allows their engineers to focus on the things that really matter; solving customer issues. The company builds its infrastructure using services like Amazon EC2, Amazon S3, and Amazon RDS, all of which allow the technical teams to offload administrative tasks. Amazon Elastic Transcoder and Amazon QuickSight are used to build analytical dashboards and Amazon Redshift for data warehousing.
Check out the Beekeeper blog to keep up with their latest news!
Betterment (New York, NY)
Betterment is on a mission to make investing easier and more accessible for everyone, no matter their financial goal. In 2008, Jon Stein founded Betterment with the intent to reinvent the industry and save future investors from making the same common mistakes he had been making. At that time, most people only had a couple of options when it came to investing their money – either do it yourself or hire another person to do it for you. Unfortunately, financial advisors are sometimes paid to recommend certain investments even if it’s not what is best for their clients. Betterment only chooses investments that are in their customer’s best interest and align with their financial goals. Today, they are the largest, independent online investment advisor managing more than $8 billion in assets for over 240,000 customers.
Betterment uses technology to make investing easier and more efficient, while also helping to increase after-tax returns. They offer a wide range of financial planning services that are personalized to their customer’s life goals. To start an investment plan, customers can input their age, retirement status, and annual income and Betterment will recommend how much money to invest and which type of account is the right choice. They will invest and manage it in a way that many traditional investment services can’t at a lower cost.
The engineers at Betterment are constantly working to build industry-changing technology as quickly as possible to help customers maximize their money. AWS gives Betterment the flexibility to easily provision infrastructure and offload functions to various services that once required entire teams to manage. When they first started in the cloud, Betterment was using standard implementations of Amazon EC2, Amazon RDS, and Amazon S3. Since they’ve gone all in with AWS, they have been leveraging services like Amazon Redshift, AWS Lambda, AWS Database Migration Service, Amazon Kinesis, Amazon DynamoDB, and more. Today, they are using over 20 AWS services to develop, test, and deploy features and enhancements on a daily basis.
Learn more about Betterment here.
ClearSlide (San Francisco, CA)
ClearSlide is one of today’s leading sales engagement platforms, offering a complete and integrated tool that makes every customer interaction successful. Since their founding in 2009, ClearSlide has looked for ways to improve customer experiences and have developed numerous enablement tools for sales leaders and teams, marketing, customer support teams, and more. The platform puts content, communication channels, and insights at their customer’s fingertips to help drive better decisions and manage opportunities. ClearSlide serves thousands of companies including Comcast, the Sacramento Kings, The Economist, and so far their customers have generated over 750 million minutes of engagement!
ClearSlide offers a solution for all parts of the sales process. For sales leaders, ClearSlide provides engagement dashboards to improve deal visibility, coaching, and sales forecast accuracy. For marketing and sales enablement teams, they guide sellers to the right content, at the right time, in the right context, and provide insight to maximize content ROI. For sales reps, ClearSlide integrates communications, content, and analytics in a single platform experience. Communications can be made across email, in-person or online meetings, web, or social. Today, ClearSlide customers report a 10-20% increase in closed deals, 25% decrease in onboarding time for new reps, and a 50-80% reduction in selling costs.
ClearSlide uses a range of AWS services, but Amazon EC2 and Amazon RDS have made the biggest impact on their business. EC2 enables them to easily scale compute capacity, which is critical for a fast-growing startup. It also provides consistency during deployment – from development and integration to staging and production. RDS reduces overhead and allows ClearSlide to scale their database infrastructure. Since AWS takes care of time-consuming database management tasks, ClearSlide sees a reduction in operations costs and can focus on being more strategic with their customers.
Watch this video to learn how LiveIntent reduced sales cycles by 22% using ClearSlide. Get all the latest updates by following them on Twitter!
Thanks for checking out another month of awesome AWS-powered startups!
-Tina
New – Server-Side Encryption for Amazon Simple Queue Service (SQS)
As one of the most venerable members of the AWS family of services, Amazon Simple Queue Service (SQS) is an essential part of many applications. Presentations such as Amazon SQS for Better Architecture and Massive Message Processing with Amazon SQS and Amazon DynamoDB explain how SQS can be used to build applications that are resilient and highly scalable.
Today we are making SQS even more useful by adding support for server-side encryption. You can now choose to have SQS encrypt messages stored in both Standard and FIFO queues using an encryption key provided by AWS Key Management Service (KMS). You can choose this option when you create your queue and you can also set it for an existing queue.
SSE encrypts the body of the message, but does not touch the queue metadata, the message metadata, or the per-queue metrics. Adding encryption to an existing queue does not encrypt any backlogged messages. Similarly, removing encryption leaves backlogged messages encrypted.
Creating an Encrypted Queue
The newest version of the AWS Management Console allows you to choose between Standard and FIFO queues using a handy graphic:

You can set the attributes for the queue and the optional Dead Letter Queue:

And you can now check Use SSE and select the desired key:

You can use the AWS-managed Customer Master Key (CMK) which is unique for each customer and region, or you can create and manage your own keys. If you choose to use your own keys, don’t forget to update your KMS key policies so that they allow for encryption and decryption of messages.
You can also configure the data reuse period. This interval controls how often SQS refreshes cryptographic information from KMS, and can range from 1 minute up to 24 hours. Using a shorter interval can improve your security profile, but increase your KMS costs.
To learn more, read the SQS SSE FAQ and the documentation for Server-Side Encryption.
Available Now
Server-side encryption is available today in the US West (Oregon) and US East (Ohio) Regions, with support for others in the works.
There is no charge for the use of encryption, but you will be charged for the calls that SQS makes to KMS. To learn more about this, read How do I Estimate My Customer Master Key (CMK) Usage Costs.
— Jeff;
Use Amazon WorkSpaces on Your Samsung Galaxy S8/S8+ With the New Samsung DeX
It is really interesting to watch as technology evolves and improves. For example, today’s mobile phones offer screens with resolution that rivals a high-end desktop, along with multiple connectivity options and portability.
Earlier this week I had the opportunity to get some hands-on experience with the brand-new Samsung Galaxy S8+ phone and a unique new companion device called the Samsung DeX Station. I installed the Amazon WorkSpaces client for Android tablet on the phone, entered the registration code for my WorkSpace, and logged in. You can see all of this in action in my new video:
DeX includes USB connectors for your keyboard and mouse, and can also communicate with them using Bluetooth. It also includes a cooling fan, a fast phone charger, plus HDMI and Ethernet ports (You can also use your phone’s cellular or Wi-Fi connections).
Bring it all together and you can get to your cloud-based desktop from just about anywhere. Travel light, use the TV / monitor in your hotel room, and enjoy full access to your corporate network, files, and other resources.
— Jeff;
PS – If you want to know more about my working environment, check out I Love my Amazon WorkSpace.
File Interface to AWS Storage Gateway
I should probably have a blog category for “catching up from AWS re:Invent!” Last November we made a really important addition to the AWS Storage Gateway that I was too busy to research and write about at the time.
As a reminder, the Storage Gateway is a multi-protocol storage appliance that fits in between your existing applications and the AWS Cloud. Your applications and your client operating systems see the gateway as (depending on the configuration), a file server, a local disk volume, or a virtual tape library (VTL). Behind the scenes, the gateway uses Amazon Simple Storage Service (S3) for cost-effective, durable, and secure storage. Storage Gateway caches data locally and uses bandwidth management to optimize data transfers.
Storage Gateway is delivered as a self-contained virtual appliance that is easy to install, configure, and run (read the Storage Gateway User Guide to learn more). It allows you to take advantage of the scale, durability, and cost benefits of cloud storage from your existing environment. It reduces the process of moving existing files and directories into S3 to a simple drag and drop (or a CLI-based copy).
As is the case with many AWS services, the Storage Gateway has gained many features since we first launched it in 2012 (The AWS Storage Gateway – Integrate Your Existing On-Premises Applications with AWS Cloud Storage). At launch, the Storage Gateway allowed you to create storage volumes and to attach them as iSCSI devices, with options to store either the entire volume or a cache of the most frequently accessed data in the gateway, all backed by S3. Later, we added support for Virtual Tape Libraries (Create a Virtual Tape Library Using the AWS Storage Gateway). Earlier this year we added read-only file shares, user permission squashing, and scanning for added and removed objects.
New File Interface
At AWS re:Invent we launched a third option, and that’s what I’d like to tell you about today. You can now use the Storage Gateway as a virtual file server that you can mount on your on-premises servers and desktops. After you set it up in your data center or in the cloud, your configured buckets will be available as NFS mount points. Your application simply reads and writes files and directories over NFS; behind the scenes, the gateway turns these operations into object-level requests on your S3 buckets, where they are accessible natively (one S3 object per file). To create a file gateway, you simply visit the Storage Gateway Console, click on Get started, and choose File gateway:

Then choose your host platform: VMware ESXi or Amazon EC2:

I expect many of our customers to host the Storage Gateway on premises and to use it as a permanent or temporary bridge to the cloud. Use cases for this option include simplified backups, migration, archiving, analytics, storage tiering, and compute-intensive cloud-based processing. Once the data is in the cloud, you can take advantage of many features of S3 including multiple storage tiers (Infrequent Access and Glacier are great for archiving), storage analytics, tagging, and the like.
I don’t have much data on-premises so I’m going to run the Storage Gateway on an EC2 instance for this post. I launched the instance and set it up per the instructions on the screen, taking care to create the proper inbound security group rules (port 80 for HTTP access and port 2049 for NFS). I added 150 GiB of General Purpose SSD storage to be used as a cache:

After the instance launched I captured its public IP address and used it to connect to my newly launched gateway:

I set the time zone and assigned a name to my gateway and clicked on Activate gateway:

Then I configured the local storage as a cache, and clicked on Save and continue:

My gateway was up and running, and I could see it in the console:

Next, I clicked on Create file share to create an NFS share and associate it with an S3 bucket:

As you can see, I had the opportunity to choose my storage class (Standard or Standard – Infrequent Access in accord with my needs and my use case). The gateway needs to be able to upload files into my bucket; clicking on Create a new IAM role will create a role and a policy (read Granting Access to an Amazon S3 Destination to learn more).
I review my settings and click on Create file share:

By the way, Root squash is a feature of the AWS Storage Gateway, not a vegetable. When enabled (as it is by default) files that arrive as owned by root (user id 0) are mapped to user id 65534 (traditionally known as nobody). I can also set up default permissions for new files and new directories.
My new share is visible in the console, and available for use within seconds:

The console displays the appropriate mount commands for Linux, Microsoft Windows, and macOS. Those commands use the private IP address of the instance; in many cases you will want to use the public address instead (needless to say, you should exercise extreme care when you create a public NFS share, and maintain close control over the IP addresses that are allowed to connect).
I flipped over to the S3 console and inspected the bucket (jbarr-gw-1), finding it empty, as expected:

Then I turned to my EC2 instance, mounted the share, and copied some files to it:

I returned to the console and found a new folder (jeff_code) in my bucket, as expected. I ventured inside and found the files that I had copied to the share:

As you can see, my files are copied directly into S3 and are simply regular S3 objects. This means that I can use my existing S3 tools, code, and analytics to process them. For example:
- Analytics – The new S3 metrics and analytics can be used to analyze the entire bucket or any directory tree within it:

- Code – AWS Lambda and Amazon Rekognition can be used to process uploaded images; see Serverless Photo Recognition for some ideas and some code. I could also use Amazon Elasticsearch Service to index some or all of the files or Amazon EMR to process massive amounts of data.
- Tools – I can process the existing objects in the bucket and I can also create new ones using the the S3 APIs. Any code or script that creates or removes should call the RefreshCache function to synchronize the contents of any gateways attached to the bucket (I can create a multi-site data distribution workflow by pointing multiple read-only gateways at the same bucket). I can also make use of existing, file-centric backup tools by using the share as the destination for my backups.
The gateway stores all of the file metadata (owner, group, permissions, and so forth) as S3 metadata:

Storage Gateway Resources
Here are some resources that will help you to learn more about the Storage Gateway:
Presentation – Deep Dive on the AWS Storage Gateway:
White Paper – File Gateway for Hybrid Architectures – Overview and Best Practices:
Recent Videos:
- Deep Dive on the AWS Storage Gateway – AWS Online Tech Talk.
- Introducing the New AWS Storage Gateway – re:Invent 2016.
- Using the AWS Storage Gateway Virtual Tape Library with Veritas Backup Exec.
- Getting Started with the Hybrid Cloud: Enterprise Backup and Recovery – re:Invent 2016.
Available Now
This cool AWS feature has been available since last November!
— Jeff;
AWS Enables Consortium Science to Accelerate Discovery
My colleague Mia Champion is a scientist (check out her publications), an AWS Certified Solutions Architect, and an AWS Certified Developer. The time that she spent doing research on large-data datasets gave her an appreciation for the value of cloud computing in the bioinformatics space, which she summarizes and explains in the guest post below!
— Jeff;
Technological advances in scientific research continue to enable the collection of exponentially growing datasets that are also increasing in the complexity of their content. The global pace of innovation is now also fueled by the recent cloud-computing revolution, which provides researchers with a seemingly boundless scalable and agile infrastructure. Now, researchers can remove the hindrances of having to own and maintain their own sequencers, microscopes, compute clusters, and more. Using the cloud, scientists can easily store, manage, process and share datasets for millions of patient samples with gigabytes and more of data for each individual. As American physicist, John Bardeen once said: “Science is a collaborative effort. The combined results of several people working together is much more effective than could be that of an individual scientist working alone”.
Prioritizing Reproducible Innovation, Democratization, and Data Protection
Today, we have many individual researchers and organizations leveraging secure cloud enabled data sharing on an unprecedented scale and producing innovative, customized analytical solutions using the AWS cloud. But, can secure data sharing and analytics be done on such a collaborative scale as to revolutionize the way science is done across a domain of interest or even across discipline/s of science? Can building a cloud-enabled consortium of resources remove the analytical variability that leads to diminished reproducibility, which has long plagued the interpretability and impact of research discoveries? The answers to these questions are ‘yes’ and initiatives such as the Neuro Cloud Consortium, The Global Alliance for Genomics and Health (GA4GH), and The Sage Bionetworks Synapse platform, which powers many research consortiums including the DREAM challenges, are starting to put into practice model cloud-initiatives that will not only provide impactful discoveries in the areas of neuroscience, infectious disease, and cancer, but are also revolutionizing the way in which scientific research is done.
Bringing Crowd Developed Models, Algorithms, and Functions to the Data
Collaborative projects have traditionally allowed investigators to download datasets such as those used for comparative sequence analysis or for training a deep learning algorithm on medical imaging data. Investigators were then able to develop and execute their analysis using institutional clusters, local workstations, or even laptops:

This method of collaboration is problematic for many reasons. The first concern is data security, since dataset download essentially permits “chain-data-sharing” with any number of recipients. Second, analytics done using compute environments that are not templated at some level introduces the risk of variable analytics that itself is not reproducible by a different investigator, or even the same investigator using a different compute environment. Third, the required data dump, processing, and then re-upload or distribution to the collaborative group is highly inefficient and dependent upon each individual’s networking and compute capabilities. Overall, traditional methods of scientific collaboration have introduced methods in which security is compromised and time to discovery is hampered.
Using the AWS cloud, collaborative researchers can share datasets easily and securely by taking advantage of Identity and Access Management (IAM) policy restrictions for user bucket access as well as S3 bucket policies or Access Control Lists (ACLs). To streamline analysis and ensure data security, many researchers are eliminating the necessity to download datasets entirely by leveraging resources that facilitate moving the analytics to the data source and/or taking advantage of remote API requests to access a shared database or data lake. One way our customers are accomplishing this is to leverage container based Docker technology to provide collaborators with a way to submit algorithms or models for execution on the system hosting the shared datasets:

Docker container images have all of the application’s dependencies bundled together, and therefore provide a high degree of versatility and portability, which is a significant advantage over using other executable-based approaches. In the case of collaborative machine learning projects, each docker container will contain applications, language runtime, packages and libraries, as well as any of the more popular deep learning frameworks commonly used by researchers including: MXNet, Caffe, TensorFlow, and Theano.
A common feature in these frameworks is the ability to leverage a host machine’s Graphical Processing Units (GPUs) for significant acceleration of the matrix and vector operations involved in the machine learning computations. As such, researchers with these objectives can leverage EC2’s new P2 instance types in order to power execution of submitted machine learning models. In addition, GPUs can be mounted directly to containers using the NVIDIA Docker tool and appear at the system level as additional devices. By leveraging Amazon EC2 Container Service and the EC2 Container Registry, collaborators are able to execute analytical solutions submitted to the project repository by their colleagues in a reproducible fashion as well as continue to build on their existing environment. Researchers can also architect a continuous deployment pipeline to run their docker-enabled workflows.
In conclusion, emerging cloud-enabled consortium initiatives serve as models for the broader research community for how cloud-enabled community science can expedite discoveries in Precision Medicine while also providing a platform where data security and discovery reproducibility is inherent to the project execution.
— Mia D. Champion, Ph.D.


