Google Cloud Platform Blog: Cloud Dataflow

Announcing General Availability of Google Cloud Dataflow and Cloud Pub/Sub

Posted: Wednesday, August 12, 2015

By the time you are done reading this blog post, Google Cloud Platform customers will have processed hundreds of millions of messages and analyzed thousands of terabytes of data utilizing Cloud Dataflow, Cloud Pub/Sub, and BigQuery. These fully-managed services remove the operational burden found in traditional data processing systems. They enable you to build applications on a platform that can scale with the growth of your business and drive down data processing latency, all while processing your data efficiently and reliably.

Every day, customers use Google Cloud Platform to execute business-critical big data processing workloads, including: financial fraud detection, genomics analysis, inventory management, click-stream analysis, A/B user interaction testing and cloud-scale ETL.

Today we are removing our “beta” label and making Cloud Dataflow generally available. Cloud Dataflow is specifically designed to remove the complexity of developing separate systems for batch and streaming data sources by providing a unified programming model. Based on more than a decade of Google innovation, including MapReduce, FlumeJava, and Millwheel, Cloud Dataflow is built to free you from the operational overhead related to large scale cluster management and optimization.

Cloud Dataflow provides a unified computation model for batch and streaming processing

With Cloud Dataflow GA you get:

A fully managed, fault tolerant, highly available, SLA-backed service for batch and stream processing.

"We are utilizing Cloud Dataflow to overcome elasticity challenges with our current Hadoop cluster. Starting with some basic ETL workflow for BigQuery ingestion, we transitioned into full blown clickstream processing and analysis. This has helped us significantly improve performance of our overall system and reduce cost."

Sudhir Hasbe, Director of Software Engineering, Zullily.com

“The current iteration of Qubit’s real-time data supply chain was heavily inspired by the ground-breaking stream processing concepts described in Google’s MillWheel paper. Today we are happy to come full circle and build streaming pipelines on top of Cloud Dataflow - which has delivered on the promise of a highly-available and fault-tolerant data processing system with an incredibly powerful and expressive API.”

Jibran Saithi, Lead Architect, Qubit

A comprehensive model for balancing correctness, latency, and cost when dealing with unordered data at massive scale. These concepts power key elements of the Cloud Dataflow programming model.

"Streaming Google Cloud Dataflow perfectly fits requirements of time series analytics platform at Wix.com, in particular, its scalability, low latency data processing and fault-tolerant computing. Wide range of data collection transformations and grouping operations allow to implement complex stream data processing algorithms."

Gregory Bondar, Ph.D., Sr. Director of Data Services Platform, Wix.com

Great performance. Cloud Dataflow is 2-3x faster and cheaper than Hadoop when evaluating classic MapReduce based pipelines, such as PageRank and WordCount. And with dynamic work rebalancing, Cloud Dataflow effectively optimizes resource utilization which provides additional performance gains without requiring manual intervention.

An extensible SDK. We have expanded our technology partner, 3rd party connector, and service provider integration efforts including Tamr, Salesforce, ClearStory, springML, Cloudera, data Artisans. We also continue to support alternate runner enablement for Apache Spark and Apache Flink.

	"We're excited to collaborate with Google Cloud Platform on integrations with Salesforce Wave. The integrations with Google Cloud Dataflow further enable Wave to deliver insights to business users. Businesses can now use vast, diverse datasets like machine-generated data to derive customer insights in near-real-time." Olivier Pin, VP of Product Management, Wave Analytics, Salesforce.com
	"Tamr and Google Cloud Dataflow are simplifying how people access and use crucial data and distributed computing assets in the enterprise. The combination of Cloud Dataflow and Tamr running on Google Cloud Platform enables organizations to connect and enrich their enterprise data at internet scale." Andy Palmer, co-founder and CEO of Tamr, Inc.

Cloud Dataflow seamlessly integrates with Google Cloud Platform, third party services & data stores

Native Google Cloud Platform integration for Cloud Storage, Cloud Datastore, BigQuery, and Cloud Pub/Sub. You now get full query support for our BigQuery source. Our integration with Cloud Pub/Sub now provides source timestamp processing in addition to arrival time processing. Source timestamps, when combined with flexible Windowing and Triggering primitives, enable developers to produce more accurate windows of data output.

"We are very excited about the productivity benefits offered by Cloud Dataflow and Cloud Pub/Sub. It took half a day to rewrite something that had previously taken over six months to build using Spark"

Paul Clarke, Director of Technology, Ocado

A decade of internal innovation also stands behind today’s general availability of Google Cloud Pub/Sub. Delivering over a trillion messages for our alpha and beta customers has helped tune our performance, refine our v1 API, and ensure a stable foundation for Cloud Dataflow’s streaming ingestion, Cloud Logging’s streaming export, Gmail’s Push API, and Cloud Platform customers streaming their own production workloads — at rates up to 1 million message operations per second.

Such diverse scenarios demonstrate how Cloud Pub/Sub is designed to deliver real-time and reliable messaging — in one global, managed service that helps you create simpler, more robust, and more flexible applications.

Cloud Pub/Sub connects your services to each other, to other Google APIs, and third parties.

Cloud Pub/Sub can help integrate applications and services reliably, as well as analyze big data streams in real-time. Traditional approaches require separate queueing, notification, and logging systems, each with their own APIs and tradeoffs between durability, availability, and scalability. Cloud Pub/Sub addresses a broad range of scenarios with a single API, a managed service that eliminates those tradeoffs, and remains cost-effective as you grow, with pricing as low as 5¢ per million message operations for sustained usage.

General availability is a key milestone, though hardly the end of the road. We are continuing to innovate with the alpha release of the gcloud pubsub tool and today’s beta release of our new Identity and Access Management (IAM) APIs and Permissions Editor in the Google Developers Console.These improvements allow users to control access down to the level of particular operations on specific topics and subscriptions. IAM ACLs make it easier to connect multiple Cloud Platform projects, either within the same organization or to third-party services.

Get Started

We’re looking forward to this next step for Google Cloud Platform as we continue to help developers and businesses everywhere benefit from Google’s technical and operational expertise in big data. Please visit Cloud Dataflow and Cloud Pub/Sub to learn more and contact us with your feedback, ideas for new connectors, or even new public data feeds we can help you share.

- Posted by Eric Schmidt (not that Eric), PM Cloud Dataflow & Rohit Khare, PM Cloud Pub/Sub

Easily run Dataflow Big Data pipelines anywhere, thanks to Cloudera

Posted: Tuesday, January 20, 2015

Big data processing can take place in many contexts. Sometimes you’re prototyping new pipelines, and at other times you’re deploying them to run at scale. Sometimes you’re working on-premises, and at other times you’re in the cloud. Sometimes you care most about speed of execution, and at other times you want to optimize for the lowest possible processing cost. The best deployment option often depends on this context. It also changes over time; new data processing engines become available, each optimized for specific needs — from the venerable Hadoop MapReduce to Storm, Spark, Tez or Flink, all in open source, as well as cloud-native services. Today’s optimal choice of big data runtime might not be tomorrow’s.

But in all these cases, what remains true is that you need an easy-to-use, powerful and flexible programming model that makes developers productive. And no one wants to have to rewrite their algorithm for a specific runtime.

We believe the Dataflow programming model, based on years of experience at Google, can provide maximum developer productivity and seamless portability. That's why in December we open sourced the Cloud Dataflow SDK, which offers a set of primitives for large-scale distributed computing, including rich semantics for stream processing. This allows the same program to execute either in stream or batch mode.

Today, we’re taking the next step in ensuring the portability of the Dataflow programming model by working with Cloudera to make Dataflow run on Spark. There are currently three runners available to allow Dataflow programs to execute in different environments:

Direct Pipeline: The “Direct Pipeline” runner executes the program on the local machine.
Google Cloud Dataflow: The Google Cloud Dataflow service is a hosted and fully managed execution environment for Dataflow programs on Google Cloud Platform. Programs can be deployed on it via a runner. This service is currently in alpha phase and available to a limited number of users; you can apply here.
Spark: Thanks to Cloudera, the Spark runner allows the same Dataflow program to execute on a Spark cluster, whether in the cloud or on-premises. The runner is part of the Cloudera Labs effort and is available in this GitHub repo. You can find out more about Dataflow and the Spark runner from Cloudera’s Josh Wills in this blog post.

We are delighted that Cloudera is joining us, and we look forward to the future growth of the Dataflow ecosystem. We’re confident that Dataflow programs will make data more useful in an ever-growing number of environments, in cloud or on-premises. Please join us – whether by using the Dataflow SDK (deploying via one of the three runners listed above) for your own data processing pipelines, or by creating a new Dataflow runner for your favorite big data runtime.

-Posted by William Vambenepe, Product Manager

Geofencing 340 million NYC taxi locations with Google Cloud Dataflow

Posted: Friday, December 19, 2014

Fun fact: around 170 million taxi journeys occur across New York City yearly, holding vast amounts of information each time someone steps in and out of one of those bright yellow cabs. How much information exactly? Being a not-so-secret maps enthusiast, I made it my challenge to visualize a NYC taxi dataset on Google Maps.

Anyone who’s tried to put a large amount of data points on a map knows about the difficulties one faces when working with big geolocation data. That's why I want to share with you how I used Cloud Dataflow to spatially aggregate every single pick-up and drop-off location with the objective of painting the whole picture on a map. For background info, Google Cloud Dataflow is now in alpha stage and can help you gain insight into large geolocation datasets. You can try experimenting with it by applying for the alpha program or learn more with yesterday's update.

When I first sat down to think through this data visualization, I knew I needed to create a thematic map, so I built a simple pipeline that was able to geofence all the 340 million pick-up and drop-off locations against 342 different polygons that resulted from converting the NYC neighbourhood tabulation areas into single-part polygons. You can find the processed data in this public BigQuery table. (In order to access BigQuery you need to have at least one project listed in your Google Developers Console. After creating a project you can access the table by following this link.)

Thematic map showing the distribution of taxi pick-up locations in NYC in 2013. Midtown South is New Yorkers’ favourite area to get a cab with almost 28 million trips starting there, which is roughly 1 trip per second. You can find an interactive map here.

This open data, released by the NYC Taxi & Limo Commission, has been the foundation for some beautiful visualizations. By utilizing the power of Google Cloud Platform's tools, I’ve been able to spatially aggregate the data using Cloud Dataflow, and then do ad hoc querying on the results using BigQuery, to gain fast and comprehensive insight into this immense dataset.

With the Google Cloud Dataflow SDK, which parallels the data transformations across multiple Cloud Platform instances, I was able to build, test and run the whole processing pipeline in a couple of days. The actual processing, distributed across five workers, took slightly less than two hours.

The pipeline’s architecture is extremely simple. Since Cloud Dataflow offers a BigQuery reader and writer, most of the heavy lifting is already taken care of. The only thing I had to provide was the geofencing function that could be parallelised across multiple instances. For a detailed description on how to do complex geofencing using open source libraries see this post on the Google Developers Blog.

When executing the pipeline, Cloud Dataflow automatically optimizes your data-centric pipeline code by collapsing multiple logical passes into a single execution pass and deploys the result to multiple Google Compute Engine instances. At the time of deploying the pipeline you can read in files from Google Cloud Storage that contain data you need for your transformations, e.g., shapefiles or GeoJSON formats. Alternatively you can call an external API to load in the geofences you want to test against.

I utilized an API I built on App Engine which exposes a list of geofences stored in Datastore. Using the Java Topology Suite I created a spatial index maintained in a class variable in the memory of each instance for fast querying access.

Distributed across five workers, Cloud Dataflow was able to process an average of 25,000 records per second, each record having two locations, ploughing through more than 170 million table rows in just under two hours. The amount of workers can be flexibly assigned at the time of deployment. The more workers you use, the more records can be processed in parallel, the faster the execution of your pipeline.

The interactive Cloud Dataflow graph of your Pipeline, helping you to monitor and debug your Pipeline in your Google Developer Console in the browser.

Having the data preprocessed and written back into BigQuery, we were then able to run super fast queries over the whole table answering questions like, “where do the best-paid trips start from?”.

Unsurprisingly they start from JFK airport with an average fare of $46 and an average tip of 20.7%*. Okay, this is probably not a secret, but did you know that, even though the average fare from LGA airport is $15 less, there are roughly 800,000 trips more starting from LGA? And with 22.2%*, passengers from LGA airport actually tip best. *As cash tips aren’t reported, only 52% of trips have a tip noted, therefore the values regarding tips could be inaccurate.

Most of the taxi trips start in Midtown-South (28 million) with an average fare of $11. Carnegie Hill in the Upper East Side comes fourth with 12 million pick-ups, however these trips are fairly short. Journeys that start there mostly stay in the Upper East Side and therefore only generate an average fare of $9.80. Here's an interactive map visualizing where people went to, what they paid on average and how they tipped at and some other visualizations of of how people tip from where:

The processed data is publicly available in this BigQuery table. You can find some interesting queries to run against this data in this gist.

Though NYC taxi cab journeys may not seem to amount to much, they actually that conceal a ton of information, which Google Cloud Dataflow, as a powerful big data tool, helped reveal by making big data processing easy and affordable. Maybe I'll try London's black cabs next.

- Posted by Thorsten Schaeff, Sales Engineer Intern

Google Announces Open-Source Cloud Dataflow SDK for Java

Posted: Thursday, December 18, 2014

The value of data lies in analysis -- and the intelligence one generates from it. Turning data into intelligence can be very challenging as data sets become large and distributed across disparate storage systems. Add to that the increasing demand for real-time analytics, and the barriers to extracting value from data sets becomes a huge challenge for developers.

In June 2014, we announced a significant step toward a managed service model for data processing. Aimed at relieving operational burden and enabling developers to focus on development, Google Cloud Dataflow was unveiled. We created Cloud Dataflow, which is now currently an alpha release, as a platform to democratize large scale data processing by enabling easier and more scalable access to data for data scientists, data analysts and data-centric developers. Regardless of role or goal - users can discover meaningful results from their data via simple and intuitive programing concepts, without the extra noise from managing distributed systems.

Today, we are announcing availability of the Cloud Dataflow SDK as open-source. This will make it easier for developers to integrate with our managed service while also forming the basis for porting Cloud Dataflow to other languages and execution environments.

We’ve learned a lot about how to turn data into intelligence as the original FlumeJava programming models (basis for Cloud Dataflow) have continued to evolve internally at Google. Why share this via open source? It’s so that the developer community can:

Spur future innovation in combining stream and batch based processing models: Reusable programming patterns are a key enabler of developer efficiency. The Cloud Dataflow SDK introduces a unified model for batch and stream data processing. Our approach to temporal based aggregations provides a rich set of windowing primitives allowing the same computations to be used with batch or stream based data sources. We will continue to innovate on new programming primitives and welcome the community to participate in this process.
Adapt the Dataflow programming model to other languages: As the proliferation of data grows, so do programming languages and patterns. We are currently building a Python 3 version of the SDK, to give developers even more choice and to make dataflow accessible to more applications.
Execute Dataflow on other service environments: Modern development - especially in the cloud - is about heterogeneous service and composition. Although we are building a massively scalable, highly reliable, strongly consistent managed service for Dataflow execution, we also embrace portability. As Storm, Spark, and the greater Hadoop family continue to mature - developers are challenged with bifurcated programming models. We hope to relieve developer fatigue and enable choice in deployment platforms by supporting execution and service portability.

We look forward to collaboratively building a system that enables distributed data processing for users from all backgrounds. We encourage developers to check out the Dataflow SDK for Java on GitHub and contribute to the community.

Interested in adding to the Cloud Dataflow conversation? Here’s how:

Apply for access to Cloud Dataflow's managed service
Learn more through the documentation
Take part in the conversation at StackOverflow [tag: google-cloud-dataflow]

- Posted by Sam McVeety, Software Engineer

Sneak peek: Google Cloud Dataflow, a Cloud-native data processing service

Posted: Thursday, June 26, 2014

In today's world, information is being generated at an incredible rate. However, unlocking insights from large datasets can be cumbersome and costly, even for experts.

It doesn’t have to be that way. Yesterday, at Google I/O, you got a sneak peek of Google Cloud Dataflow, the latest step in our effort to make data and analytics accessible to everyone. You can use Cloud Dataflow:

for data integration and preparation (e.g. in preparation for interactive SQL in BigQuery)
to examine a real-time stream of events for significant patterns and activities
to implement advanced, multi-step processing pipelines to extract deep insight from datasets of any size

In these cases and many others, you use Cloud Dataflow’s data-centric model to easily express your data processing pipeline, monitor its execution, and get actionable insights from your data, free from the burden of deploying clusters, tuning configuration parameters, and optimizing resource usage. Just focus on your application, and leave the management, tuning, sweat and tears to Cloud Dataflow.

Cloud Dataflow is based on a highly efficient and popular model used internally at Google, which evolved from MapReduce and successor technologies like Flume and MillWheel. The underlying service is language-agnostic. Our first SDK is for Java, and allows you to write your entire pipeline in a single program using intuitive Cloud Dataflow constructs to express application semantics.

Cloud Dataflow represents all datasets, irrespective of size, uniformly via PCollections (“parallel collections”). A PCollection might be an in-memory collection, read from files on Cloud Storage, queried from a BigQuery table, read as a stream from a Pub/Sub topic, or calculated on demand by your custom code.

Because PCollections can be arbitrarily large, Cloud Dataflow includes a rich library of PTransforms (“parallel transforms”), which you can customize with your own application logic. For example, ParDo (“parallel do”) runs your code over each element in a PCollection independently (like both the Map and Reduce functions in MapReduce or WHERE in SQL), and GroupByKey takes a PCollection of key-value pairs and groups together all pairs with the same key (like the Shuffle step of MapReduce or GROUP BY and JOIN in SQL). In addition, anyone can define new custom transformations by composing other transformations -- this extensibility lets you write reusable building blocks which can be shared across programs. Cloud Dataflow provides a starter set of these composed transforms out of the box, including Count, Top, and Mean.

Writing in this modular, high-level style naturally leads to pipelines that make multiple logical passes over the same data. Cloud Dataflow automatically optimizes your data-centric pipeline code by collapsing multiple logical passes into a single execution pass. However, this doesn't turn the system into a black box: as you can see below, Cloud Dataflow’s monitoring UI uses the building block concept to show you the pipeline as you wrote it, not as the system chooses to execute it.

Code snippet and monitoring UI from the Cloud Dataflow demo in the IO keynote.

The same Cloud Dataflow pipeline may run in different ways, depending on the data sources. As you start designing or debugging, you can run against data local to your development environment. When you’re ready to scale up to real data, that same pipeline can run in parallel batch mode against data in Cloud Storage or in distributed real-time processing mode against data coming in via a Pub/Sub topic. This flexibility makes it trivial to transition between different stages in the application development lifecycle: to develop and test applications, to adapt an existing batch pipeline to track time-sensitive trends, or to fix a bug in a real-time pipeline and backfill the historical results.

When you use Cloud Dataflow, you can focus solely on your application logic and let us handle everything else. You should not have to choose between scalability, ease of management and a simple coding model. With Cloud Dataflow, you can have it all.

If you’d like to be notified of future updates about Cloud Dataflow, please join our Google Group.

-Posted by Frances Perry, Software Engineer