Google Cloud Platform

Managed & Unified

Dataflow is a unified programming model and a managed service for developing and executing a wide range of data processing patterns including ETL, batch computation, and continuous computation. Cloud Dataflow frees you from operational tasks like resource management and performance optimization.

Fully Managed

The managed service transparently handles resource lifetime and can dynamically provision resources to minimize latency while maintaining high utilization efficiency. Dataflow resources are allocated on-demand providing you with nearly limitless resource capacity to solve your big data processing challenges.

Unified Programming Model

Dataflow provides programming primitives such as powerful windowing and correctness controls that can be applied across both batch and stream based data sources. Dataflow effectively eliminates programming model switching cost between batch and continuous stream processing by enabling developers to express computational requirements regardless of data source.

Integrated & Open Source

Built upon services like Google Compute Engine, Dataflow is an operationally familiar compute environment that seamlessly integrates with Cloud Storage, Cloud Pub/Sub, Cloud Datastore, Cloud Bigtable, and BigQuery. The open source Java-based Cloud Dataflow SDK enables developers to implement custom extensions and to extend Dataflow to alternate service environments.

Partnerships & Integrations

Google Cloud Platform partners and 3rd party developers have developed integrations with Dataflow to quickly and easily enable powerful data processing tasks of any size. Integrations are done with the open APIs provided by Dataflow.

ClearStory

Cloudera

DataArtisans

Sales Force

 

SpringML

tamr

Dataflow Features

Reliable execution for large-scale data processing

Resource Management
Cloud Dataflow fully automates management of required processing resources. No more spinning up instances by hand.
On Demand
All resources are provided on demand, enabling you to scale to meet your business needs. No need to buy reserved compute instances.
Intelligent Work Scheduling
Automated and optimized work partitioning which can dynamically rebalance lagging work. No more chasing down “hot keys” or pre-processing your input data.
Auto Scaling
Horizontal auto scaling of worker resources to meet optimum throughput requirements results in better overall price-to-performance.
Unified Programming Model
The Dataflow API enables you to express MapReduce like operations, powerful data windowing, and fine grained correctness control regardless of data source.
Open Source
Developers wishing to extend the Dataflow programming model can fork and or submit pull requests on the Java-based Cloud Dataflow SDK. Dataflow pipelines can also run on alternate runtimes like Spark and Flink.
Monitoring
Integrated into the Google Cloud Platform Console, Cloud Dataflow provides statistics such as pipeline throughput and lag, as well as consolidated worker log inspection—all in near-real time.
Integrated
Integrates with Cloud Storage, Cloud Pub/Sub, Cloud Datastore, Cloud Bigtable, and BigQuery for seamless data processing. And can be extended to interact with others sources and sinks like Apache Kafka and HDFS.
Reliable & Consistent Processing
Cloud Dataflow provides built-in support for fault-tolerant execution that is consistent and correct regardless of data size, cluster size, processing pattern or pipeline complexity.

Java is a registered trademark of Oracle and/or its affiliates.