Cloud Dataflow
A fully-managed cloud service and programming model for batch and streaming big data processing.
Try It FreeManaged & Unified
Dataflow is a unified programming model and a managed service for developing and executing a wide range of data processing patterns including ETL, batch computation, and continuous computation. Cloud Dataflow frees you from operational tasks like resource management and performance optimization.
Fully Managed
The managed service transparently handles resource lifetime and can dynamically provision resources to minimize latency while maintaining high utilization efficiency. Dataflow resources are allocated on-demand providing you with nearly limitless resource capacity to solve your big data processing challenges.
Unified Programming Model
Dataflow provides programming primitives such as powerful windowing and correctness controls that can be applied across both batch and stream based data sources. Dataflow effectively eliminates programming model switching cost between batch and continuous stream processing by enabling developers to express computational requirements regardless of data source.
Integrated & Open Source
Built upon services like Google Compute Engine, Dataflow is an operationally familiar compute environment that seamlessly integrates with Cloud Storage, Cloud Pub/Sub, Cloud Datastore, Cloud Bigtable, and BigQuery. The open source Java-based Cloud Dataflow SDK enables developers to implement custom extensions and to extend Dataflow to alternate service environments.
Partnerships & Integrations
Google Cloud Platform partners and 3rd party developers have developed integrations with Dataflow to quickly and easily enable powerful data processing tasks of any size. Integrations are done with the open APIs provided by Dataflow.
ClearStory
Cloudera
DataArtisans
Sales Force
SpringML
tamr
Dataflow Features
Reliable execution for large-scale data processing
- Resource Management
- Cloud Dataflow fully automates management of required processing resources. No more spinning up instances by hand.
- On Demand
- All resources are provided on demand, enabling you to scale to meet your business needs. No need to buy reserved compute instances.
- Intelligent Work Scheduling
- Automated and optimized work partitioning which can dynamically rebalance lagging work. No more chasing down “hot keys” or pre-processing your input data.
- Auto Scaling
- Horizontal auto scaling of worker resources to meet optimum throughput requirements results in better overall price-to-performance.
- Unified Programming Model
- The Dataflow API enables you to express MapReduce like operations, powerful data windowing, and fine grained correctness control regardless of data source.
- Open Source
- Developers wishing to extend the Dataflow programming model can fork and or submit pull requests on the Java-based Cloud Dataflow SDK. Dataflow pipelines can also run on alternate runtimes like Spark and Flink.
- Monitoring
- Integrated into the Google Cloud Platform Console, Cloud Dataflow provides statistics such as pipeline throughput and lag, as well as consolidated worker log inspection—all in near-real time.
- Integrated
- Integrates with Cloud Storage, Cloud Pub/Sub, Cloud Datastore, Cloud Bigtable, and BigQuery for seamless data processing. And can be extended to interact with others sources and sinks like Apache Kafka and HDFS.
- Reliable & Consistent Processing
- Cloud Dataflow provides built-in support for fault-tolerant execution that is consistent and correct regardless of data size, cluster size, processing pattern or pipeline complexity.
