An elastic batch-and stream-processing stack with Pravega and Apache Flink

An elastic batch-and stream-processing stack with Pravega and Apache Flink

Thursday, April 19
11:50 AM - 12:30 PM
Room II

Stream processing is a popular paradigm that is becoming more relevant as many applications provide low-latency response time and new application domains emerge that naturally demand data to be processed in motion. One particularly attractive characteristic of the stream-processing paradigm is that it conceptually unifies batch processing (bounded/static historic data) and continuous near-real-time data processing (unbounded streaming event data).

Implementing a unified batch and streaming data architecture is in practice not seamless—near-real-time event data and bulk historic data use different storage systems (messages queues or logs vs. file systems or object stores). Consequently, running the same analysis now and at some arbitrary time in the future (e.g., months, possibly years ahead) means dealing with different data sources and APIs. Few systems are capable of handling both near-real-time streaming workloads and large batch workloads at the same time. And streaming workloads tend to be inherently dynamic, requiring both storage and compute to adjust continuously for maximum resource efficiency.

In this talk, we present an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams). The combination of these two systems offers an unprecedented way of handling “everything as a stream,” while dynamically accommodating workload variations in a novel way. Pravega enables the ingestion capacity of a stream to grow and shrink according to workload and sends signals downstream to enable Flink to scale accordingly.

Pravega offers a permanent streaming storage, exposing an API than enables applications to access data in either near-real time or at any arbitrary time in the future in a uniform fashion. Apache Flink’s SQL and streaming APIs provide a common interface for processing continuous near-real-time data, sets of historic data, or combinations of both. A deep integration between these two systems gives end-to-end exactly-once semantics for pipelines of streams and stream processing and lets both systems jointly scale and adjust automatically to changing data rates.

Presentation Video

SPEAKERS

Stephan Ewen
Co-founder, CTO
data Artisans
Stephan Ewen is a PMC member and one of the original creators of Apache Flink, and co-founder and CTO of data Artisans (data-artisans.com). He holds a Ph.D. from the Berlin University of Technology.
Flavio Junqueira
Engineering Lead
Pravega by DellEMC
Flavio Junqueira leads the Pravega team at DellEMC. He holds a PhD in computer science from the University of California, San Diego and is interested in various aspects of distributed systems, including distributed algorithms, concurrency, and scalability. Previously, Flavio held a software engineer position with Confluent and research positions with Yahoo! Research and Microsoft Research. Flavio has contributed to a few important open-source projects. Most of his current contributions are to the Pravega open-source project, and previously he contributed and started Apache projects such as Apache ZooKeeper and Apache BookKeeper. Flavio co-authored the O’Reilly "ZooKeeper: Distributed process coordination" book.