ORC improvement in Apache Spark 2.3

Wednesday, April 18
4:00 PM - 4:40 PM
Convention Hall I - C

Apache Spark 2.3, released on February 2018, is the fourth release in 2.x line and has a lot of new improvements. One of the notable improvements is ORC support. Apache Spark 2.3 adds a native ORC file format implementation by using the latest Apache ORC 1.4.1. Users can switch between “native” and “hive” ORC file formats. Hive ORC file format is the existing one until Spark 2.2.

In this talk, I'll talk about three key changes. First of all, performance. New native ORC implementation is faster 2x - 11x times on 10TB TPCDS benchmark. Vectorized query execution over ORC files improves Spark ORC query execution greatly. Especially, ORC filter pushdown can be faster than Parquet due to in-file indexes. Second, as a part of native ORC support, Spark 2.3 can convert the Hive ORC tables into Spark ORC data sources automatically. This solves several existing ORC issues and Spark 2.4 will enable it by default. Last, but not least, Spark 2.3 officially supports structural streaming over ORC data sources. You can create a streaming dataset over ORC files.

Presentation Video

SPEAKERS

Dongjoon Hyun
Principal Software Engineer
Hortonworks
Dongjoon Hyun is an Apache REEF PMC member and committer. Currently, he works for Hortonworks and is focusing on Apache Spark and Apache ORC.