Fast Access to your Complex Data: Avro, JSON, ORC, and Parquet

Fast Access to your Complex Data: Avro, JSON, ORC, and Parquet

Thursday, May 23
4:50 PM - 5:30 PM
Marquis Salon 13

The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so.

The use cases that we’ve examined are:
* reading all of the columns
* reading a few of the columns
* filtering using a filter predicate
* writing the data
* schema evolution

While previous work has compared the size and speed from Hive, this presentation will present benchmarks from Spark including the new work that radically improves the performance of Spark on ORC. This presentation will also include tips and suggestions to optimize the performance of your application while reading and writing the data.

Finally, the value of having open source benchmarks that are available to all interested parties is hugely important and all of the code is available from the Apache ORC project.

SPEAKERS

Owen O'Malley
Co-founder & Technical Fellow
Cloudera
Owen O'Malley is a co-founder and technical fellow at Hortonworks, a rapidly growing company (25 to 1,000 employees in 5 years), which develops the completely open source Hortonworks Data Platform (HDP). HDP includes Hadoop and the large ecosystem of big data tools that enterprises need for their data analytics. Owen has been working on Hadoop since the beginning of 2006 at Yahoo, was the first committer added to the project, and used Hadoop to set the Gray sort benchmark in 2008 and 2009. In the last 8 years, he has been the architect of MapReduce, Security, and now Hive. Recently he has been driving the development of the ORC file format and adding ACID transactions to Hive. Before working on Hadoop, he worked on Yahoo Search's WebMap project, which was the original motivation for Yahoo to work on Hadoop.  Prior to Yahoo, he wandered between testing (UCI), static analysis (Reasoning), configuration management (Sun), and software model checking (NASA). He received his PhD in Software Engineering from University of California, Irvine.