Interactive Analytics with Apache Hive and Druid
Traditional rigid OLAP solutions can not handle Big Data Business Intelligence (BI) any more due to exponential growth in data size, types hierarchical complexities.
In an attempt to overcome and derive benefits from Hadoop platform to the business we propose a hybrid solution that takes advantage from the combination of a fast columnar storage Druid and traditional big data SQL database Apache Hive.
Druid is great for sub seconds analytics because it combines the best qualities of a column store, inverted indexing and bitmap indexes, which minimizes I/O costs and enables fast filter pruning for analytical queries.
Although Druid has serious limitations most important of these are joins, SQL support and traditional databases transactional consistency model.
This is a major architecture breakthrough as oppose to traditional systems like Impala or SparkSQL which rely on columnar storage to provide high-throughput aggregation, but do not deal well with finding the “needles in the haystack”
Integrating Druid with Hive allowed us to overcome big data BI challenges and achieve impressive speedups of factors ranging from X10 to X100 by offloading some of the high volume drill down slice and dice workload to Druid.
In this talk will present and analyze the performances of the proposed architecture side to side with existing solutions through the lens of concrete use cases and TPCH star schema benchmark.
This session is a (Intermediate) talk in our Data Processing and Warehousing track. It focuses on Apache Hadoop, Apache Hive, Druid and is geared towards Architect, Developer / Engineer audiences.