What's the Hadoop-la about Kubernetes?

Tuesday, June 19
4:50 PM - 5:30 PM
Meeting Room 211A/B/C/D

There is increased interest in using Kubernetes, the open-source container orchestration system for modern, stateful Big Data analytics workloads. The promised land is a unified platform that can handle cloud native stateless and stateful Big Data applications. However, stateful, multi-service Big Data cluster orchestration brings unique challenges. This session will delve into the technical gaps and considerations for Big Data on Kubernetes.

Containers offer significant value to businesses; including increased developer agility, and the ability to move applications between on-premises servers, cloud instances, and across data centers. Organizations have embarked on this journey to containerization with an emphasis on stateless workloads. Stateless applications are usually microservices or containerized applications that don’t “store” data. Web services (such as front end UIs and simple, content-centric experiences) are often great candidates as stateless applications since HTTP is stateless by nature. There is no dependency on the local container storage for the stateless workload.
Stateful applications, on the other hand, are services that require backing storage and keeping state is critical to running the service. Hadoop, Spark and to lesser extent, noSQL platforms such as Cassandra, MongoDB, Postgres, and mySQL are great examples. They require some form of persistent storage that will survive service restarts.

There are several attributes of stateful, multi-service Big Data applications that need to be considered. Hadoop and Spark are not exactly monolithic applications, but close with their multiple, co-operating services with dynamic APIs. Service start-up / tear-down ordering requirements with different sets of services running on different hosts (nodes) result in tricky service interdependencies that impact scalability. There is also lots of configuration (aka state) such as host name, IP address, ports and service-specific settings that needs to be maintained to run fault tolerant clusters.

This session will highlight the key gaps and considerations based on a real life implementation of Big Data cluster orchestration on Kubernetes. Focus areas include:
- Full cluster lifecycle management
- Big Data application support (i.e. requires no modification)
- Management of storage and networking resources
- Integration and conformance with existing enterprise services (e.g. LDAP / AD, SSO, TLS)
- Multi-tenancy, multiple clusters with different versions, auditing, monitoring, etc.
- Data Locality and Performance

This session will detail technical configurations and customizations required to run Hadoop distributions on Kubernetes. It will also detail the gaps in comparison to the standard deployment of Hadoop on physical servers or virtual machines.


Anant Chintamaneni
VP, Products
Anant Chintamaneni brings more than 16 years experience in SaaS, Analytics and Big Data to his role of vice president of Products at BlueData. Anant is responsible for spearheading product development, as well as driving the go-to-market strategy for the BlueData platform. Prior to BlueData, Anant was head of Product Management and Strategy for Pivotal’s Business Data Lake portfolio, which included Pivotal HD, HAWQ SQL-in-Hadoop, GemfireXD In Memory Processing and Machine Learning Runtimes. While at Pivotal, he established Pivotal HD as the Hadoop platform for data-driven applications, streamlined field enablement and grew the installed user base by over 400 percent in just over a year. Prior to Pivotal, Anant was head of Big Data Analytics Solutions at Merced Systems (acquired by NICE Systems), where he created a new Analytics business division that resulted in multi-million dollar annual subscriptions. Anant holds a Master of Science in Engineering from Stanford University and a Bachelor’s degree in Engineering from the Indian Institute of Technology, Varanasi.
Nanda Vijaydev
Dir, Solutions
Nanda Vijaydev is director of solution management at BlueData, where she leverages Hadoop, Spark, and Tachyon to build solutions for enterprise analytics use cases. Nanda has 10 years of experience in data management and data science. Previously, she worked on data science and big data projects in multiple industries, including healthcare and media; was a principal solutions architect at Silicon Valley Data Science; and served as director of solutions engineering at Karmasphere. Nanda has an in-depth understanding of the data analytics and data management space, particularly in the areas of data integration, ETL, warehousing, reporting, and Hadoop.