There is increased interest in using Kubernetes, the open-source container orchestration system for modern, stateful Big Data analytics workloads. The promised land is a unified platform that can handle cloud native stateless and stateful Big Data applications. However, stateful, multi-service Big Data cluster orchestration brings unique challenges. This session will delve into the technical gaps and considerations for Big Data on Kubernetes.
Containers offer significant value to businesses; including increased developer agility, and the ability to move applications between on-premises servers, cloud instances, and across data centers. Organizations have embarked on this journey to containerization with an emphasis on stateless workloads. Stateless applications are usually microservices or containerized applications that don’t “store” data. Web services (such as front end UIs and simple, content-centric experiences) are often great candidates as stateless applications since HTTP is stateless by nature. There is no dependency on the local container storage for the stateless workload.
Stateful applications, on the other hand, are services that require backing storage and keeping state is critical to running the service. Hadoop, Spark and to lesser extent, noSQL platforms such as Cassandra, MongoDB, Postgres, and mySQL are great examples. They require some form of persistent storage that will survive service restarts.
There are several attributes of stateful, multi-service Big Data applications that need to be considered. Hadoop and Spark are not exactly monolithic applications, but close with their multiple, co-operating services with dynamic APIs. Service start-up / tear-down ordering requirements with different sets of services running on different hosts (nodes) result in tricky service interdependencies that impact scalability. There is also lots of configuration (aka state) such as host name, IP address, ports and service-specific settings that needs to be maintained to run fault tolerant clusters.
This session will highlight the key gaps and considerations based on a real life implementation of Big Data cluster orchestration on Kubernetes. Focus areas include:
- Full cluster lifecycle management
- Big Data application support (i.e. requires no modification)
- Management of storage and networking resources
- Integration and conformance with existing enterprise services (e.g. LDAP / AD, SSO, TLS)
- Multi-tenancy, multiple clusters with different versions, auditing, monitoring, etc.
- Data Locality and Performance
This session will detail technical configurations and customizations required to run Hadoop distributions on Kubernetes. It will also detail the gaps in comparison to the standard deployment of Hadoop on physical servers or virtual machines.