Cloud Operations with Streaming Analytics using Apache Nifi and Apache Flink

By: Miguel Pérez Colino - 01 Sep 2017

Cloud platforms and infrastructure are, by their nature, distributed, modular and scalable.  The amount of data generated by a Cloud platform, whether it’s Infrastructure as a Service (IaaS), Platform as a Service (PaaS) or a combination of both, that could be used to provide better situational awareness, and operate it efficiently is huge. (The estimation after reviewing several escenarios is 100~150 Mb per node per day in a medium to low usage environment. We used that as baseline).

To tackle that challenge, and a variety of other scenarios oriented to provide better solutions to our users,  the Strategic Design Team at Red Hat is applying Design Thinking principles and techniques.

The work started by defining the problem and analyzing the roles involved, creating personae and use cases. The scope was then narrowed to a selected  persona and a use case. The use case chosen was improving the “Day 2 Operations” in an OpenStack Environment for “Oscar the OpenStack Operator” persona.

In collaboration with the User Experience and Design Team, after getting feedback from users and analyzing it, we developed mockups for the selected persona. We arrived at  the idea that tools such as the ones provided by the Apache foundation could be used to build a solution to that challenge, and even when we used a different approach for our own solution, we decided to prototype it. The goal of building a prototype was to quickly gain clarity about the needs and requirements (a prototype is worth a thousand meetings), validate the technical approach, and learn more about the possibilities to improve the solution. We found an excellent partner, KEEDIO, who was interested in collaborating with us on building the prototype because they were also motivated to solve this problem for our joint customers.

The proposal was intended to be forward thinking, but production oriented, so we did not use the current generation of KEEDIO’s Big Data stack, but a new set of tools including NiFi, Kafka, Flink, Cassandra and Patternfly, deployed on Red Hat OpenStack Platform 10 (OSP) and Red Hat OpenShift Container Platform 3.5 (OCP). To validate the possibility of enabling the solution on both platforms, some of the components were running as instances on OpenStack and some others as containers on OpenShift. It was all deployed in HA, simulating a production architecture.

In the prototype, we managed to address many of the challenges raised when providing good situational awareness for cloud operators such as “Oscar the OpenStack Operator”. We were able to identify data in logs, and manage it in a well streamlined way with Nifi making the most of its GUI and data provenance, helping the user find information in the logs, and making the creation of new rules much easier and maintainable. We could process the constant stream of syslog messages (RFC5424) produced by the different distributed components of the Infrastructure as a Service, and also detect a common failure pattern that could arise and generate alerts as needed, allowing the user to perform near real-time complex issue detection. We defined an extensible architecture that managed different workflows for the different data retention use cases enabling the use of analytics for Day 2 operations in cloud environments.

If you want to know more don’t miss our  session scheduled for DataWorks in Sydney on Sept the 21st.