All Things Spark – Machine Learning, Atlas integration, ORC & Hive EDW updates

Apache Spark has become one of the most popular in-memory compute engines due to its elegant and expressive development APIs combined with enterprise readiness. At the meetup we will focus on machine and deep learning use cases and performance; Apache Atlas integration to enable governance and metadata; performance improvements and Parquet parity with Apache ORC (high performance columnar storage); and finally we will cover Apache Hive EDW connector enabling data warehouse initiatives for advanced business analytics.

6:00 – 6:15 PM Food & drinks
6:15 – 6:20 PM Kickoff
6:20 – 6:40 PM Talk 1
6:40 – 7:00 PM Talk 2
7:00 – 7:20 PM Talk 3
7:20 – 7:40 PM Talk 4
7:40 – 8:00 PM Q&A
8PM+ Networking

SparkML – Pyspark performance, image integration, and Deep Learning use cases – Yanbo Liang and Mingjie Tang (20 min)
Spark Atlas integration – Yanbo Liang and Mingjie Tang (20 min)
Spark + ORC – Dongjoon Hyun (20 min)
Spark + HiveEDW connector – Eric Wohlstadter (20 min)


Robert Hryniewicz (host)
Robert is a Data Evangelist with over 11 years of experience working on a variety of technologies from AI and robotics to IoT and blockchain. He’s part of the Hortonworks community team, driving data science sandbox product strategy, thought leadership on AI, delivering crash courses and lectures on Spark, data science + deep learning, and making sure that the community has all the resources needed to build kickass next-gen products. Robert will be your host for the evening.

Arun Iyer
Arun Iyer has been involved with the design and development of various Streaming Analytics platforms at Hortonworks. He has been contributing to Apache Storm project and currently a committer and a PMC member of the project. Prior to Hortonworks he was involved in the development of various streaming and distributed systems at Informatica and at Yahoo.

Jerry Shao
Jerry Shao works as a member of technical staff at Hortonworks, mainly focused on Spark area, especially Spark core, Spark on Yarn and Spark Streaming. He is an Apache Spark committer and Apache Livy (incubating) PPMC. Prior to Hortonworks, he was a software engineer at Intel working on performance tuning and optimization of Hadoop and Spark.

Yanbo Liang
Yanbo is a staff software engineer at Hortonworks. His main interests center around implementing effective machine learning and deep learning algorithms or models. He is an Apache Spark PMC member and contributes to lots of open source projects such as TensorFlow, Apache MXNet and XGBoost. He delivered the implementation of some core Spark MLlib algorithms. Prior to Hortonworks, he was a software engineer at Yahoo! and France Telecom working on machine learning and distributed system.

Mingjie Tang
Mingjie Tang is an engineer at Hortonworks. He is working on SparkSQL, Spark MLlib and Spark Streaming. He has broad research interest in database management system, similarity query processing, data indexing, big data computation, data mining and machine learning. Mingjie completed his PhD in Computer Science from Purdue University.

Dongjoon Hyun
Dongjoon Hyun is an Apache REEF PMC member and committer. Currently, he works for Hortonworks and is focusing on Apache Spark and Apache ORC.

Eric Wohlstadter
Eric is a principal engineer at Hortonworks. He is working on Hive, Tez, and Spark-Hive interoperability. His interests are in database systems and distributed query execution. Eric completed his PhD in Computer Science from the University of California at Davis.

DataWorks Summit San Jose – IBM Meetup

Join IBM for some networking and very interesting topics. You do not have to be registered for Data Works Summit to attend the meetup.

6:00 – 6:30 – Welcome and networking
6:30 – 6:50 – Machine Learning Models in Production

Data Scientists and Machine Learning practitioners, nowadays, seem to be churning out models by the dozen and they continuously experiment to find ways to improve their accuracies. They also use a variety of ML and DL frameworks & languages , and a typical organization may find that this results in a heterogenous, complicated bunch of assets that require different types of runtimes, resources and sometimes even specialized compute to operate efficiently. Enterprises also require additional auditing and authorizations built in, approval processes and still support a “continuous delivery” paradigm whereby a data scientist can enable insights faster. Not all models are created equal, nor are consumers of a model – so enterprises require both metering and allocation of compute resources for SLAs. We will take a look at how machine learning is operationalized in IBM Data Science Experience (DSX), a Kubernetes based offering for the Private Cloud and optimized for the HortonWorks Hadoop Data Platform. DSX essentially brings in typical software engineering development practices to Data Science, organizing the dev->test->production for machine learning assets in much the same way as typical software deployments. We will also see what it means to deploy, monitor accuracies and even rollback models & customer scorers as well as how API based techniques enable consuming business processes & applications to remain relatively stable amidst all the chaos.

Speaker: Piotr Mierzejewski, Program Director, Development, DSX Local, Data Science & ML, IBM Canada Lab

6:50 – 7:10 – Enabling a hardware accelerated deep learning data science experience for Apache Spark and Hadoop

Large unstructured data sets such as images, videos, speech and text are great for deep learning, but impose a lot of demands on computing resources. New types of hardware architectures such as GPUs and faster interconnects (e.g. NVLink), RDMA capable networking interface from Mellanox available on OpenPOWER and IBM POWER systems are enabling practical speedups for deep learning. This session will show some deep learning build and deploy steps using Tensorflow and Caffe in Docker containers running in a hardware accelerated public cloud container service.

Speaker: Indrajit Poddar, STSM, IBM Cognitive Systems, Systems

7:10 – 7:30 – Exploring Graph Use Cases with JanusGraph

Graph databases are relative newcomers in the NoSQL database landscape. What are some graph model and design considerations when choosing a graph database in your architecture? Let’s take a tour of a couple graph use cases that we’ve collaborated on recently with our clients to help you better understand how and why a graph database can be integrated to help solve problems found with connected data.

Speaker: Jason Plurad, Software Engineer – Open Technology

7:30 – 7:50 – Achieve Better Analytics and Deeper Insights With IBM Digital Insights Platform

Managing big data is a constant challenge; now you can derive deeper insights in a more timely and cost effective fashion from your data to achieve impactful business outcomes. This data and analytics-as-a-service approach allows organizations to meet the challenging cost of data management and digital transformation at scale.
IBM’s Digital Insights Platform allows for rapid development of analytic capabilities with its pre-assembled packages of data management, cognitive technologies and industry-specific processes. With Hortonworks as the base platform, the Digital Insights Platform brings speed and scale to your enterprise so you can start using your data to generate value quickly.

Speaker: Tony Giordano, Partner & VP, Data Platform Services, IBM

7:50 – 8:00 – Q&A

Apache Ambari 2.7 and Beyond updates (New UI & more)

Apache Ambari is used by thousands of Hadoop Operators to manage the deployment, lifecycle, and automation of DevOps for Hadoop ecosystem projects. The Ambari product and engineering team will talk about improvements being made to the UI, metrics, logging, scalability, and other core areas within Ambari as the project is being re-imagined.

As part of this meetup, the engineering team will walk you through what we’ve learned, the challenges we’ve overcome, and how the Apache Ambari community has changed the product to handle them. The future is fast approaching, and with it comes new on-premise and cloud deployment architectures. See how Apache Ambari is being updated to handle these new challenges.

Meetup Agenda:
New UI Look & Feel – Ambari has a brand new skin! See how the new interface makes it easier to deploy, manage, and monitor Hadoop and Streaming Analytics clusters (Demo).
Ambari Server Scalability Improvements – Managing 5000 nodes clusters, how we got there, and how you can too.
Ambari Management Packs – Upgrading everything is fun, but how much better would it be to upgrade just what you want to? See what we’re doing to make it easier to get the latest technology deployed in your cluster, one mpack at a time.
Ambari Metrics System Anomaly Detection – Metrics are great, but seeing the right metric at the right time to help identify a root cause is better. See what we’re doing in Ambari Metrics to make it easier to spot relevant issues.
Ambari Log Search – An abundance of information creates a deficit in attention, which must be carefully focused to detect the signal from the noise. See how we’re making it easier to quickly find issues in logs, and access the right logs at the right time (Demo).

Meetup Speakers:
Yusaku Sako – Apache Ambari PMC Chair
Sid Wagle – Apache Ambari PMC Member
Aravindan Vijayan – Apache Ambari PMC Member
Swapan Sridhar – Apache Ambari PMC Member
Kat Petre – Product Owner
Paul Codding – Product Manager
Subhrajit Das – Senior UX Designer

Deep Dive into Apache Metron

As part of DataWorks Summit we’ll be hosting an Apache Metron meetup on 18th of June at San Jose McEnery Convention Center.

Details of the Meetup:
Explore Apache Metron through the eyes of experts. In this session, Metron committer, Michael Miklavcic, will talk in-depth about the features and capabilities of Metron: the real-time data ingestion, normalization, enrichment, triage, and management of application and security data at scale. Learn about the new enhancements available in Metron and future releases that help organizations optimize the performance and efficiency of their Security Operations Centers.

This is a great opportunity for any and all current Metron and Hortonworks Cybersecurity Platform users to get together for discussions about the platform and use cases. Participants will also take part in a simulating forum for collaborations and sharing of knowledge and experience within the wider Metron community.

Breaking Through The Challenges of Scalable Deep Learning for Video Analytics


– Use case Introduction & tool selection
– NLP & Entity Analytics
– Deep Learning Video Analytics
– Deployment tool selection and tips


– Steven Flores is a cognitive engineer at Comp Three Inc. in San Jose. He leverages state-of-the-art methods in AI and machine learning to deliver novel business solutions to clients. In 2012, Steven earned his Ph.D. in applied math from the University of Michigan, and in 2017, he completed a postdoc in mathematical physics at the University of Helsinki and Aalto University.
– Luke Hosking is a software engineer at Comp Three Inc. in San Jose. He works with clients to enable machine learning applications by building data management solutions which integrate data from disparate systems. Luke most recently came from the Healthcare Technology world, where he was the technical leader of a start-up that improved care outcomes by connecting previously isolated clinical applications. He is a general technologist who has been working with computing since the days of Internet Radio tuners.


When developing a machine learning system, the possibilities are limitless. However, with the recent explosion of Big Data and AI, there are more options than ever to filter through. Which technologies to select, which model topologies to build, and which infrastructure to use for deployment, just to name a few. We have explored these options for our faceted refinement system for video content system (consisting of 100K+ videos) along with their many roadblocks. Three primary areas of focus involve natural language processing, video frame sampling, and infrastructure deployment.

We use natural language processing on the video transcripts to extract verbally mentioned entities. Entity extraction, at a high level, can be thought of as identifying nouns in text, along with their type within a taxonomy. For instance, being able to extract George Washington, first as a name, and second as the first president of the United States. There are a number of general purpose solutions available for this but none for when the text of the video is domain specific. Therefore we’ve had to develop custom models for entity extraction for different client domains which we will describe.

The second set of challenges involve extracting entities through visual means. That is, grabbing frames from the video (every 5 seconds) and then using object detection models approaches to identify the visual entities in the videos. For example, a video of cats with no narration or words such as cat or feline should still be grouped together with cat videos. Building a collection of object detection models in conjunction with third party services provides decent coverage, but again domain specific images, such as real estate, require that we build custom models (using TensorFlow). Additionally, intelligent sampling of video frames is critical for performance. Therefore we developed heuristics to sample enough frames as to not miss critical visual elements and also not sample every frame in the video which would be computationally infeasible for a large numbers of videos.

Finally, there are a number of options today for deploying machine learning solutions and models. We evaluated a number of options such as Google Cloud Machine Learning, GPU machines in the cloud, as well as building our own dedicated GPU machine in-house. We will finish by outlining the benefits and challenges with each deployment approach and the answer as to which we ultimately used.

Simplifying Feature Engineering & Model Tuning, Ensembling & Deployment with H2O is democratizing AI by automating machine learning. It fully automates some of the most challenging and productive tasks in applied data science such as feature engineering, model tuning, model ensembling and model deployment. Driverless AI turns Kaggle-winning grandmaster recipes into production-ready code, and is specifically designed to avoid common mistakes such as under- or overfitting, data leakage or improper model validation, some of the hardest challenges in data science. Avoiding these pitfalls alone can save weeks or more for each model, and is necessary to achieve high modeling accuracy.

Now, data scientists of all proficiency levels can train and deploy modeling pipelines with just a few clicks from the GUI. Advanced users can use the client API from Python. Driverless AI builds hundreds or thousands of models under the hood to select the best feature engineering recipes for your specific problem.

To speed up training, H2O uses highly optimized C++/CUDA algorithms to take full advantage of the latest compute hardware. For example, we can now run orders of magnitudes faster on the latest Nvidia GPU supercomputers on Intel and IBM platforms, both in the cloud or on premise.

There are two more product innovations: statistically rigorous automatic data visualization and interactive model interpretation with reason codes and explanations in plain English. Both help data scientists and analysts to quickly validate the data and the models.

Speaker’s Bio:
Arno Candel is the Chief Technology Officer at He is the main committer of H2O-3 and Driverless AI and has been designing and implementing high-performance machine-learning algorithms since 2012. Previously, he spent a decade in supercomputing at ETH and SLAC and collaborated with CERN on next-generation particle accelerators.

Arno holds a PhD and Masters summa cum laude in Physics from ETH Zurich, Switzerland. He was named “2014 Big Data All-Star” by Fortune Magazine and featured by ETH GLOBE in 2015. Follow him on Twitter: @ArnoCandel.

Apache NiFi @ Dataworks Summit San Jose 2018 Meetup

Whether you are attending DataWorks Summit, HBaseCon/PhoenixCon, or just in the vicinity of San Jose, please join us for an evening of presentations and discussion with the Apache NiFi community.

More details to come!

Big Data Science @ DataWorks Summit, 2018

Theme of this meetup is “Deep Learning and Blockchain Smart Contracts”. There will be discussions on Keras, TensorFlow, Ethereum smart contracts and related things.

6:00 P.M. – 6:15 P.M. INTRODUCTION
6:15 P.M. – 6:50 P.M. Session 1
Speaker Bio:
6:50 P.M. – 7:00 P.M. Q/A

7:00 P.M. – 7:40 P.M. Session 2
Speaker Bio:
7:40 P.M. – 7:50 P.M. Q/A

Open Source Innovation Driving Connected and Autonomous Automotive Future

The automotive industry is currently within a period of unprecedented transformation, with connected and autonomous cars redefining industry business models and vehicles themselves. In this fast changing environment, Open Source data management continues to fuel innovation. In this session, discover two technologies on the forefront next generation transformation. First, discover how Apache NiFi stream processing is driving next generation connected vehicle capabilities and use cases. Next, discover new Hadoop 3.0 capabilities driving Autonomous Driving innovation including efficient management of autonomous vehicle training data and Deep Learning on this data for development of next-generation driving algorithms.

Andy LoPresto and Saumitra Buragohain, Hortonworks

0. Hortonworks in Automotive – Connected and Autonomous Cars Driving Industry Transformation (20 minutes)
1. Apache NiFi Role in Driving Connected Vehicle Innovation (35 minutes)
2. Hadoop 3.0 Driving Autonomous Vehicle Deep Learning at Scale (35 minutes)