High-energy physics experiments and accelerators at CERN produce and collect more data than ever before, recently breaking the record of 12.3 PB per month. CERN provides Hadoop and Spark services and works closely with the scientific communities in their quest to analyse and understand these vast amounts of physics and infrastructure data. Consequently, the number of CERN teams using big data frameworks for their systems has grown significantly over the past years. These systems include the Next CERN Accelerator Logging Service (NXCALS) which will perform online and offline analysis over the data acquired from each of the 20,000 devices that monitor the CERN accelerator complex, the CMS Data Reduction Facility which aims to reduce 1 PB of data produced by the CMS Detector to 1 TB of reusable data for physics analysis through Spark, intrusion detection systems, as well as the monitoring system for the CERN Data Center and the Worldwide LHC Computing Grid (WLCG) which consists of more than 170 different computing centers in 42 countries.
This talk will provide an overview of the current infrastructure based on Spark and other key components of the Hadoop ecosystem, the active use cases on big data analytics from various CERN communities, as well as the challenges in the available data sources and their architecture.