PubMatic is a leading advertisement technology company that processes 500 billion transactions (50 terabytes of data) per day in real-time and batch processing pipeline on a 900-node cluster to power highly efficient machine learning algorithms, provide real time feedback to ad-server for optimization and provide in depth insights on customer inventory and audience.
At PubMatic, scaling with ever growing volume has always been the biggest challenge; we have been optimizing our technology stack for performance and costs. Another challenge is to support the demand for variety reports and analytics by customers and internal stakeholders. Writing custom jobs to provide analytics leads to repetitive efforts and redundancy of business logic in many different jobs.
To solve the above problems, we built a platform that allows creating configuration driven data processing pipeline with high re-usability of business functions. It is also extensible to utilize cutting-edge technologies in the ever-changing big data ecosystem. This platform enables our development teams to build a robust batch data processing pipeline to power analytics dashboards. It also empowers novice users to provide a configuration with fact and dimensions to generate ad-hoc reports in a single data processing job. Framework intelligently identifies and re-uses existing business functions based on user inputs. It also provides an abstraction layer that keeps core business logic un-affected by the any technology changes. This framework is currently powered by Spark, but it can be easily configured with other technologies.
Framework significantly improved time to develop data processing jobs from weeks to few days, it simplified unit testing and QA automation, as well as provided simpler interfaces to the customers and internal stakeholders to generate custom reports.