Today Walmart offers many millions of products to purchase through its websites. All these products are managed in large scale product catalog which getting updated thousands of times per second. The changes include product information updates, new products, availability in stores and so many more different attributes. In quest of providing a seamless shopping experience for our customers, we developed a streaming indexing data pipeline which ensures that search index is getting updated on timely basis and always reflect latest state of product catalog in near real time. Our pipeline is a key component to ensure that our search data is always up-to-date and in sync with constantly changing product catalog and other features such as store and online availability, offers etc.
Our indexing component, which is based on Spark Streaming Receiver Approach, consumes events from multiple Kafka topics such as Product Change, Store Availability, and Offer Change and merges the transformed Product Attributes with the historical signals computed by relevance data pipeline stored in Cassandra. This data is further processed by another Streaming component, which partitions documents into Kafka topic for every shard as it can be indexed into Apache Solr for Product Search. Deployment of this pipeline is automated end to end.