Fraud prevention and anti-money laundering are tough use cases to tackle. They become even more
difficult when data originates from various sources, not complete or correct, or even it’s not kept fresh.
Some likely problems include:
Simply getting datasets ready to analyze
Resolving fuzzy duplicates. Is Robert Smith the same as Dr. Bob Smith?
Strict security requirements
The need for real-time change data capture from various sources to Hadoop, Cloud and Kafka
The need to track data back to the source, and trust in the algorithm’s conclusions
To make this work, machine learning models need the data in the right format, but data quality
processes at Hadoop cluster scale are no picnic.
And, the moment that model gets put into practice, the data will be out of date, so you’ll need a way to
keep the cluster in sync with transactional source systems in real-time. Shouldn’t be too hard, right?