Manage your data infrastructure like this:
(i) don’t drown the infra teams in domain data specifics
(ii) build robust low-latency lookup facilities to feed online services
(iii) always take stress out of the equation
At Klarna bank we do online decisions on risk, fraud and ID. Over a hundred data sources are being processed by over a hundred analysts and over a hundred batch jobs. Three data infrastructure engineering teams are operating and developing this data lake: Core team, apps team, and performance team. The total head count is less than a dozen.
To keep afloat, we’ve distilled the following practices: (i) The immutability and recomputation properties of the Lambda/Kappa architectures (ii) continuously delivered and automated infrastructure, (iii) tooling to empower producers and consumers of data to be accountable and self-sufficient, and (iv) proactively improve efficiency of data users.
We’ll talk about some of these practices and tools we have built during several years of running banking applications on Hortonworks Hadoop. Ecosystem components we’ll touch include Kafka, Avro, Hive, Oozie, ELK, Ranger, and Ansible. Tools developed by us include HiveRunner, tooling for data import, along with continuous delivery of data pipelines.