Tutorials

Leveraging machine learning techniques for customer data deduplication - hard-won lessons from a real-world project in the banking industry

 

Robert Wrembel (Poznan University of Technology, Poland)
Witold Andrzejewski (Poznan University of Technology, Poland)
Bartosz Bębel (Poznan University of Technology, Poland)
Paweł Boiński (Poznan University of Technology, Poland)

Abstract

The tutorial shares the practical experience gained from a 3-year R&D project (2020-2023) for the biggest Polish bank, which aimed at developing deduplication pipelines for large scale customer records. The project involved the development of two distinct end-to-end deduplication pipelines that are based on: (1) statistical modeling (denoted as SMP) and  on (2) machine learning (MLP). This tutorial focuses on MLP, detailing its design, implementation, evaluation, and faced challenges, within the context of a real-world industrial setting. Moreover, in its first part, this tutorial provides an overview of baseline approaches to data deduplication, from traditional - based on statistical modeling, via machine learning and neural networks, to more recent techniques leveraging pre-trained and large language models.

Presenter

Robert Wrembel (Poznan University of Technology, Poland)

Robert Wrembel (PhD, Dr. Habil.) is an associate professor in the Faculty of Computing and Telecommunications, at Poznan University of Technology (Poland). In 2008 he received a post-doctoral degree in computer science (habilitation), specializing in database systems and data warehouses. He has been a deputy dean of the Faculty of Computing and Management (2008-2012) and the Faculty of Computing (2012-2016). Since Jan 2023 he is the chair of the Data Processing Technologies group at Poznan University of Technology. He was a consultant at software house (2002-2003) and a lecturer at Oracle Poland (1998-2005). Currently he is an IT consultant in a private hospital. Within the last 10 years he has realized four R&D projects: for a big financial institution in Poland, one for a company in the energy sector, and two for a corporation in the field of electronics. He cooperates with IBM Software Lab Kraków in Poland. He has led at his University the Erasmus Mundus Joint Doctorate Program - Information Technologies for Business Intelligence - Doctoral College (2013-2020). Robert visited numerous research and education centers, including: INRAE Clermont-Ferrand (France), Free University of Bozen-Bolzano (Italy), Università degli Studi di Milano (Italy), Universitat Politècnica de Catalunya - BarcelonaTech (Spain), Université Lyon 2 (France), Universidad de Costa Rica (Costa Rica), Klagenfurt University (Austria), Loyola University (USA), INRIA Paris-Rocquencourt (France), and Université Paris Dauphine (France). In 2012 he graduated from a 2-months innovation and entrepreneurial program at Stanford University. In 2013 he has done an internship in a BI company Targit (USA). His research interests encompass: data integration, data quality, databases, data warehouses, and data lakes.