Redblock: a tool for online deduplication on large datasets

Luan Félix Pimentel, Igor Lemos Vicente, Guilherme Dal Bianco


Online data deduplication aims to identify records that represent the same purpose on a continuous data flow environment. It must be able to process a range of information with high effectiveness and no delays. The purpose of this paper is to introduce a developed tool entitled Redblock, for real-time data deduplication, using a distributed platform for online processing combined with an Inverted Index. During the experimental evaluation, Redblock managed to provide good preliminary results in terms of efficiency and effectiveness in a database.


Integração de Dados, Deduplicação Online, Blocagem.

Texto completo:

PDF (English)