On memory and I/O efficient duplication detection for multiple self-clean data sources
Paper
Paper/Presentation Title | On memory and I/O efficient duplication detection for multiple self-clean data sources |
---|---|
Presentation Type | Paper |
Authors | Zhang, Ji (Author), Shu, Yanfeng (Author) and Wang, Hua (Author) |
Editors | Yoshikawa, M. |
Journal or Proceedings Title | Lecture Notes in Computer Science (Book series) |
Journal Citation | 6193, pp. 130-142 |
Number of Pages | 13 |
Year | 2010 |
Publisher | Springer |
Place of Publication | Germany |
ISSN | 1611-3349 |
0302-9743 | |
ISBN | 9783642145889 |
Digital Object Identifier (DOI) | https://doi.org/10.1007/978-3-642-14589-6_14 |
Web Address (URL) of Paper | https://link.springer.com/chapter/10.1007/978-3-642-14589-6_14 |
Web Address (URL) of Conference Proceedings | https://link.springer.com/book/10.1007/978-3-642-14589-6 |
Conference/Event | DASFAA 2010: 15th International Conference on Database Systems for Advanced Applications |
Event Details | DASFAA 2010: 15th International Conference on Database Systems for Advanced Applications Event Date 01 to end of 04 Apr 2010 Event Location Tsukuba, Japan |
Abstract | In this paper, we propose efficient algorithms for duplicate detection from multiple data sources that are themselves duplicate-free. When developing these algorithms, we take the full consideration of various possible cases given the workload of data sources to be cleaned and the available memory. These algorithms are memory and I/O efficient, being able to reduce the number of pair-wise record comparison and minimize the total page access cost involved in the cleaning process. Experimental evaluation demonstrates that the algorithms we propose are efficient and are able to achieve better performance than SNM and random access methods. |
Keywords | access cost; cleaning process; data source; duplicate detection; duplication detection; efficient algorithm; experimental evaluation; multiple data sources; random access |
ANZSRC Field of Research 2020 | 460599. Data management and data science not elsewhere classified |
460499. Cybersecurity and privacy not elsewhere classified | |
461299. Software engineering not elsewhere classified | |
Public Notes | Files associated with this item cannot be displayed due to copyright restrictions. |
Byline Affiliations | Department of Mathematics and Computing |
Commonwealth Scientific and Industrial Research Organisation (CSIRO), Australia |
https://research.usq.edu.au/item/9zzw7/on-memory-and-i-o-efficient-duplication-detection-for-multiple-self-clean-data-sources
Download files
1820
total views1668
total downloads1
views this month1
downloads this month