Differential snapshot algorithms based on Hadoop MapReduce

Change Data Capture from source system is the first step in the incremental maintenance of data warehouses and business intelligence and is a key component of ETL (Extract, Transform and Load) technique. Methods of CDC are currently available, namely, time stamps, differential snapshots, triggers, a...

Full description

Saved in:
Bibliographic Details
Published in:2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) pp. 1203 - 1208
Main Authors: Wei Du, Xianxia Zou
Format: Conference Proceeding
Language:English
Published: IEEE 01.08.2015
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Change Data Capture from source system is the first step in the incremental maintenance of data warehouses and business intelligence and is a key component of ETL (Extract, Transform and Load) technique. Methods of CDC are currently available, namely, time stamps, differential snapshots, triggers, and archive log. Differential snapshots do not rely on the implementation mechanism of the information sources, and therefore demonstrates better universality and adaptability. Due to the lack of computing resources, the differential snapshots based on sort merge and hash partition are sometimes error and not effective. This paper proposes the differential snapshot of low cost and high efficiency which combines open source database and Hadoop MapReduce. The differential snapshot based data summary which is generated by the MD5 algorithm is very effective but I/O cost is very heavy. So the paper proposes the SQL statement which queries the database while generating the tuples summary only once I/O. We implement the SQL statement on the open source database MySQL. In addition the parallel programming of MapReduce is used to find difference of database files which improves the efficiency and avoids the error. Experiment verifies the different performances among differential snapshot algorithms difference algorithm.
AbstractList Change Data Capture from source system is the first step in the incremental maintenance of data warehouses and business intelligence and is a key component of ETL (Extract, Transform and Load) technique. Methods of CDC are currently available, namely, time stamps, differential snapshots, triggers, and archive log. Differential snapshots do not rely on the implementation mechanism of the information sources, and therefore demonstrates better universality and adaptability. Due to the lack of computing resources, the differential snapshots based on sort merge and hash partition are sometimes error and not effective. This paper proposes the differential snapshot of low cost and high efficiency which combines open source database and Hadoop MapReduce. The differential snapshot based data summary which is generated by the MD5 algorithm is very effective but I/O cost is very heavy. So the paper proposes the SQL statement which queries the database while generating the tuples summary only once I/O. We implement the SQL statement on the open source database MySQL. In addition the parallel programming of MapReduce is used to find difference of database files which improves the efficiency and avoids the error. Experiment verifies the different performances among differential snapshot algorithms difference algorithm.
Author Xianxia Zou
Wei Du
Author_xml – sequence: 1
  surname: Wei Du
  fullname: Wei Du
  organization: Dept. of Comput. Sci., GongDong Police Coll., Guangzhou, China
– sequence: 2
  surname: Xianxia Zou
  fullname: Xianxia Zou
  organization: Dept. of Comput. Sci., Jinan Univ., Guangzhou, China
BookMark eNotz71OwzAUQGEjwQClD4BY_AIJvv6JYzbUUoooQoLu1XV8TS2ldpSEgbdnoNPZPuncsMtcMjF2B6IGEO5h8_W2rqUAU1vVSgB1wZbOtqAbq2zTgrlmj-sUI42U54Q9nzIO07HMHPvvMqb5eJq4x4kCL5lvMZQy8HccPin8dHTLriL2Ey3PXbD95nm_2la7j5fX1dOuSiDVXLlOB42y0w5lMMGa0DbSGxG10cY1nZIRyDppoog-RI_CkwOntG9ElFYt2P0_m4joMIzphOPv4Tyk_gDUOUQ3
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/FSKD.2015.7382113
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9781467376815
1467376817
9781467376822
1467376825
EndPage 1208
ExternalDocumentID 7382113
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i123t-9c4d4a2c49a2d5d75d862b50f454596c32f1e7925f0fbdfba0be91934b60f273
IEDL.DBID RIE
IngestDate Thu Jun 29 18:36:03 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i123t-9c4d4a2c49a2d5d75d862b50f454596c32f1e7925f0fbdfba0be91934b60f273
PageCount 6
ParticipantIDs ieee_primary_7382113
PublicationCentury 2000
PublicationDate 20150801
PublicationDateYYYYMMDD 2015-08-01
PublicationDate_xml – month: 08
  year: 2015
  text: 20150801
  day: 01
PublicationDecade 2010
PublicationTitle 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD)
PublicationTitleAbbrev FSKD
PublicationYear 2015
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.5750471
Snippet Change Data Capture from source system is the first step in the incremental maintenance of data warehouses and business intelligence and is a key component of...
SourceID ieee
SourceType Publisher
StartPage 1203
SubjectTerms Algorithm design and analysis
Change Data Capture (CDC)
Data mining
Data warehouses
differential snapshot algorithm
Hadoop MapReduce
MD5 algorithm
Particle separators
Partitioning algorithms
Syntactics
Title Differential snapshot algorithms based on Hadoop MapReduce
URI https://ieeexplore.ieee.org/document/7382113
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA6tePCk0opvcvBo2n0km41XaxHEUrSH3kqetlA3S3fr73eyXSqCF28hBMJMmHwzmZl8CN0pzXJwJCTR0qSEgs9LpOKKWJZZgCcRU503ZBN8MsnnczHtoPt9L4y1tik-s4MwbHL5xutteCob8jSHeCXtoi7nfNer1SYq40gMx-8vo1CrxQbtul-EKQ1ejI__t9MJ6v803uHpHlJOUccWPfQwaklMwBjXuCpkWS19jeX6w0Nkv_yscIAig32B4R7xvsSvsnwLX7LaPpqNn2aPz6SlPCArgJCaCE0NlYmmQiaGGc4MRByKRY6CpyMynSYutlwkzEVOGadkpCxoNKUqixx4ImfooPCFPUfYAgxpa_I40Eur2MhcgrUlLBE6k06mF6gXxF6Uu08tFq3El39PX6GjoNld5ds1Oqg3W3uDDvVXvao2t81JfANPPIwI
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEB5qFfSk0opvc_Bo2n0kuxuvaqn0QdEeeit52kLdXbpbf7_JdqkIXryFEAgzYfLNZGbyAdwLSRPrSHAsuQoxsT4v5iIWWNNIW3hiPpFJRTYRj8fJbMYmDXjY9cJoraviM91xwyqXrzK5cU9l3ThMbLwS7sE-JSTwt91adarS91i39z54dtVatFOv_EWZUiFG7_h_e51A-6f1Dk12oHIKDZ224PG5pjGx5rhCRcrzYpGViK8-MhvbLz4L5MBIoSxF9ibJshyNeP7mPmXVbZj2XqZPfVyTHuClBZESM0kU4YEkjAeKqpgqG3MI6hlifR0WyTAwvo5ZQI1nhDKCe0JbnYZERJ6xvsgZNNMs1eeAtAUiqVXiO4Jp4SuecGtvAQ2YjLjh4QW0nNjzfPutxbyW-PLv6Ts47E9Hw_nwdTy4giOn5W0d3DU0y_VG38CB_CqXxfq2OpVvyRuPTw
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2015+12th+International+Conference+on+Fuzzy+Systems+and+Knowledge+Discovery+%28FSKD%29&rft.atitle=Differential+snapshot+algorithms+based+on+Hadoop+MapReduce&rft.au=Wei+Du&rft.au=Xianxia+Zou&rft.date=2015-08-01&rft.pub=IEEE&rft.spage=1203&rft.epage=1208&rft_id=info:doi/10.1109%2FFSKD.2015.7382113&rft.externalDocID=7382113