Differential snapshot algorithms based on Hadoop MapReduce

Change Data Capture from source system is the first step in the incremental maintenance of data warehouses and business intelligence and is a key component of ETL (Extract, Transform and Load) technique. Methods of CDC are currently available, namely, time stamps, differential snapshots, triggers, a...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) s. 1203 - 1208
Hlavní autori: Wei Du, Xianxia Zou
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 01.08.2015
Predmet:
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Change Data Capture from source system is the first step in the incremental maintenance of data warehouses and business intelligence and is a key component of ETL (Extract, Transform and Load) technique. Methods of CDC are currently available, namely, time stamps, differential snapshots, triggers, and archive log. Differential snapshots do not rely on the implementation mechanism of the information sources, and therefore demonstrates better universality and adaptability. Due to the lack of computing resources, the differential snapshots based on sort merge and hash partition are sometimes error and not effective. This paper proposes the differential snapshot of low cost and high efficiency which combines open source database and Hadoop MapReduce. The differential snapshot based data summary which is generated by the MD5 algorithm is very effective but I/O cost is very heavy. So the paper proposes the SQL statement which queries the database while generating the tuples summary only once I/O. We implement the SQL statement on the open source database MySQL. In addition the parallel programming of MapReduce is used to find difference of database files which improves the efficiency and avoids the error. Experiment verifies the different performances among differential snapshot algorithms difference algorithm.
AbstractList Change Data Capture from source system is the first step in the incremental maintenance of data warehouses and business intelligence and is a key component of ETL (Extract, Transform and Load) technique. Methods of CDC are currently available, namely, time stamps, differential snapshots, triggers, and archive log. Differential snapshots do not rely on the implementation mechanism of the information sources, and therefore demonstrates better universality and adaptability. Due to the lack of computing resources, the differential snapshots based on sort merge and hash partition are sometimes error and not effective. This paper proposes the differential snapshot of low cost and high efficiency which combines open source database and Hadoop MapReduce. The differential snapshot based data summary which is generated by the MD5 algorithm is very effective but I/O cost is very heavy. So the paper proposes the SQL statement which queries the database while generating the tuples summary only once I/O. We implement the SQL statement on the open source database MySQL. In addition the parallel programming of MapReduce is used to find difference of database files which improves the efficiency and avoids the error. Experiment verifies the different performances among differential snapshot algorithms difference algorithm.
Author Xianxia Zou
Wei Du
Author_xml – sequence: 1
  surname: Wei Du
  fullname: Wei Du
  organization: Dept. of Comput. Sci., GongDong Police Coll., Guangzhou, China
– sequence: 2
  surname: Xianxia Zou
  fullname: Xianxia Zou
  organization: Dept. of Comput. Sci., Jinan Univ., Guangzhou, China
BookMark eNotz71OwzAUQGEjwQClD4BY_AIJvv6JYzbUUoooQoLu1XV8TS2ldpSEgbdnoNPZPuncsMtcMjF2B6IGEO5h8_W2rqUAU1vVSgB1wZbOtqAbq2zTgrlmj-sUI42U54Q9nzIO07HMHPvvMqb5eJq4x4kCL5lvMZQy8HccPin8dHTLriL2Ey3PXbD95nm_2la7j5fX1dOuSiDVXLlOB42y0w5lMMGa0DbSGxG10cY1nZIRyDppoog-RI_CkwOntG9ElFYt2P0_m4joMIzphOPv4Tyk_gDUOUQ3
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/FSKD.2015.7382113
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE/IET Electronic Library
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9781467376815
1467376817
9781467376822
1467376825
EndPage 1208
ExternalDocumentID 7382113
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i123t-9c4d4a2c49a2d5d75d862b50f454596c32f1e7925f0fbdfba0be91934b60f273
IEDL.DBID RIE
IngestDate Thu Jun 29 18:36:03 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i123t-9c4d4a2c49a2d5d75d862b50f454596c32f1e7925f0fbdfba0be91934b60f273
PageCount 6
ParticipantIDs ieee_primary_7382113
PublicationCentury 2000
PublicationDate 20150801
PublicationDateYYYYMMDD 2015-08-01
PublicationDate_xml – month: 08
  year: 2015
  text: 20150801
  day: 01
PublicationDecade 2010
PublicationTitle 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD)
PublicationTitleAbbrev FSKD
PublicationYear 2015
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.5755186
Snippet Change Data Capture from source system is the first step in the incremental maintenance of data warehouses and business intelligence and is a key component of...
SourceID ieee
SourceType Publisher
StartPage 1203
SubjectTerms Algorithm design and analysis
Change Data Capture (CDC)
Data mining
Data warehouses
differential snapshot algorithm
Hadoop MapReduce
MD5 algorithm
Particle separators
Partitioning algorithms
Syntactics
Title Differential snapshot algorithms based on Hadoop MapReduce
URI https://ieeexplore.ieee.org/document/7382113
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA5t8eBJpRXf5ODRtNndZJN4tRZBLUV76K3kaQt1d-lu_f0mu0tF8OIthECYyeP7JjOTAeDWasnDvkFJEjtEaMyQSiOObKqUjgWnQtVVS17YdMoXCzHrgLt9Loy1tg4-s8PQrH35Jte78FQ2Ygn39krSBV3GWJOr1ToqIyxGk_fncYjVosN23K-CKTVeTI7-N9MxGPwk3sHZHlJOQMdmfXA_bouY-MO4gWUmi3KVV1BuPnJv2a8-SxigyMA8g_4eyfMCvsriLXzJagdgPnmcPzyhtuQBWnsIqZDQxBAZayJkbKhh1HiLQ1HsiGc6ItVeoZFlIqYOO2WcklhZ4TkYUSl2nomcgl6WZ_YMQBthQyJuqKYetHXKXUqVpwNceUKQEHwO-kHsZdF8arFsJb74u_sSHAbNNpFvV6BXbXf2Ghzor2pdbm_qlfgGYLWK9A
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEB5qFfSk0opvc_DotvtIdhOvaqn0QdEeeit5rS3UzdLd-vtNtktF8OIthECYyeP7JjOTAbjXklO3b7woClMPkzDxRBxQT8dCyJBRwkRVtWSYjMd0NmOTBjzscmG01lXwme64ZuXLV0Zu3FNZN4motVeiPdgnGIfBNlurdlUGPuv23gfPLlqLdOqRv0qmVIjRO_7fXCfQ_km9Q5MdqJxCQ2cteHyuy5jY47hCRcbzYmFKxFcfxtr2i88COTBSyGTI3iTG5GjE8zf3Katuw7T3Mn3qe3XRA29pQaT0mMQK81BixkNFVEKUtTkE8VNsuQ6LpVVpoBMWktRPhUoF94VmloVhEfup5SJn0MxMps8B6cBXOKCKSGJhW8Y0jYmwhIAKSwki7F9Ay4k9z7ffWsxriS__7r6Dw_50NJwPX8eDKzhyWt7GwV1Ds1xv9A0cyK9yWaxvq1X5Bszyjjs
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2015+12th+International+Conference+on+Fuzzy+Systems+and+Knowledge+Discovery+%28FSKD%29&rft.atitle=Differential+snapshot+algorithms+based+on+Hadoop+MapReduce&rft.au=Wei+Du&rft.au=Xianxia+Zou&rft.date=2015-08-01&rft.pub=IEEE&rft.spage=1203&rft.epage=1208&rft_id=info:doi/10.1109%2FFSKD.2015.7382113&rft.externalDocID=7382113