Differential snapshot algorithms based on Hadoop MapReduce

Change Data Capture from source system is the first step in the incremental maintenance of data warehouses and business intelligence and is a key component of ETL (Extract, Transform and Load) technique. Methods of CDC are currently available, namely, time stamps, differential snapshots, triggers, a...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) s. 1203 - 1208
Hlavní autori:	Wei Du, Xianxia Zou
Médium:	Konferenčný príspevok..
Jazyk:	English
Vydavateľské údaje:	IEEE 01.08.2015
Predmet:	Algorithm design and analysis Change Data Capture (CDC) Data mining Data warehouses differential snapshot algorithm Hadoop MapReduce MD5 algorithm Particle separators Partitioning algorithms Syntactics
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Abstract	Change Data Capture from source system is the first step in the incremental maintenance of data warehouses and business intelligence and is a key component of ETL (Extract, Transform and Load) technique. Methods of CDC are currently available, namely, time stamps, differential snapshots, triggers, and archive log. Differential snapshots do not rely on the implementation mechanism of the information sources, and therefore demonstrates better universality and adaptability. Due to the lack of computing resources, the differential snapshots based on sort merge and hash partition are sometimes error and not effective. This paper proposes the differential snapshot of low cost and high efficiency which combines open source database and Hadoop MapReduce. The differential snapshot based data summary which is generated by the MD5 algorithm is very effective but I/O cost is very heavy. So the paper proposes the SQL statement which queries the database while generating the tuples summary only once I/O. We implement the SQL statement on the open source database MySQL. In addition the parallel programming of MapReduce is used to find difference of database files which improves the efficiency and avoids the error. Experiment verifies the different performances among differential snapshot algorithms difference algorithm.
AbstractList	Change Data Capture from source system is the first step in the incremental maintenance of data warehouses and business intelligence and is a key component of ETL (Extract, Transform and Load) technique. Methods of CDC are currently available, namely, time stamps, differential snapshots, triggers, and archive log. Differential snapshots do not rely on the implementation mechanism of the information sources, and therefore demonstrates better universality and adaptability. Due to the lack of computing resources, the differential snapshots based on sort merge and hash partition are sometimes error and not effective. This paper proposes the differential snapshot of low cost and high efficiency which combines open source database and Hadoop MapReduce. The differential snapshot based data summary which is generated by the MD5 algorithm is very effective but I/O cost is very heavy. So the paper proposes the SQL statement which queries the database while generating the tuples summary only once I/O. We implement the SQL statement on the open source database MySQL. In addition the parallel programming of MapReduce is used to find difference of database files which improves the efficiency and avoids the error. Experiment verifies the different performances among differential snapshot algorithms difference algorithm.
Author	Xianxia Zou Wei Du
Author_xml	– sequence: 1 surname: Wei Du fullname: Wei Du organization: Dept. of Comput. Sci., GongDong Police Coll., Guangzhou, China – sequence: 2 surname: Xianxia Zou fullname: Xianxia Zou organization: Dept. of Comput. Sci., Jinan Univ., Guangzhou, China
BookMark	eNotz71OwzAUQGEjwQClD4BY_AIJvv6JYzbUUoooQoLu1XV8TS2ldpSEgbdnoNPZPuncsMtcMjF2B6IGEO5h8_W2rqUAU1vVSgB1wZbOtqAbq2zTgrlmj-sUI42U54Q9nzIO07HMHPvvMqb5eJq4x4kCL5lvMZQy8HccPin8dHTLriL2Ey3PXbD95nm_2la7j5fX1dOuSiDVXLlOB42y0w5lMMGa0DbSGxG10cY1nZIRyDppoog-RI_CkwOntG9ElFYt2P0_m4joMIzphOPv4Tyk_gDUOUQ3
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/FSKD.2015.7382113
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9781467376815 1467376817 9781467376822 1467376825
EndPage	1208
ExternalDocumentID	7382113
Genre	orig-research
GroupedDBID	6IE 6IL CBEJK RIE RIL
ID	FETCH-LOGICAL-i123t-9c4d4a2c49a2d5d75d862b50f454596c32f1e7925f0fbdfba0be91934b60f273
IEDL.DBID	RIE
IngestDate	Thu Jun 29 18:36:03 EDT 2023
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i123t-9c4d4a2c49a2d5d75d862b50f454596c32f1e7925f0fbdfba0be91934b60f273
PageCount	6
ParticipantIDs	ieee_primary_7382113
PublicationCentury	2000
PublicationDate	20150801
PublicationDateYYYYMMDD	2015-08-01
PublicationDate_xml	– month: 08 year: 2015 text: 20150801 day: 01
PublicationDecade	2010
PublicationTitle	2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD)
PublicationTitleAbbrev	FSKD
PublicationYear	2015
Publisher	IEEE
Publisher_xml	– name: IEEE
Score	1.5755186
Snippet	Change Data Capture from source system is the first step in the incremental maintenance of data warehouses and business intelligence and is a key component of...
SourceID	ieee
SourceType	Publisher
StartPage	1203
SubjectTerms	Algorithm design and analysis Change Data Capture (CDC) Data mining Data warehouses differential snapshot algorithm Hadoop MapReduce MD5 algorithm Particle separators Partitioning algorithms Syntactics
Title	Differential snapshot algorithms based on Hadoop MapReduce
URI	https://ieeexplore.ieee.org/document/7382113
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA5t8eBJpRXf5ODRtNndZJN4tRZBLUV76K3kaQt1d-lu_f0mu0tF8OIthECYyeP7JjOTAeDWasnDvkFJEjtEaMyQSiOObKqUjgWnQtVVS17YdMoXCzHrgLt9Loy1tg4-s8PQrH35Jte78FQ2Ygn39krSBV3GWJOr1ToqIyxGk_fncYjVosN23K-CKTVeTI7-N9MxGPwk3sHZHlJOQMdmfXA_bouY-MO4gWUmi3KVV1BuPnJv2a8-SxigyMA8g_4eyfMCvsriLXzJagdgPnmcPzyhtuQBWnsIqZDQxBAZayJkbKhh1HiLQ1HsiGc6ItVeoZFlIqYOO2WcklhZ4TkYUSl2nomcgl6WZ_YMQBthQyJuqKYetHXKXUqVpwNceUKQEHwO-kHsZdF8arFsJb74u_sSHAbNNpFvV6BXbXf2Ghzor2pdbm_qlfgGYLWK9A
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEB5qFfSk0opvc_DotvtIdhOvaqn0QdEeeit5rS3UzdLd-vtNtktF8OIthECYyeP7JjOTAbjXklO3b7woClMPkzDxRBxQT8dCyJBRwkRVtWSYjMd0NmOTBjzscmG01lXwme64ZuXLV0Zu3FNZN4motVeiPdgnGIfBNlurdlUGPuv23gfPLlqLdOqRv0qmVIjRO_7fXCfQ_km9Q5MdqJxCQ2cteHyuy5jY47hCRcbzYmFKxFcfxtr2i88COTBSyGTI3iTG5GjE8zf3Katuw7T3Mn3qe3XRA29pQaT0mMQK81BixkNFVEKUtTkE8VNsuQ6LpVVpoBMWktRPhUoF94VmloVhEfup5SJn0MxMps8B6cBXOKCKSGJhW8Y0jYmwhIAKSwki7F9Ay4k9z7ffWsxriS__7r6Dw_50NJwPX8eDKzhyWt7GwV1Ds1xv9A0cyK9yWaxvq1X5Bszyjjs
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2015+12th+International+Conference+on+Fuzzy+Systems+and+Knowledge+Discovery+%28FSKD%29&rft.atitle=Differential+snapshot+algorithms+based+on+Hadoop+MapReduce&rft.au=Wei+Du&rft.au=Xianxia+Zou&rft.date=2015-08-01&rft.pub=IEEE&rft.spage=1203&rft.epage=1208&rft_id=info:doi/10.1109%2FFSKD.2015.7382113&rft.externalDocID=7382113