Implementation of Indexing Techniques to Prevent Data Leakage and Duplication in Internet

Research in this area aims to create a new and efficient method for detecting near-duplicates in online content. Web pages that a search engine has scoured are first parsed to remove HTML elements and java scripts. After this phase, remove common keywords or stop words from the crawled pages. The af...

Full description

Saved in:

Bibliographic Details
Published in:	2022 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI) pp. 1 - 9
Main Authors:	Nalini, M. K., K, Dhinakaran, D, Elantamilan, Gnanavel, R., Vinod, D.
Format:	Conference Proceeding
Language:	English
Published:	IEEE 28.01.2022
Subjects:	Costs Couplings data cleaning data integration data matching deduplication Duplicates GUI Hidden Markov models HTML tags Java Scripts open source software record linkage software Search engines Stemming Threshold. Health data linkage Training Web pages
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Abstract	Research in this area aims to create a new and efficient method for detecting near-duplicates in online content. Web pages that a search engine has scoured are first parsed to remove HTML elements and java scripts. After this phase, remove common keywords or stop words from the crawled pages. The affixes (prefix and suffix) from crawled pages are drawn by extracting keywords from the text. Finally, the collected keywords generate the similarity score between two papers. This research has proved that the suggested algorithms outperform earlier ones by conducting an extensive experimental investigation on several actual datasets. New algorithms outperform those already in use, according to the test findings. Since there are so many documents on the internet, web search engines have had difficulty finding relevant results for their customers. Overhead costs for search engines have risen dramatically due to an overabundance of almost similar or identical documents somehow. The web crawling community has long been aware of duplicate and near-identical online pages. Users want search engines to return relevant results for their searches on the first page, free of duplicates and redundancies. This is an important criterion. An essential enabler in the health industry is record or data linking, as connected data is an economic resource that may aid enhance research into health policy, detecting bad medication responses, and cutting expenses while uncovering health system fraud. Recent years have seen significant progress in several record linking approaches, mostly due to breakthroughs in data mining and machine learning. Most of these novel technologies have not yet been incorporated in current record linking systems or are disguised in commercial software as 'black boxes.' When it comes to learning about new record linking strategies and comparing old methods with new ones, this makes it difficult for users to do so. Innovative record-linking systems must be tested and implemented at low costs using flexible tools. FEBRL (Freely Extensible Biomedical Record Linkage), a free, open-source software license system, is described in this work. Data cleansing and standardization, indexing (blocking), field comparisons, and record pair categorization are all included in this user-friendly graphical interface. When used as a training tool, practitioners may use Febrl to connect data sets with up to several hundred thousand records and learn about both old and novel description linking approaches.
AbstractList	Research in this area aims to create a new and efficient method for detecting near-duplicates in online content. Web pages that a search engine has scoured are first parsed to remove HTML elements and java scripts. After this phase, remove common keywords or stop words from the crawled pages. The affixes (prefix and suffix) from crawled pages are drawn by extracting keywords from the text. Finally, the collected keywords generate the similarity score between two papers. This research has proved that the suggested algorithms outperform earlier ones by conducting an extensive experimental investigation on several actual datasets. New algorithms outperform those already in use, according to the test findings. Since there are so many documents on the internet, web search engines have had difficulty finding relevant results for their customers. Overhead costs for search engines have risen dramatically due to an overabundance of almost similar or identical documents somehow. The web crawling community has long been aware of duplicate and near-identical online pages. Users want search engines to return relevant results for their searches on the first page, free of duplicates and redundancies. This is an important criterion. An essential enabler in the health industry is record or data linking, as connected data is an economic resource that may aid enhance research into health policy, detecting bad medication responses, and cutting expenses while uncovering health system fraud. Recent years have seen significant progress in several record linking approaches, mostly due to breakthroughs in data mining and machine learning. Most of these novel technologies have not yet been incorporated in current record linking systems or are disguised in commercial software as 'black boxes.' When it comes to learning about new record linking strategies and comparing old methods with new ones, this makes it difficult for users to do so. Innovative record-linking systems must be tested and implemented at low costs using flexible tools. FEBRL (Freely Extensible Biomedical Record Linkage), a free, open-source software license system, is described in this work. Data cleansing and standardization, indexing (blocking), field comparisons, and record pair categorization are all included in this user-friendly graphical interface. When used as a training tool, practitioners may use Febrl to connect data sets with up to several hundred thousand records and learn about both old and novel description linking approaches.
Author	D, Elantamilan K, Dhinakaran Nalini, M. K. Vinod, D. Gnanavel, R.
Author_xml	– sequence: 1 givenname: M. K. surname: Nalini fullname: Nalini, M. K. email: nalini.ise@bmsce.ac.in organization: BMS college of Engineering,Department of ISE,Bangalore – sequence: 2 givenname: Dhinakaran surname: K fullname: K, Dhinakaran email: maildhina.k@gmail.com organization: Dhanalakshmi College of Engineering,Department of Artificial Intelligence and Data Science,Chennai – sequence: 3 givenname: Elantamilan surname: D fullname: D, Elantamilan email: elantamilan.ds@gmail.com organization: VallalP. T. LeeChengalvarayaNaicker Arts and Science College,Department of Computer Science,Chennai-112 – sequence: 4 givenname: R. surname: Gnanavel fullname: Gnanavel, R. email: rgvelu22@gmail.com organization: R.M.K. College of Engineering and Technology,Department of Computer Science and Engineering,Chennai – sequence: 5 givenname: D. surname: Vinod fullname: Vinod, D. email: dvinopaul@gmail.com organization: Saveetha School of Engineering, Saveetha Institute of Medical and Technical Sciences,Dept of CSE,Chennai,India
BookMark	eNotj7tOwzAYRo0EA5Q-AQN-gYb4mniMUi6RIsFQBqbqj_27WCROSF0Eb0-ldjrLdz7p3JDLOEYk5J7lGWO5eajqumqUMEWe8ZzzzBSKKyUvyNIUJdNaSaO4UdfkoxmmHgeMCVIYIx09baLD3xB3dIP2M4bvA-5pGunbjD_HGV1DAtoifMEOKURH14epD_akh3jUE84R0y258tDvcXnmgrw_PW7ql1X7-tzUVbsKjJVpJaGTykovDMhOo7dOWyMQGZa6U8iEFt5pUziQrrQCOeYMhDe8ZMZ7WYgFuTv9BkTcTnMYYP7bnoPFPyhUUc8
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/ACCAI53970.2022.9752554
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9781665495295 1665495294
EndPage	9
ExternalDocumentID	9752554
Genre	orig-research
GroupedDBID	6IE 6IL CBEJK RIE RIL
ID	FETCH-LOGICAL-i118t-4ab45c4f39a4b6efcd6c93ee1e86b5e1363fd697da4d8c3e2e01a3f92819ff473
IEDL.DBID	RIE
IngestDate	Thu Jun 29 18:36:49 EDT 2023
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i118t-4ab45c4f39a4b6efcd6c93ee1e86b5e1363fd697da4d8c3e2e01a3f92819ff473
PageCount	9
ParticipantIDs	ieee_primary_9752554
PublicationCentury	2000
PublicationDate	2022-Jan.-28
PublicationDateYYYYMMDD	2022-01-28
PublicationDate_xml	– month: 01 year: 2022 text: 2022-Jan.-28 day: 28
PublicationDecade	2020
PublicationTitle	2022 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI)
PublicationTitleAbbrev	ACCAI
PublicationYear	2022
Publisher	IEEE
Publisher_xml	– name: IEEE
Score	1.7837602
Snippet	Research in this area aims to create a new and efficient method for detecting near-duplicates in online content. Web pages that a search engine has scoured are...
SourceID	ieee
SourceType	Publisher
StartPage	1
SubjectTerms	Costs Couplings data cleaning data integration data matching deduplication Duplicates GUI Hidden Markov models HTML tags Java Scripts open source software record linkage software Search engines Stemming Threshold. Health data linkage Training Web pages
Title	Implementation of Indexing Techniques to Prevent Data Leakage and Duplication in Internet
URI	https://ieeexplore.ieee.org/document/9752554
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA5t8eBJpRXf5ODRbXc32c3mWFqLhVJ6qFJPJY8JFHFX2l1_v5PtWhG8eBvCQGAC-eaR7wsh9y4NrdECM7fMuIBjQh3I0GZoxVoozJllzUp7mYn5PFut5KJFHg5cGACoH59B35v1LN8WpvKtsoEUCWbAvE3aQqR7rlbzZCsK5WA4Gg2nCeJriGVfHPcb71_fptSoMTn5336npPdDv6OLA7CckRbkXfJa6_i-N1ShnBaOTr3SIXrQ5bcQ646WBW1UmehYlYrOQL3hlUFVbum4Ogyr6San-2YglD3yPHlcjp6C5l-EYIPlQBlwpXliuGNScZ2CMzY1kgFEkKU6gYilzFkMslXcZoZBDGGkmJN-ZuYcF-ycdPIihwtCrVEI6KFzWgsOKZ4as7HQoY6weMW1S9L1YVl_7KUv1k1Erv5evibHPvK-QxFnN6RTbiu4JUfms9zstnf1eX0B6hmZ3w
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEB5qFfSk0opvc_Do1uwm-8ixtJYW19JDlXoqeUygiLvSbv39ZrdrRfDiLQyBwAxknt83ALc2okar2EVuibYedwG1J6hJ3ClQsXQxs6hQaS9pPB4ns5mYNOBui4VBxGr4DDvlserlm1yvy1LZvYhDFwHzHdgNOQ_oBq1VD235VNx3e73uKHQelrrELwg69f1fi1MqvzE4_N-LR9D-AeCRyda1HEMDsxa8Vky-7zVYKCO5JaOS69DdINNvKtYVKXJS8zKRviwkSVG-uU-DyMyQ_nrbriaLjGzKgVi04XnwMO0NvXozgrdwCUHhcal4qLllQnIVodUm0oIh-phEKkSfRcwap2YjuUk0wwCpL5kVZdfMWh6zE2hmeYanQIyWzqVTa5WKOUbObswEsaLKd-mrk51Bq1TL_GNDfjGvNXL-t_gG9ofTp3SejsaPF3BQWqGsVwTJJTSL5RqvYE9_FovV8rqy3RcXRJ0m
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2022+International+Conference+on+Advances+in+Computing%2C+Communication+and+Applied+Informatics+%28ACCAI%29&rft.atitle=Implementation+of+Indexing+Techniques+to+Prevent+Data+Leakage+and+Duplication+in+Internet&rft.au=Nalini%2C+M.+K.&rft.au=K%2C+Dhinakaran&rft.au=D%2C+Elantamilan&rft.au=Gnanavel%2C+R.&rft.date=2022-01-28&rft.pub=IEEE&rft.spage=1&rft.epage=9&rft_id=info:doi/10.1109%2FACCAI53970.2022.9752554&rft.externalDocID=9752554