Implementation of Indexing Techniques to Prevent Data Leakage and Duplication in Internet

Research in this area aims to create a new and efficient method for detecting near-duplicates in online content. Web pages that a search engine has scoured are first parsed to remove HTML elements and java scripts. After this phase, remove common keywords or stop words from the crawled pages. The af...

Full description

Saved in:
Bibliographic Details
Published in:2022 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI) pp. 1 - 9
Main Authors: Nalini, M. K., K, Dhinakaran, D, Elantamilan, Gnanavel, R., Vinod, D.
Format: Conference Proceeding
Language:English
Published: IEEE 28.01.2022
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Research in this area aims to create a new and efficient method for detecting near-duplicates in online content. Web pages that a search engine has scoured are first parsed to remove HTML elements and java scripts. After this phase, remove common keywords or stop words from the crawled pages. The affixes (prefix and suffix) from crawled pages are drawn by extracting keywords from the text. Finally, the collected keywords generate the similarity score between two papers. This research has proved that the suggested algorithms outperform earlier ones by conducting an extensive experimental investigation on several actual datasets. New algorithms outperform those already in use, according to the test findings. Since there are so many documents on the internet, web search engines have had difficulty finding relevant results for their customers. Overhead costs for search engines have risen dramatically due to an overabundance of almost similar or identical documents somehow. The web crawling community has long been aware of duplicate and near-identical online pages. Users want search engines to return relevant results for their searches on the first page, free of duplicates and redundancies. This is an important criterion. An essential enabler in the health industry is record or data linking, as connected data is an economic resource that may aid enhance research into health policy, detecting bad medication responses, and cutting expenses while uncovering health system fraud. Recent years have seen significant progress in several record linking approaches, mostly due to breakthroughs in data mining and machine learning. Most of these novel technologies have not yet been incorporated in current record linking systems or are disguised in commercial software as 'black boxes.' When it comes to learning about new record linking strategies and comparing old methods with new ones, this makes it difficult for users to do so. Innovative record-linking systems must be tested and implemented at low costs using flexible tools. FEBRL (Freely Extensible Biomedical Record Linkage), a free, open-source software license system, is described in this work. Data cleansing and standardization, indexing (blocking), field comparisons, and record pair categorization are all included in this user-friendly graphical interface. When used as a training tool, practitioners may use Febrl to connect data sets with up to several hundred thousand records and learn about both old and novel description linking approaches.
AbstractList Research in this area aims to create a new and efficient method for detecting near-duplicates in online content. Web pages that a search engine has scoured are first parsed to remove HTML elements and java scripts. After this phase, remove common keywords or stop words from the crawled pages. The affixes (prefix and suffix) from crawled pages are drawn by extracting keywords from the text. Finally, the collected keywords generate the similarity score between two papers. This research has proved that the suggested algorithms outperform earlier ones by conducting an extensive experimental investigation on several actual datasets. New algorithms outperform those already in use, according to the test findings. Since there are so many documents on the internet, web search engines have had difficulty finding relevant results for their customers. Overhead costs for search engines have risen dramatically due to an overabundance of almost similar or identical documents somehow. The web crawling community has long been aware of duplicate and near-identical online pages. Users want search engines to return relevant results for their searches on the first page, free of duplicates and redundancies. This is an important criterion. An essential enabler in the health industry is record or data linking, as connected data is an economic resource that may aid enhance research into health policy, detecting bad medication responses, and cutting expenses while uncovering health system fraud. Recent years have seen significant progress in several record linking approaches, mostly due to breakthroughs in data mining and machine learning. Most of these novel technologies have not yet been incorporated in current record linking systems or are disguised in commercial software as 'black boxes.' When it comes to learning about new record linking strategies and comparing old methods with new ones, this makes it difficult for users to do so. Innovative record-linking systems must be tested and implemented at low costs using flexible tools. FEBRL (Freely Extensible Biomedical Record Linkage), a free, open-source software license system, is described in this work. Data cleansing and standardization, indexing (blocking), field comparisons, and record pair categorization are all included in this user-friendly graphical interface. When used as a training tool, practitioners may use Febrl to connect data sets with up to several hundred thousand records and learn about both old and novel description linking approaches.
Author D, Elantamilan
K, Dhinakaran
Nalini, M. K.
Vinod, D.
Gnanavel, R.
Author_xml – sequence: 1
  givenname: M. K.
  surname: Nalini
  fullname: Nalini, M. K.
  email: nalini.ise@bmsce.ac.in
  organization: BMS college of Engineering,Department of ISE,Bangalore
– sequence: 2
  givenname: Dhinakaran
  surname: K
  fullname: K, Dhinakaran
  email: maildhina.k@gmail.com
  organization: Dhanalakshmi College of Engineering,Department of Artificial Intelligence and Data Science,Chennai
– sequence: 3
  givenname: Elantamilan
  surname: D
  fullname: D, Elantamilan
  email: elantamilan.ds@gmail.com
  organization: VallalP. T. LeeChengalvarayaNaicker Arts and Science College,Department of Computer Science,Chennai-112
– sequence: 4
  givenname: R.
  surname: Gnanavel
  fullname: Gnanavel, R.
  email: rgvelu22@gmail.com
  organization: R.M.K. College of Engineering and Technology,Department of Computer Science and Engineering,Chennai
– sequence: 5
  givenname: D.
  surname: Vinod
  fullname: Vinod, D.
  email: dvinopaul@gmail.com
  organization: Saveetha School of Engineering, Saveetha Institute of Medical and Technical Sciences,Dept of CSE,Chennai,India
BookMark eNotj7tOwzAYRo0EA5Q-AQN-gYb4mniMUi6RIsFQBqbqj_27WCROSF0Eb0-ldjrLdz7p3JDLOEYk5J7lGWO5eajqumqUMEWe8ZzzzBSKKyUvyNIUJdNaSaO4UdfkoxmmHgeMCVIYIx09baLD3xB3dIP2M4bvA-5pGunbjD_HGV1DAtoifMEOKURH14epD_akh3jUE84R0y258tDvcXnmgrw_PW7ql1X7-tzUVbsKjJVpJaGTykovDMhOo7dOWyMQGZa6U8iEFt5pUziQrrQCOeYMhDe8ZMZ7WYgFuTv9BkTcTnMYYP7bnoPFPyhUUc8
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ACCAI53970.2022.9752554
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9781665495295
1665495294
EndPage 9
ExternalDocumentID 9752554
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i118t-4ab45c4f39a4b6efcd6c93ee1e86b5e1363fd697da4d8c3e2e01a3f92819ff473
IEDL.DBID RIE
IngestDate Thu Jun 29 18:36:49 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i118t-4ab45c4f39a4b6efcd6c93ee1e86b5e1363fd697da4d8c3e2e01a3f92819ff473
PageCount 9
ParticipantIDs ieee_primary_9752554
PublicationCentury 2000
PublicationDate 2022-Jan.-28
PublicationDateYYYYMMDD 2022-01-28
PublicationDate_xml – month: 01
  year: 2022
  text: 2022-Jan.-28
  day: 28
PublicationDecade 2020
PublicationTitle 2022 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI)
PublicationTitleAbbrev ACCAI
PublicationYear 2022
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.7836597
Snippet Research in this area aims to create a new and efficient method for detecting near-duplicates in online content. Web pages that a search engine has scoured are...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Costs
Couplings
data cleaning
data integration
data matching
deduplication
Duplicates
GUI
Hidden Markov models
HTML tags
Java Scripts
open source software
record linkage software
Search engines
Stemming
Threshold. Health data linkage
Training
Web pages
Title Implementation of Indexing Techniques to Prevent Data Leakage and Duplication in Internet
URI https://ieeexplore.ieee.org/document/9752554
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA5t8eBJpRXf5ODRbbtJNtkcS7VYKKWHCvVUZpMJFHFX2q2_32R3rQhevIVkIDAJmVe-bwi5BxNWmIxiABGJRGGUaYeRRYUa_AWL664lMzWfp6uVXrTIwwELg4jV5zPsh2FVy7eF2YdU2UCrxHvAok3aSskaq9V82YqHejAaj0fTxNvXoQ_7GOs30r_aplRWY3Lyv_1OSe8HfkcXB8NyRlqYd8lrxeP73kCFclo4Og1Mh16CLr-JWHe0LGjDykQfoQQ6Q3jzTwaF3NLH_aFYTTc5rZOBWPbIy-RpOX6Omr4I0caHA2UkIBOJEY5rEJlEZ6w0miPGmMoswZhL7qzUyoKwqeHIcBgDdzrUzJwTip-TTl7keEGozoRikjnuQzFhIMus1laBg1ibxLsPl6Qb1LL-qKkv1o1Grv6evibHQfMhQ8HSG9Ipt3u8JUfms9zstnfVeX0BFJWaiw
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NS8MwFA9zCnpS2cRvc_BotzZJm-Y4NseGdewwYZ5GmrzAEFvZWv9-k7ZOBC_eQvIg8F7I-_49hO6lcick8gIpmcdCDl4qDHgaOAhpH1hQTy1J-GwWL5di3kIPu14YAKiKz6DnllUuX-eqdKGyvuChtYDZHtoPGSN-3a3VFG0FvugPhsPBNLQa1reOHyG9hv7X4JRKb4yP_3fjCer-NODh-U61nKIWZB30WiH5vjfNQhnODZ46rENLgRffUKxbXOS4wWXCI1lInIB8s58GlpnGo3KXrsbrDNfhQCi66GX8uBhOvGYygre2DkHhMZmyUDFDhWRpBEbpSAkKEEAcpSEENKJGR4JryXSsKBDwA0mNcFkzYxinZ6id5RmcIyxSxklEDLXOGFMyTbUQmksjA6FCa0BcoI5jy-qjBr9YNRy5_Hv7Dh1OFs_JKpnOnq7QkZOCi1eQ-Bq1i00JN-hAfRbr7ea2kt0XTBOd0g
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2022+International+Conference+on+Advances+in+Computing%2C+Communication+and+Applied+Informatics+%28ACCAI%29&rft.atitle=Implementation+of+Indexing+Techniques+to+Prevent+Data+Leakage+and+Duplication+in+Internet&rft.au=Nalini%2C+M.+K.&rft.au=K%2C+Dhinakaran&rft.au=D%2C+Elantamilan&rft.au=Gnanavel%2C+R.&rft.date=2022-01-28&rft.pub=IEEE&rft.spage=1&rft.epage=9&rft_id=info:doi/10.1109%2FACCAI53970.2022.9752554&rft.externalDocID=9752554