Implementation of Indexing Techniques to Prevent Data Leakage and Duplication in Internet
Research in this area aims to create a new and efficient method for detecting near-duplicates in online content. Web pages that a search engine has scoured are first parsed to remove HTML elements and java scripts. After this phase, remove common keywords or stop words from the crawled pages. The af...
Saved in:
| Published in: | 2022 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI) pp. 1 - 9 |
|---|---|
| Main Authors: | , , , , |
| Format: | Conference Proceeding |
| Language: | English |
| Published: |
IEEE
28.01.2022
|
| Subjects: | |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Research in this area aims to create a new and efficient method for detecting near-duplicates in online content. Web pages that a search engine has scoured are first parsed to remove HTML elements and java scripts. After this phase, remove common keywords or stop words from the crawled pages. The affixes (prefix and suffix) from crawled pages are drawn by extracting keywords from the text. Finally, the collected keywords generate the similarity score between two papers. This research has proved that the suggested algorithms outperform earlier ones by conducting an extensive experimental investigation on several actual datasets. New algorithms outperform those already in use, according to the test findings. Since there are so many documents on the internet, web search engines have had difficulty finding relevant results for their customers. Overhead costs for search engines have risen dramatically due to an overabundance of almost similar or identical documents somehow. The web crawling community has long been aware of duplicate and near-identical online pages. Users want search engines to return relevant results for their searches on the first page, free of duplicates and redundancies. This is an important criterion. An essential enabler in the health industry is record or data linking, as connected data is an economic resource that may aid enhance research into health policy, detecting bad medication responses, and cutting expenses while uncovering health system fraud. Recent years have seen significant progress in several record linking approaches, mostly due to breakthroughs in data mining and machine learning. Most of these novel technologies have not yet been incorporated in current record linking systems or are disguised in commercial software as 'black boxes.' When it comes to learning about new record linking strategies and comparing old methods with new ones, this makes it difficult for users to do so. Innovative record-linking systems must be tested and implemented at low costs using flexible tools. FEBRL (Freely Extensible Biomedical Record Linkage), a free, open-source software license system, is described in this work. Data cleansing and standardization, indexing (blocking), field comparisons, and record pair categorization are all included in this user-friendly graphical interface. When used as a training tool, practitioners may use Febrl to connect data sets with up to several hundred thousand records and learn about both old and novel description linking approaches. |
|---|---|
| AbstractList | Research in this area aims to create a new and efficient method for detecting near-duplicates in online content. Web pages that a search engine has scoured are first parsed to remove HTML elements and java scripts. After this phase, remove common keywords or stop words from the crawled pages. The affixes (prefix and suffix) from crawled pages are drawn by extracting keywords from the text. Finally, the collected keywords generate the similarity score between two papers. This research has proved that the suggested algorithms outperform earlier ones by conducting an extensive experimental investigation on several actual datasets. New algorithms outperform those already in use, according to the test findings. Since there are so many documents on the internet, web search engines have had difficulty finding relevant results for their customers. Overhead costs for search engines have risen dramatically due to an overabundance of almost similar or identical documents somehow. The web crawling community has long been aware of duplicate and near-identical online pages. Users want search engines to return relevant results for their searches on the first page, free of duplicates and redundancies. This is an important criterion. An essential enabler in the health industry is record or data linking, as connected data is an economic resource that may aid enhance research into health policy, detecting bad medication responses, and cutting expenses while uncovering health system fraud. Recent years have seen significant progress in several record linking approaches, mostly due to breakthroughs in data mining and machine learning. Most of these novel technologies have not yet been incorporated in current record linking systems or are disguised in commercial software as 'black boxes.' When it comes to learning about new record linking strategies and comparing old methods with new ones, this makes it difficult for users to do so. Innovative record-linking systems must be tested and implemented at low costs using flexible tools. FEBRL (Freely Extensible Biomedical Record Linkage), a free, open-source software license system, is described in this work. Data cleansing and standardization, indexing (blocking), field comparisons, and record pair categorization are all included in this user-friendly graphical interface. When used as a training tool, practitioners may use Febrl to connect data sets with up to several hundred thousand records and learn about both old and novel description linking approaches. |
| Author | D, Elantamilan K, Dhinakaran Nalini, M. K. Vinod, D. Gnanavel, R. |
| Author_xml | – sequence: 1 givenname: M. K. surname: Nalini fullname: Nalini, M. K. email: nalini.ise@bmsce.ac.in organization: BMS college of Engineering,Department of ISE,Bangalore – sequence: 2 givenname: Dhinakaran surname: K fullname: K, Dhinakaran email: maildhina.k@gmail.com organization: Dhanalakshmi College of Engineering,Department of Artificial Intelligence and Data Science,Chennai – sequence: 3 givenname: Elantamilan surname: D fullname: D, Elantamilan email: elantamilan.ds@gmail.com organization: VallalP. T. LeeChengalvarayaNaicker Arts and Science College,Department of Computer Science,Chennai-112 – sequence: 4 givenname: R. surname: Gnanavel fullname: Gnanavel, R. email: rgvelu22@gmail.com organization: R.M.K. College of Engineering and Technology,Department of Computer Science and Engineering,Chennai – sequence: 5 givenname: D. surname: Vinod fullname: Vinod, D. email: dvinopaul@gmail.com organization: Saveetha School of Engineering, Saveetha Institute of Medical and Technical Sciences,Dept of CSE,Chennai,India |
| BookMark | eNotj7tOwzAYRo0EA5Q-AQN-gYb4mniMUi6RIsFQBqbqj_27WCROSF0Eb0-ldjrLdz7p3JDLOEYk5J7lGWO5eajqumqUMEWe8ZzzzBSKKyUvyNIUJdNaSaO4UdfkoxmmHgeMCVIYIx09baLD3xB3dIP2M4bvA-5pGunbjD_HGV1DAtoifMEOKURH14epD_akh3jUE84R0y258tDvcXnmgrw_PW7ql1X7-tzUVbsKjJVpJaGTykovDMhOo7dOWyMQGZa6U8iEFt5pUziQrrQCOeYMhDe8ZMZ7WYgFuTv9BkTcTnMYYP7bnoPFPyhUUc8 |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/ACCAI53970.2022.9752554 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9781665495295 1665495294 |
| EndPage | 9 |
| ExternalDocumentID | 9752554 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IL CBEJK RIE RIL |
| ID | FETCH-LOGICAL-i118t-4ab45c4f39a4b6efcd6c93ee1e86b5e1363fd697da4d8c3e2e01a3f92819ff473 |
| IEDL.DBID | RIE |
| IngestDate | Thu Jun 29 18:36:49 EDT 2023 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i118t-4ab45c4f39a4b6efcd6c93ee1e86b5e1363fd697da4d8c3e2e01a3f92819ff473 |
| PageCount | 9 |
| ParticipantIDs | ieee_primary_9752554 |
| PublicationCentury | 2000 |
| PublicationDate | 2022-Jan.-28 |
| PublicationDateYYYYMMDD | 2022-01-28 |
| PublicationDate_xml | – month: 01 year: 2022 text: 2022-Jan.-28 day: 28 |
| PublicationDecade | 2020 |
| PublicationTitle | 2022 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI) |
| PublicationTitleAbbrev | ACCAI |
| PublicationYear | 2022 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| Score | 1.7837602 |
| Snippet | Research in this area aims to create a new and efficient method for detecting near-duplicates in online content. Web pages that a search engine has scoured are... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | Costs Couplings data cleaning data integration data matching deduplication Duplicates GUI Hidden Markov models HTML tags Java Scripts open source software record linkage software Search engines Stemming Threshold. Health data linkage Training Web pages |
| Title | Implementation of Indexing Techniques to Prevent Data Leakage and Duplication in Internet |
| URI | https://ieeexplore.ieee.org/document/9752554 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA5t8eBJpRXf5ODRbXc32c3mWFqLhVJ6qFJPJY8JFHFX2l1_v5PtWhG8eBvCQGAC-eaR7wsh9y4NrdECM7fMuIBjQh3I0GZoxVoozJllzUp7mYn5PFut5KJFHg5cGACoH59B35v1LN8WpvKtsoEUCWbAvE3aQqR7rlbzZCsK5WA4Gg2nCeJriGVfHPcb71_fptSoMTn5336npPdDv6OLA7CckRbkXfJa6_i-N1ShnBaOTr3SIXrQ5bcQ646WBW1UmehYlYrOQL3hlUFVbum4Ogyr6San-2YglD3yPHlcjp6C5l-EYIPlQBlwpXliuGNScZ2CMzY1kgFEkKU6gYilzFkMslXcZoZBDGGkmJN-ZuYcF-ycdPIihwtCrVEI6KFzWgsOKZ4as7HQoY6weMW1S9L1YVl_7KUv1k1Erv5evibHPvK-QxFnN6RTbiu4JUfms9zstnf1eX0B6hmZ3w |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEB5qFfSk0opvc_Do1uwm-8ixtJYW19JDlXoqeUygiLvSbv39ZrdrRfDiLQyBwAxknt83ALc2okar2EVuibYedwG1J6hJ3ClQsXQxs6hQaS9pPB4ns5mYNOBui4VBxGr4DDvlserlm1yvy1LZvYhDFwHzHdgNOQ_oBq1VD235VNx3e73uKHQelrrELwg69f1fi1MqvzE4_N-LR9D-AeCRyda1HEMDsxa8Vky-7zVYKCO5JaOS69DdINNvKtYVKXJS8zKRviwkSVG-uU-DyMyQ_nrbriaLjGzKgVi04XnwMO0NvXozgrdwCUHhcal4qLllQnIVodUm0oIh-phEKkSfRcwap2YjuUk0wwCpL5kVZdfMWh6zE2hmeYanQIyWzqVTa5WKOUbObswEsaLKd-mrk51Bq1TL_GNDfjGvNXL-t_gG9ofTp3SejsaPF3BQWqGsVwTJJTSL5RqvYE9_FovV8rqy3RcXRJ0m |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2022+International+Conference+on+Advances+in+Computing%2C+Communication+and+Applied+Informatics+%28ACCAI%29&rft.atitle=Implementation+of+Indexing+Techniques+to+Prevent+Data+Leakage+and+Duplication+in+Internet&rft.au=Nalini%2C+M.+K.&rft.au=K%2C+Dhinakaran&rft.au=D%2C+Elantamilan&rft.au=Gnanavel%2C+R.&rft.date=2022-01-28&rft.pub=IEEE&rft.spage=1&rft.epage=9&rft_id=info:doi/10.1109%2FACCAI53970.2022.9752554&rft.externalDocID=9752554 |