Implementation of Indexing Techniques to Prevent Data Leakage and Duplication in Internet

Research in this area aims to create a new and efficient method for detecting near-duplicates in online content. Web pages that a search engine has scoured are first parsed to remove HTML elements and java scripts. After this phase, remove common keywords or stop words from the crawled pages. The af...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:2022 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI) S. 1 - 9
Hauptverfasser: Nalini, M. K., K, Dhinakaran, D, Elantamilan, Gnanavel, R., Vinod, D.
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: IEEE 28.01.2022
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Research in this area aims to create a new and efficient method for detecting near-duplicates in online content. Web pages that a search engine has scoured are first parsed to remove HTML elements and java scripts. After this phase, remove common keywords or stop words from the crawled pages. The affixes (prefix and suffix) from crawled pages are drawn by extracting keywords from the text. Finally, the collected keywords generate the similarity score between two papers. This research has proved that the suggested algorithms outperform earlier ones by conducting an extensive experimental investigation on several actual datasets. New algorithms outperform those already in use, according to the test findings. Since there are so many documents on the internet, web search engines have had difficulty finding relevant results for their customers. Overhead costs for search engines have risen dramatically due to an overabundance of almost similar or identical documents somehow. The web crawling community has long been aware of duplicate and near-identical online pages. Users want search engines to return relevant results for their searches on the first page, free of duplicates and redundancies. This is an important criterion. An essential enabler in the health industry is record or data linking, as connected data is an economic resource that may aid enhance research into health policy, detecting bad medication responses, and cutting expenses while uncovering health system fraud. Recent years have seen significant progress in several record linking approaches, mostly due to breakthroughs in data mining and machine learning. Most of these novel technologies have not yet been incorporated in current record linking systems or are disguised in commercial software as 'black boxes.' When it comes to learning about new record linking strategies and comparing old methods with new ones, this makes it difficult for users to do so. Innovative record-linking systems must be tested and implemented at low costs using flexible tools. FEBRL (Freely Extensible Biomedical Record Linkage), a free, open-source software license system, is described in this work. Data cleansing and standardization, indexing (blocking), field comparisons, and record pair categorization are all included in this user-friendly graphical interface. When used as a training tool, practitioners may use Febrl to connect data sets with up to several hundred thousand records and learn about both old and novel description linking approaches.
DOI:10.1109/ACCAI53970.2022.9752554