Hierarchical Multi-class and Multi-label Text Classification for Crime Report: A Traditional Machine Learning Approach
Large amounts of digital data are produced daily through society's use of government and private companies. Digital transformation contributes to the increasing amount of structured and unstructured data stored in digital media. Organizations build centralized data repositories to store and pro...
Uloženo v:
| Vydáno v: | IEEE access s. 1 |
|---|---|
| Hlavní autoři: | , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
28.11.2025
|
| Témata: | |
| ISSN: | 2169-3536 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Large amounts of digital data are produced daily through society's use of government and private companies. Digital transformation contributes to the increasing amount of structured and unstructured data stored in digital media. Organizations build centralized data repositories to store and provide information to business areas, supporting Business Intelligence solutions. Some databases store vast amounts of unstructured data, which must be systematized and classified to meet the data owner's needs. In criminal incident report systems, each recorded incident must be classified as a specific crime, with hundreds or thousands of categories presented to the responsible officer. This work explores a clustering approach to group categories into a hierarchical tree of classes, enabling the use of Machine Learning (ML) models like XGBoost for automated classification of criminal incident reports narratives. As a case study, the Civil Police of of the State of Rio de Janeiro (SEPOL/RJ) has a database with over 6.5 million records, growing daily from Judicial Police Units (JPU) across the state. Each new report requires manual classification. A hierarchical tree of classes was developed to segment the problem, allowing various XGBoost models for automated classification. The proposed hierarchical model with 80 classes achieved an accuracy of 0.463, outperforming the baseline flat model which reached 0.419, along with a 25.48% reduction in training time. The weighted average F1-score obtained by the hierarchical model was 0.48188, while the baseline model reached 0.44061. The improvement was statistically validated through a Wilcoxon signed-rank test, which yielded a p-value of 0.000010. |
|---|---|
| AbstractList | Large amounts of digital data are produced daily through society's use of government and private companies. Digital transformation contributes to the increasing amount of structured and unstructured data stored in digital media. Organizations build centralized data repositories to store and provide information to business areas, supporting Business Intelligence solutions. Some databases store vast amounts of unstructured data, which must be systematized and classified to meet the data owner's needs. In criminal incident report systems, each recorded incident must be classified as a specific crime, with hundreds or thousands of categories presented to the responsible officer. This work explores a clustering approach to group categories into a hierarchical tree of classes, enabling the use of Machine Learning (ML) models like XGBoost for automated classification of criminal incident reports narratives. As a case study, the Civil Police of of the State of Rio de Janeiro (SEPOL/RJ) has a database with over 6.5 million records, growing daily from Judicial Police Units (JPU) across the state. Each new report requires manual classification. A hierarchical tree of classes was developed to segment the problem, allowing various XGBoost models for automated classification. The proposed hierarchical model with 80 classes achieved an accuracy of 0.463, outperforming the baseline flat model which reached 0.419, along with a 25.48% reduction in training time. The weighted average F1-score obtained by the hierarchical model was 0.48188, while the baseline model reached 0.44061. The improvement was statistically validated through a Wilcoxon signed-rank test, which yielded a p-value of 0.000010. |
| Author | Santos, Glaucio De S. Rust, Luiz F. Vieira, Andre R. Melo, Wilson S. |
| Author_xml | – sequence: 1 givenname: Andre R. orcidid: 0009-0006-3193-7375 surname: Vieira fullname: Vieira, Andre R. email: arvieira@labnet.nce.ufrj.br organization: Programa de Pós-Graduação em Informática, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brasil – sequence: 2 givenname: Glaucio De S. surname: Santos fullname: Santos, Glaucio De S. organization: Programa de Pós-Graduação em Informática, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brasil – sequence: 3 givenname: Wilson S. orcidid: 0000-0002-7710-7995 surname: Melo fullname: Melo, Wilson S. organization: National Institute of Metrology, Quality, and Technology, Duque de Caxias, RJ, Brasil – sequence: 4 givenname: Luiz F. orcidid: 0000-0001-6131-7771 surname: Rust fullname: Rust, Luiz F. organization: Programa de Pós-Graduação em Informática, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brasil |
| BookMark | eNotjl1LhEAUhocoaNv2F9TF_AHN-XK0O5FtNzCC1q6Xo56pCRtltKh_30j73hzeDx7OFTl3g0NCblgSM5bkd0VZbg-HmCdcxSIVWZ7JM7LiLM0joUR6STbT9JEEZSFSekW-9xY9-PbdttDTp69-tlHbwzRRcN3J99BgT2v8mWm5VNaE8WwHR83gaentJ9IXHAc_39OC1h46u7QLDwLYIa0QvLPujRbj6IcQXpMLA_2Em9Ndk9eHbV3uo-p591gWVWSZyGTElZDYIm9kqpRIJNNKKyO7pgWdABNcICJHk3EtIUVh8qZNVQeZasBIrcWa3P5zbRgex_Aq-N8jY1wzybX4A0i9XSQ |
| CODEN | IAECCG |
| ContentType | Journal Article |
| DBID | 97E ESBDL RIA RIE |
| DOI | 10.1109/ACCESS.2025.3638984 |
| DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE Xplore : Open Access Journals and Conferences [open access] IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering |
| EISSN | 2169-3536 |
| EndPage | 1 |
| ExternalDocumentID | 11271427 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: Coordena??o de Aperfei?oamento de Pessoal de N?vel Superior funderid: 10.13039/501100002322 |
| GroupedDBID | 0R~ 5VS 6IK 97E AAJGR ABVLG ACGFS ADBBV ADCSY ALMA_UNASSIGNED_HOLDINGS BCNDV BEFXN BFFAM BGNUA BKEBE BPEOZ EBS ESBDL GROUPED_DOAJ IPLJI JAVBF KQ8 M~E O9- OCL OK1 RIA RIE RNS |
| ID | FETCH-LOGICAL-i1384-2534ece2b465530417575f4dbca70a1323eee2ef8274a6e3f9bc65da85baf4773 |
| IEDL.DBID | RIE |
| IngestDate | Wed Dec 10 09:49:55 EST 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i1384-2534ece2b465530417575f4dbca70a1323eee2ef8274a6e3f9bc65da85baf4773 |
| ORCID | 0009-0006-3193-7375 0000-0002-7710-7995 0000-0001-6131-7771 |
| OpenAccessLink | https://ieeexplore.ieee.org/document/11271427 |
| PageCount | 1 |
| ParticipantIDs | ieee_primary_11271427 |
| PublicationCentury | 2000 |
| PublicationDate | 20251128 |
| PublicationDateYYYYMMDD | 2025-11-28 |
| PublicationDate_xml | – month: 11 year: 2025 text: 20251128 day: 28 |
| PublicationDecade | 2020 |
| PublicationTitle | IEEE access |
| PublicationTitleAbbrev | Access |
| PublicationYear | 2025 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0000816957 |
| Score | 2.3646193 |
| SecondaryResourceType | online_first |
| Snippet | Large amounts of digital data are produced daily through society's use of government and private companies. Digital transformation contributes to the... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | Crime Narratives Hierarchical Multilabel Text Classification Public Security Data Semantic Clustering Text Embeddings XGBoost |
| Title | Hierarchical Multi-class and Multi-label Text Classification for Crime Report: A Traditional Machine Learning Approach |
| URI | https://ieeexplore.ieee.org/document/11271427 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAON databaseName: DOAJ Directory of Open Access Journals databaseCode: DOA dateStart: 20130101 customDbUrl: isFulltext: true eissn: 2169-3536 dateEnd: 99991231 titleUrlDefault: https://www.doaj.org/ omitProxy: false ssIdentifier: ssj0000816957 providerName: Directory of Open Access Journals – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources databaseCode: M~E dateStart: 20130101 customDbUrl: isFulltext: true eissn: 2169-3536 dateEnd: 99991231 titleUrlDefault: https://road.issn.org omitProxy: false ssIdentifier: ssj0000816957 providerName: ISSN International Centre |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZoxQADzyKelQfWtIkfsc0WVa060IqhoG6V7VxQpCpFpe3Ib8d2wmNhYImSLJZ9Uu67y33fh9B9XOQqjjWJjC9RGAUeKRvLyMrcpRttLAt9yJdHMZ3K-Vw9NWT1wIUBgDB8Bj1_G_7l5yu79a2yvsMGImFEtFBLiLQma303VLyDhOKiURZKYtXPBgO3CVcDEt6jPjMHCdMfD5WQQkbH_1z8BB01WBFndXBP0R5UZ-jwl4LgOdqNS88gDoYmSxzYtJH1gBjrKm-eXZxhiWfuK4yDBaYfDgrxwA6wYu_rBbjG4Q84wy555WXdIMSTMGkJuBFhfcVZo0DeQc-j4WwwjhorhahMqGQR4ZSBBWK8XBqNmQMNghcsN1aLWLuKlLr9EiikK1J1CrRQxqY815IbXTAh6AVqV6sKLhFWxHKbSlMIK1kOiSyMBQc6DFFuCSKvUMcf3uKtVstYfJ3b9R_vb9CBj5Dn9xF5i9qb9Rbu0L7dbcr3dTfUyO46-Rh2Q7w_AclFqwk |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELagIAEDzyLeeGBNm_hR22xRRVVEWzEU1K2ynQuKVKWotP392E54LAxsSRbLPsn33eW-70PoLs4zFceaRMaXKIwCj5SNZWRl5tKNNpaFPuTrQIxGcjJRzzVZPXBhACAMn0HLP4Z_-dncrnyrrO2wgUgYEZtoy1tn1XSt75aK95BQXNTaQkms2mm367bhqkDCW9Tn5iBi-uOiEpJI7-Cfyx-i_Rot4rQK7xHagPIY7f3SEDxB637hOcTB0mSGA582sh4SY11m9buLNMzw2N3DOJhg-vGgEBHsICv2zl6AKyR-j1Ps0ldWVC1CPAyzloBrGdY3nNYa5E300nsYd_tRbaYQFQmVLCKcMrBAjBdMozFzsEHwnGXGahFrV5NSt18CuXRlqu4AzZWxHZ5pyY3OmRD0FDXKeQlnCCtiue1IkwsrWQaJzI0FBzsMUW4JIs9R0x_e9L3Sy5h-ndvFH99v0U5_PBxMB4-jp0u066Pl2X5EXqHGcrGCa7Rt18viY3ET4v0JyuisLA |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Hierarchical+Multi-class+and+Multi-label+Text+Classification+for+Crime+Report%3A+A+Traditional+Machine+Learning+Approach&rft.jtitle=IEEE+access&rft.au=Vieira%2C+Andre+R.&rft.au=Santos%2C+Glaucio+De+S.&rft.au=Melo%2C+Wilson+S.&rft.au=Rust%2C+Luiz+F.&rft.date=2025-11-28&rft.pub=IEEE&rft.eissn=2169-3536&rft.spage=1&rft.epage=1&rft_id=info:doi/10.1109%2FACCESS.2025.3638984&rft.externalDocID=11271427 |