Hierarchical Multi-class and Multi-label Text Classification for Crime Report: A Traditional Machine Learning Approach

Large amounts of digital data are produced daily through society's use of government and private companies. Digital transformation contributes to the increasing amount of structured and unstructured data stored in digital media. Organizations build centralized data repositories to store and pro...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE access s. 1
Hlavní autoři: Vieira, Andre R., Santos, Glaucio De S., Melo, Wilson S., Rust, Luiz F.
Médium: Journal Article
Jazyk:angličtina
Vydáno: IEEE 28.11.2025
Témata:
ISSN:2169-3536
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Large amounts of digital data are produced daily through society's use of government and private companies. Digital transformation contributes to the increasing amount of structured and unstructured data stored in digital media. Organizations build centralized data repositories to store and provide information to business areas, supporting Business Intelligence solutions. Some databases store vast amounts of unstructured data, which must be systematized and classified to meet the data owner's needs. In criminal incident report systems, each recorded incident must be classified as a specific crime, with hundreds or thousands of categories presented to the responsible officer. This work explores a clustering approach to group categories into a hierarchical tree of classes, enabling the use of Machine Learning (ML) models like XGBoost for automated classification of criminal incident reports narratives. As a case study, the Civil Police of of the State of Rio de Janeiro (SEPOL/RJ) has a database with over 6.5 million records, growing daily from Judicial Police Units (JPU) across the state. Each new report requires manual classification. A hierarchical tree of classes was developed to segment the problem, allowing various XGBoost models for automated classification. The proposed hierarchical model with 80 classes achieved an accuracy of 0.463, outperforming the baseline flat model which reached 0.419, along with a 25.48% reduction in training time. The weighted average F1-score obtained by the hierarchical model was 0.48188, while the baseline model reached 0.44061. The improvement was statistically validated through a Wilcoxon signed-rank test, which yielded a p-value of 0.000010.
AbstractList Large amounts of digital data are produced daily through society's use of government and private companies. Digital transformation contributes to the increasing amount of structured and unstructured data stored in digital media. Organizations build centralized data repositories to store and provide information to business areas, supporting Business Intelligence solutions. Some databases store vast amounts of unstructured data, which must be systematized and classified to meet the data owner's needs. In criminal incident report systems, each recorded incident must be classified as a specific crime, with hundreds or thousands of categories presented to the responsible officer. This work explores a clustering approach to group categories into a hierarchical tree of classes, enabling the use of Machine Learning (ML) models like XGBoost for automated classification of criminal incident reports narratives. As a case study, the Civil Police of of the State of Rio de Janeiro (SEPOL/RJ) has a database with over 6.5 million records, growing daily from Judicial Police Units (JPU) across the state. Each new report requires manual classification. A hierarchical tree of classes was developed to segment the problem, allowing various XGBoost models for automated classification. The proposed hierarchical model with 80 classes achieved an accuracy of 0.463, outperforming the baseline flat model which reached 0.419, along with a 25.48% reduction in training time. The weighted average F1-score obtained by the hierarchical model was 0.48188, while the baseline model reached 0.44061. The improvement was statistically validated through a Wilcoxon signed-rank test, which yielded a p-value of 0.000010.
Author Santos, Glaucio De S.
Rust, Luiz F.
Vieira, Andre R.
Melo, Wilson S.
Author_xml – sequence: 1
  givenname: Andre R.
  orcidid: 0009-0006-3193-7375
  surname: Vieira
  fullname: Vieira, Andre R.
  email: arvieira@labnet.nce.ufrj.br
  organization: Programa de Pós-Graduação em Informática, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brasil
– sequence: 2
  givenname: Glaucio De S.
  surname: Santos
  fullname: Santos, Glaucio De S.
  organization: Programa de Pós-Graduação em Informática, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brasil
– sequence: 3
  givenname: Wilson S.
  orcidid: 0000-0002-7710-7995
  surname: Melo
  fullname: Melo, Wilson S.
  organization: National Institute of Metrology, Quality, and Technology, Duque de Caxias, RJ, Brasil
– sequence: 4
  givenname: Luiz F.
  orcidid: 0000-0001-6131-7771
  surname: Rust
  fullname: Rust, Luiz F.
  organization: Programa de Pós-Graduação em Informática, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brasil
BookMark eNotjl1LhEAUhocoaNv2F9TF_AHN-XK0O5FtNzCC1q6Xo56pCRtltKh_30j73hzeDx7OFTl3g0NCblgSM5bkd0VZbg-HmCdcxSIVWZ7JM7LiLM0joUR6STbT9JEEZSFSekW-9xY9-PbdttDTp69-tlHbwzRRcN3J99BgT2v8mWm5VNaE8WwHR83gaentJ9IXHAc_39OC1h46u7QLDwLYIa0QvLPujRbj6IcQXpMLA_2Em9Ndk9eHbV3uo-p591gWVWSZyGTElZDYIm9kqpRIJNNKKyO7pgWdABNcICJHk3EtIUVh8qZNVQeZasBIrcWa3P5zbRgex_Aq-N8jY1wzybX4A0i9XSQ
CODEN IAECCG
ContentType Journal Article
DBID 97E
ESBDL
RIA
RIE
DOI 10.1109/ACCESS.2025.3638984
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE Xplore : Open Access Journals and Conferences [open access]
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 2169-3536
EndPage 1
ExternalDocumentID 11271427
Genre orig-research
GrantInformation_xml – fundername: Coordena??o de Aperfei?oamento de Pessoal de N?vel Superior
  funderid: 10.13039/501100002322
GroupedDBID 0R~
5VS
6IK
97E
AAJGR
ABVLG
ACGFS
ADBBV
ADCSY
ALMA_UNASSIGNED_HOLDINGS
BCNDV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
EBS
ESBDL
GROUPED_DOAJ
IPLJI
JAVBF
KQ8
M~E
O9-
OCL
OK1
RIA
RIE
RNS
ID FETCH-LOGICAL-i1384-2534ece2b465530417575f4dbca70a1323eee2ef8274a6e3f9bc65da85baf4773
IEDL.DBID RIE
IngestDate Wed Dec 10 09:49:55 EST 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i1384-2534ece2b465530417575f4dbca70a1323eee2ef8274a6e3f9bc65da85baf4773
ORCID 0009-0006-3193-7375
0000-0002-7710-7995
0000-0001-6131-7771
OpenAccessLink https://ieeexplore.ieee.org/document/11271427
PageCount 1
ParticipantIDs ieee_primary_11271427
PublicationCentury 2000
PublicationDate 20251128
PublicationDateYYYYMMDD 2025-11-28
PublicationDate_xml – month: 11
  year: 2025
  text: 20251128
  day: 28
PublicationDecade 2020
PublicationTitle IEEE access
PublicationTitleAbbrev Access
PublicationYear 2025
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0000816957
Score 2.3646193
SecondaryResourceType online_first
Snippet Large amounts of digital data are produced daily through society's use of government and private companies. Digital transformation contributes to the...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Crime Narratives
Hierarchical Multilabel Text Classification
Public Security Data
Semantic Clustering
Text Embeddings
XGBoost
Title Hierarchical Multi-class and Multi-label Text Classification for Crime Report: A Traditional Machine Learning Approach
URI https://ieeexplore.ieee.org/document/11271427
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  databaseCode: DOA
  dateStart: 20130101
  customDbUrl:
  isFulltext: true
  eissn: 2169-3536
  dateEnd: 99991231
  titleUrlDefault: https://www.doaj.org/
  omitProxy: false
  ssIdentifier: ssj0000816957
  providerName: Directory of Open Access Journals
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  databaseCode: M~E
  dateStart: 20130101
  customDbUrl:
  isFulltext: true
  eissn: 2169-3536
  dateEnd: 99991231
  titleUrlDefault: https://road.issn.org
  omitProxy: false
  ssIdentifier: ssj0000816957
  providerName: ISSN International Centre
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZoxQADzyKelQfWtIkfsc0WVa060IqhoG6V7VxQpCpFpe3Ib8d2wmNhYImSLJZ9Uu67y33fh9B9XOQqjjWJjC9RGAUeKRvLyMrcpRttLAt9yJdHMZ3K-Vw9NWT1wIUBgDB8Bj1_G_7l5yu79a2yvsMGImFEtFBLiLQma303VLyDhOKiURZKYtXPBgO3CVcDEt6jPjMHCdMfD5WQQkbH_1z8BB01WBFndXBP0R5UZ-jwl4LgOdqNS88gDoYmSxzYtJH1gBjrKm-eXZxhiWfuK4yDBaYfDgrxwA6wYu_rBbjG4Q84wy555WXdIMSTMGkJuBFhfcVZo0DeQc-j4WwwjhorhahMqGQR4ZSBBWK8XBqNmQMNghcsN1aLWLuKlLr9EiikK1J1CrRQxqY815IbXTAh6AVqV6sKLhFWxHKbSlMIK1kOiSyMBQc6DFFuCSKvUMcf3uKtVstYfJ3b9R_vb9CBj5Dn9xF5i9qb9Rbu0L7dbcr3dTfUyO46-Rh2Q7w_AclFqwk
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELagIAEDzyLeeGBNm_hR22xRRVVEWzEU1K2ynQuKVKWotP392E54LAxsSRbLPsn33eW-70PoLs4zFceaRMaXKIwCj5SNZWRl5tKNNpaFPuTrQIxGcjJRzzVZPXBhACAMn0HLP4Z_-dncrnyrrO2wgUgYEZtoy1tn1XSt75aK95BQXNTaQkms2mm367bhqkDCW9Tn5iBi-uOiEpJI7-Cfyx-i_Rot4rQK7xHagPIY7f3SEDxB637hOcTB0mSGA582sh4SY11m9buLNMzw2N3DOJhg-vGgEBHsICv2zl6AKyR-j1Ps0ldWVC1CPAyzloBrGdY3nNYa5E300nsYd_tRbaYQFQmVLCKcMrBAjBdMozFzsEHwnGXGahFrV5NSt18CuXRlqu4AzZWxHZ5pyY3OmRD0FDXKeQlnCCtiue1IkwsrWQaJzI0FBzsMUW4JIs9R0x_e9L3Sy5h-ndvFH99v0U5_PBxMB4-jp0u066Pl2X5EXqHGcrGCa7Rt18viY3ET4v0JyuisLA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Hierarchical+Multi-class+and+Multi-label+Text+Classification+for+Crime+Report%3A+A+Traditional+Machine+Learning+Approach&rft.jtitle=IEEE+access&rft.au=Vieira%2C+Andre+R.&rft.au=Santos%2C+Glaucio+De+S.&rft.au=Melo%2C+Wilson+S.&rft.au=Rust%2C+Luiz+F.&rft.date=2025-11-28&rft.pub=IEEE&rft.eissn=2169-3536&rft.spage=1&rft.epage=1&rft_id=info:doi/10.1109%2FACCESS.2025.3638984&rft.externalDocID=11271427