Topic models do not model topics: epistemological remarks and steps towards best practices

The social sciences and digital humanities have recently adopted the machine learning technique of topic modeling to address research questions in their fields. This is problematic in a number of ways, some of which have not received much attention in the debate yet. This paper adds epistemological...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Journal of data mining and digital humanities Ročník 2021
Hlavný autor: Shadrova, Anna
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: INRIA 27.10.2021
Nicolas Turenne
Predmet:
ISSN:2416-5999, 2416-5999
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract The social sciences and digital humanities have recently adopted the machine learning technique of topic modeling to address research questions in their fields. This is problematic in a number of ways, some of which have not received much attention in the debate yet. This paper adds epistemological concerns centering around the interface between topic modeling and linguistic concepts and the argumentative embedding of evidence obtained through topic modeling. It concludes that topic modeling in its present state of methodological integration does not meet the requirements of an independent research method. It operates from relevantly unrealistic assumptions, is non-deterministic, cannot effectively be validated against a reasonable number of competing models, does not lock into a well-defined linguistic interface, and does not scholarly model topics in the sense of themes or content. These features are intrinsic and make the interpretation of its results prone to apophenia (the human tendency to perceive random sets of elements as meaningful patterns) and confirmation bias (the human tendency to perceptually prefer patterns that are in alignment with pre-existing biases). While partial validation of the statistical model is possible, a conceptual validation would require an extended triangulation with other methods and human ratings, and clarification of whether statistical distinctivity of lexical co-occurrence correlates with conceputal topics in any reliable way.
AbstractList The social sciences and digital humanities have recently adopted the machine learning technique of topic modeling to address research questions in their fields. This is problematic in a number of ways, some of which have not received much attention in the debate yet. This paper adds epistemological concerns centering around the interface between topic modeling and linguistic concepts and the argumentative embedding of evidence obtained through topic modeling. It concludes that topic modeling in its present state of methodological integration does not meet the requirements of an independent research method. It operates from relevantly unrealistic assumptions, is non-deterministic, cannot effectively be validated against a reasonable number of competing models, does not lock into a well-defined linguistic interface, and does not scholarly model topics in the sense of themes or content. These features are intrinsic and make the interpretation of its results prone to apophenia (the human tendency to perceive random sets of elements as meaningful patterns) and confirmation bias (the human tendency to perceptually prefer patterns that are in alignment with pre-existing biases). While partial validation of the statistical model is possible, a conceptual validation would require an extended triangulation with other methods and human ratings, and clarification of whether statistical distinctivity of lexical co-occurrence correlates with conceputal topics in any reliable way.
Author Shadrova, Anna
Author_xml – sequence: 1
  givenname: Anna
  surname: Shadrova
  fullname: Shadrova, Anna
  organization: Humboldt University Of Berlin
BackLink https://hal.science/hal-03261599$$DView record in HAL
BookMark eNptkU1PAyEQhonRxK-e_ANcjdkKC-yKt8aobdLEgz15ITCwLXW7bGCj8d9LW2PUeBpm5pl3hryn6LALnUPogpIxr0p5c722G7sa10KKA3RScloVQkp5-ON9jEYprQkhVPAbIcQJelmE3gPeBOvahG3AXRj2GR62nXSLXe_T4DahDUsPusXRbXR8TVh3FudGnzL5rqNN2Lg04D5qGDy4dI6OGt0mN_qKZ-j54X5xNy3mT4-zu8m8gJIKUVhJDQPOgTNW1xU0uqKWUcqMEc5Q7izlEuq6sRUHqIg22tL8RwnAZcnO0GyvaoNeqz76fNyHCtqrXSHEpdIx39M6RUpDwAEVmtW8BmtKJglzrCHCGsd41rrca610-0tqOpmrbY2wssq75RvL7NWehRhSiq75HqBE7fxQOz_U1o9M0z80-EEPPnRD1L79d-YTCb2RkQ
CitedBy_id crossref_primary_10_1057_s41599_025_05382_x
crossref_primary_10_1177_08944393231212251
crossref_primary_10_1007_s13437_023_00307_4
crossref_primary_10_1111_oli_12458
crossref_primary_10_3366_cor_2025_0337
crossref_primary_10_1111_medu_15202
crossref_primary_10_1007_s11135_023_01802_9
crossref_primary_10_1108_IJOPM_03_2023_0239
crossref_primary_10_1515_zgl_2025_2005
crossref_primary_10_1111_1467_8551_12816
crossref_primary_10_1515_opli_2024_0036
crossref_primary_10_1177_00076503241306615
ContentType Journal Article
Copyright Attribution - NonCommercial - ShareAlike
Copyright_xml – notice: Attribution - NonCommercial - ShareAlike
DBID AAYXX
CITATION
1XC
BXJBU
IHQJB
VOOES
DOA
DOI 10.46298/jdmdh.7595
DatabaseName CrossRef
Hyper Article en Ligne (HAL)
HAL-SHS: Archive ouverte en Sciences de l'Homme et de la Société
HAL-SHS: Archive ouverte en Sciences de l'Homme et de la Société (Open Access)
Hyper Article en Ligne (HAL) (Open Access)
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
DatabaseTitleList
CrossRef

Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 2416-5999
ExternalDocumentID oai_doaj_org_article_02b0cec15a3747cdb23903e3f05dbe34
oai:HAL:hal-03261599v3
10_46298_jdmdh_7595
GroupedDBID 5VS
AAFWJ
AAYXX
ADBBV
ADQAK
AFPKN
ALMA_UNASSIGNED_HOLDINGS
BCNDV
CITATION
FRP
GROUPED_DOAJ
KQ8
M~E
OK1
1XC
BXJBU
IHQJB
VOOES
ID FETCH-LOGICAL-c2155-d91b3c44c433776cfa61d3113bb5eb14ed149c77fd64cc60abad15959cc4923
IEDL.DBID DOA
ISSN 2416-5999
IngestDate Fri Oct 03 12:46:06 EDT 2025
Tue Oct 14 20:44:43 EDT 2025
Tue Nov 18 22:10:17 EST 2025
Sat Nov 29 04:10:29 EST 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Keywords Topic modeling
digital humanities
information extraction for scientific inquiry
Language English
License https://creativecommons.org/licenses/by-nc-sa/4.0
Attribution - NonCommercial - ShareAlike: http://creativecommons.org/licenses/by-nc-sa
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c2155-d91b3c44c433776cfa61d3113bb5eb14ed149c77fd64cc60abad15959cc4923
ORCID 0000-0001-6455-2427
OpenAccessLink https://doaj.org/article/02b0cec15a3747cdb23903e3f05dbe34
ParticipantIDs doaj_primary_oai_doaj_org_article_02b0cec15a3747cdb23903e3f05dbe34
hal_primary_oai_HAL_hal_03261599v3
crossref_primary_10_46298_jdmdh_7595
crossref_citationtrail_10_46298_jdmdh_7595
PublicationCentury 2000
PublicationDate 2021-10-27
PublicationDateYYYYMMDD 2021-10-27
PublicationDate_xml – month: 10
  year: 2021
  text: 2021-10-27
  day: 27
PublicationDecade 2020
PublicationTitle Journal of data mining and digital humanities
PublicationYear 2021
Publisher INRIA
Nicolas Turenne
Publisher_xml – name: INRIA
– name: Nicolas Turenne
SSID ssj0001548555
Score 2.1611493
Snippet The social sciences and digital humanities have recently adopted the machine learning technique of topic modeling to address research questions in their...
SourceID doaj
hal
crossref
SourceType Open Website
Open Access Repository
Enrichment Source
Index Database
SubjectTerms [info.info-dl]computer science [cs]/digital libraries [cs.dl]
[shs.langue]humanities and social sciences/linguistics
[shs]humanities and social sciences
Computer Science
digital humanities
Digital Libraries
Humanities and Social Sciences
information extraction for scientific inquiry
Linguistics
topic modeling
Title Topic models do not model topics: epistemological remarks and steps towards best practices
URI https://hal.science/hal-03261599
https://doaj.org/article/02b0cec15a3747cdb23903e3f05dbe34
Volume 2021
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  customDbUrl:
  eissn: 2416-5999
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0001548555
  issn: 2416-5999
  databaseCode: DOA
  dateStart: 20140101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  customDbUrl:
  eissn: 2416-5999
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0001548555
  issn: 2416-5999
  databaseCode: M~E
  dateStart: 20140101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV3NS8MwFA8yPHjxW5xfBNlJqLZN07TeVDY86BAcMryU5CXFqevGOnf0b_clbWWC4MVLockjbd9L8t5rkt-PkE5iV__QDdmRhgmKBO1JqSNPM24SX-Yg3Ar-053o95PhMH1Yovqye8IqeOBKcRd-qHwwEHDJMPIFrbBpnxmW-1wrwxwSKEY9S8lUdT7Ygp7w6kBeFIdpcvGqx_rlXHDLJLHkghxSPzqWl-ZHqnMsvU2yXkeE9Kp6ky2yYoptstGwLdB68O2Q58FkOgLqqGtKqie0mMyrOzq3NeUlNVNrtHEzn9GZGcvZW0lloSlWTEuUtLtkS6rQGdDmhFS5Sx573cHNrVczI3iALpp7Og0UgyiCiDEhYshlHGgWBEwpjpNvZDQmPiBEruMIIPalkhrjFp4CWEC2PdIqJoXZJzT1U5XkYLgwlnraKB6JPACjRM5CAaZNzhpVZVCDhlvuivcMkwen18zpNbN6bZPOt_C0wsr4Xeza6vxbxAJcuwI0e1abPfvL7G1yihb70cbt1V1my3yMSPFj0wU7-I8nHZK10O5jwf4ciiPSms8-zDFZhcV8VM5OXLfD6_1n9wuL9uCE
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Topic+models+do+not+model+topics%3A+epistemological+remarks+and+steps+towards+best+practices&rft.jtitle=Journal+of+data+mining+and+digital+humanities&rft.au=Anna+Shadrova&rft.date=2021-10-27&rft.pub=Nicolas+Turenne&rft.eissn=2416-5999&rft.volume=2021&rft_id=info:doi/10.46298%2Fjdmdh.7595&rft.externalDBID=DOA&rft.externalDocID=oai_doaj_org_article_02b0cec15a3747cdb23903e3f05dbe34
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2416-5999&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2416-5999&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2416-5999&client=summon