Topic models do not model topics: epistemological remarks and steps towards best practices
The social sciences and digital humanities have recently adopted the machine learning technique of topic modeling to address research questions in their fields. This is problematic in a number of ways, some of which have not received much attention in the debate yet. This paper adds epistemological...
Uložené v:
| Vydané v: | Journal of data mining and digital humanities Ročník 2021 |
|---|---|
| Hlavný autor: | |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
INRIA
27.10.2021
Nicolas Turenne |
| Predmet: | |
| ISSN: | 2416-5999, 2416-5999 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | The social sciences and digital humanities have recently adopted the machine learning technique of topic modeling to address research questions in their fields. This is problematic in a number of ways, some of which have not received much attention in the debate yet. This paper adds epistemological concerns centering around the interface between topic modeling and linguistic concepts and the argumentative embedding of evidence obtained through topic modeling. It concludes that topic modeling in its present state of methodological integration does not meet the requirements of an independent research method. It operates from relevantly unrealistic assumptions, is non-deterministic, cannot effectively be validated against a reasonable number of competing models, does not lock into a well-defined linguistic interface, and does not scholarly model topics in the sense of themes or content. These features are intrinsic and make the interpretation of its results prone to apophenia (the human tendency to perceive random sets of elements as meaningful patterns) and confirmation bias (the human tendency to perceptually prefer patterns that are in alignment with pre-existing biases). While partial validation of the statistical model is possible, a conceptual validation would require an extended triangulation with other methods and human ratings, and clarification of whether statistical distinctivity of lexical co-occurrence correlates with conceputal topics in any reliable way. |
|---|---|
| AbstractList | The social sciences and digital humanities have recently adopted the machine learning technique of topic modeling to address research questions in their fields. This is problematic in a number of ways, some of which have not received much attention in the debate yet. This paper adds epistemological concerns centering around the interface between topic modeling and linguistic concepts and the argumentative embedding of evidence obtained through topic modeling. It concludes that topic modeling in its present state of methodological integration does not meet the requirements of an independent research method. It operates from relevantly unrealistic assumptions, is non-deterministic, cannot effectively be validated against a reasonable number of competing models, does not lock into a well-defined linguistic interface, and does not scholarly model topics in the sense of themes or content. These features are intrinsic and make the interpretation of its results prone to apophenia (the human tendency to perceive random sets of elements as meaningful patterns) and confirmation bias (the human tendency to perceptually prefer patterns that are in alignment with pre-existing biases). While partial validation of the statistical model is possible, a conceptual validation would require an extended triangulation with other methods and human ratings, and clarification of whether statistical distinctivity of lexical co-occurrence correlates with conceputal topics in any reliable way. |
| Author | Shadrova, Anna |
| Author_xml | – sequence: 1 givenname: Anna surname: Shadrova fullname: Shadrova, Anna organization: Humboldt University Of Berlin |
| BackLink | https://hal.science/hal-03261599$$DView record in HAL |
| BookMark | eNptkU1PAyEQhonRxK-e_ANcjdkKC-yKt8aobdLEgz15ITCwLXW7bGCj8d9LW2PUeBpm5pl3hryn6LALnUPogpIxr0p5c722G7sa10KKA3RScloVQkp5-ON9jEYprQkhVPAbIcQJelmE3gPeBOvahG3AXRj2GR62nXSLXe_T4DahDUsPusXRbXR8TVh3FudGnzL5rqNN2Lg04D5qGDy4dI6OGt0mN_qKZ-j54X5xNy3mT4-zu8m8gJIKUVhJDQPOgTNW1xU0uqKWUcqMEc5Q7izlEuq6sRUHqIg22tL8RwnAZcnO0GyvaoNeqz76fNyHCtqrXSHEpdIx39M6RUpDwAEVmtW8BmtKJglzrCHCGsd41rrca610-0tqOpmrbY2wssq75RvL7NWehRhSiq75HqBE7fxQOz_U1o9M0z80-EEPPnRD1L79d-YTCb2RkQ |
| CitedBy_id | crossref_primary_10_1057_s41599_025_05382_x crossref_primary_10_1177_08944393231212251 crossref_primary_10_1007_s13437_023_00307_4 crossref_primary_10_1111_oli_12458 crossref_primary_10_3366_cor_2025_0337 crossref_primary_10_1111_medu_15202 crossref_primary_10_1007_s11135_023_01802_9 crossref_primary_10_1108_IJOPM_03_2023_0239 crossref_primary_10_1515_zgl_2025_2005 crossref_primary_10_1111_1467_8551_12816 crossref_primary_10_1515_opli_2024_0036 crossref_primary_10_1177_00076503241306615 |
| ContentType | Journal Article |
| Copyright | Attribution - NonCommercial - ShareAlike |
| Copyright_xml | – notice: Attribution - NonCommercial - ShareAlike |
| DBID | AAYXX CITATION 1XC BXJBU IHQJB VOOES DOA |
| DOI | 10.46298/jdmdh.7595 |
| DatabaseName | CrossRef Hyper Article en Ligne (HAL) HAL-SHS: Archive ouverte en Sciences de l'Homme et de la Société HAL-SHS: Archive ouverte en Sciences de l'Homme et de la Société (Open Access) Hyper Article en Ligne (HAL) (Open Access) DOAJ Directory of Open Access Journals |
| DatabaseTitle | CrossRef |
| DatabaseTitleList | CrossRef |
| Database_xml | – sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 2416-5999 |
| ExternalDocumentID | oai_doaj_org_article_02b0cec15a3747cdb23903e3f05dbe34 oai:HAL:hal-03261599v3 10_46298_jdmdh_7595 |
| GroupedDBID | 5VS AAFWJ AAYXX ADBBV ADQAK AFPKN ALMA_UNASSIGNED_HOLDINGS BCNDV CITATION FRP GROUPED_DOAJ KQ8 M~E OK1 1XC BXJBU IHQJB VOOES |
| ID | FETCH-LOGICAL-c2155-d91b3c44c433776cfa61d3113bb5eb14ed149c77fd64cc60abad15959cc4923 |
| IEDL.DBID | DOA |
| ISSN | 2416-5999 |
| IngestDate | Fri Oct 03 12:46:06 EDT 2025 Tue Oct 14 20:44:43 EDT 2025 Tue Nov 18 22:10:17 EST 2025 Sat Nov 29 04:10:29 EST 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Keywords | Topic modeling digital humanities information extraction for scientific inquiry |
| Language | English |
| License | https://creativecommons.org/licenses/by-nc-sa/4.0 Attribution - NonCommercial - ShareAlike: http://creativecommons.org/licenses/by-nc-sa |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c2155-d91b3c44c433776cfa61d3113bb5eb14ed149c77fd64cc60abad15959cc4923 |
| ORCID | 0000-0001-6455-2427 |
| OpenAccessLink | https://doaj.org/article/02b0cec15a3747cdb23903e3f05dbe34 |
| ParticipantIDs | doaj_primary_oai_doaj_org_article_02b0cec15a3747cdb23903e3f05dbe34 hal_primary_oai_HAL_hal_03261599v3 crossref_primary_10_46298_jdmdh_7595 crossref_citationtrail_10_46298_jdmdh_7595 |
| PublicationCentury | 2000 |
| PublicationDate | 2021-10-27 |
| PublicationDateYYYYMMDD | 2021-10-27 |
| PublicationDate_xml | – month: 10 year: 2021 text: 2021-10-27 day: 27 |
| PublicationDecade | 2020 |
| PublicationTitle | Journal of data mining and digital humanities |
| PublicationYear | 2021 |
| Publisher | INRIA Nicolas Turenne |
| Publisher_xml | – name: INRIA – name: Nicolas Turenne |
| SSID | ssj0001548555 |
| Score | 2.1611493 |
| Snippet | The social sciences and digital humanities have recently adopted the machine learning technique of topic modeling to address research questions in their... |
| SourceID | doaj hal crossref |
| SourceType | Open Website Open Access Repository Enrichment Source Index Database |
| SubjectTerms | [info.info-dl]computer science [cs]/digital libraries [cs.dl] [shs.langue]humanities and social sciences/linguistics [shs]humanities and social sciences Computer Science digital humanities Digital Libraries Humanities and Social Sciences information extraction for scientific inquiry Linguistics topic modeling |
| Title | Topic models do not model topics: epistemological remarks and steps towards best practices |
| URI | https://hal.science/hal-03261599 https://doaj.org/article/02b0cec15a3747cdb23903e3f05dbe34 |
| Volume | 2021 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAON databaseName: DOAJ Directory of Open Access Journals customDbUrl: eissn: 2416-5999 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0001548555 issn: 2416-5999 databaseCode: DOA dateStart: 20140101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2416-5999 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0001548555 issn: 2416-5999 databaseCode: M~E dateStart: 20140101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV3NS8MwFA8yPHjxW5xfBNlJqLZN07TeVDY86BAcMryU5CXFqevGOnf0b_clbWWC4MVLockjbd9L8t5rkt-PkE5iV__QDdmRhgmKBO1JqSNPM24SX-Yg3Ar-053o95PhMH1Yovqye8IqeOBKcRd-qHwwEHDJMPIFrbBpnxmW-1wrwxwSKEY9S8lUdT7Ygp7w6kBeFIdpcvGqx_rlXHDLJLHkghxSPzqWl-ZHqnMsvU2yXkeE9Kp6ky2yYoptstGwLdB68O2Q58FkOgLqqGtKqie0mMyrOzq3NeUlNVNrtHEzn9GZGcvZW0lloSlWTEuUtLtkS6rQGdDmhFS5Sx573cHNrVczI3iALpp7Og0UgyiCiDEhYshlHGgWBEwpjpNvZDQmPiBEruMIIPalkhrjFp4CWEC2PdIqJoXZJzT1U5XkYLgwlnraKB6JPACjRM5CAaZNzhpVZVCDhlvuivcMkwen18zpNbN6bZPOt_C0wsr4Xeza6vxbxAJcuwI0e1abPfvL7G1yihb70cbt1V1my3yMSPFj0wU7-I8nHZK10O5jwf4ciiPSms8-zDFZhcV8VM5OXLfD6_1n9wuL9uCE |
| linkProvider | Directory of Open Access Journals |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Topic+models+do+not+model+topics%3A+epistemological+remarks+and+steps+towards+best+practices&rft.jtitle=Journal+of+data+mining+and+digital+humanities&rft.au=Anna+Shadrova&rft.date=2021-10-27&rft.pub=Nicolas+Turenne&rft.eissn=2416-5999&rft.volume=2021&rft_id=info:doi/10.46298%2Fjdmdh.7595&rft.externalDBID=DOA&rft.externalDocID=oai_doaj_org_article_02b0cec15a3747cdb23903e3f05dbe34 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2416-5999&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2416-5999&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2416-5999&client=summon |