Machine Learning for Occupation Coding—A Comparison Study
Abstract Asking people about their occupation is common practice in surveys and censuses around the world. The answers are typically recorded in textual form and subsequently assigned (coded) to categories, which have been defined in official occupational classifications. While this coding step is o...
Saved in:
| Published in: | Journal of survey statistics and methodology Vol. 9; no. 5; pp. 1013 - 1034 |
|---|---|
| Main Authors: | , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Oxford University Press
01.11.2021
|
| Subjects: | |
| ISSN: | 2325-0984, 2325-0992 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Abstract
Asking people about their occupation is common practice in surveys and censuses around the world. The answers are typically recorded in textual form and subsequently assigned (coded) to categories, which have been defined in official occupational classifications. While this coding step is often done manually, substituting it with more automated workflows has been a longstanding goal, promising reduced data-processing costs and accelerated publication of key statistics. Although numerous researchers have developed different algorithms for automated occupation coding, the algorithms have rarely been compared with each other or tested on different data sets. We fill this gap by comparing some of the most promising algorithms found in the literature and testing them on five data sets from Germany. The first two algorithms we test exemplify a common practice in which answers are coded automatically according to a predefined list of job titles. Statistical learning algorithms—that is, regularized multinomial regression, tree boosting, or algorithms developed specifically for occupation coding (algorithms three to six)—can improve upon algorithms one and two, but only if a sufficient number of training observations from previous surveys is available. The best results are obtained by merging the list of job titles with coded answers from previous surveys before using this combined training data for statistical learning (algorithm 7). However, the differences between the algorithms are often small compared to the large variation found across different data sets, which we ascribe to systematic differences in the way the data were coded in the first place. Such differences complicate the application of statistical learning, which risks perpetuating questionable coding decisions from the training data to the future. |
|---|---|
| AbstractList | Abstract
Asking people about their occupation is common practice in surveys and censuses around the world. The answers are typically recorded in textual form and subsequently assigned (coded) to categories, which have been defined in official occupational classifications. While this coding step is often done manually, substituting it with more automated workflows has been a longstanding goal, promising reduced data-processing costs and accelerated publication of key statistics. Although numerous researchers have developed different algorithms for automated occupation coding, the algorithms have rarely been compared with each other or tested on different data sets. We fill this gap by comparing some of the most promising algorithms found in the literature and testing them on five data sets from Germany. The first two algorithms we test exemplify a common practice in which answers are coded automatically according to a predefined list of job titles. Statistical learning algorithms—that is, regularized multinomial regression, tree boosting, or algorithms developed specifically for occupation coding (algorithms three to six)—can improve upon algorithms one and two, but only if a sufficient number of training observations from previous surveys is available. The best results are obtained by merging the list of job titles with coded answers from previous surveys before using this combined training data for statistical learning (algorithm 7). However, the differences between the algorithms are often small compared to the large variation found across different data sets, which we ascribe to systematic differences in the way the data were coded in the first place. Such differences complicate the application of statistical learning, which risks perpetuating questionable coding decisions from the training data to the future. |
| Author | Schierholz, Malte Schonlau, Matthias |
| Author_xml | – sequence: 1 givenname: Malte surname: Schierholz fullname: Schierholz, Malte email: malte.schierholz@iab.de organization: Malte Schierholz is a Senior Researcher with the Institute for Employment Research (IAB), Regensburger Str. 104, 90478 Nuremberg, Germany – sequence: 2 givenname: Matthias surname: Schonlau fullname: Schonlau, Matthias organization: Matthias Schonlau is Professor at the University of Waterloo, 200 University Avenue West, Waterloo, ON, N2L 3G1, Canada |
| BookMark | eNo9j71OwzAURi1UJErpyJ6RJfTaN05iMVURFKSiDsAc3fgHUhEnipOhGw_BE_IkBKiYvqMzfNI5ZzPfesvYJYdrDgpX-xCoWYWGCASesLlAIWNQSsz-OU_O2DKEPQBwVHmmYM5uHkm_1d5GW0u9r_1r5No-2mk9djTUrY-K1kz26-NzPWHTUV-HyT4NozlcsFNH78Euj7tgL3e3z8V9vN1tHor1NtYIaohTwEQryGSlAKVxPEOpDUluhc1tVVXO6QRckmmjQcuMuCHDMXWCp2RlhQt29ffbjl3Z9XVD_aHkUP50l7_d5bEbvwHUsVA_ |
| CitedBy_id | crossref_primary_10_2478_jos_2021_0042 crossref_primary_10_1186_s12651_024_00376_9 crossref_primary_10_1007_s44163_023_00050_y crossref_primary_10_3390_stats8030068 crossref_primary_10_1093_jssam_smaf014 crossref_primary_10_1038_s43856_023_00397_4 crossref_primary_10_1093_jssam_smad015 crossref_primary_10_1177_0282423X241309971 crossref_primary_10_3389_fdata_2022_880554 crossref_primary_10_1093_annweh_wxad020 crossref_primary_10_1177_18747655251335761 crossref_primary_10_1177_00491241241303517 crossref_primary_10_3390_info15080496 |
| ContentType | Journal Article |
| Copyright | The Author(s) 2020. Published by Oxford University Press on behalf of the American Association for Public Opinion Research. 2020 |
| Copyright_xml | – notice: The Author(s) 2020. Published by Oxford University Press on behalf of the American Association for Public Opinion Research. 2020 |
| DBID | TOX |
| DOI | 10.1093/jssam/smaa023 |
| DatabaseName | Oxford Journals Open Access Collection |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: TOX name: Oxford Journals Open Access Collection url: https://academic.oup.com/journals/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Statistics |
| EISSN | 2325-0992 |
| EndPage | 1034 |
| ExternalDocumentID | 10.1093/jssam/smaa023 |
| GroupedDBID | 0R~ 1TH 4.4 48X AAFXQ AAHZY AAJKP AAJQQ AAMVS AAMZS AAOGV AAPQZ AAPXW AARHZ AAUAY AAUOS AAVAP ABIXL ABJNI ABKEB ABLIL ABNKS ABPBX ABPPZ ABPTD ABQBX ABQLI ABWST ABXVV ABYLZ ABZBJ ACDXO ACGFS ACIPB ACUFI ACVHY ACVJI ADBKU ADEYI ADEZT ADGZP ADHKW ADHZD ADIPN ADLMC ADLOL ADOCK ADORO ADQBN ADQIT ADRIX ADYVW ADZXQ AEGPL AEJOX AEKSI AEMDU AENZO AEPUE AFFZL AFHLB AFIYH AFOFC AFVIK AFXEN AGINJ AGQXC AGSYK AIDGQ ALMA_UNASSIGNED_HOLDINGS ALUQC AMEGR ATGXG AVWKF AYLYT BAYMD BCRHZ BEYMZ BHZBG BMSTW BWUDY BZKTN BZYEK CNZYI CWPEY DAKXR DDUBX D~K EBS EE~ EJD ETYVG F9B FLUFQ FOEOM FQBLK FTKQU FXXIA GAOTZ GJXCC H13 HZ~ J21 KBUDW KOP KSI KSN MJWOD NEJ NOMLY NPJNY NPUNC O9- OAIJC OJQWA OJZSN OKKKP OXVUA PEELM PLIXB ROL ROX RW1 RXO TJJ TOX ULE YADRA YAJVU YKOAZ YXANX ~SN |
| ID | FETCH-LOGICAL-c309t-6034c9075b9035df1735cda51e2e8ebbbffc40f47cdc0c57a1dad136f216ae5b3 |
| IEDL.DBID | TOX |
| ISICitedReferencesCount | 12 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000733666000005&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 2325-0984 |
| IngestDate | Wed Aug 28 03:17:18 EDT 2024 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | false |
| IsScholarly | true |
| Issue | 5 |
| Keywords | Occupation Comparison of algorithms Data processing Coding Statistical learning Machine learning |
| Language | English |
| License | This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c309t-6034c9075b9035df1735cda51e2e8ebbbffc40f47cdc0c57a1dad136f216ae5b3 |
| OpenAccessLink | https://dx.doi.org/10.1093/jssam/smaa023 |
| PageCount | 22 |
| ParticipantIDs | oup_primary_10_1093_jssam_smaa023 |
| PublicationCentury | 2000 |
| PublicationDate | 2021-11-01 |
| PublicationDateYYYYMMDD | 2021-11-01 |
| PublicationDate_xml | – month: 11 year: 2021 text: 2021-11-01 day: 01 |
| PublicationDecade | 2020 |
| PublicationTitle | Journal of survey statistics and methodology |
| PublicationYear | 2021 |
| Publisher | Oxford University Press |
| Publisher_xml | – name: Oxford University Press |
| SSID | ssj0001398790 |
| Score | 2.2987692 |
| Snippet | Abstract
Asking people about their occupation is common practice in surveys and censuses around the world. The answers are typically recorded in textual form... |
| SourceID | oup |
| SourceType | Publisher |
| StartPage | 1013 |
| Title | Machine Learning for Occupation Coding—A Comparison Study |
| Volume | 9 |
| WOSCitedRecordID | wos000733666000005&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LSgMxFA1SXHTjW3wTwW1oMklmcnFVisWN1UWF7obkJiMIbaVTBXd-hF_ol5h5YAVddB8SOBdyz32dS8iV8M4kmBYsA--YAgRmvASG0fUlgJFEWF4vm8hGIzOZwEOb7yj_KeGD7D2XpZ32yqm10b_Ez1ZoU60qGN9PVskUGUPnOp8SCYJmHIxq9TT_XNAMsv3yIcPt9V_fIVstT6T9xrC7ZCPM9ki3ooaNsvI-ub6r-yADbSVSn2jkn3SlGkwH88ovfX189ungZ9sgrRoH3w_I4_BmPLhl7SoEhpLDkqVcKoxxrHbApfaFyKRGb7UISTDBOVcUqHihMvTIUWdWeOuFTItEpDZoJw9JZzafhSNCjXGANrFWOVCQSes0YFCFU6Cqns9jchkByl8asYu8KVLLvMYhb3E4WePMKekmVf9HPbd3RjrLxWs4J5v4FnFaXNT2-wYNV5q_ |
| linkProvider | Oxford University Press |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Machine+Learning+for+Occupation+Coding%E2%80%94A+Comparison+Study&rft.jtitle=Journal+of+survey+statistics+and+methodology&rft.au=Schierholz%2C+Malte&rft.au=Schonlau%2C+Matthias&rft.date=2021-11-01&rft.pub=Oxford+University+Press&rft.issn=2325-0984&rft.eissn=2325-0992&rft.volume=9&rft.issue=5&rft.spage=1013&rft.epage=1034&rft_id=info:doi/10.1093%2Fjssam%2Fsmaa023&rft.externalDocID=10.1093%2Fjssam%2Fsmaa023 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2325-0984&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2325-0984&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2325-0984&client=summon |