Machine Learning for Occupation Coding—A Comparison Study

Abstract Asking people about their occupation is common practice in surveys and censuses around the world. The answers are typically recorded in textual form and subsequently assigned (coded) to categories, which have been defined in official occupational classifications. While this coding step is o...

Full description

Saved in:
Bibliographic Details
Published in:Journal of survey statistics and methodology Vol. 9; no. 5; pp. 1013 - 1034
Main Authors: Schierholz, Malte, Schonlau, Matthias
Format: Journal Article
Language:English
Published: Oxford University Press 01.11.2021
Subjects:
ISSN:2325-0984, 2325-0992
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Abstract Asking people about their occupation is common practice in surveys and censuses around the world. The answers are typically recorded in textual form and subsequently assigned (coded) to categories, which have been defined in official occupational classifications. While this coding step is often done manually, substituting it with more automated workflows has been a longstanding goal, promising reduced data-processing costs and accelerated publication of key statistics. Although numerous researchers have developed different algorithms for automated occupation coding, the algorithms have rarely been compared with each other or tested on different data sets. We fill this gap by comparing some of the most promising algorithms found in the literature and testing them on five data sets from Germany. The first two algorithms we test exemplify a common practice in which answers are coded automatically according to a predefined list of job titles. Statistical learning algorithms—that is, regularized multinomial regression, tree boosting, or algorithms developed specifically for occupation coding (algorithms three to six)—can improve upon algorithms one and two, but only if a sufficient number of training observations from previous surveys is available. The best results are obtained by merging the list of job titles with coded answers from previous surveys before using this combined training data for statistical learning (algorithm 7). However, the differences between the algorithms are often small compared to the large variation found across different data sets, which we ascribe to systematic differences in the way the data were coded in the first place. Such differences complicate the application of statistical learning, which risks perpetuating questionable coding decisions from the training data to the future.
AbstractList Abstract Asking people about their occupation is common practice in surveys and censuses around the world. The answers are typically recorded in textual form and subsequently assigned (coded) to categories, which have been defined in official occupational classifications. While this coding step is often done manually, substituting it with more automated workflows has been a longstanding goal, promising reduced data-processing costs and accelerated publication of key statistics. Although numerous researchers have developed different algorithms for automated occupation coding, the algorithms have rarely been compared with each other or tested on different data sets. We fill this gap by comparing some of the most promising algorithms found in the literature and testing them on five data sets from Germany. The first two algorithms we test exemplify a common practice in which answers are coded automatically according to a predefined list of job titles. Statistical learning algorithms—that is, regularized multinomial regression, tree boosting, or algorithms developed specifically for occupation coding (algorithms three to six)—can improve upon algorithms one and two, but only if a sufficient number of training observations from previous surveys is available. The best results are obtained by merging the list of job titles with coded answers from previous surveys before using this combined training data for statistical learning (algorithm 7). However, the differences between the algorithms are often small compared to the large variation found across different data sets, which we ascribe to systematic differences in the way the data were coded in the first place. Such differences complicate the application of statistical learning, which risks perpetuating questionable coding decisions from the training data to the future.
Author Schierholz, Malte
Schonlau, Matthias
Author_xml – sequence: 1
  givenname: Malte
  surname: Schierholz
  fullname: Schierholz, Malte
  email: malte.schierholz@iab.de
  organization: Malte Schierholz is a Senior Researcher with the Institute for Employment Research (IAB), Regensburger Str. 104, 90478 Nuremberg, Germany
– sequence: 2
  givenname: Matthias
  surname: Schonlau
  fullname: Schonlau, Matthias
  organization: Matthias Schonlau is Professor at the University of Waterloo, 200 University Avenue West, Waterloo, ON, N2L 3G1, Canada
BookMark eNo9j71OwzAURi1UJErpyJ6RJfTaN05iMVURFKSiDsAc3fgHUhEnipOhGw_BE_IkBKiYvqMzfNI5ZzPfesvYJYdrDgpX-xCoWYWGCASesLlAIWNQSsz-OU_O2DKEPQBwVHmmYM5uHkm_1d5GW0u9r_1r5No-2mk9djTUrY-K1kz26-NzPWHTUV-HyT4NozlcsFNH78Euj7tgL3e3z8V9vN1tHor1NtYIaohTwEQryGSlAKVxPEOpDUluhc1tVVXO6QRckmmjQcuMuCHDMXWCp2RlhQt29ffbjl3Z9XVD_aHkUP50l7_d5bEbvwHUsVA_
CitedBy_id crossref_primary_10_2478_jos_2021_0042
crossref_primary_10_1186_s12651_024_00376_9
crossref_primary_10_1007_s44163_023_00050_y
crossref_primary_10_3390_stats8030068
crossref_primary_10_1093_jssam_smaf014
crossref_primary_10_1038_s43856_023_00397_4
crossref_primary_10_1093_jssam_smad015
crossref_primary_10_1177_0282423X241309971
crossref_primary_10_3389_fdata_2022_880554
crossref_primary_10_1093_annweh_wxad020
crossref_primary_10_1177_18747655251335761
crossref_primary_10_1177_00491241241303517
crossref_primary_10_3390_info15080496
ContentType Journal Article
Copyright The Author(s) 2020. Published by Oxford University Press on behalf of the American Association for Public Opinion Research. 2020
Copyright_xml – notice: The Author(s) 2020. Published by Oxford University Press on behalf of the American Association for Public Opinion Research. 2020
DBID TOX
DOI 10.1093/jssam/smaa023
DatabaseName Oxford Journals Open Access Collection
DatabaseTitleList
Database_xml – sequence: 1
  dbid: TOX
  name: Oxford Journals Open Access Collection
  url: https://academic.oup.com/journals/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Statistics
EISSN 2325-0992
EndPage 1034
ExternalDocumentID 10.1093/jssam/smaa023
GroupedDBID 0R~
1TH
4.4
48X
AAFXQ
AAHZY
AAJKP
AAJQQ
AAMVS
AAMZS
AAOGV
AAPQZ
AAPXW
AARHZ
AAUAY
AAUOS
AAVAP
ABIXL
ABJNI
ABKEB
ABLIL
ABNKS
ABPBX
ABPPZ
ABPTD
ABQBX
ABQLI
ABWST
ABXVV
ABYLZ
ABZBJ
ACDXO
ACGFS
ACIPB
ACUFI
ACVHY
ACVJI
ADBKU
ADEYI
ADEZT
ADGZP
ADHKW
ADHZD
ADIPN
ADLMC
ADLOL
ADOCK
ADORO
ADQBN
ADQIT
ADRIX
ADYVW
ADZXQ
AEGPL
AEJOX
AEKSI
AEMDU
AENZO
AEPUE
AFFZL
AFHLB
AFIYH
AFOFC
AFVIK
AFXEN
AGINJ
AGQXC
AGSYK
AIDGQ
ALMA_UNASSIGNED_HOLDINGS
ALUQC
AMEGR
ATGXG
AVWKF
AYLYT
BAYMD
BCRHZ
BEYMZ
BHZBG
BMSTW
BWUDY
BZKTN
BZYEK
CNZYI
CWPEY
DAKXR
DDUBX
D~K
EBS
EE~
EJD
ETYVG
F9B
FLUFQ
FOEOM
FQBLK
FTKQU
FXXIA
GAOTZ
GJXCC
H13
HZ~
J21
KBUDW
KOP
KSI
KSN
MJWOD
NEJ
NOMLY
NPJNY
NPUNC
O9-
OAIJC
OJQWA
OJZSN
OKKKP
OXVUA
PEELM
PLIXB
ROL
ROX
RW1
RXO
TJJ
TOX
ULE
YADRA
YAJVU
YKOAZ
YXANX
~SN
ID FETCH-LOGICAL-c309t-6034c9075b9035df1735cda51e2e8ebbbffc40f47cdc0c57a1dad136f216ae5b3
IEDL.DBID TOX
ISICitedReferencesCount 12
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000733666000005&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 2325-0984
IngestDate Wed Aug 28 03:17:18 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Issue 5
Keywords Occupation
Comparison of algorithms
Data processing
Coding
Statistical learning
Machine learning
Language English
License This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c309t-6034c9075b9035df1735cda51e2e8ebbbffc40f47cdc0c57a1dad136f216ae5b3
OpenAccessLink https://dx.doi.org/10.1093/jssam/smaa023
PageCount 22
ParticipantIDs oup_primary_10_1093_jssam_smaa023
PublicationCentury 2000
PublicationDate 2021-11-01
PublicationDateYYYYMMDD 2021-11-01
PublicationDate_xml – month: 11
  year: 2021
  text: 2021-11-01
  day: 01
PublicationDecade 2020
PublicationTitle Journal of survey statistics and methodology
PublicationYear 2021
Publisher Oxford University Press
Publisher_xml – name: Oxford University Press
SSID ssj0001398790
Score 2.2987692
Snippet Abstract Asking people about their occupation is common practice in surveys and censuses around the world. The answers are typically recorded in textual form...
SourceID oup
SourceType Publisher
StartPage 1013
Title Machine Learning for Occupation Coding—A Comparison Study
Volume 9
WOSCitedRecordID wos000733666000005&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LSgMxFA1SXHTjW3wTwW1oMklmcnFVisWN1UWF7obkJiMIbaVTBXd-hF_ol5h5YAVddB8SOBdyz32dS8iV8M4kmBYsA--YAgRmvASG0fUlgJFEWF4vm8hGIzOZwEOb7yj_KeGD7D2XpZ32yqm10b_Ez1ZoU60qGN9PVskUGUPnOp8SCYJmHIxq9TT_XNAMsv3yIcPt9V_fIVstT6T9xrC7ZCPM9ki3ooaNsvI-ub6r-yADbSVSn2jkn3SlGkwH88ovfX189ungZ9sgrRoH3w_I4_BmPLhl7SoEhpLDkqVcKoxxrHbApfaFyKRGb7UISTDBOVcUqHihMvTIUWdWeOuFTItEpDZoJw9JZzafhSNCjXGANrFWOVCQSes0YFCFU6Cqns9jchkByl8asYu8KVLLvMYhb3E4WePMKekmVf9HPbd3RjrLxWs4J5v4FnFaXNT2-wYNV5q_
linkProvider Oxford University Press
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Machine+Learning+for+Occupation+Coding%E2%80%94A+Comparison+Study&rft.jtitle=Journal+of+survey+statistics+and+methodology&rft.au=Schierholz%2C+Malte&rft.au=Schonlau%2C+Matthias&rft.date=2021-11-01&rft.pub=Oxford+University+Press&rft.issn=2325-0984&rft.eissn=2325-0992&rft.volume=9&rft.issue=5&rft.spage=1013&rft.epage=1034&rft_id=info:doi/10.1093%2Fjssam%2Fsmaa023&rft.externalDocID=10.1093%2Fjssam%2Fsmaa023
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2325-0984&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2325-0984&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2325-0984&client=summon