The Discriminativeness of Internal Syntactic Representations in Automatic Genre Classification

Genre characterizes a document differently from a subject that has been the focus of most document retrieval and classification applications. This work hypothesizes a close interaction between syntactic variation and genre differentiation by introspecting stylistic cues in functional and structural...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Journal of quantitative linguistics Ročník 28; číslo 2; s. 138 - 171
Hlavní autori: Wan, Mingyu, Fang, Alex Chengyu, Huang, Chu-Ren
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Lisse Routledge 03.04.2021
Taylor & Francis Ltd
Predmet:
ISSN:0929-6174, 1744-5035
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Genre characterizes a document differently from a subject that has been the focus of most document retrieval and classification applications. This work hypothesizes a close interaction between syntactic variation and genre differentiation by introspecting stylistic cues in functional and structural aspects beyond word level. It has engineered 14 syntactic feature sets of internal representations for genre classification through Machine Learning devices. Experiment results show significant superiority of fusing structural and lexical features for genre classification (F ∆max. = 9.2%, sig. = 0.001), suggesting the effectiveness of incorporating syntactic cues for genre discrimination. In addition, the PCA analysis reports the noun phrases (NP) as the most principle component (66%) for genre variation and prepositional phrases (PP) the second. Particularly, noun phrases with dominant structures of prepositional complements and pronouns functioning as a subject are most effective for identifying printed texts of high formality, while prepositional phrases are useful for identifying speeches of low formality. Error analysis suggests that the phrasal features are particularly useful for classifying four groups of genre classes, i.e. unscripted speech, fiction, news reports, and academic writing, all distributed with distinct structural characteristics, and they demonstrate an incremental degree of formality in the continuum of language complexity.
AbstractList Genre characterizes a document differently from a subject that has been the focus of most document retrieval and classification applications. This work hypothesizes a close interaction between syntactic variation and genre differentiation by introspecting stylistic cues in functional and structural aspects beyond word level. It has engineered 14 syntactic feature sets of internal representations for genre classification through Machine Learning devices. Experiment results show significant superiority of fusing structural and lexical features for genre classification (F ∆max. = 9.2%, sig. = 0.001), suggesting the effectiveness of incorporating syntactic cues for genre discrimination. In addition, the PCA analysis reports the noun phrases (NP) as the most principle component (66%) for genre variation and prepositional phrases (PP) the second. Particularly, noun phrases with dominant structures of prepositional complements and pronouns functioning as a subject are most effective for identifying printed texts of high formality, while prepositional phrases are useful for identifying speeches of low formality. Error analysis suggests that the phrasal features are particularly useful for classifying four groups of genre classes, i.e. unscripted speech, fiction, news reports, and academic writing, all distributed with distinct structural characteristics, and they demonstrate an incremental degree of formality in the continuum of language complexity.
Genre characterizes a document differently from a subject that has been the focus of most document retrieval and classification applications. This work hypothesizes a close interaction between syntactic variation and genre differentiation by introspecting stylistic cues in functional and structural aspects beyond word level. It has engineered 14 syntactic feature sets of internal representations for genre classification through Machine Learning devices. Experiment results show significant superiority of fusing structural and lexical features for genre classification (F∆max. = 9.2%, sig. = 0.001), suggesting the effectiveness of incorporating syntactic cues for genre discrimination. In addition, the PCA analysis reports the noun phrases (NP) as the most principle component (66%) for genre variation and prepositional phrases (PP) the second. Particularly, noun phrases with dominant structures of prepositional complements and pronouns functioning as a subject are most effective for identifying printed texts of high formality, while prepositional phrases are useful for identifying speeches of low formality. Error analysis suggests that the phrasal features are particularly useful for classifying four groups of genre classes, i.e. unscripted speech, fiction, news reports, and academic writing, all distributed with distinct structural characteristics, and they demonstrate an incremental degree of formality in the continuum of language complexity.
Author Fang, Alex Chengyu
Huang, Chu-Ren
Wan, Mingyu
Author_xml – sequence: 1
  givenname: Mingyu
  orcidid: 0000-0003-0083-5895
  surname: Wan
  fullname: Wan, Mingyu
  email: pku.clara@gmail.com
  organization: Peking University
– sequence: 2
  givenname: Alex Chengyu
  surname: Fang
  fullname: Fang, Alex Chengyu
  organization: City University of Hong Kong
– sequence: 3
  givenname: Chu-Ren
  orcidid: 0000-0002-8526-5520
  surname: Huang
  fullname: Huang, Chu-Ren
  organization: Peking University
BookMark eNqFkEFLwzAYhoNMcJv-BKHguTNpmnTFi2PqHAwE3dmQpl8wo0tmkin797Z2XjzoKSR53pfve0ZoYJ0FhC4JnhA8xde4zEpOinySYVJOCOeUM3aChu1TnjJM2QANOybtoDM0CmGDMSkY5UP0un6D5M4E5c3WWBnNB1gIIXE6WdoI3someTnYKFU0KnmGnYcA7TUaZ0NibDLbR7eV3ecCrIdk3sgQjDbqGzlHp1o2AS6O5xitH-7X88d09bRYzmerVFE6jWkNuMCKVxzzCqAmikkFOa9kPa1oQfU0o5gCLxTRHDPCdKlzJlmRU1JDoegYXfW1O-_e9xCi2Lh9N3sQGcNlmZGSFC3Fekp5F4IHLXbt1tIfBMGiMyl-TIrOpDiabHM3v3LK9Aail6b5N33bp43Vzm_lp_NNLaI8NM5rL60yQdC_K74AdA-P-w
CitedBy_id crossref_primary_10_1080_23311983_2025_2451513
crossref_primary_10_1109_ACCESS_2021_3056927
Cites_doi 10.1145/133160.133172
10.1080/01690969108406936
10.1093/oso/9780198235828.003.0011
10.1093/comjnl/41.8.537
10.1093/llc/fqz005
10.1515/PROBUS.2007.001
10.1017/CBO9780511621024
10.1145/564376.564403
10.1145/1183550.1183559
10.1155/2019/6979830
10.3115/977035.977055
10.1016/j.ipm.2004.06.004
10.3115/991250.991324
10.1162/089120100750105920
10.1007/BF00136979
10.1080/09296174.2017.1314411
10.1007/978-3-319-49508-8_25
10.1075/lali.00035.wan
10.1515/9783110214406.165
10.1145/505282.505283
10.1007/11816508_6
10.1109/ICASSP.2014.6854949
10.3115/976909.979622
10.4324/9780203783771
10.1007/978-3-662-45100-7
10.1109/MSP.2006.1598089
10.1515/cllt-2016-0062
10.1007/978-3-642-19400-9_14
10.1016/j.eswa.2015.09.018
10.1017/CBO9780511519871
10.1016/S0306-4573(01)00045-0
ContentType Journal Article
Copyright 2019 Informa UK Limited, trading as Taylor & Francis Group 2019
2019 Informa UK Limited, trading as Taylor & Francis Group
Copyright_xml – notice: 2019 Informa UK Limited, trading as Taylor & Francis Group 2019
– notice: 2019 Informa UK Limited, trading as Taylor & Francis Group
DBID AAYXX
CITATION
7T9
8BM
DOI 10.1080/09296174.2019.1663655
DatabaseName CrossRef
Linguistics and Language Behavior Abstracts (LLBA)
ComDisDome
DatabaseTitle CrossRef
Linguistics and Language Behavior Abstracts (LLBA)
ComDisDome
DatabaseTitleList
Linguistics and Language Behavior Abstracts (LLBA)
DeliveryMethod fulltext_linktorsrc
Discipline Languages & Literatures
EISSN 1744-5035
EndPage 171
ExternalDocumentID 10_1080_09296174_2019_1663655
1663655
Genre Research Article
GroupedDBID .7I
.QK
0BK
0R~
29L
4.4
5GY
5VS
AACJB
AAGDL
AAGZJ
AAHIA
AAMFJ
AAMIU
AAPUL
AATTQ
AAZMC
ABCCR
ABCCY
ABFIM
ABIVO
ABJNI
ABLIJ
ABPEM
ABTAI
ABXUL
ABXYU
ABZLS
ACGFS
ACHQT
ACTIO
ACTOA
ADAHI
ADCVX
ADKVQ
ADLRE
ADXPE
AECIN
AEFOU
AEISY
AEKEX
AEOZL
AEPSL
AERSA
AEYOC
AEZRU
AFRVT
AGDLA
AGMYJ
AGRBW
AHDZW
AIJEM
AIYEW
AJWEG
AKBVH
ALMA_UNASSIGNED_HOLDINGS
ALQZU
AQTUD
AVBZW
AWYRJ
BEJHT
BLEHA
BMOTO
BOHLJ
CCCUG
CQ1
CS3
DGFLZ
DKSSO
DU5
EBS
E~B
E~C
F5P
G-F
GTTXZ
H13
HF~
HZ~
IPNFZ
J.O
KYCEM
LJTGL
M4Z
MLAFT
NA5
NV0
O9-
P2P
RIG
RNANH
ROSJB
RSYQP
S-F
STATR
TASJS
TBQAZ
TDBHL
TEA
TFH
TFL
TFW
TNTFI
TRJHH
TUROJ
UT5
UT9
VAE
~01
~S~
AAYXX
CITATION
7T9
8BM
ID FETCH-LOGICAL-c338t-de070c6b606beed1c5ace46bad8b373f82303e67c1f60515f9f45a57431de7c3
IEDL.DBID TFW
ISICitedReferencesCount 2
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000487866900001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0929-6174
IngestDate Sat Nov 08 19:10:17 EST 2025
Sat Nov 29 03:58:20 EST 2025
Tue Nov 18 22:08:53 EST 2025
Mon Oct 20 23:48:13 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 2
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c338t-de070c6b606beed1c5ace46bad8b373f82303e67c1f60515f9f45a57431de7c3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0002-8526-5520
0000-0003-0083-5895
PQID 2509921917
PQPubID 2038271
PageCount 34
ParticipantIDs crossref_primary_10_1080_09296174_2019_1663655
crossref_citationtrail_10_1080_09296174_2019_1663655
informaworld_taylorfrancis_310_1080_09296174_2019_1663655
proquest_journals_2509921917
PublicationCentury 2000
PublicationDate 2021-04-03
PublicationDateYYYYMMDD 2021-04-03
PublicationDate_xml – month: 04
  year: 2021
  text: 2021-04-03
  day: 03
PublicationDecade 2020
PublicationPlace Lisse
PublicationPlace_xml – name: Lisse
PublicationTitle Journal of quantitative linguistics
PublicationYear 2021
Publisher Routledge
Taylor & Francis Ltd
Publisher_xml – name: Routledge
– name: Taylor & Francis Ltd
References Fang C. A. (cit0011) 2015
Mehler A. (cit0034) 2007; 22
cit0032
Lidy T. (cit0027) 2016
cit0030
Quirk R. (cit0041) 1985
Rish I. (cit0042) 2001; 3
Wan M. Y. (cit0051) 2018; 9
Liu M. C. (cit0031) 2019; 2
cit0037
cit0038
Bekkerman R. (cit0001) 2004
cit0035
cit0036
Bird S. (cit0005) 2009
cit0022
Wan M. Y. (cit0050) 2018
cit0023
cit0021
Selic B. (cit0044) 1994; 2
Martin J. R. (cit0033) 1984
cit0028
cit0029
cit0026
Platt J. (cit0039) 1998; 3
cit0024
cit0025
cit0055
cit0053
cit0010
cit0019
cit0017
Hou R. (cit0014)
cit0018
cit0015
cit0016
cit0013
cit0045
cit0043
cit0040
Fürnkranz J. (cit0012) 1998; 3
Fang A. C. (cit0009) 1996
Karlgren J. (cit0020) 2004
Witten I. H. (cit0054) 2016
cit0008
Wan M. Y. (cit0052) 2019
cit0006
cit0007
cit0004
cit0048
cit0049
cit0002
cit0046
cit0003
cit0047
References_xml – volume-title: Proceedings of MIREX2016 (pp. 1–4). New York, USA.
  year: 2016
  ident: cit0027
– ident: cit0025
  doi: 10.1145/133160.133172
– ident: cit0036
  doi: 10.1080/01690969108406936
– ident: cit0040
– start-page: 142
  volume-title: Comparing English world wide: The international corpus of English
  year: 1996
  ident: cit0009
  doi: 10.1093/oso/9780198235828.003.0011
– ident: cit0026
  doi: 10.1093/comjnl/41.8.537
– volume: 2
  volume-title: Real-time object-oriented modeling
  year: 1994
  ident: cit0044
– ident: cit0015
  doi: 10.1093/llc/fqz005
– ident: cit0007
  doi: 10.1515/PROBUS.2007.001
– ident: cit0006
– volume-title: Proceedings of AAAI fall symposium on style and meaning in language, art and music
  year: 2004
  ident: cit0020
– ident: cit0002
  doi: 10.1017/CBO9780511621024
– ident: cit0024
  doi: 10.1145/564376.564403
– ident: cit0029
  doi: 10.1145/1183550.1183559
– start-page: 122
  volume-title: Proceedings of the first workshop on financial technology and natural language processing
  year: 2019
  ident: cit0052
– ident: cit0038
  doi: 10.1155/2019/6979830
– ident: cit0035
– ident: cit0018
– ident: cit0055
  doi: 10.3115/977035.977055
– ident: cit0028
  doi: 10.1016/j.ipm.2004.06.004
– ident: cit0021
  doi: 10.3115/991250.991324
– ident: cit0046
  doi: 10.1162/089120100750105920
– volume: 3
  start-page: 1
  issue: 1998
  year: 1998
  ident: cit0012
  publication-title: Austrian Research Institute for Artificial Intelligence
– ident: cit0049
– volume: 2
  start-page: 42
  year: 2019
  ident: cit0031
  publication-title: 辞书研究 [Lexicographical Studies]
– volume: 22
  start-page: 51
  issue: 2
  year: 2007
  ident: cit0034
  publication-title: LDV Forum
– ident: cit0003
  doi: 10.1007/BF00136979
– ident: cit0016
  doi: 10.1080/09296174.2017.1314411
– volume-title: Natural language processing with Python: Analyzing text with the natural language toolkit
  year: 2009
  ident: cit0005
– ident: cit0030
  doi: 10.1007/978-3-319-49508-8_25
– volume-title: Language, register and genre in children’s writing
  year: 1984
  ident: cit0033
– ident: cit0053
  doi: 10.1075/lali.00035.wan
– ident: cit0014
  publication-title: Journal of Natural Language Engineering
– ident: cit0047
  doi: 10.1515/9783110214406.165
– ident: cit0008
  doi: 10.1145/505282.505283
– ident: cit0010
  doi: 10.1007/11816508_6
– ident: cit0045
  doi: 10.1109/ICASSP.2014.6854949
– ident: cit0022
  doi: 10.3115/976909.979622
– volume-title: Data mining: Practical machine learning tools and techniques
  year: 2016
  ident: cit0054
– volume-title: Using bigrams in text categorization
  year: 2004
  ident: cit0001
– ident: cit0023
– ident: cit0013
  doi: 10.4324/9780203783771
– volume-title: Text genres and registers: The computation of linguistic features
  year: 2015
  ident: cit0011
  doi: 10.1007/978-3-662-45100-7
– ident: cit0043
  doi: 10.1109/MSP.2006.1598089
– ident: cit0017
  doi: 10.1515/cllt-2016-0062
– ident: cit0032
  doi: 10.1007/978-3-642-19400-9_14
– volume-title: A comprehensive grammar of the English language
  year: 1985
  ident: cit0041
– ident: cit0037
  doi: 10.1016/j.eswa.2015.09.018
– ident: cit0004
  doi: 10.1017/CBO9780511519871
– ident: cit0048
  doi: 10.1016/S0306-4573(01)00045-0
– volume: 9
  start-page: 33
  issue: 2
  year: 2018
  ident: cit0051
  publication-title: International Journal of Knowledge and Language Processing
– volume-title: Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation
  year: 2018
  ident: cit0050
– volume: 3
  start-page: 88
  issue: 1
  year: 1998
  ident: cit0039
  publication-title: MSRTR: Microsoft Research
– volume: 3
  start-page: 41
  issue: 22
  year: 2001
  ident: cit0042
  publication-title: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence
– ident: cit0019
SSID ssj0017536
Score 2.199057
Snippet Genre characterizes a document differently from a subject that has been the focus of most document retrieval and classification applications. This work...
SourceID proquest
crossref
informaworld
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 138
SubjectTerms Academic writing
Classification
Computer generated language analysis
Cues
Differentiation
Discrimination
Error analysis
Fiction
Genre
Linguistic complexity
Machine learning
News media
Noun phrases
Phrases
Prepositional phrases
Retrieval
Speech
Speeches
Structural aspects
Syntactic features
Syntactic structures
Writing
Title The Discriminativeness of Internal Syntactic Representations in Automatic Genre Classification
URI https://www.tandfonline.com/doi/abs/10.1080/09296174.2019.1663655
https://www.proquest.com/docview/2509921917
Volume 28
WOSCitedRecordID wos000487866900001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAWR
  databaseName: Taylor and Francis Online Journals
  customDbUrl:
  eissn: 1744-5035
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0017536
  issn: 0929-6174
  databaseCode: TFW
  dateStart: 19940101
  isFulltext: true
  titleUrlDefault: https://www.tandfonline.com
  providerName: Taylor & Francis
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LSwMxEA5SPHjx_ahWyUG8bXU3-2iORa0eShEt2pNhk02gULaluxX8985ks8Ui0oPed8KSTOaVb74h5NLgLBo_k56SWnrgITKP6zDyfAg1tAStzizjzWs_GQw6oxF_cmjCwsEqMYc2FVGEtdV4uVNZ1Ii46xtw6eB4sSLi87YPPjOOsM0cXD_OMBj23pbvCBCM29dKjiAPEKl7eH5bZcU7rXCX_rDV1gH1dv7h13fJtos-abdSlz2yofN9ctx3NcuCXtH-kma5OCDvoET0boyWBREztWWkU0NdJXFCXz7z0jZa0WcLqnW9THlBxzntLsqp5YSlD6in1I7gRHCS_eSQDHv3w9tHzw1k8BRksqWXaTAQKpaQ9Ejwrb6KUqXDWKZZR7KEGXy0YzpOlG9inB1juAmjNMIgJdOJYkekkU9zfUIokt6kDNZRTIYq5jKMVSCDzKhOlAaKN0lYn4NQjqwcZ2ZMhF9zmrqdFLiTwu1kk7SXYrOKrWOdAP9-yKK0ZRJTzTQRbI1sq9YI4S5-ISCi5DzAJPj0D0ufka0AsTOIEGIt0ijnC31ONtVHOS7mF1bFvwAYh_fd
linkProvider Taylor & Francis
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV3NS8MwFA86Bb34_TGdmoN4q9qPtMtxqHNi3UGH7mRo0gQGo5O1E_zvzUvTsSGyg977Qkle3ld-7_cQOlcwi8ZNuSO45I72EKlDZUAcV4cakmutTg3jzWscdbvNfp_O9sIArBJyaFUSRRhbDZcbitEVJO7qWvt07XmhJOLSS1c7zZCQZbRCtK8FLe-136YvCTocN--VFGAeWqbq4vltmTn_NMde-sNaGxfU3vyPn99CGzYAxa1SY7bRksx20EFsy5Y5vsDxlGk530XvWo_w7QCMC4BmKuOIRwrbYuIQv3xlhem1ws8GV2vbmbIcDzLcmhQjQwuL70FVsZnCCfgk88ke6rXvejcdx85kcIROZgsnldpGiJDrvIdr9-oKkggZhDxJm9yPfAXvdr4MI-GqEMbHKKoCkhCIU1IZCX8f1bJRJg8RBt6bxNfrCJ8HIqQ8CIXHvVSJJkk8QesoqA6CCctXDmMzhsytaE3tTjLYSWZ3so4up2IfJWHHIgE6e8qsMJUSVY41Yf4C2UalEsze_ZzpoJJSD_Lgoz8sfYbWOr2nmMUP3cdjtO4BlAYAQ34D1YrxRJ6gVfFZDPLxqdH3b9Ms_Ag
linkToPdf http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LSwMxEA5aRbz4flSr5iDetrrv5lisq-JSihbtybDJJlAo29LdCv57M9lssYj0oPedsCSTeeWbbxC6lDCLxk6ZxZlglvIQqUWE51u2CjUEU1qdasab1zjsdluDAekZNGFuYJWQQ8uSKELbarjck1RWiLjrG-XSleOFiohNmrbymYHvr6I1FToHkH_1o7f5Q4KKxvVzJQGUh5Kpmnh-W2bBPS2Ql_4w1toDRdv_8O87aMuEn7hd6ssuWhHZHjqKTdEyx1c4nvMs5_voXWkR7gzBtABkpjKNeCyxKSWO8MtnVuhOK_ysUbWmmSnL8TDD7Vkx1qSw-B4UFesZnIBO0p8coH501799sMxEBourVLawUqEsBA-YynqYcq429xMuvIAlaYu5oSvh1c4VQchtGcDwGEmk5yc-RCmpCLl7iGrZOBPHCAPrTeKqdbjLPB4Q5gXcYU4qectPHE7qyKvOgXLDVg5DM0bUrkhNzU5S2ElqdrKOmnOxSUnXsUyAfD9kWug6iSyHmlB3iWyj0ghqbn5OVUhJiANZ8Mkflr5AG71OROPH7tMp2nQARwNoIbeBasV0Js7QOv8ohvn0XGv7F9aK-rk
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=The+Discriminativeness+of+Internal+Syntactic+Representations+in+Automatic+Genre+Classification&rft.jtitle=Journal+of+quantitative+linguistics&rft.au=Wan%2C+Mingyu&rft.au=Fang%2C+Alex+Chengyu&rft.au=Chu-Ren%2C+Huang&rft.date=2021-04-03&rft.pub=Taylor+%26+Francis+Ltd&rft.issn=0929-6174&rft.eissn=1744-5035&rft.volume=28&rft.issue=2&rft.spage=138&rft.epage=171&rft_id=info:doi/10.1080%2F09296174.2019.1663655&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0929-6174&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0929-6174&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0929-6174&client=summon