The structural and content aspects of abstracts versus bodies of full text journal articles are different

Background An increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	BMC bioinformatics Ročník 11; číslo 1; s. 492
Hlavní autoři:	Cohen, K Bretonnel, Johnson, Helen L, Verspoor, Karin, Roeder, Christophe, Hunter, Lawrence E
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	London BioMed Central 29.09.2010 BioMed Central Ltd Springer Nature B.V BMC
Témata:	Abstracting and Indexing as Topic - methods Abstracts Algorithms Analysis Bioinformatics Biomedical and Life Sciences Computational Biology/Bioinformatics Computer Appl. in Life Sciences Data mining Full text databases Genetic research Information management Information Storage and Retrieval - methods Life Sciences MEDLINE Microarrays Natural Language Processing Networks analysis Periodicals as Topic Research Article Terminology as Topic Sentence Length Sentence Complexity Semantic Classis Text Mining Gene Mention
ISSN:	1471-2105, 1471-2105
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Background An increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research. Results We examined the structural and linguistic aspects of abstracts and bodies of full text articles, the performance of text mining tools on both, and the distribution of a variety of semantic classes of named entities between them. We found marked structural differences, with longer sentences in the article bodies and much heavier use of parenthesized material in the bodies than in the abstracts. We found content differences with respect to linguistic features. Three out of four of the linguistic features that we examined were statistically significantly differently distributed between the two genres. We also found content differences with respect to the distribution of semantic features. There were significantly different densities per thousand words for three out of four semantic classes, and clear differences in the extent to which they appeared in the two genres. With respect to the performance of text mining tools, we found that a mutation finder performed equally well in both genres, but that a wide variety of gene mention systems performed much worse on article bodies than they did on abstracts. POS tagging was also more accurate in abstracts than in article bodies. Conclusions Aspects of structure and content differ markedly between article abstracts and article bodies. A number of these differences may pose problems as the text mining field moves more into the area of processing full-text articles. However, these differences also present a number of opportunities for the extraction of data types, particularly that found in parenthesized text, that is present in article bodies but not in article abstracts.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1471-2105 1471-2105
DOI:	10.1186/1471-2105-11-492