TextBenDS: a Generic Textual Data Benchmark for Distributed Systems

Extracting top- k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the mod...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Information systems frontiers Ročník 23; číslo 1; s. 81 - 100
Hlavní autoři: Truică, Ciprian-Octavian, Apostol, Elena-Simona, Darmont, Jérôme, Assent, Ira
Médium: Journal Article
Jazyk:angličtina
Vydáno: New York Springer US 01.02.2021
Springer Nature B.V
Springer Verlag
Edice:Breakthroughs on Cross-Cutting Data Management, Data Analytics and Applied Data Science
Témata:
ISSN:1387-3326, 1572-9419
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Extracting top- k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the modifications performed on the dataset. Furthermore, calculation errors are introduced when analyzing only subsets of the dataset, i.e., wrong weighting are computed as weighting schemes use the number of documents for scoring keywords and documents. Therefore, in a Big Data context, it is crucial to lower the runtime of computing weighting schemes, without hindering the analysis process and the accuracy of the machine learning algorithms. To address this requirement for the task of computing top- k keywords and documents (which largely relies on weighting schemes), it is customary to design benchmarks that compare weighting schemes within various configurations of distributedframeworks and database management systems. Thus, we propose T ext B en DS - a generic document-oriented benchmark for storing textual data and constructing weighting schemes. Our benchmark offers a generic data model designed with a multidimensional approach for storing text documents. We also propose using aggregation queries with various complexities and selectivities for constructing term weighting schemes, that are utilized in extracting top- k keywords and documents. We evaluate the computing performance of the queries on several distributed environments set within the Apache Hadoop ecosystem. Our experimental results provide interesting insights. As an example, MongoDB shows the best overall performance, while Spark’s execution time remains almost constant regardless of weighting schemes.
AbstractList Extracting top-k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the modifications performed on the dataset. Furthermore, calculation errors are introduced when analyzing only subsets of the dataset, i.e., wrong weighting are computed as weighting schemes use the number of documents for scoring keywords and documents. Therefore, in a Big Data context, it is crucial to lower the runtime of computing weighting schemes, without hindering the analysis process and the accuracy of the machine learning algorithms. To address this requirement for the task of computing top-k keywords and documents (which largely relies on weighting schemes), it is customary to design benchmarks that compare weighting schemes within various configurations of distributedframeworks and database management systems. Thus, we propose TextBenDS - a generic document-oriented benchmark for storing textual data and constructing weighting schemes. Our benchmark offers a generic data model designed with a multidimensional approach for storing text documents. We also propose using aggregation queries with various complexities and selectivities for constructing term weighting schemes, that are utilized in extracting top-k keywords and documents. We evaluate the computing performance of the queries on several distributed environments set within the Apache Hadoop ecosystem. Our experimental results provide interesting insights. As an example, MongoDB shows the best overall performance, while Spark’s execution time remains almost constant regardless of weighting schemes.
Extracting top- k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the modifications performed on the dataset. Furthermore, calculation errors are introduced when analyzing only subsets of the dataset, i.e., wrong weighting are computed as weighting schemes use the number of documents for scoring keywords and documents. Therefore, in a Big Data context, it is crucial to lower the runtime of computing weighting schemes, without hindering the analysis process and the accuracy of the machine learning algorithms. To address this requirement for the task of computing top- k keywords and documents (which largely relies on weighting schemes), it is customary to design benchmarks that compare weighting schemes within various configurations of distributedframeworks and database management systems. Thus, we propose T ext B en DS - a generic document-oriented benchmark for storing textual data and constructing weighting schemes. Our benchmark offers a generic data model designed with a multidimensional approach for storing text documents. We also propose using aggregation queries with various complexities and selectivities for constructing term weighting schemes, that are utilized in extracting top- k keywords and documents. We evaluate the computing performance of the queries on several distributed environments set within the Apache Hadoop ecosystem. Our experimental results provide interesting insights. As an example, MongoDB shows the best overall performance, while Spark’s execution time remains almost constant regardless of weighting schemes.
Extracting top-k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the modifications performed on the dataset. Furthermore, computation errors are introduced when analyzing only subsets of the dataset. Therefore, in a Big Data context, it is crucial to lower the runtime of computing weighting schemes, without hindering the analysis process and the accuracy of the machine learning algorithms. To address this requirement for the task of top-k keywords and documents, it is customary to design benchmarks that compare weighting schemes within various configurations of distributed frameworks and database management systems. Thus, we propose a generic document-oriented benchmark for storing textual data and constructing weighting schemes (TextBenDS). Our benchmark offers a generic data model designed with a multidimensional approach for storing text documents. We also propose using aggregation queries with various complexities and selectivities for constructing term weighting schemes, that are utilized in extracting top-k keywords and documents. We evaluate the computing performance of the queries on several distributed environments set within the Apache Hadoop ecosystem. Our experimental results provide interesting insights. As an example, MongoDB proves to have the best overall performance, while Spark's execution time remains almost the same, regardless of the weighting schemes.
Author Truică, Ciprian-Octavian
Apostol, Elena-Simona
Assent, Ira
Darmont, Jérôme
Author_xml – sequence: 1
  givenname: Ciprian-Octavian
  orcidid: 0000-0001-7292-4462
  surname: Truică
  fullname: Truică, Ciprian-Octavian
  email: ciprian.truica@cs.pub.ro, ciprian.truica@cs.au.dk
  organization: Computer Science and Engineering Department, Faculty of Automatic Control and Computers, University Politehnica of Bucharest, Department of Computer Science, Aarhus University
– sequence: 2
  givenname: Elena-Simona
  surname: Apostol
  fullname: Apostol, Elena-Simona
  organization: Computer Science and Engineering Department, Faculty of Automatic Control and Computers, University Politehnica of Bucharest
– sequence: 3
  givenname: Jérôme
  surname: Darmont
  fullname: Darmont, Jérôme
  organization: Université de Lyon
– sequence: 4
  givenname: Ira
  surname: Assent
  fullname: Assent, Ira
  organization: DIGIT, Department of Computer Science, Aarhus University
BackLink https://hal.science/hal-02476758$$DView record in HAL
BookMark eNp9kE9PAjEQxRuDiYB-AU-bePKw2n-7bb0hKJiQeADPTel2pQi72HaN--0trsbEA71MZ-b9Ji9vAHpVXRkALhG8QRCyW48gE3kKMUyhiC9tT0AfZQyngiLRi3_CWUoIzs_AwPsNhCjHLOuD8dJ8hntTTRZ3iUqmpjLO6uQwbNQ2maigkrjV651yb0lZu2RifXB21QRTJIvWB7Pz5-C0VFtvLn7qELw8PizHs3T-PH0aj-apJoKF1FBEKcmpwauswLDMCsI4I8pkpSZK5ELHzjCtkdZcYIVLJbhWtCxgBotsRYbguru7Vlu5dzZ6amWtrJyN5vIwg5iynGX8A0XtVafdu_q9MT7ITd24KtqTmHIGc8E5jireqbSrvXemlNoGFWxdBafsViIoD_HKLt54H8rveGUbUfwP_XV0FCId5KO4ejXuz9UR6gtIq45u
CitedBy_id crossref_primary_10_1007_s10796_023_10468_5
crossref_primary_10_1016_j_jestch_2024_101728
crossref_primary_10_4018_JDM_321756
crossref_primary_10_1007_s10796_020_10091_8
crossref_primary_10_1016_j_tbench_2022_100074
crossref_primary_10_3390_math11030508
crossref_primary_10_1016_j_datak_2023_102154
crossref_primary_10_1007_s11036_020_01699_w
crossref_primary_10_1109_TKDE_2024_3417232
Cites_doi 10.1145/2742854.2747283
10.1145/2934664
10.1007/978-3-540-85836-2-6
10.1145/2723372.2742797
10.1007/978-3-642-16184-1-1
10.1504/IJIIDS.2010.032437
10.1007/978-3-319-04936-6_8
10.1109/IISWC.2014.6983058
10.18653/v1/P18-2040
10.1109/HPCA.2014.6835958
10.1002/sam.11159
10.1145/2463676.2465296
10.1145/3219819.3220094
10.1145/3137597.3137600
10.1007/978-3-319-67162-8_3
10.1016/j.future.2018.02.037
10.1016/S0306-4573(00)00016-9
10.1016/S0306-4573(00)00015-7
10.1017/CBO9780511809071
10.14778/1687553.1687609
10.1145/1327452.1327492
10.1109/ICDEW.2010.5452747
10.1007/978-3-319-31409-9-3
10.1109/ICDE.2017.167
10.1007/978-3-319-10596-3-11
10.1145/2568388.2568393
10.1145/3018661.3018726
10.1145/3130348.3130376
10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
10.1109/FUZZ-IEEE.2017.8015720
10.1007/s13278-015-0258-0
10.1137/1.9781611972795.96
10.1145/2463676.2463712
10.1007/978-3-319-49586-6-33
10.1177/0165551515620551
10.1109/synasc.2016.055
10.1504/ijbidm.2016.076425
10.1145/3121050.3121062
10.1109/MSST.2010.5496972
10.1145/3130348.3130370
10.1631/FITEE.1500332
10.1007/978-3-642-36949-0-2
10.1145/2723372.2742790
10.1007/978-3-319-30671-1-30
10.1109/BigData.2015.7363793
10.1147/JRD.2013.2240732
10.1007/978-3-642-23091-2_15
10.1145/2523616.2523633
10.1007/978-3-319-10596-3-1
ContentType Journal Article
Copyright Springer Science+Business Media, LLC, part of Springer Nature 2020
Springer Science+Business Media, LLC, part of Springer Nature 2020.
Attribution
Copyright_xml – notice: Springer Science+Business Media, LLC, part of Springer Nature 2020
– notice: Springer Science+Business Media, LLC, part of Springer Nature 2020.
– notice: Attribution
DBID AAYXX
CITATION
3V.
7SC
7WY
7WZ
7XB
87Z
8AL
8AO
8FD
8FE
8FG
8FK
8FL
ABUWG
AFKRA
ALSLI
ARAPS
AZQEC
BENPR
BEZIV
BGLVJ
CCPQU
CNYFK
DWQXO
FRNLG
F~G
GNUQQ
HCIFZ
JQ2
K60
K6~
K7-
L.-
L7M
L~C
L~D
M0C
M0N
M1O
P5Z
P62
PHGZM
PHGZT
PKEHL
PQBIZ
PQBZA
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
PRQQA
Q9U
1XC
VOOES
DOI 10.1007/s10796-020-09999-y
DatabaseName CrossRef
ProQuest Central (Corporate)
Computer and Information Systems Abstracts
ABI/INFORM Collection
ABI/INFORM Global (PDF only)
ProQuest Central (purchase pre-March 2016)
ABI/INFORM Collection
Computing Database (Alumni Edition)
ProQuest Pharma Collection
Technology Research Database
ProQuest SciTech Collection
ProQuest Technology Collection
ProQuest Central (Alumni) (purchase pre-March 2016)
ABI/INFORM Collection (Alumni)
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
Social Science Premium Collection
Advanced Technologies & Computer Science Collection
ProQuest Central Essentials - QC
ProQuest Central
Business Premium Collection
ProQuest Technology Collection
ProQuest One Community College
Library & Information Science Collection
ProQuest Central Korea
Business Premium Collection (Alumni)
ABI/INFORM Global (Corporate)
ProQuest Central Student
SciTech Collection (ProQuest)
ProQuest Computer Science Collection
ProQuest Business Collection (Alumni Edition)
ProQuest Business Collection
Computer Science Database (ProQuest)
ABI/INFORM Professional Advanced
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
ABI/INFORM Collection (ProQuest)
Computing Database
Library Science Database
Advanced Technologies & Aerospace Database
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Premium
ProQuest One Academic (New)
ProQuest One Academic Middle East (New)
ProQuest One Business
ProQuest One Business (Alumni)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic (retired)
ProQuest One Academic UKI Edition
ProQuest Central China
ProQuest One Social Sciences
ProQuest Central Basic
Hyper Article en Ligne (HAL)
Hyper Article en Ligne (HAL) (Open Access)
DatabaseTitle CrossRef
ProQuest Business Collection (Alumni Edition)
Computer Science Database
ProQuest Central Student
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
SciTech Premium Collection
ProQuest Central China
ABI/INFORM Complete
ProQuest One Applied & Life Sciences
Library & Information Science Collection
ProQuest Central (New)
Advanced Technologies & Aerospace Collection
Business Premium Collection
Social Science Premium Collection
ABI/INFORM Global
ProQuest One Academic Eastern Edition
ProQuest Technology Collection
ProQuest Business Collection
ProQuest One Academic UKI Edition
ProQuest One Academic
ProQuest One Academic (New)
ABI/INFORM Global (Corporate)
ProQuest One Business
Technology Collection
Technology Research Database
Computer and Information Systems Abstracts – Academic
ProQuest One Academic Middle East (New)
ProQuest Central (Alumni Edition)
ProQuest One Community College
ProQuest Pharma Collection
ProQuest Central
ABI/INFORM Professional Advanced
ProQuest Library Science
ProQuest Central Korea
Advanced Technologies Database with Aerospace
ABI/INFORM Complete (Alumni Edition)
ProQuest Computing
ProQuest One Social Sciences
ABI/INFORM Global (Alumni Edition)
ProQuest Central Basic
ProQuest Computing (Alumni Edition)
ProQuest SciTech Collection
Computer and Information Systems Abstracts Professional
Advanced Technologies & Aerospace Database
ProQuest One Business (Alumni)
ProQuest Central (Alumni)
Business Premium Collection (Alumni)
DatabaseTitleList ProQuest Business Collection (Alumni Edition)


Database_xml – sequence: 1
  dbid: BENPR
  name: ProQuest Central Database Suite (ProQuest)
  url: https://www.proquest.com/central
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EISSN 1572-9419
EndPage 100
ExternalDocumentID oai:HAL:hal-02476758v1
10_1007_s10796_020_09999_y
GrantInformation_xml – fundername: UEFISCDI
  grantid: No. PN-III-P1-1.2-PCCDI-2017-0734
GroupedDBID -57
-5G
-BR
-EM
-Y2
-~C
.4S
.86
.DC
.VR
06D
0R~
0VY
1N0
1SB
203
29I
2J2
2JN
2JY
2KG
2LR
2P1
2VQ
2~H
30V
3V.
4.4
406
408
409
40D
40E
5GY
5VS
67Z
6NX
7WY
8AO
8FE
8FG
8FL
8TC
8UJ
95-
95.
95~
96X
AAAVM
AABHQ
AACDK
AAHNG
AAIAL
AAJBT
AAJKR
AANZL
AARHV
AARTL
AASML
AATNV
AATVU
AAUYE
AAWCG
AAYIU
AAYQN
AAYTO
AAYZH
ABAKF
ABBBX
ABBXA
ABDZT
ABECU
ABFTD
ABFTV
ABHQN
ABJNI
ABJOX
ABKCH
ABKTR
ABMNI
ABMQK
ABNWP
ABQBU
ABQSL
ABSXP
ABTEG
ABTHY
ABTKH
ABTMW
ABULA
ABUWG
ABWNU
ABXPI
ACAOD
ACBXY
ACDTI
ACGFS
ACHSB
ACHXU
ACKNC
ACMDZ
ACMLO
ACOKC
ACOMO
ACPIV
ACSNA
ACZOJ
ADHHG
ADHIR
ADINQ
ADKNI
ADKPE
ADMLS
ADRFC
ADTPH
ADURQ
ADYFF
ADZKW
AEBTG
AEFQL
AEGAL
AEGNC
AEJHL
AEJRE
AEKMD
AEMSY
AENEX
AEOHA
AEPYU
AESKC
AETLH
AEVLU
AEXYK
AFBBN
AFDYV
AFGCZ
AFKRA
AFLOW
AFQWF
AFWTZ
AFZKB
AGAYW
AGDGC
AGGDS
AGJBK
AGMZJ
AGQEE
AGQMX
AGRTI
AGWIL
AGWZB
AGYKE
AHAVH
AHBYD
AHSBF
AHYZX
AIAKS
AIGIU
AIIXL
AILAN
AITGF
AJBLW
AJRNO
AJZVZ
ALMA_UNASSIGNED_HOLDINGS
ALSLI
ALWAN
AMKLP
AMXSW
AMYLF
AMYQR
AOCGG
ARAPS
ARCSS
ARMRJ
ASPBG
AVWKF
AXYYD
AYQZM
AZFZN
AZQEC
B-.
BA0
BAPOH
BDATZ
BENPR
BEZIV
BGLVJ
BGNMA
BPHCQ
BSONS
CAG
CCPQU
CNYFK
COF
CS3
CSCUP
DDRTE
DL5
DNIVK
DPUIP
DU5
DWQXO
EBLON
EBS
EDO
EIOEI
EJD
ESBYG
FEDTE
FERAY
FFXSO
FIGPU
FINBP
FNLPD
FRNLG
FRRFC
FSGXE
FWDCC
GGCAI
GGRSB
GJIRD
GNUQQ
GNWQR
GQ6
GQ7
GQ8
GROUPED_ABI_INFORM_COMPLETE
GROUPED_ABI_INFORM_RESEARCH
GXS
H13
HCIFZ
HF~
HG5
HG6
HMJXF
HQYDN
HRMNR
HVGLF
HZ~
I-F
I09
IHE
IJ-
IKXTQ
ITM
IWAJR
IXC
IZIGR
IZQ
I~X
I~Z
J-C
J0Z
JBSCW
JCJTX
JZLTJ
K60
K6V
K6~
K7-
KDC
KOV
LAK
LLZTM
M0C
M0N
M1O
M4Y
MA-
MK~
ML~
N2Q
NB0
NPVJJ
NQJWS
NU0
O9-
O93
O9G
O9J
OAM
OVD
P62
P9O
PF0
PQBIZ
PQBZA
PQQKQ
PROAC
PT4
PT5
Q2X
QOS
R89
R9I
RNI
RNS
ROL
RPX
RSV
RZC
RZD
RZK
S16
S1Z
S27
S3B
SAP
SBE
SDH
SHX
SISQX
SJYHP
SNE
SNPRN
SNX
SOHCF
SOJ
SPISZ
SRMVM
SSLCW
STPWE
SZN
T13
TEORI
TSG
TSK
TSV
TUC
TUS
U2A
UG4
UOJIU
UTJUX
UZXMN
VC2
VFIZW
W23
W48
WK8
YLTOR
Z45
Z7R
Z7S
Z7X
Z7Z
Z81
Z83
Z88
ZMTXR
~A9
AAPKM
AAYXX
ABBRH
ABDBE
ABFSG
ABRTQ
ACSTC
ADHKG
ADKFA
AEZWR
AFDZB
AFFHD
AFHIU
AFOHR
AGQPQ
AHPBZ
AHWEU
AIXLP
ATHPR
AYFIA
CITATION
PHGZM
PHGZT
PQGLB
PRQQA
7SC
7XB
8AL
8FD
8FK
JQ2
L.-
L7M
L~C
L~D
PKEHL
PQEST
PQUKI
PRINS
Q9U
1XC
VOOES
ID FETCH-LOGICAL-c397t-e4144364e2b5d20f5d37873ae5fc3a969c873e7cc1cc892a2fa98ca4fd050d5b3
IEDL.DBID M0C
ISICitedReferencesCount 12
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000562310100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1387-3326
IngestDate Sat Nov 29 15:10:56 EST 2025
Sat Nov 15 05:52:43 EST 2025
Tue Nov 18 22:37:46 EST 2025
Sat Nov 29 03:40:11 EST 2025
Fri Feb 21 02:49:00 EST 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 1
Keywords Weighting schemes
Distributed DBMSs
Top
keywords
Distributed frameworks
documents
Benchmark
Top-k documents
Top-k keywords
Language English
License Attribution: http://creativecommons.org/licenses/by
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c397t-e4144364e2b5d20f5d37873ae5fc3a969c873e7cc1cc892a2fa98ca4fd050d5b3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0001-7292-4462
0000-0003-1491-384X
OpenAccessLink https://hal.science/hal-02476758
PQID 2487069882
PQPubID 26108
PageCount 20
ParticipantIDs hal_primary_oai_HAL_hal_02476758v1
proquest_journals_2487069882
crossref_citationtrail_10_1007_s10796_020_09999_y
crossref_primary_10_1007_s10796_020_09999_y
springer_journals_10_1007_s10796_020_09999_y
PublicationCentury 2000
PublicationDate 20210200
2021-02-00
20210201
2021-02
PublicationDateYYYYMMDD 2021-02-01
PublicationDate_xml – month: 2
  year: 2021
  text: 20210200
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationSeriesTitle Breakthroughs on Cross-Cutting Data Management, Data Analytics and Applied Data Science
PublicationSubtitle A Journal of Research and Innovation
PublicationTitle Information systems frontiers
PublicationTitleAbbrev Inf Syst Front
PublicationYear 2021
Publisher Springer US
Springer Nature B.V
Springer Verlag
Publisher_xml – name: Springer US
– name: Springer Nature B.V
– name: Springer Verlag
References DeerwesterSDumaisSTFurnasGWLandauerTKHarshmanRIndexing by latent semantic analysisJournal of the American Society for Information Science199041639140710.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
Truică, C. O., & Darmont, J. (2017). T2K2: The twitter top-k keywords benchmark. In European Conference on Advances in Databases and Information Systems (pp. 21–28). Springer International Publishing. https://doi.org/10.1007/978-3-319-67162-8_3.
GhazalAIvanovTKostamaaPCrolotteAVoongRAl-KatebMGhazalWZicariRVBigbench v2: The new and improved bigbench2017 IEEE 33rd International Conference on Data Engineering20171225123610.1109/ICDE.2017.167
Truică, C. O., Darmont, J., & Velcine, J. (2016a). A scalable document-based architecture for text analysis. In International Conference on Advanced Data Mining and Applications (pp. 481–494). Springer. https://doi.org/10.1007/978-3-319-49586-6-33.
KılıçDÖzçiftABozyigitFYildirimPYücalarFBorandagETtc-3600: A new benchmark dataset for turkish text categorizationJournal of Information Science201743217418510.1177/0165551515620551
Transaction Processing Performance Council (TPC) (2019). TPC-DS decision support benchmark 2.10.1.http://www.tpc.org Accessed March 2019.
Yin, J., Chao, D., Liu, Z., Zhang, W., Yu, X., & Wang, J. (2018). Model-based clustering of short text streams. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2634–2642). ACM Press. https://doi.org/10.1145/3219819.3220094.
Truică, C.O., Rădulescu, F., Boicea, A. (2016b). Comparing different term weighting schemas for topic modeling. In: International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. IEEE. https://doi.org/10.1109/synasc.2016.055.
BellotPDoucetAGevaSGurajadaSKampsJKazaiGKoolenMMishraAMoriceauVMotheJPremingerMSanJuanESchenkelRTannierXTheobaldMTrappettMTrotmanASandersonMScholerFWangQReport on inex 2013SIGIR Forum2013472213210.1145/2568388.2568393
TruicăCODarmontJBoiceaARădulescuFBenchmarking top-k keyword and top-k document processing with T2K2 and T2K2D2Future Generation Computer Systems201885607510.1016/j.future.2018.02.037
Armstrong, T. G., Ponnekanti, V., Borthakur, D., & Callaghan, M. (2013). Linkbench: A database benchmark based on the facebook social graph. In ACM SIGMOD International Conference on Management of Data, SIGMOD ‘13 (pp. 1185–1196). ACM. https://doi.org/10.1145/2463676.2465296.
WangLZhanJLuoCZhuYYangQHeYGaoWJiaZShiYZhangSZhengCLuGZhanKLiXQiuBBigDataBench: A big data benchmark suite from internet servicesIEEE International Symposium on High Performance Computer Architecture201448849910.1109/HPCA.2014.6835958
Lin, J., Crane, M., Trotman, A., Callan, J., Chattopadhyaya, I., Foley, J., Ingersoll, G., Macdonald, C., & Vigna, S. (2016). Toward reproducible baselines: The open-source ir reproducibility challenge. In Advances in information retrieval (pp. 408–420). Springer International Publishing. https://doi.org/10.1007/978-3-319-30671-1-30.
ZhangDZhaiCHanJMiTexCube: MicroTextCluster cube for online analysis of text cells and its applicationsStatistical Analysis and Data Mining20126324325910.1002/sam.11159
ShvachkoKKuangHRadiaSChanslerRThe hadoop distributed file systemSymposium on Mass Storage Systems and Technologies201011010.1109/MSST.2010.5496972
ShuKSlivaAWangSTangJLiuHFake news detection on social media: A data mining perspectiveACM SIGKDD Explorations Newsletter2017191223610.1145/3137597.3137600
GattikerAEGebaraFHHofsteeHPHayesJDHylickABig data text-oriented benchmark creation for HadoopIBM Journal of Research and Development2013573/410:110:610.1147/JRD.2013.2240732
Ferrarons, J., Adhana, M., Colmenares, C., Pietrowska, S., Bentayeb, F., Darmont, J. (2014). Primeball: a parallel processing framework benchmark for big data applications in the cloud. In: TPC Technology Conference on Performance Evaluation and Benchmarking, LNCS1, 839, pp. 109–124. https://doi.org/10.1007/978-3-319-04936-6_8
Crane, M., Culpepper, J. S., Lin, J., Mackenzie, J., & Trotman, A. (2017). A comparison of document-at-a-time and score-at-a-time query evaluation. In ACM International Conference on Web Search and Data Mining (pp. 201–210). ACM. https://doi.org/10.1145/3018661.3018726.
LewisDDYangYRoseTGLiFRcv1: A new benchmark collection for text categorization researchJournal of Machine Learning Research20045361397URL http://www.jmlr.org/papers/v5/lewis04a.html
RavatFTesteOTournierRZurfluhGTop−keyword: an aggregation function for textual document olapInternational Conference on Data Warehousing and Knowledge Discovery2008556410.1007/978-3-540-85836-2-6
Krasnashchok, K., Jouili, S. (2018). Improving topic quality by promoting named entities in topic modeling. In: Annual Meeting of the Association for Computational Linguistics, pp. 247–253.
Bifet, A., & Frank, E. (2010). Sentiment knowledge discovery in twitter streaming data. In Discovery Science (pp. 1–15). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-16184-1-1.
Chowdhury, B., Rabl, T., Saadatpanah, P., Du, J., & Jacobsen, H. A. (2014). A bigbench implementation in the hadoop ecosystem. In Advancing big data benchmarks (pp. 3–18). Springer International Publishing. https://doi.org/10.1007/978-3-319-10596-3-1.
Shu, K., Mahudeswaran, D., Wang, S., Lee, D., Liu, H. (2018). Fakenewsnet: A data repository with news content, social context and dynamic information for studying fake news on social media. arXiv preprint arXiv:1809.01286.
Zhang, D., Zhai, C., Han, J. (2009). Topic cube: Topic modeling for OLAP on multidimensional text databases. In: SIAM International Conference on Data Mining, pp. 1124–1135. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611972795.96
Agrawal, D., Butt, A., Doshi, K., Larriba-Pey, J. L., Li, M., Reiss, F. R., Raab, F., Schiefer, B., Suzumura, T., & Xia, Y. (2016). Sparkbench – a spark performance testing suite. In Performance evaluation and benchmarking: Traditional to big data to internet of things (pp. 26–44). Springer International Publishing. https://doi.org/10.1007/978-3-319-31409-9-3.
JiaZZhanJWangLHanRMcKeeSAYangQLuoCLiJCharacterizing and subsetting big data workloads2014 IEEE International Symposium on Workload Characterization201419120110.1109/IISWC.2014.6983058
VavilapalliVKMurthyACDouglasCAgarwalSKonarMEvansRGravesTLoweJShahHSethSSahaBCurinoCO’MalleyORadiaSReedBBaldeschwielerEApache hadoop yarn: Yet another resource negotiatorAnnual Symposium on Cloud Computing20135:15:1610.1145/2523616.2523633
PirzadehPCareyMJWestmannTBigfun: A performance study of big data management system functionalityIEEE International Conference on Big Data201550751410.1109/BigData.2015.7363793
HuangSHuangJDaiJXieTHuangBThe HiBench benchmark suite: Characterization of the MapReduce-based data analysisInternational Conference on Data Engineering2010415110.1109/ICDEW.2010.5452747
Ming, Z., Luo, C., Gao, W., Han, R., Yang, Q., Wang, L., & Zhan, J. (2014). Bdgs: A scalable big data generator suite in big data benchmarking. In Advancing big data benchmarks (pp. 138–154). Springer International Publishing. https://doi.org/10.1007/978-3-319-10596-3-11.
DeanJGhemawatSMapreduce: Simplified data processing on large clustersCommunications of the ACM200851110711310.1145/1327452.1327492
LavrenkoVCroftWBRelevance-based language modelsSIGIR Forum201751226026710.1145/3130348.3130376
Partalas, I., Kosmopoulos, A., Baskiotis, N., Artières, T., Paliouras, G., Gaussier, É., Androutsopoulos, I., Amini, M.R., Gallinari, P. (2015). Lshtc: A benchmark for large-scale text classification. CoRR. URL http://arxiv.org/abs/1503.08581.
Sangroya, A., Serrano, D., & Bouchenak, S. (2013). Mrbs: Towards dependability benchmarking for hadoop mapreduce. In Euro-Par 2012: Parallel Processing Workshops (pp. 3–12). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-36949-0-2.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.
ZahariaMXinRSWendellPDasTArmbrustMDaveAMengXRosenJVenkataramanSFranklinMJGhodsiAGonzalezJShenkerSStoicaIApache spark: A unified engine for big data processingCommunications of the ACM20165911566510.1145/2934664
GuilleAFavreCEvent detection, tracking, and visualization in twitter: a mention-anomaly-based approachSocial Network Analysis and Mining2015511810.1007/s13278-015-0258-0
Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A., & Zaharia, M. (2015). Spark sql: Relational data processing in spark. In ACM SIGMOD International Conference on Management of Data (pp. 1383–1394). ACM Press. https://doi.org/10.1145/2723372.2742797.
Paltoglou, G., Thelwall, M. (2010). A study of information retrieval weighting schemes for sentiment analysis. In: Annual Meeting of the Association for Computational Linguistics, pp. 1386–1395. URL http://dl.acm.org/citation.cfm?id=1858681.1858822.
BringaySBéchetNBouillotFPonceletPRocheMTeisseireMTowards an on-line analysis of tweets processingInternational Conference on Database and Expert Systems Applications201115416110.1007/978-3-642-23091-2_15
Li, M., Tan, J., Wang, Y., Zhang, L., & Salapura, V. (2015). Sparkbench: A comprehensive benchmarking suite for in memory data analytic platform spark. In ACM International Conference on Computing Frontiers, CF ‘15 (pp. 53:1–53:8). ACM. https://doi.org/10.1145/2742854.2747283.
SahaBShahHSethSVijayaraghavanGMurthyACurinoCApache tez: A unifying framework for modeling and building data processing applicationsACM SIGMOD International Conference on Management of Data2015New YorkACM1357136910.1145/2723372.2742790
Spärck JonesKWalkerSRobertsonSEA probabilistic model of information retrieval: development and comparative experiments: Part 2Information Processing & Management200036680984010.1016/S0306-4573(00)00016-9
Raiber, F., & Kurland, O. (2017). Kullback-leibler divergence revisited. In ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR ‘17 (pp. 117–124). ACM. https
K Spärck Jones (9999_CR41) 2000; 36
DD Lewis (9999_CR24) 2004; 5
S Deerwester (9999_CR11) 1990; 41
A Guille (9999_CR17) 2015; 5
9999_CR43
9999_CR45
A Thusoo (9999_CR42) 2009; 2
9999_CR44
9999_CR47
9999_CR46
M Zaharia (9999_CR54) 2016; 59
P Bellot (9999_CR4) 2013; 47
P Pirzadeh (9999_CR32) 2015
J O’Shea (9999_CR29) 2010; 4
V Lavrenko (9999_CR23) 2017; 51
9999_CR53
9999_CR12
K Shu (9999_CR37) 2017; 19
9999_CR55
J Dean (9999_CR10) 2008; 51
9999_CR16
AE Gattiker (9999_CR13) 2013; 57
F Ravat (9999_CR34) 2008
S Huang (9999_CR19) 2010
VK Vavilapalli (9999_CR49) 2013
L Wang (9999_CR50) 2014
K Shvachko (9999_CR39) 2010
A Ghazal (9999_CR15) 2017
9999_CR22
9999_CR25
L Wang (9999_CR51) 2016; 17
9999_CR9
9999_CR27
9999_CR8
9999_CR26
A Ghazal (9999_CR14) 2013
9999_CR28
9999_CR1
9999_CR3
K Spärck Jones (9999_CR40) 2000; 36
9999_CR2
9999_CR5
CO Truică (9999_CR48) 2018; 85
Z Jia (9999_CR20) 2014
9999_CR30
T Hofmann (9999_CR18) 2017; 51
9999_CR31
D Kılıç (9999_CR21) 2017; 43
9999_CR33
9999_CR36
D Zhang (9999_CR56) 2012; 6
9999_CR38
X Wang (9999_CR52) 2017
S Bringay (9999_CR7) 2011
B Saha (9999_CR35) 2015
M Bouakkaz (9999_CR6) 2016; 11
References_xml – reference: SahaBShahHSethSVijayaraghavanGMurthyACurinoCApache tez: A unifying framework for modeling and building data processing applicationsACM SIGMOD International Conference on Management of Data2015New YorkACM1357136910.1145/2723372.2742790
– reference: ThusooASarmaJSJainNShaoZChakkaPAnthonySLiuHWyckoffPMurthyRHive: A warehousing solution over a map-reduce frameworkVLDB Endowment2009221626162910.14778/1687553.1687609
– reference: TruicăCODarmontJBoiceaARădulescuFBenchmarking top-k keyword and top-k document processing with T2K2 and T2K2D2Future Generation Computer Systems201885607510.1016/j.future.2018.02.037
– reference: Crane, M., Culpepper, J. S., Lin, J., Mackenzie, J., & Trotman, A. (2017). A comparison of document-at-a-time and score-at-a-time query evaluation. In ACM International Conference on Web Search and Data Mining (pp. 201–210). ACM. https://doi.org/10.1145/3018661.3018726.
– reference: ZhangDZhaiCHanJMiTexCube: MicroTextCluster cube for online analysis of text cells and its applicationsStatistical Analysis and Data Mining20126324325910.1002/sam.11159
– reference: Spärck JonesKWalkerSRobertsonSEA probabilistic model of information retrieval: development and comparative experiments: Part 1Information Processing & Management200036677980810.1016/S0306-4573(00)00015-7
– reference: LavrenkoVCroftWBRelevance-based language modelsSIGIR Forum201751226026710.1145/3130348.3130376
– reference: Ferrarons, J., Adhana, M., Colmenares, C., Pietrowska, S., Bentayeb, F., Darmont, J. (2014). Primeball: a parallel processing framework benchmark for big data applications in the cloud. In: TPC Technology Conference on Performance Evaluation and Benchmarking, LNCS1, 839, pp. 109–124. https://doi.org/10.1007/978-3-319-04936-6_8
– reference: Raiber, F., & Kurland, O. (2017). Kullback-leibler divergence revisited. In ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR ‘17 (pp. 117–124). ACM. https://doi.org/10.1145/3121050.3121062.
– reference: Shu, K., Mahudeswaran, D., Wang, S., Lee, D., Liu, H. (2018). Fakenewsnet: A data repository with news content, social context and dynamic information for studying fake news on social media. arXiv preprint arXiv:1809.01286.
– reference: LewisDDYangYRoseTGLiFRcv1: A new benchmark collection for text categorization researchJournal of Machine Learning Research20045361397URL http://www.jmlr.org/papers/v5/lewis04a.html
– reference: Transaction Processing Performance Council (TPC) (2019). TPC-DS decision support benchmark 2.10.1.http://www.tpc.org Accessed March 2019.
– reference: Gray, J. (1993). The benchmark handbook for database and transaction systems (2nd ed.). Burlington: Morgan Kaufmann Publishers.
– reference: Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.
– reference: Krasnashchok, K., Jouili, S. (2018). Improving topic quality by promoting named entities in topic modeling. In: Annual Meeting of the Association for Computational Linguistics, pp. 247–253.
– reference: DeerwesterSDumaisSTFurnasGWLandauerTKHarshmanRIndexing by latent semantic analysisJournal of the American Society for Information Science199041639140710.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
– reference: GhazalAIvanovTKostamaaPCrolotteAVoongRAl-KatebMGhazalWZicariRVBigbench v2: The new and improved bigbench2017 IEEE 33rd International Conference on Data Engineering20171225123610.1109/ICDE.2017.167
– reference: WangLZhanJLuoCZhuYYangQHeYGaoWJiaZShiYZhangSZhengCLuGZhanKLiXQiuBBigDataBench: A big data benchmark suite from internet servicesIEEE International Symposium on High Performance Computer Architecture201448849910.1109/HPCA.2014.6835958
– reference: PirzadehPCareyMJWestmannTBigfun: A performance study of big data management system functionalityIEEE International Conference on Big Data201550751410.1109/BigData.2015.7363793
– reference: O’SheaJBandarZCrockettKAMcLeanDBenchmarking short text semantic similarityInternational Journal of Intelligent Information and Database Systems20104210312010.1504/IJIIDS.2010.032437
– reference: GattikerAEGebaraFHHofsteeHPHayesJDHylickABig data text-oriented benchmark creation for HadoopIBM Journal of Research and Development2013573/410:110:610.1147/JRD.2013.2240732
– reference: HuangSHuangJDaiJXieTHuangBThe HiBench benchmark suite: Characterization of the MapReduce-based data analysisInternational Conference on Data Engineering2010415110.1109/ICDEW.2010.5452747
– reference: Armstrong, T. G., Ponnekanti, V., Borthakur, D., & Callaghan, M. (2013). Linkbench: A database benchmark based on the facebook social graph. In ACM SIGMOD International Conference on Management of Data, SIGMOD ‘13 (pp. 1185–1196). ACM. https://doi.org/10.1145/2463676.2465296.
– reference: Lin, J., Crane, M., Trotman, A., Callan, J., Chattopadhyaya, I., Foley, J., Ingersoll, G., Macdonald, C., & Vigna, S. (2016). Toward reproducible baselines: The open-source ir reproducibility challenge. In Advances in information retrieval (pp. 408–420). Springer International Publishing. https://doi.org/10.1007/978-3-319-30671-1-30.
– reference: Transaction Processing Performance Council (TPC) (2016). TPC express benchmark hs standard specification version 1.4.2.http://www.tpc.org Accessed March 2019.
– reference: Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A., & Zaharia, M. (2015). Spark sql: Relational data processing in spark. In ACM SIGMOD International Conference on Management of Data (pp. 1383–1394). ACM Press. https://doi.org/10.1145/2723372.2742797.
– reference: Yin, J., Chao, D., Liu, Z., Zhang, W., Yu, X., & Wang, J. (2018). Model-based clustering of short text streams. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2634–2642). ACM Press. https://doi.org/10.1145/3219819.3220094.
– reference: Bifet, A., & Frank, E. (2010). Sentiment knowledge discovery in twitter streaming data. In Discovery Science (pp. 1–15). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-16184-1-1.
– reference: ZahariaMXinRSWendellPDasTArmbrustMDaveAMengXRosenJVenkataramanSFranklinMJGhodsiAGonzalezJShenkerSStoicaIApache spark: A unified engine for big data processingCommunications of the ACM20165911566510.1145/2934664
– reference: Li, M., Tan, J., Wang, Y., Zhang, L., & Salapura, V. (2015). Sparkbench: A comprehensive benchmarking suite for in memory data analytic platform spark. In ACM International Conference on Computing Frontiers, CF ‘15 (pp. 53:1–53:8). ACM. https://doi.org/10.1145/2742854.2747283.
– reference: Paltoglou, G., Thelwall, M. (2010). A study of information retrieval weighting schemes for sentiment analysis. In: Annual Meeting of the Association for Computational Linguistics, pp. 1386–1395. URL http://dl.acm.org/citation.cfm?id=1858681.1858822.
– reference: ShuKSlivaAWangSTangJLiuHFake news detection on social media: A data mining perspectiveACM SIGKDD Explorations Newsletter2017191223610.1145/3137597.3137600
– reference: Zhang, D., Zhai, C., Han, J. (2009). Topic cube: Topic modeling for OLAP on multidimensional text databases. In: SIAM International Conference on Data Mining, pp. 1124–1135. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611972795.96
– reference: BellotPDoucetAGevaSGurajadaSKampsJKazaiGKoolenMMishraAMoriceauVMotheJPremingerMSanJuanESchenkelRTannierXTheobaldMTrappettMTrotmanASandersonMScholerFWangQReport on inex 2013SIGIR Forum2013472213210.1145/2568388.2568393
– reference: Truică, C.O., Rădulescu, F., Boicea, A. (2016b). Comparing different term weighting schemas for topic modeling. In: International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. IEEE. https://doi.org/10.1109/synasc.2016.055.
– reference: GhazalARablTHuMRaabFPoessMCrolotteAJacobsenHABigbench: Towards an industry standard benchmark for big data analyticsACM SIGMOD International Conference on Management of Data, SIGMOD ‘1320131197120810.1145/2463676.2463712
– reference: Truică, C. O., & Darmont, J. (2017). T2K2: The twitter top-k keywords benchmark. In European Conference on Advances in Databases and Information Systems (pp. 21–28). Springer International Publishing. https://doi.org/10.1007/978-3-319-67162-8_3.
– reference: RavatFTesteOTournierRZurfluhGTop−keyword: an aggregation function for textual document olapInternational Conference on Data Warehousing and Knowledge Discovery2008556410.1007/978-3-540-85836-2-6
– reference: KılıçDÖzçiftABozyigitFYildirimPYücalarFBorandagETtc-3600: A new benchmark dataset for turkish text categorizationJournal of Information Science201743217418510.1177/0165551515620551
– reference: Agrawal, D., Butt, A., Doshi, K., Larriba-Pey, J. L., Li, M., Reiss, F. R., Raab, F., Schiefer, B., Suzumura, T., & Xia, Y. (2016). Sparkbench – a spark performance testing suite. In Performance evaluation and benchmarking: Traditional to big data to internet of things (pp. 26–44). Springer International Publishing. https://doi.org/10.1007/978-3-319-31409-9-3.
– reference: BouakkazMLoudcherSOuintenYOLAP textual aggregation approach using the google similarity distanceInternational Journal of Business Intelligence and Data Mining20161113110.1504/ijbidm.2016.076425
– reference: HofmannTProbabilistic latent semantic indexingSIGIR Forum201751221121810.1145/3130348.3130370
– reference: VavilapalliVKMurthyACDouglasCAgarwalSKonarMEvansRGravesTLoweJShahHSethSSahaBCurinoCO’MalleyORadiaSReedBBaldeschwielerEApache hadoop yarn: Yet another resource negotiatorAnnual Symposium on Cloud Computing20135:15:1610.1145/2523616.2523633
– reference: Truică, C. O., Darmont, J., & Velcine, J. (2016a). A scalable document-based architecture for text analysis. In International Conference on Advanced Data Mining and Applications (pp. 481–494). Springer. https://doi.org/10.1007/978-3-319-49586-6-33.
– reference: Ming, Z., Luo, C., Gao, W., Han, R., Yang, Q., Wang, L., & Zhan, J. (2014). Bdgs: A scalable big data generator suite in big data benchmarking. In Advancing big data benchmarks (pp. 138–154). Springer International Publishing. https://doi.org/10.1007/978-3-319-10596-3-11.
– reference: Chowdhury, B., Rabl, T., Saadatpanah, P., Du, J., & Jacobsen, H. A. (2014). A bigbench implementation in the hadoop ecosystem. In Advancing big data benchmarks (pp. 3–18). Springer International Publishing. https://doi.org/10.1007/978-3-319-10596-3-1.
– reference: WangXAh-PineJDarmontJShcoclust, a scalable similarity-based hierarchical co-clustering method and its application to textual collections2017 IEEE International Conference on Fuzzy Systems20171610.1109/FUZZ-IEEE.2017.8015720
– reference: BringaySBéchetNBouillotFPonceletPRocheMTeisseireMTowards an on-line analysis of tweets processingInternational Conference on Database and Expert Systems Applications201115416110.1007/978-3-642-23091-2_15
– reference: ShvachkoKKuangHRadiaSChanslerRThe hadoop distributed file systemSymposium on Mass Storage Systems and Technologies201011010.1109/MSST.2010.5496972
– reference: DeanJGhemawatSMapreduce: Simplified data processing on large clustersCommunications of the ACM200851110711310.1145/1327452.1327492
– reference: GuilleAFavreCEvent detection, tracking, and visualization in twitter: a mention-anomaly-based approachSocial Network Analysis and Mining2015511810.1007/s13278-015-0258-0
– reference: Sangroya, A., Serrano, D., & Bouchenak, S. (2013). Mrbs: Towards dependability benchmarking for hadoop mapreduce. In Euro-Par 2012: Parallel Processing Workshops (pp. 3–12). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-36949-0-2.
– reference: JiaZZhanJWangLHanRMcKeeSAYangQLuoCLiJCharacterizing and subsetting big data workloads2014 IEEE International Symposium on Workload Characterization201419120110.1109/IISWC.2014.6983058
– reference: Spärck JonesKWalkerSRobertsonSEA probabilistic model of information retrieval: development and comparative experiments: Part 2Information Processing & Management200036680984010.1016/S0306-4573(00)00016-9
– reference: WangLDongXZhangXWangYJuTFengGTextgen: a realistic text data content generation method for modern storage system benchmarksFrontiers of Information Technology & Electronic Engineering2016171098299310.1631/FITEE.1500332
– reference: Partalas, I., Kosmopoulos, A., Baskiotis, N., Artières, T., Paliouras, G., Gaussier, É., Androutsopoulos, I., Amini, M.R., Gallinari, P. (2015). Lshtc: A benchmark for large-scale text classification. CoRR. URL http://arxiv.org/abs/1503.08581.
– ident: 9999_CR25
  doi: 10.1145/2742854.2747283
– volume: 59
  start-page: 56
  issue: 11
  year: 2016
  ident: 9999_CR54
  publication-title: Communications of the ACM
  doi: 10.1145/2934664
– start-page: 55
  volume-title: International Conference on Data Warehousing and Knowledge Discovery
  year: 2008
  ident: 9999_CR34
  doi: 10.1007/978-3-540-85836-2-6
– ident: 9999_CR2
  doi: 10.1145/2723372.2742797
– ident: 9999_CR5
  doi: 10.1007/978-3-642-16184-1-1
– volume: 4
  start-page: 103
  issue: 2
  year: 2010
  ident: 9999_CR29
  publication-title: International Journal of Intelligent Information and Database Systems
  doi: 10.1504/IJIIDS.2010.032437
– ident: 9999_CR12
  doi: 10.1007/978-3-319-04936-6_8
– start-page: 191
  volume-title: 2014 IEEE International Symposium on Workload Characterization
  year: 2014
  ident: 9999_CR20
  doi: 10.1109/IISWC.2014.6983058
– ident: 9999_CR22
  doi: 10.18653/v1/P18-2040
– start-page: 488
  volume-title: IEEE International Symposium on High Performance Computer Architecture
  year: 2014
  ident: 9999_CR50
  doi: 10.1109/HPCA.2014.6835958
– volume: 6
  start-page: 243
  issue: 3
  year: 2012
  ident: 9999_CR56
  publication-title: Statistical Analysis and Data Mining
  doi: 10.1002/sam.11159
– ident: 9999_CR3
  doi: 10.1145/2463676.2465296
– ident: 9999_CR53
  doi: 10.1145/3219819.3220094
– volume: 19
  start-page: 22
  issue: 1
  year: 2017
  ident: 9999_CR37
  publication-title: ACM SIGKDD Explorations Newsletter
  doi: 10.1145/3137597.3137600
– ident: 9999_CR45
  doi: 10.1007/978-3-319-67162-8_3
– volume: 85
  start-page: 60
  year: 2018
  ident: 9999_CR48
  publication-title: Future Generation Computer Systems
  doi: 10.1016/j.future.2018.02.037
– volume: 36
  start-page: 809
  issue: 6
  year: 2000
  ident: 9999_CR41
  publication-title: Information Processing & Management
  doi: 10.1016/S0306-4573(00)00016-9
– volume: 36
  start-page: 779
  issue: 6
  year: 2000
  ident: 9999_CR40
  publication-title: Information Processing & Management
  doi: 10.1016/S0306-4573(00)00015-7
– volume: 5
  start-page: 361
  year: 2004
  ident: 9999_CR24
  publication-title: Journal of Machine Learning Research
– ident: 9999_CR44
– ident: 9999_CR27
  doi: 10.1017/CBO9780511809071
– volume: 2
  start-page: 1626
  issue: 2
  year: 2009
  ident: 9999_CR42
  publication-title: VLDB Endowment
  doi: 10.14778/1687553.1687609
– volume: 51
  start-page: 107
  issue: 1
  year: 2008
  ident: 9999_CR10
  publication-title: Communications of the ACM
  doi: 10.1145/1327452.1327492
– start-page: 41
  volume-title: International Conference on Data Engineering
  year: 2010
  ident: 9999_CR19
  doi: 10.1109/ICDEW.2010.5452747
– ident: 9999_CR1
  doi: 10.1007/978-3-319-31409-9-3
– start-page: 1225
  volume-title: 2017 IEEE 33rd International Conference on Data Engineering
  year: 2017
  ident: 9999_CR15
  doi: 10.1109/ICDE.2017.167
– ident: 9999_CR16
– ident: 9999_CR43
– ident: 9999_CR28
  doi: 10.1007/978-3-319-10596-3-11
– volume: 47
  start-page: 21
  issue: 2
  year: 2013
  ident: 9999_CR4
  publication-title: SIGIR Forum
  doi: 10.1145/2568388.2568393
– ident: 9999_CR9
  doi: 10.1145/3018661.3018726
– volume: 51
  start-page: 260
  issue: 2
  year: 2017
  ident: 9999_CR23
  publication-title: SIGIR Forum
  doi: 10.1145/3130348.3130376
– volume: 41
  start-page: 391
  issue: 6
  year: 1990
  ident: 9999_CR11
  publication-title: Journal of the American Society for Information Science
  doi: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
– start-page: 1
  volume-title: 2017 IEEE International Conference on Fuzzy Systems
  year: 2017
  ident: 9999_CR52
  doi: 10.1109/FUZZ-IEEE.2017.8015720
– ident: 9999_CR30
– volume: 5
  start-page: 18
  issue: 1
  year: 2015
  ident: 9999_CR17
  publication-title: Social Network Analysis and Mining
  doi: 10.1007/s13278-015-0258-0
– ident: 9999_CR38
– ident: 9999_CR55
  doi: 10.1137/1.9781611972795.96
– start-page: 1197
  volume-title: ACM SIGMOD International Conference on Management of Data, SIGMOD ‘13
  year: 2013
  ident: 9999_CR14
  doi: 10.1145/2463676.2463712
– ident: 9999_CR46
  doi: 10.1007/978-3-319-49586-6-33
– volume: 43
  start-page: 174
  issue: 2
  year: 2017
  ident: 9999_CR21
  publication-title: Journal of Information Science
  doi: 10.1177/0165551515620551
– ident: 9999_CR47
  doi: 10.1109/synasc.2016.055
– volume: 11
  start-page: 31
  issue: 1
  year: 2016
  ident: 9999_CR6
  publication-title: International Journal of Business Intelligence and Data Mining
  doi: 10.1504/ijbidm.2016.076425
– ident: 9999_CR33
  doi: 10.1145/3121050.3121062
– start-page: 1
  volume-title: Symposium on Mass Storage Systems and Technologies
  year: 2010
  ident: 9999_CR39
  doi: 10.1109/MSST.2010.5496972
– volume: 51
  start-page: 211
  issue: 2
  year: 2017
  ident: 9999_CR18
  publication-title: SIGIR Forum
  doi: 10.1145/3130348.3130370
– volume: 17
  start-page: 982
  issue: 10
  year: 2016
  ident: 9999_CR51
  publication-title: Frontiers of Information Technology & Electronic Engineering
  doi: 10.1631/FITEE.1500332
– ident: 9999_CR36
  doi: 10.1007/978-3-642-36949-0-2
– start-page: 1357
  volume-title: ACM SIGMOD International Conference on Management of Data
  year: 2015
  ident: 9999_CR35
  doi: 10.1145/2723372.2742790
– ident: 9999_CR31
– ident: 9999_CR26
  doi: 10.1007/978-3-319-30671-1-30
– start-page: 507
  volume-title: IEEE International Conference on Big Data
  year: 2015
  ident: 9999_CR32
  doi: 10.1109/BigData.2015.7363793
– volume: 57
  start-page: 10:1
  issue: 3/4
  year: 2013
  ident: 9999_CR13
  publication-title: IBM Journal of Research and Development
  doi: 10.1147/JRD.2013.2240732
– start-page: 154
  volume-title: International Conference on Database and Expert Systems Applications
  year: 2011
  ident: 9999_CR7
  doi: 10.1007/978-3-642-23091-2_15
– start-page: 5:1
  volume-title: Annual Symposium on Cloud Computing
  year: 2013
  ident: 9999_CR49
  doi: 10.1145/2523616.2523633
– ident: 9999_CR8
  doi: 10.1007/978-3-319-10596-3-1
SSID ssj0016275
Score 2.3487248
Snippet Extracting top- k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and...
Extracting top-k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and...
SourceID hal
proquest
crossref
springer
SourceType Open Access Repository
Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 81
SubjectTerms Algorithms
Benchmarks
Big Data
Business and Management
Computation
Computer networks
Computer Science
Control
Data base management systems
Data mining
Datasets
Document and Text Processing
Ecosystems
Errors
Information systems
IT in Business
Keywords
Machine learning
Management of Computing and Information Systems
Multidimensional approach
Operations Research/Decision Theory
Performance evaluation
Queries
Retrieval
Scores
Subsets
Systems Theory
Weighting
SummonAdditionalLinks – databaseName: SpringerLINK Contemporary 1997-Present
  dbid: RSV
  link: http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LT8JAEJ4IetCDb2MVzcZ40ya026c3BAkHQoyg4dZst7uBqNVAIeHfO7u0gEZN9Nh9tM3Mzs432dlvAC7RoyQIM2yTxZ40HQuXMeNMosUzyS3KAso0z2zb73SCfj-8zy-FjYts9-JIUu_UK5fdfJ0wWzUVqgnNWQnW0d0FqmDDQ_dpcXageHd1mKXMB9FJflXm-3d8ckelgUqGXEGaXw5Htc9p7vzvb3dhO8eYpDZfFHuwJtJ92FphHjyAeg835VuRNro3hBHNPT3kRDVOcGaDZYxgLx-8stEzQVxLGopgV9XGEgnJWc4P4bF516u3zLyegskRdWSmcDB6op4j7NhN7Kp0E4rmSplwJacs9EKOT8Ln3OI8CG1mSxYGnDkyqbqo0ZgeQTl9S8UxEI9Jx-dBTGMfQ4xYxKhYR3quk_i-Vw2pAVYh1ojnZOOq5sVLtKRJVgKKUECRFlA0M-BqMed9TrXx6-gL1NZioGLJbtXakWpD2KEoaoKpZUClUGaU2-Y4sh11thtiaGHAdaG8ZffPnzz52_BT2LRVAoxO8a5AORtNxBls8Gk2HI_O9Zr9ABjv5LA
  priority: 102
  providerName: Springer Nature
Title TextBenDS: a Generic Textual Data Benchmark for Distributed Systems
URI https://link.springer.com/article/10.1007/s10796-020-09999-y
https://www.proquest.com/docview/2487069882
https://hal.science/hal-02476758
Volume 23
WOSCitedRecordID wos000562310100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAVX
  databaseName: SpringerLINK Contemporary 1997-Present
  customDbUrl:
  eissn: 1572-9419
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0016275
  issn: 1387-3326
  databaseCode: RSV
  dateStart: 19990701
  isFulltext: true
  titleUrlDefault: https://link.springer.com/search?facet-content-type=%22Journal%22
  providerName: Springer Nature
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1JT9wwFH7qQA_0AKVQdVhGFuJWrCZxEidcKpgBIQHTEVuhl8hxbIGAAWZB4t_3PeOZoZXgwsWStyTK5-Wz_fw9gHWcUSqkGRFXZWp5HGIzVlpZ7PHK6lCoTCinM3sg2-3s_Dzv-A23vjerHI2JbqCu7jTtkf-IYjqRy5EQ_rx_4OQ1ik5XvQuNGkwTsyGTvsOgOT5FIAVet-CijoQ8xV-a8VfnpDO_DThxpJw__TMx1S7JLPIF5_zvmNTNPrtz7_3uzzDreSfbem4o8_DBdL_ApxdqhAvQPMGBett0W8ebTDGnR32lGSUOsWZLDRTDXH15q3rXDLkua5HoLvnLMhXzyueLcLq7c9Lc497HAtfIRAbcxLiiEmlsojKposAmlcAuLJRJrBYqT3ONMSO1DrXO8khFVuWZVrGtggRRLsVXmOredc03YKmysdRZKUqJy47SlAh2bNMkrqRMg1zUIRz94EJ7AXLyg3FTTKSTCZQCQSkcKMVTHb6P69w_y2-8WXoNcRsXJOXsva2DgtKQipBsTfYY1mFlBFTh-2u_mKBUh40R1JPs11-59PbTlmEmIiMYZ-a9AlOD3tCswkf9OLjq9xpQk78vGjC9vdPuHGFsX_KGa8EUhr8w7CR_MDw6PvsLhFj0IA
linkProvider ProQuest
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LTxsxEB4BrdRy6LtqWkqtip7AImt7X5WqipKiIEKERJC4Ga_XVlBLoEmgyp_qb-yMs5sAUrlx6HH9Wnn92TOzM_4GYA0lSolqhuCmSDxXEcLYWONxxxtvI2kyaQLPbCftdrPj4_xgAf7Ud2EorLI-E8NBXZ5b-ke-KRR55HJUCL9e_OKUNYq8q3UKjSks9tzkN5psoy-7LVzfT0LsfO9tt3mVVYBblL1j7hTaEDJRThRxKZo-LiWCVhoXeytNnuQWn1xqbWRtlgsjvMkza5QvmzHOq5A47iI8UBKbkRM45TOvBTH-BgOPNi7qRdUlneqqXhrCfZucdLKcT24IwsU-hWFe03FvuWWDtNt5-r99p2fwpNKr2dZ0IzyHBTd4AcvX2BZfwnYPBdE3N2gdfmaGBb7tU8uo8BJ7tszYMKy1_TMz_MFQl2ctIhWmfGCuZBWz-ys4updZvIalwfnAvQGWGK9SmxWySNGsKlyBYFY-iVWZpkkzlw2I6gXVtiJYpzwfP_WcGppAoBEEOoBATxqwPutzMaUXubP1R8TJrCExg7e3OprKUNUiWp7sKmrASg0MXZ1HIz1HRQM2amjNq__9yrd3j_YBHrV7-x3d2e3uvYPHggJ-Qkj7CiyNh5fuPTy0V-PT0XA17BQGJ_cNub-YAUuB
linkToPdf http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V3dT9swED_xpWk8jI9tohsMC8HTFrWx84mEEBAqEFVVaUxCe_EcxxYIKNAWpv5r--t25yYtTBpvPPAYx3bi-Hf2Xe78O4BN3FEKVDO4p_LIeoGPMFZaWZR4ZbUvVCKU45ltxe12cnaWdqbgT3UWhsIqqzXRLdTFjaZ_5HUekEcuRYWwbsuwiE7W3L298yiDFHlaq3QaI4icmOFvNN_6O8cZzvUW583D04Mjr8ww4GnchweeCdCeEFFgeB4WvGHDQiCAhTKh1UKlUarxysRa-1onKVfcqjTRKrBFI8Qx5gL7nYbZGG1MCifshD_HHgxi_3XGHgkx6kjlgZ3y2F7sQn8bHulnqTd8silOn1NI5iN99x8Xrdv5mguv-ZstwrtS32Z7IwFZginTXYb5RyyM7-HgFN9433Sz79tMMcfDfaEZFd5jy0wNFMO7-vxa9S4Z6vgsI7JhyhNmClYyvn-AHy8yio8w073pmhVgkbJBrJNc5DGaW7nJEeSBjcKgiOOokYoa-NXkSl0Sr1P-jys5oYwmQEgEhHSAkMMafB23uR3RjjxbewMxM65IjOFHey1JZaiCEV1P8uDXYLUCiSzXqb6cIKQG3yqYTW7__5Gfnu9tHd4g0mTruH3yGd5yigNyke6rMDPo3Zs1mNMPg4t-74sTGga_XhpxfwEW4VSl
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=TextBenDS%3A+a+Generic+Textual+Data+Benchmark+for+Distributed+Systems&rft.jtitle=Information+systems+frontiers&rft.au=Ciprian-Octavian%2C+Truic%C4%83&rft.au=Elena-Simona%2C+Apostol&rft.au=Darmont+J%C3%A9r%C3%B4me&rft.au=Assent+Ira&rft.date=2021-02-01&rft.pub=Springer+Nature+B.V&rft.issn=1387-3326&rft.eissn=1572-9419&rft.volume=23&rft.issue=1&rft.spage=81&rft.epage=100&rft_id=info:doi/10.1007%2Fs10796-020-09999-y&rft.externalDBID=HAS_PDF_LINK
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1387-3326&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1387-3326&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1387-3326&client=summon