TextBenDS: a Generic Textual Data Benchmark for Distributed Systems
Extracting top- k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the mod...
Uloženo v:
| Vydáno v: | Information systems frontiers Ročník 23; číslo 1; s. 81 - 100 |
|---|---|
| Hlavní autoři: | , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
New York
Springer US
01.02.2021
Springer Nature B.V Springer Verlag |
| Edice: | Breakthroughs on Cross-Cutting Data Management, Data Analytics and Applied Data Science |
| Témata: | |
| ISSN: | 1387-3326, 1572-9419 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Extracting top-
k
keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the modifications performed on the dataset. Furthermore, calculation errors are introduced when analyzing only subsets of the dataset, i.e., wrong weighting are computed as weighting schemes use the number of documents for scoring keywords and documents. Therefore, in a Big Data context, it is crucial to lower the runtime of computing weighting schemes, without hindering the analysis process and the accuracy of the machine learning algorithms. To address this requirement for the task of computing top-
k
keywords and documents (which largely relies on weighting schemes), it is customary to design benchmarks that compare weighting schemes within various configurations of distributedframeworks and database management systems. Thus, we propose T
ext
B
en
DS - a generic document-oriented benchmark for storing textual data and constructing weighting schemes. Our benchmark offers a generic data model designed with a multidimensional approach for storing text documents. We also propose using aggregation queries with various complexities and selectivities for constructing term weighting schemes, that are utilized in extracting top-
k
keywords and documents. We evaluate the computing performance of the queries on several distributed environments set within the Apache Hadoop ecosystem. Our experimental results provide interesting insights. As an example, MongoDB shows the best overall performance, while Spark’s execution time remains almost constant regardless of weighting schemes. |
|---|---|
| AbstractList | Extracting top-k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the modifications performed on the dataset. Furthermore, calculation errors are introduced when analyzing only subsets of the dataset, i.e., wrong weighting are computed as weighting schemes use the number of documents for scoring keywords and documents. Therefore, in a Big Data context, it is crucial to lower the runtime of computing weighting schemes, without hindering the analysis process and the accuracy of the machine learning algorithms. To address this requirement for the task of computing top-k keywords and documents (which largely relies on weighting schemes), it is customary to design benchmarks that compare weighting schemes within various configurations of distributedframeworks and database management systems. Thus, we propose TextBenDS - a generic document-oriented benchmark for storing textual data and constructing weighting schemes. Our benchmark offers a generic data model designed with a multidimensional approach for storing text documents. We also propose using aggregation queries with various complexities and selectivities for constructing term weighting schemes, that are utilized in extracting top-k keywords and documents. We evaluate the computing performance of the queries on several distributed environments set within the Apache Hadoop ecosystem. Our experimental results provide interesting insights. As an example, MongoDB shows the best overall performance, while Spark’s execution time remains almost constant regardless of weighting schemes. Extracting top- k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the modifications performed on the dataset. Furthermore, calculation errors are introduced when analyzing only subsets of the dataset, i.e., wrong weighting are computed as weighting schemes use the number of documents for scoring keywords and documents. Therefore, in a Big Data context, it is crucial to lower the runtime of computing weighting schemes, without hindering the analysis process and the accuracy of the machine learning algorithms. To address this requirement for the task of computing top- k keywords and documents (which largely relies on weighting schemes), it is customary to design benchmarks that compare weighting schemes within various configurations of distributedframeworks and database management systems. Thus, we propose T ext B en DS - a generic document-oriented benchmark for storing textual data and constructing weighting schemes. Our benchmark offers a generic data model designed with a multidimensional approach for storing text documents. We also propose using aggregation queries with various complexities and selectivities for constructing term weighting schemes, that are utilized in extracting top- k keywords and documents. We evaluate the computing performance of the queries on several distributed environments set within the Apache Hadoop ecosystem. Our experimental results provide interesting insights. As an example, MongoDB shows the best overall performance, while Spark’s execution time remains almost constant regardless of weighting schemes. Extracting top-k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and retrieval tasks. The weights are usually computed in the data preprocessing step, as they are costly to update and keep track of all the modifications performed on the dataset. Furthermore, computation errors are introduced when analyzing only subsets of the dataset. Therefore, in a Big Data context, it is crucial to lower the runtime of computing weighting schemes, without hindering the analysis process and the accuracy of the machine learning algorithms. To address this requirement for the task of top-k keywords and documents, it is customary to design benchmarks that compare weighting schemes within various configurations of distributed frameworks and database management systems. Thus, we propose a generic document-oriented benchmark for storing textual data and constructing weighting schemes (TextBenDS). Our benchmark offers a generic data model designed with a multidimensional approach for storing text documents. We also propose using aggregation queries with various complexities and selectivities for constructing term weighting schemes, that are utilized in extracting top-k keywords and documents. We evaluate the computing performance of the queries on several distributed environments set within the Apache Hadoop ecosystem. Our experimental results provide interesting insights. As an example, MongoDB proves to have the best overall performance, while Spark's execution time remains almost the same, regardless of the weighting schemes. |
| Author | Truică, Ciprian-Octavian Apostol, Elena-Simona Assent, Ira Darmont, Jérôme |
| Author_xml | – sequence: 1 givenname: Ciprian-Octavian orcidid: 0000-0001-7292-4462 surname: Truică fullname: Truică, Ciprian-Octavian email: ciprian.truica@cs.pub.ro, ciprian.truica@cs.au.dk organization: Computer Science and Engineering Department, Faculty of Automatic Control and Computers, University Politehnica of Bucharest, Department of Computer Science, Aarhus University – sequence: 2 givenname: Elena-Simona surname: Apostol fullname: Apostol, Elena-Simona organization: Computer Science and Engineering Department, Faculty of Automatic Control and Computers, University Politehnica of Bucharest – sequence: 3 givenname: Jérôme surname: Darmont fullname: Darmont, Jérôme organization: Université de Lyon – sequence: 4 givenname: Ira surname: Assent fullname: Assent, Ira organization: DIGIT, Department of Computer Science, Aarhus University |
| BackLink | https://hal.science/hal-02476758$$DView record in HAL |
| BookMark | eNp9kE9PAjEQxRuDiYB-AU-bePKw2n-7bb0hKJiQeADPTel2pQi72HaN--0trsbEA71MZ-b9Ji9vAHpVXRkALhG8QRCyW48gE3kKMUyhiC9tT0AfZQyngiLRi3_CWUoIzs_AwPsNhCjHLOuD8dJ8hntTTRZ3iUqmpjLO6uQwbNQ2maigkrjV651yb0lZu2RifXB21QRTJIvWB7Pz5-C0VFtvLn7qELw8PizHs3T-PH0aj-apJoKF1FBEKcmpwauswLDMCsI4I8pkpSZK5ELHzjCtkdZcYIVLJbhWtCxgBotsRYbguru7Vlu5dzZ6amWtrJyN5vIwg5iynGX8A0XtVafdu_q9MT7ITd24KtqTmHIGc8E5jireqbSrvXemlNoGFWxdBafsViIoD_HKLt54H8rveGUbUfwP_XV0FCId5KO4ejXuz9UR6gtIq45u |
| CitedBy_id | crossref_primary_10_1007_s10796_023_10468_5 crossref_primary_10_1016_j_jestch_2024_101728 crossref_primary_10_4018_JDM_321756 crossref_primary_10_1007_s10796_020_10091_8 crossref_primary_10_1016_j_tbench_2022_100074 crossref_primary_10_3390_math11030508 crossref_primary_10_1016_j_datak_2023_102154 crossref_primary_10_1007_s11036_020_01699_w crossref_primary_10_1109_TKDE_2024_3417232 |
| Cites_doi | 10.1145/2742854.2747283 10.1145/2934664 10.1007/978-3-540-85836-2-6 10.1145/2723372.2742797 10.1007/978-3-642-16184-1-1 10.1504/IJIIDS.2010.032437 10.1007/978-3-319-04936-6_8 10.1109/IISWC.2014.6983058 10.18653/v1/P18-2040 10.1109/HPCA.2014.6835958 10.1002/sam.11159 10.1145/2463676.2465296 10.1145/3219819.3220094 10.1145/3137597.3137600 10.1007/978-3-319-67162-8_3 10.1016/j.future.2018.02.037 10.1016/S0306-4573(00)00016-9 10.1016/S0306-4573(00)00015-7 10.1017/CBO9780511809071 10.14778/1687553.1687609 10.1145/1327452.1327492 10.1109/ICDEW.2010.5452747 10.1007/978-3-319-31409-9-3 10.1109/ICDE.2017.167 10.1007/978-3-319-10596-3-11 10.1145/2568388.2568393 10.1145/3018661.3018726 10.1145/3130348.3130376 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 10.1109/FUZZ-IEEE.2017.8015720 10.1007/s13278-015-0258-0 10.1137/1.9781611972795.96 10.1145/2463676.2463712 10.1007/978-3-319-49586-6-33 10.1177/0165551515620551 10.1109/synasc.2016.055 10.1504/ijbidm.2016.076425 10.1145/3121050.3121062 10.1109/MSST.2010.5496972 10.1145/3130348.3130370 10.1631/FITEE.1500332 10.1007/978-3-642-36949-0-2 10.1145/2723372.2742790 10.1007/978-3-319-30671-1-30 10.1109/BigData.2015.7363793 10.1147/JRD.2013.2240732 10.1007/978-3-642-23091-2_15 10.1145/2523616.2523633 10.1007/978-3-319-10596-3-1 |
| ContentType | Journal Article |
| Copyright | Springer Science+Business Media, LLC, part of Springer Nature 2020 Springer Science+Business Media, LLC, part of Springer Nature 2020. Attribution |
| Copyright_xml | – notice: Springer Science+Business Media, LLC, part of Springer Nature 2020 – notice: Springer Science+Business Media, LLC, part of Springer Nature 2020. – notice: Attribution |
| DBID | AAYXX CITATION 3V. 7SC 7WY 7WZ 7XB 87Z 8AL 8AO 8FD 8FE 8FG 8FK 8FL ABUWG AFKRA ALSLI ARAPS AZQEC BENPR BEZIV BGLVJ CCPQU CNYFK DWQXO FRNLG F~G GNUQQ HCIFZ JQ2 K60 K6~ K7- L.- L7M L~C L~D M0C M0N M1O P5Z P62 PHGZM PHGZT PKEHL PQBIZ PQBZA PQEST PQGLB PQQKQ PQUKI PRINS PRQQA Q9U 1XC VOOES |
| DOI | 10.1007/s10796-020-09999-y |
| DatabaseName | CrossRef ProQuest Central (Corporate) Computer and Information Systems Abstracts ABI/INFORM Collection ABI/INFORM Global (PDF only) ProQuest Central (purchase pre-March 2016) ABI/INFORM Collection Computing Database (Alumni Edition) ProQuest Pharma Collection Technology Research Database ProQuest SciTech Collection ProQuest Technology Collection ProQuest Central (Alumni) (purchase pre-March 2016) ABI/INFORM Collection (Alumni) ProQuest Central (Alumni) ProQuest Central UK/Ireland Social Science Premium Collection Advanced Technologies & Computer Science Collection ProQuest Central Essentials - QC ProQuest Central Business Premium Collection ProQuest Technology Collection ProQuest One Community College Library & Information Science Collection ProQuest Central Korea Business Premium Collection (Alumni) ABI/INFORM Global (Corporate) ProQuest Central Student SciTech Collection (ProQuest) ProQuest Computer Science Collection ProQuest Business Collection (Alumni Edition) ProQuest Business Collection Computer Science Database (ProQuest) ABI/INFORM Professional Advanced Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional ABI/INFORM Collection (ProQuest) Computing Database Library Science Database Advanced Technologies & Aerospace Database ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Premium ProQuest One Academic (New) ProQuest One Academic Middle East (New) ProQuest One Business ProQuest One Business (Alumni) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic (retired) ProQuest One Academic UKI Edition ProQuest Central China ProQuest One Social Sciences ProQuest Central Basic Hyper Article en Ligne (HAL) Hyper Article en Ligne (HAL) (Open Access) |
| DatabaseTitle | CrossRef ProQuest Business Collection (Alumni Edition) Computer Science Database ProQuest Central Student ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Computer Science Collection Computer and Information Systems Abstracts SciTech Premium Collection ProQuest Central China ABI/INFORM Complete ProQuest One Applied & Life Sciences Library & Information Science Collection ProQuest Central (New) Advanced Technologies & Aerospace Collection Business Premium Collection Social Science Premium Collection ABI/INFORM Global ProQuest One Academic Eastern Edition ProQuest Technology Collection ProQuest Business Collection ProQuest One Academic UKI Edition ProQuest One Academic ProQuest One Academic (New) ABI/INFORM Global (Corporate) ProQuest One Business Technology Collection Technology Research Database Computer and Information Systems Abstracts – Academic ProQuest One Academic Middle East (New) ProQuest Central (Alumni Edition) ProQuest One Community College ProQuest Pharma Collection ProQuest Central ABI/INFORM Professional Advanced ProQuest Library Science ProQuest Central Korea Advanced Technologies Database with Aerospace ABI/INFORM Complete (Alumni Edition) ProQuest Computing ProQuest One Social Sciences ABI/INFORM Global (Alumni Edition) ProQuest Central Basic ProQuest Computing (Alumni Edition) ProQuest SciTech Collection Computer and Information Systems Abstracts Professional Advanced Technologies & Aerospace Database ProQuest One Business (Alumni) ProQuest Central (Alumni) Business Premium Collection (Alumni) |
| DatabaseTitleList | ProQuest Business Collection (Alumni Edition) |
| Database_xml | – sequence: 1 dbid: BENPR name: ProQuest Central Database Suite (ProQuest) url: https://www.proquest.com/central sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering Computer Science |
| EISSN | 1572-9419 |
| EndPage | 100 |
| ExternalDocumentID | oai:HAL:hal-02476758v1 10_1007_s10796_020_09999_y |
| GrantInformation_xml | – fundername: UEFISCDI grantid: No. PN-III-P1-1.2-PCCDI-2017-0734 |
| GroupedDBID | -57 -5G -BR -EM -Y2 -~C .4S .86 .DC .VR 06D 0R~ 0VY 1N0 1SB 203 29I 2J2 2JN 2JY 2KG 2LR 2P1 2VQ 2~H 30V 3V. 4.4 406 408 409 40D 40E 5GY 5VS 67Z 6NX 7WY 8AO 8FE 8FG 8FL 8TC 8UJ 95- 95. 95~ 96X AAAVM AABHQ AACDK AAHNG AAIAL AAJBT AAJKR AANZL AARHV AARTL AASML AATNV AATVU AAUYE AAWCG AAYIU AAYQN AAYTO AAYZH ABAKF ABBBX ABBXA ABDZT ABECU ABFTD ABFTV ABHQN ABJNI ABJOX ABKCH ABKTR ABMNI ABMQK ABNWP ABQBU ABQSL ABSXP ABTEG ABTHY ABTKH ABTMW ABULA ABUWG ABWNU ABXPI ACAOD ACBXY ACDTI ACGFS ACHSB ACHXU ACKNC ACMDZ ACMLO ACOKC ACOMO ACPIV ACSNA ACZOJ ADHHG ADHIR ADINQ ADKNI ADKPE ADMLS ADRFC ADTPH ADURQ ADYFF ADZKW AEBTG AEFQL AEGAL AEGNC AEJHL AEJRE AEKMD AEMSY AENEX AEOHA AEPYU AESKC AETLH AEVLU AEXYK AFBBN AFDYV AFGCZ AFKRA AFLOW AFQWF AFWTZ AFZKB AGAYW AGDGC AGGDS AGJBK AGMZJ AGQEE AGQMX AGRTI AGWIL AGWZB AGYKE AHAVH AHBYD AHSBF AHYZX AIAKS AIGIU AIIXL AILAN AITGF AJBLW AJRNO AJZVZ ALMA_UNASSIGNED_HOLDINGS ALSLI ALWAN AMKLP AMXSW AMYLF AMYQR AOCGG ARAPS ARCSS ARMRJ ASPBG AVWKF AXYYD AYQZM AZFZN AZQEC B-. BA0 BAPOH BDATZ BENPR BEZIV BGLVJ BGNMA BPHCQ BSONS CAG CCPQU CNYFK COF CS3 CSCUP DDRTE DL5 DNIVK DPUIP DU5 DWQXO EBLON EBS EDO EIOEI EJD ESBYG FEDTE FERAY FFXSO FIGPU FINBP FNLPD FRNLG FRRFC FSGXE FWDCC GGCAI GGRSB GJIRD GNUQQ GNWQR GQ6 GQ7 GQ8 GROUPED_ABI_INFORM_COMPLETE GROUPED_ABI_INFORM_RESEARCH GXS H13 HCIFZ HF~ HG5 HG6 HMJXF HQYDN HRMNR HVGLF HZ~ I-F I09 IHE IJ- IKXTQ ITM IWAJR IXC IZIGR IZQ I~X I~Z J-C J0Z JBSCW JCJTX JZLTJ K60 K6V K6~ K7- KDC KOV LAK LLZTM M0C M0N M1O M4Y MA- MK~ ML~ N2Q NB0 NPVJJ NQJWS NU0 O9- O93 O9G O9J OAM OVD P62 P9O PF0 PQBIZ PQBZA PQQKQ PROAC PT4 PT5 Q2X QOS R89 R9I RNI RNS ROL RPX RSV RZC RZD RZK S16 S1Z S27 S3B SAP SBE SDH SHX SISQX SJYHP SNE SNPRN SNX SOHCF SOJ SPISZ SRMVM SSLCW STPWE SZN T13 TEORI TSG TSK TSV TUC TUS U2A UG4 UOJIU UTJUX UZXMN VC2 VFIZW W23 W48 WK8 YLTOR Z45 Z7R Z7S Z7X Z7Z Z81 Z83 Z88 ZMTXR ~A9 AAPKM AAYXX ABBRH ABDBE ABFSG ABRTQ ACSTC ADHKG ADKFA AEZWR AFDZB AFFHD AFHIU AFOHR AGQPQ AHPBZ AHWEU AIXLP ATHPR AYFIA CITATION PHGZM PHGZT PQGLB PRQQA 7SC 7XB 8AL 8FD 8FK JQ2 L.- L7M L~C L~D PKEHL PQEST PQUKI PRINS Q9U 1XC VOOES |
| ID | FETCH-LOGICAL-c397t-e4144364e2b5d20f5d37873ae5fc3a969c873e7cc1cc892a2fa98ca4fd050d5b3 |
| IEDL.DBID | M0C |
| ISICitedReferencesCount | 12 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000562310100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1387-3326 |
| IngestDate | Sat Nov 29 15:10:56 EST 2025 Sat Nov 15 05:52:43 EST 2025 Tue Nov 18 22:37:46 EST 2025 Sat Nov 29 03:40:11 EST 2025 Fri Feb 21 02:49:00 EST 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 1 |
| Keywords | Weighting schemes Distributed DBMSs Top keywords Distributed frameworks documents Benchmark Top-k documents Top-k keywords |
| Language | English |
| License | Attribution: http://creativecommons.org/licenses/by |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c397t-e4144364e2b5d20f5d37873ae5fc3a969c873e7cc1cc892a2fa98ca4fd050d5b3 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ORCID | 0000-0001-7292-4462 0000-0003-1491-384X |
| OpenAccessLink | https://hal.science/hal-02476758 |
| PQID | 2487069882 |
| PQPubID | 26108 |
| PageCount | 20 |
| ParticipantIDs | hal_primary_oai_HAL_hal_02476758v1 proquest_journals_2487069882 crossref_citationtrail_10_1007_s10796_020_09999_y crossref_primary_10_1007_s10796_020_09999_y springer_journals_10_1007_s10796_020_09999_y |
| PublicationCentury | 2000 |
| PublicationDate | 20210200 2021-02-00 20210201 2021-02 |
| PublicationDateYYYYMMDD | 2021-02-01 |
| PublicationDate_xml | – month: 2 year: 2021 text: 20210200 |
| PublicationDecade | 2020 |
| PublicationPlace | New York |
| PublicationPlace_xml | – name: New York |
| PublicationSeriesTitle | Breakthroughs on Cross-Cutting Data Management, Data Analytics and Applied Data Science |
| PublicationSubtitle | A Journal of Research and Innovation |
| PublicationTitle | Information systems frontiers |
| PublicationTitleAbbrev | Inf Syst Front |
| PublicationYear | 2021 |
| Publisher | Springer US Springer Nature B.V Springer Verlag |
| Publisher_xml | – name: Springer US – name: Springer Nature B.V – name: Springer Verlag |
| References | DeerwesterSDumaisSTFurnasGWLandauerTKHarshmanRIndexing by latent semantic analysisJournal of the American Society for Information Science199041639140710.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 Truică, C. O., & Darmont, J. (2017). T2K2: The twitter top-k keywords benchmark. In European Conference on Advances in Databases and Information Systems (pp. 21–28). Springer International Publishing. https://doi.org/10.1007/978-3-319-67162-8_3. GhazalAIvanovTKostamaaPCrolotteAVoongRAl-KatebMGhazalWZicariRVBigbench v2: The new and improved bigbench2017 IEEE 33rd International Conference on Data Engineering20171225123610.1109/ICDE.2017.167 Truică, C. O., Darmont, J., & Velcine, J. (2016a). A scalable document-based architecture for text analysis. In International Conference on Advanced Data Mining and Applications (pp. 481–494). Springer. https://doi.org/10.1007/978-3-319-49586-6-33. KılıçDÖzçiftABozyigitFYildirimPYücalarFBorandagETtc-3600: A new benchmark dataset for turkish text categorizationJournal of Information Science201743217418510.1177/0165551515620551 Transaction Processing Performance Council (TPC) (2019). TPC-DS decision support benchmark 2.10.1.http://www.tpc.org Accessed March 2019. Yin, J., Chao, D., Liu, Z., Zhang, W., Yu, X., & Wang, J. (2018). Model-based clustering of short text streams. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2634–2642). ACM Press. https://doi.org/10.1145/3219819.3220094. Truică, C.O., Rădulescu, F., Boicea, A. (2016b). Comparing different term weighting schemas for topic modeling. In: International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. IEEE. https://doi.org/10.1109/synasc.2016.055. BellotPDoucetAGevaSGurajadaSKampsJKazaiGKoolenMMishraAMoriceauVMotheJPremingerMSanJuanESchenkelRTannierXTheobaldMTrappettMTrotmanASandersonMScholerFWangQReport on inex 2013SIGIR Forum2013472213210.1145/2568388.2568393 TruicăCODarmontJBoiceaARădulescuFBenchmarking top-k keyword and top-k document processing with T2K2 and T2K2D2Future Generation Computer Systems201885607510.1016/j.future.2018.02.037 Armstrong, T. G., Ponnekanti, V., Borthakur, D., & Callaghan, M. (2013). Linkbench: A database benchmark based on the facebook social graph. In ACM SIGMOD International Conference on Management of Data, SIGMOD ‘13 (pp. 1185–1196). ACM. https://doi.org/10.1145/2463676.2465296. WangLZhanJLuoCZhuYYangQHeYGaoWJiaZShiYZhangSZhengCLuGZhanKLiXQiuBBigDataBench: A big data benchmark suite from internet servicesIEEE International Symposium on High Performance Computer Architecture201448849910.1109/HPCA.2014.6835958 Lin, J., Crane, M., Trotman, A., Callan, J., Chattopadhyaya, I., Foley, J., Ingersoll, G., Macdonald, C., & Vigna, S. (2016). Toward reproducible baselines: The open-source ir reproducibility challenge. In Advances in information retrieval (pp. 408–420). Springer International Publishing. https://doi.org/10.1007/978-3-319-30671-1-30. ZhangDZhaiCHanJMiTexCube: MicroTextCluster cube for online analysis of text cells and its applicationsStatistical Analysis and Data Mining20126324325910.1002/sam.11159 ShvachkoKKuangHRadiaSChanslerRThe hadoop distributed file systemSymposium on Mass Storage Systems and Technologies201011010.1109/MSST.2010.5496972 ShuKSlivaAWangSTangJLiuHFake news detection on social media: A data mining perspectiveACM SIGKDD Explorations Newsletter2017191223610.1145/3137597.3137600 GattikerAEGebaraFHHofsteeHPHayesJDHylickABig data text-oriented benchmark creation for HadoopIBM Journal of Research and Development2013573/410:110:610.1147/JRD.2013.2240732 Ferrarons, J., Adhana, M., Colmenares, C., Pietrowska, S., Bentayeb, F., Darmont, J. (2014). Primeball: a parallel processing framework benchmark for big data applications in the cloud. In: TPC Technology Conference on Performance Evaluation and Benchmarking, LNCS1, 839, pp. 109–124. https://doi.org/10.1007/978-3-319-04936-6_8 Crane, M., Culpepper, J. S., Lin, J., Mackenzie, J., & Trotman, A. (2017). A comparison of document-at-a-time and score-at-a-time query evaluation. In ACM International Conference on Web Search and Data Mining (pp. 201–210). ACM. https://doi.org/10.1145/3018661.3018726. LewisDDYangYRoseTGLiFRcv1: A new benchmark collection for text categorization researchJournal of Machine Learning Research20045361397URL http://www.jmlr.org/papers/v5/lewis04a.html RavatFTesteOTournierRZurfluhGTop−keyword: an aggregation function for textual document olapInternational Conference on Data Warehousing and Knowledge Discovery2008556410.1007/978-3-540-85836-2-6 Krasnashchok, K., Jouili, S. (2018). Improving topic quality by promoting named entities in topic modeling. In: Annual Meeting of the Association for Computational Linguistics, pp. 247–253. Bifet, A., & Frank, E. (2010). Sentiment knowledge discovery in twitter streaming data. In Discovery Science (pp. 1–15). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-16184-1-1. Chowdhury, B., Rabl, T., Saadatpanah, P., Du, J., & Jacobsen, H. A. (2014). A bigbench implementation in the hadoop ecosystem. In Advancing big data benchmarks (pp. 3–18). Springer International Publishing. https://doi.org/10.1007/978-3-319-10596-3-1. Shu, K., Mahudeswaran, D., Wang, S., Lee, D., Liu, H. (2018). Fakenewsnet: A data repository with news content, social context and dynamic information for studying fake news on social media. arXiv preprint arXiv:1809.01286. Zhang, D., Zhai, C., Han, J. (2009). Topic cube: Topic modeling for OLAP on multidimensional text databases. In: SIAM International Conference on Data Mining, pp. 1124–1135. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611972795.96 Agrawal, D., Butt, A., Doshi, K., Larriba-Pey, J. L., Li, M., Reiss, F. R., Raab, F., Schiefer, B., Suzumura, T., & Xia, Y. (2016). Sparkbench – a spark performance testing suite. In Performance evaluation and benchmarking: Traditional to big data to internet of things (pp. 26–44). Springer International Publishing. https://doi.org/10.1007/978-3-319-31409-9-3. JiaZZhanJWangLHanRMcKeeSAYangQLuoCLiJCharacterizing and subsetting big data workloads2014 IEEE International Symposium on Workload Characterization201419120110.1109/IISWC.2014.6983058 VavilapalliVKMurthyACDouglasCAgarwalSKonarMEvansRGravesTLoweJShahHSethSSahaBCurinoCO’MalleyORadiaSReedBBaldeschwielerEApache hadoop yarn: Yet another resource negotiatorAnnual Symposium on Cloud Computing20135:15:1610.1145/2523616.2523633 PirzadehPCareyMJWestmannTBigfun: A performance study of big data management system functionalityIEEE International Conference on Big Data201550751410.1109/BigData.2015.7363793 HuangSHuangJDaiJXieTHuangBThe HiBench benchmark suite: Characterization of the MapReduce-based data analysisInternational Conference on Data Engineering2010415110.1109/ICDEW.2010.5452747 Ming, Z., Luo, C., Gao, W., Han, R., Yang, Q., Wang, L., & Zhan, J. (2014). Bdgs: A scalable big data generator suite in big data benchmarking. In Advancing big data benchmarks (pp. 138–154). Springer International Publishing. https://doi.org/10.1007/978-3-319-10596-3-11. DeanJGhemawatSMapreduce: Simplified data processing on large clustersCommunications of the ACM200851110711310.1145/1327452.1327492 LavrenkoVCroftWBRelevance-based language modelsSIGIR Forum201751226026710.1145/3130348.3130376 Partalas, I., Kosmopoulos, A., Baskiotis, N., Artières, T., Paliouras, G., Gaussier, É., Androutsopoulos, I., Amini, M.R., Gallinari, P. (2015). Lshtc: A benchmark for large-scale text classification. CoRR. URL http://arxiv.org/abs/1503.08581. Sangroya, A., Serrano, D., & Bouchenak, S. (2013). Mrbs: Towards dependability benchmarking for hadoop mapreduce. In Euro-Par 2012: Parallel Processing Workshops (pp. 3–12). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-36949-0-2. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press. ZahariaMXinRSWendellPDasTArmbrustMDaveAMengXRosenJVenkataramanSFranklinMJGhodsiAGonzalezJShenkerSStoicaIApache spark: A unified engine for big data processingCommunications of the ACM20165911566510.1145/2934664 GuilleAFavreCEvent detection, tracking, and visualization in twitter: a mention-anomaly-based approachSocial Network Analysis and Mining2015511810.1007/s13278-015-0258-0 Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A., & Zaharia, M. (2015). Spark sql: Relational data processing in spark. In ACM SIGMOD International Conference on Management of Data (pp. 1383–1394). ACM Press. https://doi.org/10.1145/2723372.2742797. Paltoglou, G., Thelwall, M. (2010). A study of information retrieval weighting schemes for sentiment analysis. In: Annual Meeting of the Association for Computational Linguistics, pp. 1386–1395. URL http://dl.acm.org/citation.cfm?id=1858681.1858822. BringaySBéchetNBouillotFPonceletPRocheMTeisseireMTowards an on-line analysis of tweets processingInternational Conference on Database and Expert Systems Applications201115416110.1007/978-3-642-23091-2_15 Li, M., Tan, J., Wang, Y., Zhang, L., & Salapura, V. (2015). Sparkbench: A comprehensive benchmarking suite for in memory data analytic platform spark. In ACM International Conference on Computing Frontiers, CF ‘15 (pp. 53:1–53:8). ACM. https://doi.org/10.1145/2742854.2747283. SahaBShahHSethSVijayaraghavanGMurthyACurinoCApache tez: A unifying framework for modeling and building data processing applicationsACM SIGMOD International Conference on Management of Data2015New YorkACM1357136910.1145/2723372.2742790 Spärck JonesKWalkerSRobertsonSEA probabilistic model of information retrieval: development and comparative experiments: Part 2Information Processing & Management200036680984010.1016/S0306-4573(00)00016-9 Raiber, F., & Kurland, O. (2017). Kullback-leibler divergence revisited. In ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR ‘17 (pp. 117–124). ACM. https K Spärck Jones (9999_CR41) 2000; 36 DD Lewis (9999_CR24) 2004; 5 S Deerwester (9999_CR11) 1990; 41 A Guille (9999_CR17) 2015; 5 9999_CR43 9999_CR45 A Thusoo (9999_CR42) 2009; 2 9999_CR44 9999_CR47 9999_CR46 M Zaharia (9999_CR54) 2016; 59 P Bellot (9999_CR4) 2013; 47 P Pirzadeh (9999_CR32) 2015 J O’Shea (9999_CR29) 2010; 4 V Lavrenko (9999_CR23) 2017; 51 9999_CR53 9999_CR12 K Shu (9999_CR37) 2017; 19 9999_CR55 J Dean (9999_CR10) 2008; 51 9999_CR16 AE Gattiker (9999_CR13) 2013; 57 F Ravat (9999_CR34) 2008 S Huang (9999_CR19) 2010 VK Vavilapalli (9999_CR49) 2013 L Wang (9999_CR50) 2014 K Shvachko (9999_CR39) 2010 A Ghazal (9999_CR15) 2017 9999_CR22 9999_CR25 L Wang (9999_CR51) 2016; 17 9999_CR9 9999_CR27 9999_CR8 9999_CR26 A Ghazal (9999_CR14) 2013 9999_CR28 9999_CR1 9999_CR3 K Spärck Jones (9999_CR40) 2000; 36 9999_CR2 9999_CR5 CO Truică (9999_CR48) 2018; 85 Z Jia (9999_CR20) 2014 9999_CR30 T Hofmann (9999_CR18) 2017; 51 9999_CR31 D Kılıç (9999_CR21) 2017; 43 9999_CR33 9999_CR36 D Zhang (9999_CR56) 2012; 6 9999_CR38 X Wang (9999_CR52) 2017 S Bringay (9999_CR7) 2011 B Saha (9999_CR35) 2015 M Bouakkaz (9999_CR6) 2016; 11 |
| References_xml | – reference: SahaBShahHSethSVijayaraghavanGMurthyACurinoCApache tez: A unifying framework for modeling and building data processing applicationsACM SIGMOD International Conference on Management of Data2015New YorkACM1357136910.1145/2723372.2742790 – reference: ThusooASarmaJSJainNShaoZChakkaPAnthonySLiuHWyckoffPMurthyRHive: A warehousing solution over a map-reduce frameworkVLDB Endowment2009221626162910.14778/1687553.1687609 – reference: TruicăCODarmontJBoiceaARădulescuFBenchmarking top-k keyword and top-k document processing with T2K2 and T2K2D2Future Generation Computer Systems201885607510.1016/j.future.2018.02.037 – reference: Crane, M., Culpepper, J. S., Lin, J., Mackenzie, J., & Trotman, A. (2017). A comparison of document-at-a-time and score-at-a-time query evaluation. In ACM International Conference on Web Search and Data Mining (pp. 201–210). ACM. https://doi.org/10.1145/3018661.3018726. – reference: ZhangDZhaiCHanJMiTexCube: MicroTextCluster cube for online analysis of text cells and its applicationsStatistical Analysis and Data Mining20126324325910.1002/sam.11159 – reference: Spärck JonesKWalkerSRobertsonSEA probabilistic model of information retrieval: development and comparative experiments: Part 1Information Processing & Management200036677980810.1016/S0306-4573(00)00015-7 – reference: LavrenkoVCroftWBRelevance-based language modelsSIGIR Forum201751226026710.1145/3130348.3130376 – reference: Ferrarons, J., Adhana, M., Colmenares, C., Pietrowska, S., Bentayeb, F., Darmont, J. (2014). Primeball: a parallel processing framework benchmark for big data applications in the cloud. In: TPC Technology Conference on Performance Evaluation and Benchmarking, LNCS1, 839, pp. 109–124. https://doi.org/10.1007/978-3-319-04936-6_8 – reference: Raiber, F., & Kurland, O. (2017). Kullback-leibler divergence revisited. In ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR ‘17 (pp. 117–124). ACM. https://doi.org/10.1145/3121050.3121062. – reference: Shu, K., Mahudeswaran, D., Wang, S., Lee, D., Liu, H. (2018). Fakenewsnet: A data repository with news content, social context and dynamic information for studying fake news on social media. arXiv preprint arXiv:1809.01286. – reference: LewisDDYangYRoseTGLiFRcv1: A new benchmark collection for text categorization researchJournal of Machine Learning Research20045361397URL http://www.jmlr.org/papers/v5/lewis04a.html – reference: Transaction Processing Performance Council (TPC) (2019). TPC-DS decision support benchmark 2.10.1.http://www.tpc.org Accessed March 2019. – reference: Gray, J. (1993). The benchmark handbook for database and transaction systems (2nd ed.). Burlington: Morgan Kaufmann Publishers. – reference: Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press. – reference: Krasnashchok, K., Jouili, S. (2018). Improving topic quality by promoting named entities in topic modeling. In: Annual Meeting of the Association for Computational Linguistics, pp. 247–253. – reference: DeerwesterSDumaisSTFurnasGWLandauerTKHarshmanRIndexing by latent semantic analysisJournal of the American Society for Information Science199041639140710.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 – reference: GhazalAIvanovTKostamaaPCrolotteAVoongRAl-KatebMGhazalWZicariRVBigbench v2: The new and improved bigbench2017 IEEE 33rd International Conference on Data Engineering20171225123610.1109/ICDE.2017.167 – reference: WangLZhanJLuoCZhuYYangQHeYGaoWJiaZShiYZhangSZhengCLuGZhanKLiXQiuBBigDataBench: A big data benchmark suite from internet servicesIEEE International Symposium on High Performance Computer Architecture201448849910.1109/HPCA.2014.6835958 – reference: PirzadehPCareyMJWestmannTBigfun: A performance study of big data management system functionalityIEEE International Conference on Big Data201550751410.1109/BigData.2015.7363793 – reference: O’SheaJBandarZCrockettKAMcLeanDBenchmarking short text semantic similarityInternational Journal of Intelligent Information and Database Systems20104210312010.1504/IJIIDS.2010.032437 – reference: GattikerAEGebaraFHHofsteeHPHayesJDHylickABig data text-oriented benchmark creation for HadoopIBM Journal of Research and Development2013573/410:110:610.1147/JRD.2013.2240732 – reference: HuangSHuangJDaiJXieTHuangBThe HiBench benchmark suite: Characterization of the MapReduce-based data analysisInternational Conference on Data Engineering2010415110.1109/ICDEW.2010.5452747 – reference: Armstrong, T. G., Ponnekanti, V., Borthakur, D., & Callaghan, M. (2013). Linkbench: A database benchmark based on the facebook social graph. In ACM SIGMOD International Conference on Management of Data, SIGMOD ‘13 (pp. 1185–1196). ACM. https://doi.org/10.1145/2463676.2465296. – reference: Lin, J., Crane, M., Trotman, A., Callan, J., Chattopadhyaya, I., Foley, J., Ingersoll, G., Macdonald, C., & Vigna, S. (2016). Toward reproducible baselines: The open-source ir reproducibility challenge. In Advances in information retrieval (pp. 408–420). Springer International Publishing. https://doi.org/10.1007/978-3-319-30671-1-30. – reference: Transaction Processing Performance Council (TPC) (2016). TPC express benchmark hs standard specification version 1.4.2.http://www.tpc.org Accessed March 2019. – reference: Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A., & Zaharia, M. (2015). Spark sql: Relational data processing in spark. In ACM SIGMOD International Conference on Management of Data (pp. 1383–1394). ACM Press. https://doi.org/10.1145/2723372.2742797. – reference: Yin, J., Chao, D., Liu, Z., Zhang, W., Yu, X., & Wang, J. (2018). Model-based clustering of short text streams. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2634–2642). ACM Press. https://doi.org/10.1145/3219819.3220094. – reference: Bifet, A., & Frank, E. (2010). Sentiment knowledge discovery in twitter streaming data. In Discovery Science (pp. 1–15). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-16184-1-1. – reference: ZahariaMXinRSWendellPDasTArmbrustMDaveAMengXRosenJVenkataramanSFranklinMJGhodsiAGonzalezJShenkerSStoicaIApache spark: A unified engine for big data processingCommunications of the ACM20165911566510.1145/2934664 – reference: Li, M., Tan, J., Wang, Y., Zhang, L., & Salapura, V. (2015). Sparkbench: A comprehensive benchmarking suite for in memory data analytic platform spark. In ACM International Conference on Computing Frontiers, CF ‘15 (pp. 53:1–53:8). ACM. https://doi.org/10.1145/2742854.2747283. – reference: Paltoglou, G., Thelwall, M. (2010). A study of information retrieval weighting schemes for sentiment analysis. In: Annual Meeting of the Association for Computational Linguistics, pp. 1386–1395. URL http://dl.acm.org/citation.cfm?id=1858681.1858822. – reference: ShuKSlivaAWangSTangJLiuHFake news detection on social media: A data mining perspectiveACM SIGKDD Explorations Newsletter2017191223610.1145/3137597.3137600 – reference: Zhang, D., Zhai, C., Han, J. (2009). Topic cube: Topic modeling for OLAP on multidimensional text databases. In: SIAM International Conference on Data Mining, pp. 1124–1135. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9781611972795.96 – reference: BellotPDoucetAGevaSGurajadaSKampsJKazaiGKoolenMMishraAMoriceauVMotheJPremingerMSanJuanESchenkelRTannierXTheobaldMTrappettMTrotmanASandersonMScholerFWangQReport on inex 2013SIGIR Forum2013472213210.1145/2568388.2568393 – reference: Truică, C.O., Rădulescu, F., Boicea, A. (2016b). Comparing different term weighting schemas for topic modeling. In: International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. IEEE. https://doi.org/10.1109/synasc.2016.055. – reference: GhazalARablTHuMRaabFPoessMCrolotteAJacobsenHABigbench: Towards an industry standard benchmark for big data analyticsACM SIGMOD International Conference on Management of Data, SIGMOD ‘1320131197120810.1145/2463676.2463712 – reference: Truică, C. O., & Darmont, J. (2017). T2K2: The twitter top-k keywords benchmark. In European Conference on Advances in Databases and Information Systems (pp. 21–28). Springer International Publishing. https://doi.org/10.1007/978-3-319-67162-8_3. – reference: RavatFTesteOTournierRZurfluhGTop−keyword: an aggregation function for textual document olapInternational Conference on Data Warehousing and Knowledge Discovery2008556410.1007/978-3-540-85836-2-6 – reference: KılıçDÖzçiftABozyigitFYildirimPYücalarFBorandagETtc-3600: A new benchmark dataset for turkish text categorizationJournal of Information Science201743217418510.1177/0165551515620551 – reference: Agrawal, D., Butt, A., Doshi, K., Larriba-Pey, J. L., Li, M., Reiss, F. R., Raab, F., Schiefer, B., Suzumura, T., & Xia, Y. (2016). Sparkbench – a spark performance testing suite. In Performance evaluation and benchmarking: Traditional to big data to internet of things (pp. 26–44). Springer International Publishing. https://doi.org/10.1007/978-3-319-31409-9-3. – reference: BouakkazMLoudcherSOuintenYOLAP textual aggregation approach using the google similarity distanceInternational Journal of Business Intelligence and Data Mining20161113110.1504/ijbidm.2016.076425 – reference: HofmannTProbabilistic latent semantic indexingSIGIR Forum201751221121810.1145/3130348.3130370 – reference: VavilapalliVKMurthyACDouglasCAgarwalSKonarMEvansRGravesTLoweJShahHSethSSahaBCurinoCO’MalleyORadiaSReedBBaldeschwielerEApache hadoop yarn: Yet another resource negotiatorAnnual Symposium on Cloud Computing20135:15:1610.1145/2523616.2523633 – reference: Truică, C. O., Darmont, J., & Velcine, J. (2016a). A scalable document-based architecture for text analysis. In International Conference on Advanced Data Mining and Applications (pp. 481–494). Springer. https://doi.org/10.1007/978-3-319-49586-6-33. – reference: Ming, Z., Luo, C., Gao, W., Han, R., Yang, Q., Wang, L., & Zhan, J. (2014). Bdgs: A scalable big data generator suite in big data benchmarking. In Advancing big data benchmarks (pp. 138–154). Springer International Publishing. https://doi.org/10.1007/978-3-319-10596-3-11. – reference: Chowdhury, B., Rabl, T., Saadatpanah, P., Du, J., & Jacobsen, H. A. (2014). A bigbench implementation in the hadoop ecosystem. In Advancing big data benchmarks (pp. 3–18). Springer International Publishing. https://doi.org/10.1007/978-3-319-10596-3-1. – reference: WangXAh-PineJDarmontJShcoclust, a scalable similarity-based hierarchical co-clustering method and its application to textual collections2017 IEEE International Conference on Fuzzy Systems20171610.1109/FUZZ-IEEE.2017.8015720 – reference: BringaySBéchetNBouillotFPonceletPRocheMTeisseireMTowards an on-line analysis of tweets processingInternational Conference on Database and Expert Systems Applications201115416110.1007/978-3-642-23091-2_15 – reference: ShvachkoKKuangHRadiaSChanslerRThe hadoop distributed file systemSymposium on Mass Storage Systems and Technologies201011010.1109/MSST.2010.5496972 – reference: DeanJGhemawatSMapreduce: Simplified data processing on large clustersCommunications of the ACM200851110711310.1145/1327452.1327492 – reference: GuilleAFavreCEvent detection, tracking, and visualization in twitter: a mention-anomaly-based approachSocial Network Analysis and Mining2015511810.1007/s13278-015-0258-0 – reference: Sangroya, A., Serrano, D., & Bouchenak, S. (2013). Mrbs: Towards dependability benchmarking for hadoop mapreduce. In Euro-Par 2012: Parallel Processing Workshops (pp. 3–12). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-36949-0-2. – reference: JiaZZhanJWangLHanRMcKeeSAYangQLuoCLiJCharacterizing and subsetting big data workloads2014 IEEE International Symposium on Workload Characterization201419120110.1109/IISWC.2014.6983058 – reference: Spärck JonesKWalkerSRobertsonSEA probabilistic model of information retrieval: development and comparative experiments: Part 2Information Processing & Management200036680984010.1016/S0306-4573(00)00016-9 – reference: WangLDongXZhangXWangYJuTFengGTextgen: a realistic text data content generation method for modern storage system benchmarksFrontiers of Information Technology & Electronic Engineering2016171098299310.1631/FITEE.1500332 – reference: Partalas, I., Kosmopoulos, A., Baskiotis, N., Artières, T., Paliouras, G., Gaussier, É., Androutsopoulos, I., Amini, M.R., Gallinari, P. (2015). Lshtc: A benchmark for large-scale text classification. CoRR. URL http://arxiv.org/abs/1503.08581. – ident: 9999_CR25 doi: 10.1145/2742854.2747283 – volume: 59 start-page: 56 issue: 11 year: 2016 ident: 9999_CR54 publication-title: Communications of the ACM doi: 10.1145/2934664 – start-page: 55 volume-title: International Conference on Data Warehousing and Knowledge Discovery year: 2008 ident: 9999_CR34 doi: 10.1007/978-3-540-85836-2-6 – ident: 9999_CR2 doi: 10.1145/2723372.2742797 – ident: 9999_CR5 doi: 10.1007/978-3-642-16184-1-1 – volume: 4 start-page: 103 issue: 2 year: 2010 ident: 9999_CR29 publication-title: International Journal of Intelligent Information and Database Systems doi: 10.1504/IJIIDS.2010.032437 – ident: 9999_CR12 doi: 10.1007/978-3-319-04936-6_8 – start-page: 191 volume-title: 2014 IEEE International Symposium on Workload Characterization year: 2014 ident: 9999_CR20 doi: 10.1109/IISWC.2014.6983058 – ident: 9999_CR22 doi: 10.18653/v1/P18-2040 – start-page: 488 volume-title: IEEE International Symposium on High Performance Computer Architecture year: 2014 ident: 9999_CR50 doi: 10.1109/HPCA.2014.6835958 – volume: 6 start-page: 243 issue: 3 year: 2012 ident: 9999_CR56 publication-title: Statistical Analysis and Data Mining doi: 10.1002/sam.11159 – ident: 9999_CR3 doi: 10.1145/2463676.2465296 – ident: 9999_CR53 doi: 10.1145/3219819.3220094 – volume: 19 start-page: 22 issue: 1 year: 2017 ident: 9999_CR37 publication-title: ACM SIGKDD Explorations Newsletter doi: 10.1145/3137597.3137600 – ident: 9999_CR45 doi: 10.1007/978-3-319-67162-8_3 – volume: 85 start-page: 60 year: 2018 ident: 9999_CR48 publication-title: Future Generation Computer Systems doi: 10.1016/j.future.2018.02.037 – volume: 36 start-page: 809 issue: 6 year: 2000 ident: 9999_CR41 publication-title: Information Processing & Management doi: 10.1016/S0306-4573(00)00016-9 – volume: 36 start-page: 779 issue: 6 year: 2000 ident: 9999_CR40 publication-title: Information Processing & Management doi: 10.1016/S0306-4573(00)00015-7 – volume: 5 start-page: 361 year: 2004 ident: 9999_CR24 publication-title: Journal of Machine Learning Research – ident: 9999_CR44 – ident: 9999_CR27 doi: 10.1017/CBO9780511809071 – volume: 2 start-page: 1626 issue: 2 year: 2009 ident: 9999_CR42 publication-title: VLDB Endowment doi: 10.14778/1687553.1687609 – volume: 51 start-page: 107 issue: 1 year: 2008 ident: 9999_CR10 publication-title: Communications of the ACM doi: 10.1145/1327452.1327492 – start-page: 41 volume-title: International Conference on Data Engineering year: 2010 ident: 9999_CR19 doi: 10.1109/ICDEW.2010.5452747 – ident: 9999_CR1 doi: 10.1007/978-3-319-31409-9-3 – start-page: 1225 volume-title: 2017 IEEE 33rd International Conference on Data Engineering year: 2017 ident: 9999_CR15 doi: 10.1109/ICDE.2017.167 – ident: 9999_CR16 – ident: 9999_CR43 – ident: 9999_CR28 doi: 10.1007/978-3-319-10596-3-11 – volume: 47 start-page: 21 issue: 2 year: 2013 ident: 9999_CR4 publication-title: SIGIR Forum doi: 10.1145/2568388.2568393 – ident: 9999_CR9 doi: 10.1145/3018661.3018726 – volume: 51 start-page: 260 issue: 2 year: 2017 ident: 9999_CR23 publication-title: SIGIR Forum doi: 10.1145/3130348.3130376 – volume: 41 start-page: 391 issue: 6 year: 1990 ident: 9999_CR11 publication-title: Journal of the American Society for Information Science doi: 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 – start-page: 1 volume-title: 2017 IEEE International Conference on Fuzzy Systems year: 2017 ident: 9999_CR52 doi: 10.1109/FUZZ-IEEE.2017.8015720 – ident: 9999_CR30 – volume: 5 start-page: 18 issue: 1 year: 2015 ident: 9999_CR17 publication-title: Social Network Analysis and Mining doi: 10.1007/s13278-015-0258-0 – ident: 9999_CR38 – ident: 9999_CR55 doi: 10.1137/1.9781611972795.96 – start-page: 1197 volume-title: ACM SIGMOD International Conference on Management of Data, SIGMOD ‘13 year: 2013 ident: 9999_CR14 doi: 10.1145/2463676.2463712 – ident: 9999_CR46 doi: 10.1007/978-3-319-49586-6-33 – volume: 43 start-page: 174 issue: 2 year: 2017 ident: 9999_CR21 publication-title: Journal of Information Science doi: 10.1177/0165551515620551 – ident: 9999_CR47 doi: 10.1109/synasc.2016.055 – volume: 11 start-page: 31 issue: 1 year: 2016 ident: 9999_CR6 publication-title: International Journal of Business Intelligence and Data Mining doi: 10.1504/ijbidm.2016.076425 – ident: 9999_CR33 doi: 10.1145/3121050.3121062 – start-page: 1 volume-title: Symposium on Mass Storage Systems and Technologies year: 2010 ident: 9999_CR39 doi: 10.1109/MSST.2010.5496972 – volume: 51 start-page: 211 issue: 2 year: 2017 ident: 9999_CR18 publication-title: SIGIR Forum doi: 10.1145/3130348.3130370 – volume: 17 start-page: 982 issue: 10 year: 2016 ident: 9999_CR51 publication-title: Frontiers of Information Technology & Electronic Engineering doi: 10.1631/FITEE.1500332 – ident: 9999_CR36 doi: 10.1007/978-3-642-36949-0-2 – start-page: 1357 volume-title: ACM SIGMOD International Conference on Management of Data year: 2015 ident: 9999_CR35 doi: 10.1145/2723372.2742790 – ident: 9999_CR31 – ident: 9999_CR26 doi: 10.1007/978-3-319-30671-1-30 – start-page: 507 volume-title: IEEE International Conference on Big Data year: 2015 ident: 9999_CR32 doi: 10.1109/BigData.2015.7363793 – volume: 57 start-page: 10:1 issue: 3/4 year: 2013 ident: 9999_CR13 publication-title: IBM Journal of Research and Development doi: 10.1147/JRD.2013.2240732 – start-page: 154 volume-title: International Conference on Database and Expert Systems Applications year: 2011 ident: 9999_CR7 doi: 10.1007/978-3-642-23091-2_15 – start-page: 5:1 volume-title: Annual Symposium on Cloud Computing year: 2013 ident: 9999_CR49 doi: 10.1145/2523616.2523633 – ident: 9999_CR8 doi: 10.1007/978-3-319-10596-3-1 |
| SSID | ssj0016275 |
| Score | 2.3487248 |
| Snippet | Extracting top-
k
keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and... Extracting top-k keywords and documents using weighting schemes are popular techniques employed in text mining and machine learning for different analysis and... |
| SourceID | hal proquest crossref springer |
| SourceType | Open Access Repository Aggregation Database Enrichment Source Index Database Publisher |
| StartPage | 81 |
| SubjectTerms | Algorithms Benchmarks Big Data Business and Management Computation Computer networks Computer Science Control Data base management systems Data mining Datasets Document and Text Processing Ecosystems Errors Information systems IT in Business Keywords Machine learning Management of Computing and Information Systems Multidimensional approach Operations Research/Decision Theory Performance evaluation Queries Retrieval Scores Subsets Systems Theory Weighting |
| SummonAdditionalLinks | – databaseName: SpringerLINK Contemporary 1997-Present dbid: RSV link: http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LT8JAEJ4IetCDb2MVzcZ40ya026c3BAkHQoyg4dZst7uBqNVAIeHfO7u0gEZN9Nh9tM3Mzs432dlvAC7RoyQIM2yTxZ40HQuXMeNMosUzyS3KAso0z2zb73SCfj-8zy-FjYts9-JIUu_UK5fdfJ0wWzUVqgnNWQnW0d0FqmDDQ_dpcXageHd1mKXMB9FJflXm-3d8ckelgUqGXEGaXw5Htc9p7vzvb3dhO8eYpDZfFHuwJtJ92FphHjyAeg835VuRNro3hBHNPT3kRDVOcGaDZYxgLx-8stEzQVxLGopgV9XGEgnJWc4P4bF516u3zLyegskRdWSmcDB6op4j7NhN7Kp0E4rmSplwJacs9EKOT8Ln3OI8CG1mSxYGnDkyqbqo0ZgeQTl9S8UxEI9Jx-dBTGMfQ4xYxKhYR3quk_i-Vw2pAVYh1ojnZOOq5sVLtKRJVgKKUECRFlA0M-BqMed9TrXx6-gL1NZioGLJbtXakWpD2KEoaoKpZUClUGaU2-Y4sh11thtiaGHAdaG8ZffPnzz52_BT2LRVAoxO8a5AORtNxBls8Gk2HI_O9Zr9ABjv5LA priority: 102 providerName: Springer Nature |
| Title | TextBenDS: a Generic Textual Data Benchmark for Distributed Systems |
| URI | https://link.springer.com/article/10.1007/s10796-020-09999-y https://www.proquest.com/docview/2487069882 https://hal.science/hal-02476758 |
| Volume | 23 |
| WOSCitedRecordID | wos000562310100001&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAVX databaseName: SpringerLINK Contemporary 1997-Present customDbUrl: eissn: 1572-9419 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0016275 issn: 1387-3326 databaseCode: RSV dateStart: 19990701 isFulltext: true titleUrlDefault: https://link.springer.com/search?facet-content-type=%22Journal%22 providerName: Springer Nature |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1JT9wwFH7qQA_0AKVQdVhGFuJWrCZxEidcKpgBIQHTEVuhl8hxbIGAAWZB4t_3PeOZoZXgwsWStyTK5-Wz_fw9gHWcUSqkGRFXZWp5HGIzVlpZ7PHK6lCoTCinM3sg2-3s_Dzv-A23vjerHI2JbqCu7jTtkf-IYjqRy5EQ_rx_4OQ1ik5XvQuNGkwTsyGTvsOgOT5FIAVet-CijoQ8xV-a8VfnpDO_DThxpJw__TMx1S7JLPIF5_zvmNTNPrtz7_3uzzDreSfbem4o8_DBdL_ApxdqhAvQPMGBett0W8ebTDGnR32lGSUOsWZLDRTDXH15q3rXDLkua5HoLvnLMhXzyueLcLq7c9Lc497HAtfIRAbcxLiiEmlsojKposAmlcAuLJRJrBYqT3ONMSO1DrXO8khFVuWZVrGtggRRLsVXmOredc03YKmysdRZKUqJy47SlAh2bNMkrqRMg1zUIRz94EJ7AXLyg3FTTKSTCZQCQSkcKMVTHb6P69w_y2-8WXoNcRsXJOXsva2DgtKQipBsTfYY1mFlBFTh-2u_mKBUh40R1JPs11-59PbTlmEmIiMYZ-a9AlOD3tCswkf9OLjq9xpQk78vGjC9vdPuHGFsX_KGa8EUhr8w7CR_MDw6PvsLhFj0IA |
| linkProvider | ProQuest |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LTxsxEB4BrdRy6LtqWkqtip7AImt7X5WqipKiIEKERJC4Ga_XVlBLoEmgyp_qb-yMs5sAUrlx6HH9Wnn92TOzM_4GYA0lSolqhuCmSDxXEcLYWONxxxtvI2kyaQLPbCftdrPj4_xgAf7Ud2EorLI-E8NBXZ5b-ke-KRR55HJUCL9e_OKUNYq8q3UKjSks9tzkN5psoy-7LVzfT0LsfO9tt3mVVYBblL1j7hTaEDJRThRxKZo-LiWCVhoXeytNnuQWn1xqbWRtlgsjvMkza5QvmzHOq5A47iI8UBKbkRM45TOvBTH-BgOPNi7qRdUlneqqXhrCfZucdLKcT24IwsU-hWFe03FvuWWDtNt5-r99p2fwpNKr2dZ0IzyHBTd4AcvX2BZfwnYPBdE3N2gdfmaGBb7tU8uo8BJ7tszYMKy1_TMz_MFQl2ctIhWmfGCuZBWz-ys4updZvIalwfnAvQGWGK9SmxWySNGsKlyBYFY-iVWZpkkzlw2I6gXVtiJYpzwfP_WcGppAoBEEOoBATxqwPutzMaUXubP1R8TJrCExg7e3OprKUNUiWp7sKmrASg0MXZ1HIz1HRQM2amjNq__9yrd3j_YBHrV7-x3d2e3uvYPHggJ-Qkj7CiyNh5fuPTy0V-PT0XA17BQGJ_cNub-YAUuB |
| linkToPdf | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V3dT9swED_xpWk8jI9tohsMC8HTFrWx84mEEBAqEFVVaUxCe_EcxxYIKNAWpv5r--t25yYtTBpvPPAYx3bi-Hf2Xe78O4BN3FEKVDO4p_LIeoGPMFZaWZR4ZbUvVCKU45ltxe12cnaWdqbgT3UWhsIqqzXRLdTFjaZ_5HUekEcuRYWwbsuwiE7W3L298yiDFHlaq3QaI4icmOFvNN_6O8cZzvUW583D04Mjr8ww4GnchweeCdCeEFFgeB4WvGHDQiCAhTKh1UKlUarxysRa-1onKVfcqjTRKrBFI8Qx5gL7nYbZGG1MCifshD_HHgxi_3XGHgkx6kjlgZ3y2F7sQn8bHulnqTd8silOn1NI5iN99x8Xrdv5mguv-ZstwrtS32Z7IwFZginTXYb5RyyM7-HgFN9433Sz79tMMcfDfaEZFd5jy0wNFMO7-vxa9S4Z6vgsI7JhyhNmClYyvn-AHy8yio8w073pmhVgkbJBrJNc5DGaW7nJEeSBjcKgiOOokYoa-NXkSl0Sr1P-jys5oYwmQEgEhHSAkMMafB23uR3RjjxbewMxM65IjOFHey1JZaiCEV1P8uDXYLUCiSzXqb6cIKQG3yqYTW7__5Gfnu9tHd4g0mTruH3yGd5yigNyke6rMDPo3Zs1mNMPg4t-74sTGga_XhpxfwEW4VSl |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=TextBenDS%3A+a+Generic+Textual+Data+Benchmark+for+Distributed+Systems&rft.jtitle=Information+systems+frontiers&rft.au=Ciprian-Octavian%2C+Truic%C4%83&rft.au=Elena-Simona%2C+Apostol&rft.au=Darmont+J%C3%A9r%C3%B4me&rft.au=Assent+Ira&rft.date=2021-02-01&rft.pub=Springer+Nature+B.V&rft.issn=1387-3326&rft.eissn=1572-9419&rft.volume=23&rft.issue=1&rft.spage=81&rft.epage=100&rft_id=info:doi/10.1007%2Fs10796-020-09999-y&rft.externalDBID=HAS_PDF_LINK |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1387-3326&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1387-3326&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1387-3326&client=summon |