External memory pipelining made easy with TPIE

When handling large datasets that exceed the capacity of the main memory, movement of data between main memory and external memory (disk), rather than actual (CPU) computation time, is often the bottleneck in the computation. Since data is moved between disk and main memory in large contiguous block...

Full description

Saved in:
Bibliographic Details
Published in:2017 IEEE International Conference on Big Data (Big Data) pp. 319 - 324
Main Authors: Arge, Lars, Rav, Mathias, Svendsen, Svend C., Truelsen, Jakob
Format: Conference Proceeding
Language:English
Published: IEEE 01.12.2017
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract When handling large datasets that exceed the capacity of the main memory, movement of data between main memory and external memory (disk), rather than actual (CPU) computation time, is often the bottleneck in the computation. Since data is moved between disk and main memory in large contiguous blocks, this has led to the development of a large number of I/O-efficient algorithms that minimize the number of such block movements. However, actually implementing these algorithms can be somewhat of a challenge since operating systems do not give complete control over movement of blocks and management of main memory. TPIE is one of two major libraries that have been developed to support I/O-efficient algorithm implementations. It relies heavily on the fact that most I/O-efficient algorithms are naturally composed of components that stream through one or more lists of data items, while producing one or more such output lists, or components that sort such lists. Thus TPIE provides an interface where list stream processing and sorting can be implemented in a simple and modular way without having to worry about memory management or block movement. However, if care is not taken, such streaming-based implementations can lead to practically inefficient algorithms since lists of data items are typically written to (and read from) disk between components. In this paper we present a major extension of the TPIE library that includes a pipelining framework that allows for practically efficient streaming-based implementations while minimizing I/O-overhead between streaming components. The framework pipelines streaming components to avoid I/Os between components, that is, it processes several components simultaneously while passing output from one component directly to the input of the next component in main memory. TPIE automatically determines which components to pipeline and performs the required main memory management, and the extension also includes support for parallelization of internal memory computation and progress tracking across an entire application. Thus TPIE supports efficient streaming-based implementations of I/O-efficient algorithms in a simple, modular and maintainable way. The extended library has already been used to evaluate I/O-efficient algorithms in the research literature, and is heavily used in I/O-efficient commercial terrain processing applications by the Danish startup SCALGO.
AbstractList When handling large datasets that exceed the capacity of the main memory, movement of data between main memory and external memory (disk), rather than actual (CPU) computation time, is often the bottleneck in the computation. Since data is moved between disk and main memory in large contiguous blocks, this has led to the development of a large number of I/O-efficient algorithms that minimize the number of such block movements. However, actually implementing these algorithms can be somewhat of a challenge since operating systems do not give complete control over movement of blocks and management of main memory. TPIE is one of two major libraries that have been developed to support I/O-efficient algorithm implementations. It relies heavily on the fact that most I/O-efficient algorithms are naturally composed of components that stream through one or more lists of data items, while producing one or more such output lists, or components that sort such lists. Thus TPIE provides an interface where list stream processing and sorting can be implemented in a simple and modular way without having to worry about memory management or block movement. However, if care is not taken, such streaming-based implementations can lead to practically inefficient algorithms since lists of data items are typically written to (and read from) disk between components. In this paper we present a major extension of the TPIE library that includes a pipelining framework that allows for practically efficient streaming-based implementations while minimizing I/O-overhead between streaming components. The framework pipelines streaming components to avoid I/Os between components, that is, it processes several components simultaneously while passing output from one component directly to the input of the next component in main memory. TPIE automatically determines which components to pipeline and performs the required main memory management, and the extension also includes support for parallelization of internal memory computation and progress tracking across an entire application. Thus TPIE supports efficient streaming-based implementations of I/O-efficient algorithms in a simple, modular and maintainable way. The extended library has already been used to evaluate I/O-efficient algorithms in the research literature, and is heavily used in I/O-efficient commercial terrain processing applications by the Danish startup SCALGO.
Author Arge, Lars
Svendsen, Svend C.
Rav, Mathias
Truelsen, Jakob
Author_xml – sequence: 1
  givenname: Lars
  surname: Arge
  fullname: Arge, Lars
  email: large@madalgo.au.dk
  organization: Dept. of Comput. Sci., Aarhus Univ. Aarhus, Aarhus, Denmark
– sequence: 2
  givenname: Mathias
  surname: Rav
  fullname: Rav, Mathias
  email: rav@madalgo.au.dk
  organization: Dept. of Comput. Sci., Aarhus Univ. Aarhus, Aarhus, Denmark
– sequence: 3
  givenname: Svend C.
  surname: Svendsen
  fullname: Svendsen, Svend C.
  email: svendcs@madalgo.au.dk
  organization: Dept. of Comput. Sci., Aarhus Univ. Aarhus, Aarhus, Denmark
– sequence: 4
  givenname: Jakob
  surname: Truelsen
  fullname: Truelsen, Jakob
  email: jakob@scalgo.com
  organization: SCALGO, Aarhus, Denmark
BookMark eNotj0tOwzAUAI0ECyg9ASx8gQS_-L-EEkqlSrDIvnp2noulJI3SSJDbg0RXs5vR3LHr4TQQY48gSgDhn17y8RVnLCsBtnSVtl6JK7b21oGWzlQWtLhlZf0z0zRgx3vqT9PCxzxSl4c8HHmPLXHC88K_8_zFm89dfc9uEnZnWl-4Ys1b3Wzei_3Hdrd53hfZi7kgmQClCVIlpSOZAPCXa0NUwXpDAlzUiEb5GFqPymhCFcklqFqBIlm5Yg__2kxEh3HKPU7L4TIhfwE80UFs
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/BigData.2017.8257940
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9781538627150
1538627159
EndPage 324
ExternalDocumentID 8257940
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i90t-e3f1a36b34f45ce6b11271dbc4b796e018c5aa649cbd9a465ea4ce8f12d0a0f73
IEDL.DBID RIE
IngestDate Thu Jun 29 18:36:30 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i90t-e3f1a36b34f45ce6b11271dbc4b796e018c5aa649cbd9a465ea4ce8f12d0a0f73
PageCount 6
ParticipantIDs ieee_primary_8257940
PublicationCentury 2000
PublicationDate 2017-Dec.
PublicationDateYYYYMMDD 2017-12-01
PublicationDate_xml – month: 12
  year: 2017
  text: 2017-Dec.
PublicationDecade 2010
PublicationTitle 2017 IEEE International Conference on Big Data (Big Data)
PublicationTitleAbbrev BigData
PublicationYear 2017
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.6655283
Snippet When handling large datasets that exceed the capacity of the main memory, movement of data between main memory and external memory (disk), rather than actual...
SourceID ieee
SourceType Publisher
StartPage 319
SubjectTerms Algorithm design and analysis
C++
Hardware
I/O-efficient algorithms
Libraries
Memory management
Operating systems
Pipeline processing
Software algorithms
software framework
Title External memory pipelining made easy with TPIE
URI https://ieeexplore.ieee.org/document/8257940
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA61ePCk0opvcvDotsluNtlc1RYFKT0U6a1MkonsoQ_ardB_b7JdKoIXbyEMhJkE5pH5viHkIdcGuDVpopm0icg8TyD1WcId0-gBCl0z3ny8q9GomE71uEUeD1gYRKybz7AXl_VfvlvabSyV9UM2E55PSNCPlJJ7rFaDhuNM95_KzxeoIpcQV71G9NfMlNplDE__d9gZ6f5g7-j44FXOSQsXHdIbNGTNdB47Y3d0Va4ikDxI0Dk4pAibHY01VToZvw26ZDIcTJ5fk2bSQVJqViUYTASZNJnwIrcoTQiCFHfGCqO0RMYLmwNIoa1xGoTMEYTFwvPUMWBeZRekvVgu8JJQaYJowTE3Ia7TzoB3uXAm6JKaDDW_Ip2o6my157KYNVpe_719Q06iNfftG7ekXa23eEeO7VdVbtb39QV8A93-if0
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA6lCnpSacW3OXg0bbKbZDdXtaXFWnpYpLeSx6zswba0W6H_3mS7VAQv3kIYCDMJzCPzfYPQg1BGM2sioqi0hMc5IzrKY8IcVZBrnaqK8eZ9lIzH6XSqJg30uMfCAEDVfAadsKz-8t3CbkKprOuzGf98fIJ-IDiP6A6tVePhGFXdp-LjRZeBTYglnVr419SUymn0T_533Clq_6Dv8GTvV85QA-Yt1OnVdM34M_TGbvGyWAYouZfAn9oBBr3e4lBVxdlk2GujrN_LngeknnVACkVLAt5IOpYm5jkXFqTxYVDCnLHcJEoCZakVWkuurHFKcylAcwtpziJHNc2T-Bw154s5XCAsjRdNGQjjIzvljM6d4M54XSITg2KXqBVUnS13bBazWsurv7fv0dEgexvNRsPx6zU6DpbdNXPcoGa52sAtOrRfZbFe3VWX8Q08tY1E
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2017+IEEE+International+Conference+on+Big+Data+%28Big+Data%29&rft.atitle=External+memory+pipelining+made+easy+with+TPIE&rft.au=Arge%2C+Lars&rft.au=Rav%2C+Mathias&rft.au=Svendsen%2C+Svend+C.&rft.au=Truelsen%2C+Jakob&rft.date=2017-12-01&rft.pub=IEEE&rft.spage=319&rft.epage=324&rft_id=info:doi/10.1109%2FBigData.2017.8257940&rft.externalDocID=8257940