External memory pipelining made easy with TPIE

When handling large datasets that exceed the capacity of the main memory, movement of data between main memory and external memory (disk), rather than actual (CPU) computation time, is often the bottleneck in the computation. Since data is moved between disk and main memory in large contiguous block...

Full description

Saved in:

Bibliographic Details
Published in:	2017 IEEE International Conference on Big Data (Big Data) pp. 319 - 324
Main Authors:	Arge, Lars, Rav, Mathias, Svendsen, Svend C., Truelsen, Jakob
Format:	Conference Proceeding
Language:	English
Published:	IEEE 01.12.2017
Subjects:	Algorithm design and analysis C++ Hardware I/O-efficient algorithms Libraries Memory management Operating systems Pipeline processing Software algorithms software framework
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Abstract	When handling large datasets that exceed the capacity of the main memory, movement of data between main memory and external memory (disk), rather than actual (CPU) computation time, is often the bottleneck in the computation. Since data is moved between disk and main memory in large contiguous blocks, this has led to the development of a large number of I/O-efficient algorithms that minimize the number of such block movements. However, actually implementing these algorithms can be somewhat of a challenge since operating systems do not give complete control over movement of blocks and management of main memory. TPIE is one of two major libraries that have been developed to support I/O-efficient algorithm implementations. It relies heavily on the fact that most I/O-efficient algorithms are naturally composed of components that stream through one or more lists of data items, while producing one or more such output lists, or components that sort such lists. Thus TPIE provides an interface where list stream processing and sorting can be implemented in a simple and modular way without having to worry about memory management or block movement. However, if care is not taken, such streaming-based implementations can lead to practically inefficient algorithms since lists of data items are typically written to (and read from) disk between components. In this paper we present a major extension of the TPIE library that includes a pipelining framework that allows for practically efficient streaming-based implementations while minimizing I/O-overhead between streaming components. The framework pipelines streaming components to avoid I/Os between components, that is, it processes several components simultaneously while passing output from one component directly to the input of the next component in main memory. TPIE automatically determines which components to pipeline and performs the required main memory management, and the extension also includes support for parallelization of internal memory computation and progress tracking across an entire application. Thus TPIE supports efficient streaming-based implementations of I/O-efficient algorithms in a simple, modular and maintainable way. The extended library has already been used to evaluate I/O-efficient algorithms in the research literature, and is heavily used in I/O-efficient commercial terrain processing applications by the Danish startup SCALGO.
AbstractList	When handling large datasets that exceed the capacity of the main memory, movement of data between main memory and external memory (disk), rather than actual (CPU) computation time, is often the bottleneck in the computation. Since data is moved between disk and main memory in large contiguous blocks, this has led to the development of a large number of I/O-efficient algorithms that minimize the number of such block movements. However, actually implementing these algorithms can be somewhat of a challenge since operating systems do not give complete control over movement of blocks and management of main memory. TPIE is one of two major libraries that have been developed to support I/O-efficient algorithm implementations. It relies heavily on the fact that most I/O-efficient algorithms are naturally composed of components that stream through one or more lists of data items, while producing one or more such output lists, or components that sort such lists. Thus TPIE provides an interface where list stream processing and sorting can be implemented in a simple and modular way without having to worry about memory management or block movement. However, if care is not taken, such streaming-based implementations can lead to practically inefficient algorithms since lists of data items are typically written to (and read from) disk between components. In this paper we present a major extension of the TPIE library that includes a pipelining framework that allows for practically efficient streaming-based implementations while minimizing I/O-overhead between streaming components. The framework pipelines streaming components to avoid I/Os between components, that is, it processes several components simultaneously while passing output from one component directly to the input of the next component in main memory. TPIE automatically determines which components to pipeline and performs the required main memory management, and the extension also includes support for parallelization of internal memory computation and progress tracking across an entire application. Thus TPIE supports efficient streaming-based implementations of I/O-efficient algorithms in a simple, modular and maintainable way. The extended library has already been used to evaluate I/O-efficient algorithms in the research literature, and is heavily used in I/O-efficient commercial terrain processing applications by the Danish startup SCALGO.
Author	Arge, Lars Svendsen, Svend C. Rav, Mathias Truelsen, Jakob
Author_xml	– sequence: 1 givenname: Lars surname: Arge fullname: Arge, Lars email: large@madalgo.au.dk organization: Dept. of Comput. Sci., Aarhus Univ. Aarhus, Aarhus, Denmark – sequence: 2 givenname: Mathias surname: Rav fullname: Rav, Mathias email: rav@madalgo.au.dk organization: Dept. of Comput. Sci., Aarhus Univ. Aarhus, Aarhus, Denmark – sequence: 3 givenname: Svend C. surname: Svendsen fullname: Svendsen, Svend C. email: svendcs@madalgo.au.dk organization: Dept. of Comput. Sci., Aarhus Univ. Aarhus, Aarhus, Denmark – sequence: 4 givenname: Jakob surname: Truelsen fullname: Truelsen, Jakob email: jakob@scalgo.com organization: SCALGO, Aarhus, Denmark
BookMark	eNotj0tOwzAUAI0ECyg9ASx8gQS_-L-EEkqlSrDIvnp2noulJI3SSJDbg0RXs5vR3LHr4TQQY48gSgDhn17y8RVnLCsBtnSVtl6JK7b21oGWzlQWtLhlZf0z0zRgx3vqT9PCxzxSl4c8HHmPLXHC88K_8_zFm89dfc9uEnZnWl-4Ys1b3Wzei_3Hdrd53hfZi7kgmQClCVIlpSOZAPCXa0NUwXpDAlzUiEb5GFqPymhCFcklqFqBIlm5Yg__2kxEh3HKPU7L4TIhfwE80UFs
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/BigData.2017.8257940
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9781538627150 1538627159
EndPage	324
ExternalDocumentID	8257940
Genre	orig-research
GroupedDBID	6IE 6IL CBEJK RIE RIL
ID	FETCH-LOGICAL-i90t-e3f1a36b34f45ce6b11271dbc4b796e018c5aa649cbd9a465ea4ce8f12d0a0f73
IEDL.DBID	RIE
IngestDate	Thu Jun 29 18:36:30 EDT 2023
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i90t-e3f1a36b34f45ce6b11271dbc4b796e018c5aa649cbd9a465ea4ce8f12d0a0f73
PageCount	6
ParticipantIDs	ieee_primary_8257940
PublicationCentury	2000
PublicationDate	2017-Dec.
PublicationDateYYYYMMDD	2017-12-01
PublicationDate_xml	– month: 12 year: 2017 text: 2017-Dec.
PublicationDecade	2010
PublicationTitle	2017 IEEE International Conference on Big Data (Big Data)
PublicationTitleAbbrev	BigData
PublicationYear	2017
Publisher	IEEE
Publisher_xml	– name: IEEE
Score	1.6655283
Snippet	When handling large datasets that exceed the capacity of the main memory, movement of data between main memory and external memory (disk), rather than actual...
SourceID	ieee
SourceType	Publisher
StartPage	319
SubjectTerms	Algorithm design and analysis C++ Hardware I/O-efficient algorithms Libraries Memory management Operating systems Pipeline processing Software algorithms software framework
Title	External memory pipelining made easy with TPIE
URI	https://ieeexplore.ieee.org/document/8257940
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA61ePCk0opvcvDotsluNtlc1RYFKT0U6a1MkonsoQ_ardB_b7JdKoIXbyEMhJkE5pH5viHkIdcGuDVpopm0icg8TyD1WcId0-gBCl0z3ny8q9GomE71uEUeD1gYRKybz7AXl_VfvlvabSyV9UM2E55PSNCPlJJ7rFaDhuNM95_KzxeoIpcQV71G9NfMlNplDE__d9gZ6f5g7-j44FXOSQsXHdIbNGTNdB47Y3d0Va4ikDxI0Dk4pAibHY01VToZvw26ZDIcTJ5fk2bSQVJqViUYTASZNJnwIrcoTQiCFHfGCqO0RMYLmwNIoa1xGoTMEYTFwvPUMWBeZRekvVgu8JJQaYJowTE3Ia7TzoB3uXAm6JKaDDW_Ip2o6my157KYNVpe_719Q06iNfftG7ekXa23eEeO7VdVbtb39QV8A93-if0
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA6lCnpSacW3OXg0bbKbZDdXtaXFWnpYpLeSx6zswba0W6H_3mS7VAQv3kIYCDMJzCPzfYPQg1BGM2sioqi0hMc5IzrKY8IcVZBrnaqK8eZ9lIzH6XSqJg30uMfCAEDVfAadsKz-8t3CbkKprOuzGf98fIJ-IDiP6A6tVePhGFXdp-LjRZeBTYglnVr419SUymn0T_533Clq_6Dv8GTvV85QA-Yt1OnVdM34M_TGbvGyWAYouZfAn9oBBr3e4lBVxdlk2GujrN_LngeknnVACkVLAt5IOpYm5jkXFqTxYVDCnLHcJEoCZakVWkuurHFKcylAcwtpziJHNc2T-Bw154s5XCAsjRdNGQjjIzvljM6d4M54XSITg2KXqBVUnS13bBazWsurv7fv0dEgexvNRsPx6zU6DpbdNXPcoGa52sAtOrRfZbFe3VWX8Q08tY1E
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2017+IEEE+International+Conference+on+Big+Data+%28Big+Data%29&rft.atitle=External+memory+pipelining+made+easy+with+TPIE&rft.au=Arge%2C+Lars&rft.au=Rav%2C+Mathias&rft.au=Svendsen%2C+Svend+C.&rft.au=Truelsen%2C+Jakob&rft.date=2017-12-01&rft.pub=IEEE&rft.spage=319&rft.epage=324&rft_id=info:doi/10.1109%2FBigData.2017.8257940&rft.externalDocID=8257940