In-Memory Distributed Matrix Computation Processing and Optimization

The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalab...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:2017 IEEE 33rd International Conference on Data Engineering (ICDE) s. 1047 - 1058
Hlavní autoři: Yongyang Yu, Mingjie Tang, Aref, Walid G., Malluhi, Qutaibah M., Abbas, Mostafa M., Ouzzani, Mourad
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.04.2017
Témata:
ISSN:2375-026X
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This paper presents new efficient and scalable matrix processing and optimization techniques for in-memory distributed clusters. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics to optimize the cost of matrix computations in an in-memory distributed environment. The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix processing and optimization techniques in Spark, a distributed in-memory computing platform. Experiments on both real and synthetic data demonstrate that our proposed techniques achieve up to an order-of-magnitude performance improvement over state-of the-art distributed matrix computation systems on a wide range of applications.
AbstractList The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This paper presents new efficient and scalable matrix processing and optimization techniques for in-memory distributed clusters. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics to optimize the cost of matrix computations in an in-memory distributed environment. The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix processing and optimization techniques in Spark, a distributed in-memory computing platform. Experiments on both real and synthetic data demonstrate that our proposed techniques achieve up to an order-of-magnitude performance improvement over state-of the-art distributed matrix computation systems on a wide range of applications.
Author Malluhi, Qutaibah M.
Ouzzani, Mourad
Yongyang Yu
Mingjie Tang
Abbas, Mostafa M.
Aref, Walid G.
Author_xml – sequence: 1
  surname: Yongyang Yu
  fullname: Yongyang Yu
  email: yu163@cs.purdue.edu
– sequence: 2
  surname: Mingjie Tang
  fullname: Mingjie Tang
  email: tang49@cs.purdue.edu
– sequence: 3
  givenname: Walid G.
  surname: Aref
  fullname: Aref, Walid G.
  email: aref@cs.purdue.edu
– sequence: 4
  givenname: Qutaibah M.
  surname: Malluhi
  fullname: Malluhi, Qutaibah M.
  email: qmalluhi@qu.edu.qa
– sequence: 5
  givenname: Mostafa M.
  surname: Abbas
  fullname: Abbas, Mostafa M.
  email: mohamza@hbku.edu.qa
– sequence: 6
  givenname: Mourad
  surname: Ouzzani
  fullname: Ouzzani, Mourad
  email: mouzzani@hbku.edu.qa
BookMark eNotjE9LwzAchqMoOGePnrzkC7T-8r85SrtpYWMeFLyNpE0kYtvRZOD89Bb1vTwPvC_vNboYxsEhdEugIAT0fVPVq4ICUQURcIYyrcpZNEjBGTlHC8qUyIHKtyuUxfgBczQn82SB6mbIt64fpxOuQ0xTsMfkOrw1s37hauwPx2RSGAf8PI2tizEM79gMHd4dUujD9293gy69-Ywu--cSva5XL9VTvtk9NtXDJg9EiZT7lirhS8cll8IbKywjpfTEC9F6xcFyNdN53wE3wDuqWmKY7by2VHpwbInu_n6Dc25_mEJvptNeaQbAJfsBTpZN5w
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ICDE.2017.150
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9781509065431
1509065431
EISSN 2375-026X
EndPage 1058
ExternalDocumentID 7930046
Genre orig-research
GroupedDBID 6IE
6IF
6IG
6IH
6IK
6IL
6IM
6IN
AAJGR
AAWTH
ABLEC
ABQGA
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
IEGSK
IJVOP
OCL
RIB
RIC
RIE
RIL
RIO
ID FETCH-LOGICAL-i175t-fc275f8e46465fab5b3186f1f55cf740b47cf7effd04a04d27c1a3bdf9b26f0e3
IEDL.DBID RIE
ISICitedReferencesCount 20
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000403398200143&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:16:23 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i175t-fc275f8e46465fab5b3186f1f55cf740b47cf7effd04a04d27c1a3bdf9b26f0e3
PageCount 12
ParticipantIDs ieee_primary_7930046
PublicationCentury 2000
PublicationDate 2017-April
PublicationDateYYYYMMDD 2017-04-01
PublicationDate_xml – month: 04
  year: 2017
  text: 2017-April
PublicationDecade 2010
PublicationTitle 2017 IEEE 33rd International Conference on Data Engineering (ICDE)
PublicationTitleAbbrev ICDE
PublicationYear 2017
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0000941150
ssj0001968397
Score 2.1186442
Snippet The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and...
SourceID ieee
SourceType Publisher
StartPage 1047
SubjectTerms Computational modeling
Data models
distributed computing
Distributed databases
Generators
Matrix computation
Optimization
query optimization
Sparks
Sparse matrices
Title In-Memory Distributed Matrix Computation Processing and Optimization
URI https://ieeexplore.ieee.org/document/7930046
WOSCitedRecordID wos000403398200143&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4g8eAJFYzv9ODRwj762D3ziCSCHDThRvqahIOLQTD67213V_DgxVvTNE0z0-m00_m-AbhDF3Gu45xqoRxlwiJVmY6o1EpiJlOmVQkUfpTTaTaf57MG3O-wMM65MvnMdUOz_Mu3K7MNobKe30vhPXcAB1LKCqu1i6f4Z0q43OzjK7nwvl_uaTV74_5gGHK5ZDcOKPtfxVRKXzJq_W8Vx9DZg_LIbOduTqDhilNo_VRlILWRtmEwLugk5M9-kUFgxQ0FrZwlk8DF_0mq8aU6SA0S8LMRVVjy5E-P1xqW2YGX0fC5_0DrWgl06S8AG4omkRwzxwQTHJXm2hurwBg5NyhZpFngH3KINmIqYjaRJlaptpjrRGDk0jNoFqvCnQPJrdY2Mi5OVMpEIhX6M0igMEGp2uQX0A5CWbxVdBiLWh6Xf3dfwVGQeZXscg3NzXrrbuDQfGyW7-vbUoffComevg
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4gmugJFYxve_BoYR997J55BCIgB0y4kT4TDi4Ewei_t91dwYMXb03TNM1Mp9NO5_sG4NGagFIZplgyYTBh2mKRyABzKbhNeEykyIHCQz4eJ7NZOqnA0w4LY4zJk89M0zfzv3y9VFsfKmu5veTfcwdwSAmJwgKttYuouIeKv97sIywpc96f74k1W4N2p-uzuXgz9Dj7X-VUcm_Sq_1vHafQ2MPy0GTncM6gYrJzqP3UZUClmdahM8jwyGfQfqGO58X1Ja2MRiPPxv-JivG5QlAJE3CzIZFp9OLOj7cSmNmA11532u7jsloCXrgrwAZbFXFqE0MYYdQKSaUzV2ZDS6mynASSeAYiY60OiAiIjrgKRSy1TWXEbGDiC6hmy8xcAkq1lDpQJoxETFjEhXWnELNMebVKlV5B3QtlvioIMealPK7_7n6A4_50NJwPB-PnGzjx8i9SX26hullvzR0cqY_N4n19n-vzG3vDogU
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2017+IEEE+33rd+International+Conference+on+Data+Engineering+%28ICDE%29&rft.atitle=In-Memory+Distributed+Matrix+Computation+Processing+and+Optimization&rft.au=Yongyang+Yu&rft.au=Mingjie+Tang&rft.au=Aref%2C+Walid+G.&rft.au=Malluhi%2C+Qutaibah+M.&rft.date=2017-04-01&rft.pub=IEEE&rft.eissn=2375-026X&rft.spage=1047&rft.epage=1058&rft_id=info:doi/10.1109%2FICDE.2017.150&rft.externalDocID=7930046