In-Memory Distributed Matrix Computation Processing and Optimization

The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalab...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	2017 IEEE 33rd International Conference on Data Engineering (ICDE) s. 1047 - 1058
Hlavní autoři:	Yongyang Yu, Mingjie Tang, Aref, Walid G., Malluhi, Qutaibah M., Abbas, Mostafa M., Ouzzani, Mourad
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 01.04.2017
Témata:	Computational modeling Data models distributed computing Distributed databases Generators Matrix computation Optimization query optimization Sparks Sparse matrices
ISSN:	2375-026X
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Abstract	The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This paper presents new efficient and scalable matrix processing and optimization techniques for in-memory distributed clusters. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics to optimize the cost of matrix computations in an in-memory distributed environment. The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix processing and optimization techniques in Spark, a distributed in-memory computing platform. Experiments on both real and synthetic data demonstrate that our proposed techniques achieve up to an order-of-magnitude performance improvement over state-of the-art distributed matrix computation systems on a wide range of applications.
AbstractList	The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This paper presents new efficient and scalable matrix processing and optimization techniques for in-memory distributed clusters. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics to optimize the cost of matrix computations in an in-memory distributed environment. The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix processing and optimization techniques in Spark, a distributed in-memory computing platform. Experiments on both real and synthetic data demonstrate that our proposed techniques achieve up to an order-of-magnitude performance improvement over state-of the-art distributed matrix computation systems on a wide range of applications.
Author	Malluhi, Qutaibah M. Ouzzani, Mourad Yongyang Yu Mingjie Tang Abbas, Mostafa M. Aref, Walid G.
Author_xml	– sequence: 1 surname: Yongyang Yu fullname: Yongyang Yu email: yu163@cs.purdue.edu – sequence: 2 surname: Mingjie Tang fullname: Mingjie Tang email: tang49@cs.purdue.edu – sequence: 3 givenname: Walid G. surname: Aref fullname: Aref, Walid G. email: aref@cs.purdue.edu – sequence: 4 givenname: Qutaibah M. surname: Malluhi fullname: Malluhi, Qutaibah M. email: qmalluhi@qu.edu.qa – sequence: 5 givenname: Mostafa M. surname: Abbas fullname: Abbas, Mostafa M. email: mohamza@hbku.edu.qa – sequence: 6 givenname: Mourad surname: Ouzzani fullname: Ouzzani, Mourad email: mouzzani@hbku.edu.qa
BookMark	eNotjE9LwzAchqMoOGePnrzkC7T-8r85SrtpYWMeFLyNpE0kYtvRZOD89Bb1vTwPvC_vNboYxsEhdEugIAT0fVPVq4ICUQURcIYyrcpZNEjBGTlHC8qUyIHKtyuUxfgBczQn82SB6mbIt64fpxOuQ0xTsMfkOrw1s37hauwPx2RSGAf8PI2tizEM79gMHd4dUujD9293gy69-Ywu--cSva5XL9VTvtk9NtXDJg9EiZT7lirhS8cll8IbKywjpfTEC9F6xcFyNdN53wE3wDuqWmKY7by2VHpwbInu_n6Dc25_mEJvptNeaQbAJfsBTpZN5w
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/ICDE.2017.150
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Xplore url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISBN	9781509065431 1509065431
EISSN	2375-026X
EndPage	1058
ExternalDocumentID	7930046
Genre	orig-research
GroupedDBID	6IE 6IF 6IG 6IH 6IK 6IL 6IM 6IN AAJGR AAWTH ABLEC ABQGA ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK IEGSK IJVOP OCL RIB RIC RIE RIL RIO
ID	FETCH-LOGICAL-i175t-fc275f8e46465fab5b3186f1f55cf740b47cf7effd04a04d27c1a3bdf9b26f0e3
IEDL.DBID	RIE
ISICitedReferencesCount	20
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000403398200143&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Aug 27 02:16:23 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i175t-fc275f8e46465fab5b3186f1f55cf740b47cf7effd04a04d27c1a3bdf9b26f0e3
PageCount	12
ParticipantIDs	ieee_primary_7930046
PublicationCentury	2000
PublicationDate	2017-April
PublicationDateYYYYMMDD	2017-04-01
PublicationDate_xml	– month: 04 year: 2017 text: 2017-April
PublicationDecade	2010
PublicationTitle	2017 IEEE 33rd International Conference on Data Engineering (ICDE)
PublicationTitleAbbrev	ICDE
PublicationYear	2017
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0000941150 ssj0001968397
Score	2.1186442
Snippet	The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and...
SourceID	ieee
SourceType	Publisher
StartPage	1047
SubjectTerms	Computational modeling Data models distributed computing Distributed databases Generators Matrix computation Optimization query optimization Sparks Sparse matrices
Title	In-Memory Distributed Matrix Computation Processing and Optimization
URI	https://ieeexplore.ieee.org/document/7930046
WOSCitedRecordID	wos000403398200143&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4g8eAJFYzv9ODRwj762D3ziCSCHDThRvqahIOLQTD67213V_DgxVvTNE0z0-m00_m-AbhDF3Gu45xqoRxlwiJVmY6o1EpiJlOmVQkUfpTTaTaf57MG3O-wMM65MvnMdUOz_Mu3K7MNobKe30vhPXcAB1LKCqu1i6f4Z0q43OzjK7nwvl_uaTV74_5gGHK5ZDcOKPtfxVRKXzJq_W8Vx9DZg_LIbOduTqDhilNo_VRlILWRtmEwLugk5M9-kUFgxQ0FrZwlk8DF_0mq8aU6SA0S8LMRVVjy5E-P1xqW2YGX0fC5_0DrWgl06S8AG4omkRwzxwQTHJXm2hurwBg5NyhZpFngH3KINmIqYjaRJlaptpjrRGDk0jNoFqvCnQPJrdY2Mi5OVMpEIhX6M0igMEGp2uQX0A5CWbxVdBiLWh6Xf3dfwVGQeZXscg3NzXrrbuDQfGyW7-vbUoffComevg
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4gmugJFYxve_BoYR997J55BCIgB0y4kT4TDi4Ewei_t91dwYMXb03TNM1Mp9NO5_sG4NGagFIZplgyYTBh2mKRyABzKbhNeEykyIHCQz4eJ7NZOqnA0w4LY4zJk89M0zfzv3y9VFsfKmu5veTfcwdwSAmJwgKttYuouIeKv97sIywpc96f74k1W4N2p-uzuXgz9Dj7X-VUcm_Sq_1vHafQ2MPy0GTncM6gYrJzqP3UZUClmdahM8jwyGfQfqGO58X1Ja2MRiPPxv-JivG5QlAJE3CzIZFp9OLOj7cSmNmA11532u7jsloCXrgrwAZbFXFqE0MYYdQKSaUzV2ZDS6mynASSeAYiY60OiAiIjrgKRSy1TWXEbGDiC6hmy8xcAkq1lDpQJoxETFjEhXWnELNMebVKlV5B3QtlvioIMealPK7_7n6A4_50NJwPB-PnGzjx8i9SX26hullvzR0cqY_N4n19n-vzG3vDogU
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2017+IEEE+33rd+International+Conference+on+Data+Engineering+%28ICDE%29&rft.atitle=In-Memory+Distributed+Matrix+Computation+Processing+and+Optimization&rft.au=Yongyang+Yu&rft.au=Mingjie+Tang&rft.au=Aref%2C+Walid+G.&rft.au=Malluhi%2C+Qutaibah+M.&rft.date=2017-04-01&rft.pub=IEEE&rft.eissn=2375-026X&rft.spage=1047&rft.epage=1058&rft_id=info:doi/10.1109%2FICDE.2017.150&rft.externalDocID=7930046