In-Memory Distributed Matrix Computation Processing and Optimization
The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalab...
Uloženo v:
| Vydáno v: | 2017 IEEE 33rd International Conference on Data Engineering (ICDE) s. 1047 - 1058 |
|---|---|
| Hlavní autoři: | , , , , , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
01.04.2017
|
| Témata: | |
| ISSN: | 2375-026X |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This paper presents new efficient and scalable matrix processing and optimization techniques for in-memory distributed clusters. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics to optimize the cost of matrix computations in an in-memory distributed environment. The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix processing and optimization techniques in Spark, a distributed in-memory computing platform. Experiments on both real and synthetic data demonstrate that our proposed techniques achieve up to an order-of-magnitude performance improvement over state-of the-art distributed matrix computation systems on a wide range of applications. |
|---|---|
| AbstractList | The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This paper presents new efficient and scalable matrix processing and optimization techniques for in-memory distributed clusters. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics to optimize the cost of matrix computations in an in-memory distributed environment. The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix processing and optimization techniques in Spark, a distributed in-memory computing platform. Experiments on both real and synthetic data demonstrate that our proposed techniques achieve up to an order-of-magnitude performance improvement over state-of the-art distributed matrix computation systems on a wide range of applications. |
| Author | Malluhi, Qutaibah M. Ouzzani, Mourad Yongyang Yu Mingjie Tang Abbas, Mostafa M. Aref, Walid G. |
| Author_xml | – sequence: 1 surname: Yongyang Yu fullname: Yongyang Yu email: yu163@cs.purdue.edu – sequence: 2 surname: Mingjie Tang fullname: Mingjie Tang email: tang49@cs.purdue.edu – sequence: 3 givenname: Walid G. surname: Aref fullname: Aref, Walid G. email: aref@cs.purdue.edu – sequence: 4 givenname: Qutaibah M. surname: Malluhi fullname: Malluhi, Qutaibah M. email: qmalluhi@qu.edu.qa – sequence: 5 givenname: Mostafa M. surname: Abbas fullname: Abbas, Mostafa M. email: mohamza@hbku.edu.qa – sequence: 6 givenname: Mourad surname: Ouzzani fullname: Ouzzani, Mourad email: mouzzani@hbku.edu.qa |
| BookMark | eNotjE9LwzAchqMoOGePnrzkC7T-8r85SrtpYWMeFLyNpE0kYtvRZOD89Bb1vTwPvC_vNboYxsEhdEugIAT0fVPVq4ICUQURcIYyrcpZNEjBGTlHC8qUyIHKtyuUxfgBczQn82SB6mbIt64fpxOuQ0xTsMfkOrw1s37hauwPx2RSGAf8PI2tizEM79gMHd4dUujD9293gy69-Ywu--cSva5XL9VTvtk9NtXDJg9EiZT7lirhS8cll8IbKywjpfTEC9F6xcFyNdN53wE3wDuqWmKY7by2VHpwbInu_n6Dc25_mEJvptNeaQbAJfsBTpZN5w |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/ICDE.2017.150 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Xplore url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISBN | 9781509065431 1509065431 |
| EISSN | 2375-026X |
| EndPage | 1058 |
| ExternalDocumentID | 7930046 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IF 6IG 6IH 6IK 6IL 6IM 6IN AAJGR AAWTH ABLEC ABQGA ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK IEGSK IJVOP OCL RIB RIC RIE RIL RIO |
| ID | FETCH-LOGICAL-i175t-fc275f8e46465fab5b3186f1f55cf740b47cf7effd04a04d27c1a3bdf9b26f0e3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 20 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000403398200143&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:16:23 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i175t-fc275f8e46465fab5b3186f1f55cf740b47cf7effd04a04d27c1a3bdf9b26f0e3 |
| PageCount | 12 |
| ParticipantIDs | ieee_primary_7930046 |
| PublicationCentury | 2000 |
| PublicationDate | 2017-April |
| PublicationDateYYYYMMDD | 2017-04-01 |
| PublicationDate_xml | – month: 04 year: 2017 text: 2017-April |
| PublicationDecade | 2010 |
| PublicationTitle | 2017 IEEE 33rd International Conference on Data Engineering (ICDE) |
| PublicationTitleAbbrev | ICDE |
| PublicationYear | 2017 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0000941150 ssj0001968397 |
| Score | 2.1186442 |
| Snippet | The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1047 |
| SubjectTerms | Computational modeling Data models distributed computing Distributed databases Generators Matrix computation Optimization query optimization Sparks Sparse matrices |
| Title | In-Memory Distributed Matrix Computation Processing and Optimization |
| URI | https://ieeexplore.ieee.org/document/7930046 |
| WOSCitedRecordID | wos000403398200143&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4g8eAJFYzv9ODRwj762D3ziCSCHDThRvqahIOLQTD67213V_DgxVvTNE0z0-m00_m-AbhDF3Gu45xqoRxlwiJVmY6o1EpiJlOmVQkUfpTTaTaf57MG3O-wMM65MvnMdUOz_Mu3K7MNobKe30vhPXcAB1LKCqu1i6f4Z0q43OzjK7nwvl_uaTV74_5gGHK5ZDcOKPtfxVRKXzJq_W8Vx9DZg_LIbOduTqDhilNo_VRlILWRtmEwLugk5M9-kUFgxQ0FrZwlk8DF_0mq8aU6SA0S8LMRVVjy5E-P1xqW2YGX0fC5_0DrWgl06S8AG4omkRwzxwQTHJXm2hurwBg5NyhZpFngH3KINmIqYjaRJlaptpjrRGDk0jNoFqvCnQPJrdY2Mi5OVMpEIhX6M0igMEGp2uQX0A5CWbxVdBiLWh6Xf3dfwVGQeZXscg3NzXrrbuDQfGyW7-vbUoffComevg |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LTwIxEJ4gmugJFYxve_BoYR997J55BCIgB0y4kT4TDi4Ewei_t91dwYMXb03TNM1Mp9NO5_sG4NGagFIZplgyYTBh2mKRyABzKbhNeEykyIHCQz4eJ7NZOqnA0w4LY4zJk89M0zfzv3y9VFsfKmu5veTfcwdwSAmJwgKttYuouIeKv97sIywpc96f74k1W4N2p-uzuXgz9Dj7X-VUcm_Sq_1vHafQ2MPy0GTncM6gYrJzqP3UZUClmdahM8jwyGfQfqGO58X1Ja2MRiPPxv-JivG5QlAJE3CzIZFp9OLOj7cSmNmA11532u7jsloCXrgrwAZbFXFqE0MYYdQKSaUzV2ZDS6mynASSeAYiY60OiAiIjrgKRSy1TWXEbGDiC6hmy8xcAkq1lDpQJoxETFjEhXWnELNMebVKlV5B3QtlvioIMealPK7_7n6A4_50NJwPB-PnGzjx8i9SX26hullvzR0cqY_N4n19n-vzG3vDogU |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2017+IEEE+33rd+International+Conference+on+Data+Engineering+%28ICDE%29&rft.atitle=In-Memory+Distributed+Matrix+Computation+Processing+and+Optimization&rft.au=Yongyang+Yu&rft.au=Mingjie+Tang&rft.au=Aref%2C+Walid+G.&rft.au=Malluhi%2C+Qutaibah+M.&rft.date=2017-04-01&rft.pub=IEEE&rft.eissn=2375-026X&rft.spage=1047&rft.epage=1058&rft_id=info:doi/10.1109%2FICDE.2017.150&rft.externalDocID=7930046 |