Separating Storage and Compute with the Databricks Lakehouse Platform
As a part of The Arena Group's Data & AI Team, we are architecting a new unified data platform that can handle both Data Engineering and Data Science use cases for all the company's needs. In order to accomplish this, we are working to create a scalable and cost-effective data platform...
Uloženo v:
| Vydáno v: | 2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA) s. 1 - 2 |
|---|---|
| Hlavní autoři: | , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
13.10.2022
|
| Témata: | |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | As a part of The Arena Group's Data & AI Team, we are architecting a new unified data platform that can handle both Data Engineering and Data Science use cases for all the company's needs. In order to accomplish this, we are working to create a scalable and cost-effective data platform that will allow us to store large volumes of historical data and process, transform, and query it with variable workloads. This means that our current Redshift data warehouse cannot serve as the backbone of our data platform, since it couples storage and compute, which forces us to pay for increased compute nodes just to store the growing amounts of historical data. As a result, we set out to explore data platforms that decoupled storage and compute, such as Snowflake and Databricks. We chose Databricks because it adequately serves our Data Engineering needs by keeping storage on AWS S3 and gives us flexibility with compute by using ad-hoc Spark clusters. It also offers us more capabilities for Data Science needs. In this paper, we will go over our proposed architecture and explain how we will take advantage of these Data Engineering and Data Science capabilities to address our initial use cases. |
|---|---|
| AbstractList | As a part of The Arena Group's Data & AI Team, we are architecting a new unified data platform that can handle both Data Engineering and Data Science use cases for all the company's needs. In order to accomplish this, we are working to create a scalable and cost-effective data platform that will allow us to store large volumes of historical data and process, transform, and query it with variable workloads. This means that our current Redshift data warehouse cannot serve as the backbone of our data platform, since it couples storage and compute, which forces us to pay for increased compute nodes just to store the growing amounts of historical data. As a result, we set out to explore data platforms that decoupled storage and compute, such as Snowflake and Databricks. We chose Databricks because it adequately serves our Data Engineering needs by keeping storage on AWS S3 and gives us flexibility with compute by using ad-hoc Spark clusters. It also offers us more capabilities for Data Science needs. In this paper, we will go over our proposed architecture and explain how we will take advantage of these Data Engineering and Data Science capabilities to address our initial use cases. |
| Author | Kumar, Deeptaanshu Li, Suxi |
| Author_xml | – sequence: 1 givenname: Deeptaanshu surname: Kumar fullname: Kumar, Deeptaanshu email: deeptaan@alumni.cmu.edu organization: Carnegie Mellon University,Electrical & Computer Engineering,Washington DC,USA – sequence: 2 givenname: Suxi surname: Li fullname: Li, Suxi email: suxi.li@thearenagroup.net organization: Economics University of Miami,Los Angeles,USA |
| BookMark | eNo1j8tKxDAUQCPowhn9A8H8QGueTbosnfEBBYXqeriZ3E7DTB-kGcS_V1BXZ3cOZ0Uux2lEQu45yzln5cOmrSqtpNW5YELknDEppC0uyIoXhVZGSqauybbFGSKkMB5om6YIB6QwelpPw3xOSD9D6mnqkW4ggYthf1xoA0fsp_OC9O0EqZvicEOuOjgtePvHNfl43L7Xz1nz-vRSV00WBFMps15xrRz8lL0D6ztTWgWgpDMGDWrHrfTWGwOcOc0VAy0ss06UuhN87-Sa3P16AyLu5hgGiF-7_zX5DU0LSMc |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/DSAA54385.2022.10032386 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library (IEL) (UW System Shared) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE/IET Electronic Library (IEL) (UW System Shared) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 1665473304 9781665473309 |
| EndPage | 2 |
| ExternalDocumentID | 10032386 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IL CBEJK RIE RIL |
| ID | FETCH-LOGICAL-i204t-8d4154ba304dba8df7984aa43b77e7e5b183d8d77a10b5140a52808b295f21cb3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 1 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000967751000114&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Thu Jan 18 11:14:48 EST 2024 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i204t-8d4154ba304dba8df7984aa43b77e7e5b183d8d77a10b5140a52808b295f21cb3 |
| PageCount | 2 |
| ParticipantIDs | ieee_primary_10032386 |
| PublicationCentury | 2000 |
| PublicationDate | 2022-Oct.-13 |
| PublicationDateYYYYMMDD | 2022-10-13 |
| PublicationDate_xml | – month: 10 year: 2022 text: 2022-Oct.-13 day: 13 |
| PublicationDecade | 2020 |
| PublicationTitle | 2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA) |
| PublicationTitleAbbrev | DSAA |
| PublicationYear | 2022 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| Score | 1.8456761 |
| Snippet | As a part of The Arena Group's Data & AI Team, we are architecting a new unified data platform that can handle both Data Engineering and Data Science use cases... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | Artificial intelligence Big Data Computer architecture Data Analytics Data engineering Data science Data warehouses Databricks Distributed Systems Sparks Transforms |
| Title | Separating Storage and Compute with the Databricks Lakehouse Platform |
| URI | https://ieeexplore.ieee.org/document/10032386 |
| WOSCitedRecordID | wos000967751000114&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1JSwMxFA62ePCkYsWdHLxOzWSZJMdiWzxIKVSht5JM3mhRptJO_f2-zEwVDx68hRAS8oW35i2E3KbecvBSJWA8T2TQIXEhkwlqz175zATvWN1sQk8mZj630zZZvc6FAYA6-Az6cVj_5YdVvo2uMqRwJlDEZB3S0TprkrXamK2U2bvhbDBQUhiFZh_n_d3qX31TarExPvzngUek95OAR6ffouWY7EF5QkYzaAp1ly90hqYycgLqykDbxgw0ulQp6nN06CrnY7n7DX10b_CKxj1u9-6qqKD2yPN49HT_kLRdEJIlZ7JKTEAZK70TTCJwJhTaGumcFF5r0KA8EmUwQWuXMo_qD3OKG4bIW1XwNPfilHTLVQlnhHJXpBasCha4FLm2UHghLcsl86AzdU56EYPFR1PoYrG7_sUf85fkICIdWXkqrki3Wm_hmuznn9Vys76pn-cL7iSSUA |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3LS8MwGA86BT2pOPFtDl470zya5DjcxsQ5Bpuw20iarzqUTrbOv9-k7RQPHryFEBLyC98z3wOh29hqCpaLCJSlEXfSRcYlPPLasxU2Uc4aUjabkMOhmk71qE5WL3NhAKAMPoNWGJZ_-W6RroOrzFM4YV7EJNtoR3BOSZWuVUdtxUTfdcbttuBMCW_4UdrarP_VOaUUHL2Dfx55iJo_KXh49C1cjtAW5MeoO4aqVHf-gsfeWPa8AJvc4bo1Aw5OVew1OtwxhbGh4P0KD8wbvHrz3m_3boqgojbRc687ue9HdR-EaE4JLyLlvJTl1jDCPXTKZVIrbgxnVkqQIKwnS6eclCYm1itAxAiqiMdei4zGqWUnqJEvcjhFmJos1qCF00A5S6WGzDKuScqJBZmIM9QMGMw-qlIXs831z_-Yv0F7_cnTYDZ4GD5eoP2AemDsMbtEjWK5hiu0m34W89XyunyqL7x_lZc |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2022+IEEE+9th+International+Conference+on+Data+Science+and+Advanced+Analytics+%28DSAA%29&rft.atitle=Separating+Storage+and+Compute+with+the+Databricks+Lakehouse+Platform&rft.au=Kumar%2C+Deeptaanshu&rft.au=Li%2C+Suxi&rft.date=2022-10-13&rft.pub=IEEE&rft.spage=1&rft.epage=2&rft_id=info:doi/10.1109%2FDSAA54385.2022.10032386&rft.externalDocID=10032386 |