Separating Storage and Compute with the Databricks Lakehouse Platform

As a part of The Arena Group's Data & AI Team, we are architecting a new unified data platform that can handle both Data Engineering and Data Science use cases for all the company's needs. In order to accomplish this, we are working to create a scalable and cost-effective data platform...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA) s. 1 - 2
Hlavní autori: Kumar, Deeptaanshu, Li, Suxi
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 13.10.2022
Predmet:
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract As a part of The Arena Group's Data & AI Team, we are architecting a new unified data platform that can handle both Data Engineering and Data Science use cases for all the company's needs. In order to accomplish this, we are working to create a scalable and cost-effective data platform that will allow us to store large volumes of historical data and process, transform, and query it with variable workloads. This means that our current Redshift data warehouse cannot serve as the backbone of our data platform, since it couples storage and compute, which forces us to pay for increased compute nodes just to store the growing amounts of historical data. As a result, we set out to explore data platforms that decoupled storage and compute, such as Snowflake and Databricks. We chose Databricks because it adequately serves our Data Engineering needs by keeping storage on AWS S3 and gives us flexibility with compute by using ad-hoc Spark clusters. It also offers us more capabilities for Data Science needs. In this paper, we will go over our proposed architecture and explain how we will take advantage of these Data Engineering and Data Science capabilities to address our initial use cases.
AbstractList As a part of The Arena Group's Data & AI Team, we are architecting a new unified data platform that can handle both Data Engineering and Data Science use cases for all the company's needs. In order to accomplish this, we are working to create a scalable and cost-effective data platform that will allow us to store large volumes of historical data and process, transform, and query it with variable workloads. This means that our current Redshift data warehouse cannot serve as the backbone of our data platform, since it couples storage and compute, which forces us to pay for increased compute nodes just to store the growing amounts of historical data. As a result, we set out to explore data platforms that decoupled storage and compute, such as Snowflake and Databricks. We chose Databricks because it adequately serves our Data Engineering needs by keeping storage on AWS S3 and gives us flexibility with compute by using ad-hoc Spark clusters. It also offers us more capabilities for Data Science needs. In this paper, we will go over our proposed architecture and explain how we will take advantage of these Data Engineering and Data Science capabilities to address our initial use cases.
Author Kumar, Deeptaanshu
Li, Suxi
Author_xml – sequence: 1
  givenname: Deeptaanshu
  surname: Kumar
  fullname: Kumar, Deeptaanshu
  email: deeptaan@alumni.cmu.edu
  organization: Carnegie Mellon University,Electrical & Computer Engineering,Washington DC,USA
– sequence: 2
  givenname: Suxi
  surname: Li
  fullname: Li, Suxi
  email: suxi.li@thearenagroup.net
  organization: Economics University of Miami,Los Angeles,USA
BookMark eNo1j8tKxDAUQCPowhn9A8H8QGueTbosnfEBBYXqeriZ3E7DTB-kGcS_V1BXZ3cOZ0Uux2lEQu45yzln5cOmrSqtpNW5YELknDEppC0uyIoXhVZGSqauybbFGSKkMB5om6YIB6QwelpPw3xOSD9D6mnqkW4ggYthf1xoA0fsp_OC9O0EqZvicEOuOjgtePvHNfl43L7Xz1nz-vRSV00WBFMps15xrRz8lL0D6ztTWgWgpDMGDWrHrfTWGwOcOc0VAy0ss06UuhN87-Sa3P16AyLu5hgGiF-7_zX5DU0LSMc
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/DSAA54385.2022.10032386
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 1665473304
9781665473309
EndPage 2
ExternalDocumentID 10032386
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i204t-8d4154ba304dba8df7984aa43b77e7e5b183d8d77a10b5140a52808b295f21cb3
IEDL.DBID RIE
ISICitedReferencesCount 1
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000967751000114&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Thu Jan 18 11:14:48 EST 2024
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i204t-8d4154ba304dba8df7984aa43b77e7e5b183d8d77a10b5140a52808b295f21cb3
PageCount 2
ParticipantIDs ieee_primary_10032386
PublicationCentury 2000
PublicationDate 2022-Oct.-13
PublicationDateYYYYMMDD 2022-10-13
PublicationDate_xml – month: 10
  year: 2022
  text: 2022-Oct.-13
  day: 13
PublicationDecade 2020
PublicationTitle 2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA)
PublicationTitleAbbrev DSAA
PublicationYear 2022
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.8456761
Snippet As a part of The Arena Group's Data & AI Team, we are architecting a new unified data platform that can handle both Data Engineering and Data Science use cases...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Artificial intelligence
Big Data
Computer architecture
Data Analytics
Data engineering
Data science
Data warehouses
Databricks
Distributed Systems
Sparks
Transforms
Title Separating Storage and Compute with the Databricks Lakehouse Platform
URI https://ieeexplore.ieee.org/document/10032386
WOSCitedRecordID wos000967751000114&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA62ePCkYsU3OXjdms2jyR6LtXgopVCF3spkM6tF2Uq79fc72W4VDx68hRASMiEz803yzTB2WwRFyKeAJPWSAEpwIsm0JWWIIHICENrkddWSkR2P3WyWTRqyes2FQcT68xl2Y7N-yw_LfBNDZXTDhSIT02uxlrW9LVmr-bOViuxuMO33jVbOEOyTsrsb_atuSm02hof_XPCIdX4IeHzybVqO2R6WJ-xhittE3eULnxJUJk3AoQy8KczAY0iVkz_HB1CBj-nu13wEb_hK4J6me4cqOqgd9jx8eLp_TJoqCMlCCl0lLpCN1R6U0MGDC4XNnAbQyluLFo2nSxlcsBZS4cn9EWCkE87LzBQyzb06Ze1yWeIZ42BlD20olNagPRaZMM6HEHkIuXMYzlknymD-sU10Md9t_-KP_kt2ECUdVXmqrli7Wm3wmu3nn9Vivbqpj-cLkkSR7g
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEA1aBT2pWPHbHLxuzWaTJnsstqXiWgqt0FuZbGa1KFtpt_5-k92t4sGDtxBCQiZkZt4kb4aQ28xGDvlkEISGO4BiNQtioZwyRGCpAxBCpmXVkkQNh3o6jUc1Wb3kwiBi-fkMW75ZvuXbRbr2oTJ3w1nkTEx7m-xIITir6Fr1r62QxXfdcacjRaSlA36ctzbjf1VOKQ1H_-CfSx6S5g8Fj46-jcsR2cL8mPTGWKXqzl_o2IFlpwso5JbWpRmoD6pS59HRLhRgfML7FU3gDV8dvHfTvUPhXdQmee73JveDoK6DEMw5E0WgrbOywkDEhDWgbaZiLQBEZJRChdK4a2m1VQpCZpwDxEByzbThscx4mJrohDTyRY6nhILibVQ2i4QAYTCLmdTGWs9ESLVGe0aaXgazjyrVxWyz_fM_-m_I3mDylMySh-HjBdn3Uq_iE5ekUSzXeEV2089ivlpel0f1BVo4lTI
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2022+IEEE+9th+International+Conference+on+Data+Science+and+Advanced+Analytics+%28DSAA%29&rft.atitle=Separating+Storage+and+Compute+with+the+Databricks+Lakehouse+Platform&rft.au=Kumar%2C+Deeptaanshu&rft.au=Li%2C+Suxi&rft.date=2022-10-13&rft.pub=IEEE&rft.spage=1&rft.epage=2&rft_id=info:doi/10.1109%2FDSAA54385.2022.10032386&rft.externalDocID=10032386