Separating Storage and Compute with the Databricks Lakehouse Platform

As a part of The Arena Group's Data & AI Team, we are architecting a new unified data platform that can handle both Data Engineering and Data Science use cases for all the company's needs. In order to accomplish this, we are working to create a scalable and cost-effective data platform...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA) s. 1 - 2
Hlavní autori:	Kumar, Deeptaanshu, Li, Suxi
Médium:	Konferenčný príspevok..
Jazyk:	English
Vydavateľské údaje:	IEEE 13.10.2022
Predmet:	Artificial intelligence Big Data Computer architecture Data Analytics Data engineering Data science Data warehouses Databricks Distributed Systems Sparks Transforms
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Abstract	As a part of The Arena Group's Data & AI Team, we are architecting a new unified data platform that can handle both Data Engineering and Data Science use cases for all the company's needs. In order to accomplish this, we are working to create a scalable and cost-effective data platform that will allow us to store large volumes of historical data and process, transform, and query it with variable workloads. This means that our current Redshift data warehouse cannot serve as the backbone of our data platform, since it couples storage and compute, which forces us to pay for increased compute nodes just to store the growing amounts of historical data. As a result, we set out to explore data platforms that decoupled storage and compute, such as Snowflake and Databricks. We chose Databricks because it adequately serves our Data Engineering needs by keeping storage on AWS S3 and gives us flexibility with compute by using ad-hoc Spark clusters. It also offers us more capabilities for Data Science needs. In this paper, we will go over our proposed architecture and explain how we will take advantage of these Data Engineering and Data Science capabilities to address our initial use cases.
AbstractList	As a part of The Arena Group's Data & AI Team, we are architecting a new unified data platform that can handle both Data Engineering and Data Science use cases for all the company's needs. In order to accomplish this, we are working to create a scalable and cost-effective data platform that will allow us to store large volumes of historical data and process, transform, and query it with variable workloads. This means that our current Redshift data warehouse cannot serve as the backbone of our data platform, since it couples storage and compute, which forces us to pay for increased compute nodes just to store the growing amounts of historical data. As a result, we set out to explore data platforms that decoupled storage and compute, such as Snowflake and Databricks. We chose Databricks because it adequately serves our Data Engineering needs by keeping storage on AWS S3 and gives us flexibility with compute by using ad-hoc Spark clusters. It also offers us more capabilities for Data Science needs. In this paper, we will go over our proposed architecture and explain how we will take advantage of these Data Engineering and Data Science capabilities to address our initial use cases.
Author	Kumar, Deeptaanshu Li, Suxi
Author_xml	– sequence: 1 givenname: Deeptaanshu surname: Kumar fullname: Kumar, Deeptaanshu email: deeptaan@alumni.cmu.edu organization: Carnegie Mellon University,Electrical & Computer Engineering,Washington DC,USA – sequence: 2 givenname: Suxi surname: Li fullname: Li, Suxi email: suxi.li@thearenagroup.net organization: Economics University of Miami,Los Angeles,USA
BookMark	eNo1j8tKxDAUQCPowhn9A8H8QGueTbosnfEBBYXqeriZ3E7DTB-kGcS_V1BXZ3cOZ0Uux2lEQu45yzln5cOmrSqtpNW5YELknDEppC0uyIoXhVZGSqauybbFGSKkMB5om6YIB6QwelpPw3xOSD9D6mnqkW4ggYthf1xoA0fsp_OC9O0EqZvicEOuOjgtePvHNfl43L7Xz1nz-vRSV00WBFMps15xrRz8lL0D6ztTWgWgpDMGDWrHrfTWGwOcOc0VAy0ss06UuhN87-Sa3P16AyLu5hgGiF-7_zX5DU0LSMc
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/DSAA54385.2022.10032386
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE/IET Electronic Library url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	1665473304 9781665473309
EndPage	2
ExternalDocumentID	10032386
Genre	orig-research
GroupedDBID	6IE 6IL CBEJK RIE RIL
ID	FETCH-LOGICAL-i204t-8d4154ba304dba8df7984aa43b77e7e5b183d8d77a10b5140a52808b295f21cb3
IEDL.DBID	RIE
ISICitedReferencesCount	1
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000967751000114&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Thu Jan 18 11:14:48 EST 2024
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i204t-8d4154ba304dba8df7984aa43b77e7e5b183d8d77a10b5140a52808b295f21cb3
PageCount	2
ParticipantIDs	ieee_primary_10032386
PublicationCentury	2000
PublicationDate	2022-Oct.-13
PublicationDateYYYYMMDD	2022-10-13
PublicationDate_xml	– month: 10 year: 2022 text: 2022-Oct.-13 day: 13
PublicationDecade	2020
PublicationTitle	2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA)
PublicationTitleAbbrev	DSAA
PublicationYear	2022
Publisher	IEEE
Publisher_xml	– name: IEEE
Score	1.8456761
Snippet	As a part of The Arena Group's Data & AI Team, we are architecting a new unified data platform that can handle both Data Engineering and Data Science use cases...
SourceID	ieee
SourceType	Publisher
StartPage	1
SubjectTerms	Artificial intelligence Big Data Computer architecture Data Analytics Data engineering Data science Data warehouses Databricks Distributed Systems Sparks Transforms
Title	Separating Storage and Compute with the Databricks Lakehouse Platform
URI	https://ieeexplore.ieee.org/document/10032386
WOSCitedRecordID	wos000967751000114&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA62ePCkYsU3OXjdms2jyR6LtXgopVCF3spkM6tF2Uq79fc72W4VDx68hRASMiEz803yzTB2WwRFyKeAJPWSAEpwIsm0JWWIIHICENrkddWSkR2P3WyWTRqyes2FQcT68xl2Y7N-yw_LfBNDZXTDhSIT02uxlrW9LVmr-bOViuxuMO33jVbOEOyTsrsb_atuSm02hof_XPCIdX4IeHzybVqO2R6WJ-xhittE3eULnxJUJk3AoQy8KczAY0iVkz_HB1CBj-nu13wEb_hK4J6me4cqOqgd9jx8eLp_TJoqCMlCCl0lLpCN1R6U0MGDC4XNnAbQyluLFo2nSxlcsBZS4cn9EWCkE87LzBQyzb06Ze1yWeIZ42BlD20olNagPRaZMM6HEHkIuXMYzlknymD-sU10Md9t_-KP_kt2ECUdVXmqrli7Wm3wmu3nn9Vivbqpj-cLkkSR7g
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NSwMxEA1aBT2pWPHbHLxuzWaTJnsstqXiWgqt0FuZbGa1KFtpt_5-k92t4sGDtxBCQiZkZt4kb4aQ28xGDvlkEISGO4BiNQtioZwyRGCpAxBCpmXVkkQNh3o6jUc1Wb3kwiBi-fkMW75ZvuXbRbr2oTJ3w1nkTEx7m-xIITir6Fr1r62QxXfdcacjRaSlA36ctzbjf1VOKQ1H_-CfSx6S5g8Fj46-jcsR2cL8mPTGWKXqzl_o2IFlpwso5JbWpRmoD6pS59HRLhRgfML7FU3gDV8dvHfTvUPhXdQmee73JveDoK6DEMw5E0WgrbOywkDEhDWgbaZiLQBEZJRChdK4a2m1VQpCZpwDxEByzbThscx4mJrohDTyRY6nhILibVQ2i4QAYTCLmdTGWs9ESLVGe0aaXgazjyrVxWyz_fM_-m_I3mDylMySh-HjBdn3Uq_iE5ekUSzXeEV2089ivlpel0f1BVo4lTI
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2022+IEEE+9th+International+Conference+on+Data+Science+and+Advanced+Analytics+%28DSAA%29&rft.atitle=Separating+Storage+and+Compute+with+the+Databricks+Lakehouse+Platform&rft.au=Kumar%2C+Deeptaanshu&rft.au=Li%2C+Suxi&rft.date=2022-10-13&rft.pub=IEEE&rft.spage=1&rft.epage=2&rft_id=info:doi/10.1109%2FDSAA54385.2022.10032386&rft.externalDocID=10032386