melnik_VLDB-大数据文档资料.docx
Dremel:InteractiveAnalysisofWeb-ScaleDatasets
SergeyMelnik,AndreyGubarev,JingJingLong,GeoffreyRomer,
ShivaShivakumar,MattTolton,TheoVassilakis
Google,Inc.
{melnik,andrey,jlong,gromer,shiva,mtolton,theov}@
ABSTRACT
Dremelisascalable,interactivead-hocquerysystemforanaly-sisofread-onlynesteddata.Bycombiningmulti-levelexecutiontreesandcolumnardatalayout,itiscapableofrunningaggrega-tionqueriesovertrillion-rowtablesinseconds.ThesystemscalestothousandsofCPUsandpetabytesofdata,andhasthousandsofusersatGoogle.Inthispaper,wedescribethearchitectureandimplementationofDremel,andexplainhowitcomplementsMapReduce-basedcomputing.Wepresentanovelcolumnarstor-agerepresentationfornestedrecordsanddiscussexperimentsonfew-thousandnodeinstancesofthesystem.
1.INTRODUCTION
Large-scaleanalyticaldataprocessinghasbecomewidespreadinwebcompaniesandacrossindustries,notleastduetolow-coststoragethatenabledcollectingvastamountsofbusiness-criticaldata.Puttingthisdataatthe?ngertipsofanalystsandengineershasgrownincreasinglyimportant;interactiveresponsetimesof-tenmakeaqualitativedifferenceindataexploration,monitor-ing,onlinecustomersupport,rapidprototyping,debuggingofdatapipelines,andothertasks.
Performinginteractivedataanalysisatscaledemandsahighde-greeofparallelism.Forexample,readingoneterabyteofcom-presseddatainonesecondusingtoday’scommoditydiskswouldrequiretensofthousandsofdisks.Similarly,CPU-intensivequeriesmayneedtorunonthousandsofcorestocompletewithinseconds.AtGoogle,massivelyparallelcomputingisdoneusingsharedclustersofcommoditymachines[5].Aclustertypicallyhostsamultitudeofdistributedapplicationsthatshareresources,havewidelyvaryingworkloads,andrunonmachineswithdifferenthardwareparameters.Anindividua