Language:EN
Pages: 2
Rating : ⭐⭐⭐⭐⭐
Price: $10.99
Page 1 Preview
edgeconf boilerplaterequiredtoexecutethejobisconsi

Edgeconf boilerplaterequiredtoexecutethejobisconsiderable detectsafailedtask re-executethefailedtasks isthefirsthigh-profileprojecttousetezasitsexecutionengine hiveinalotmoredetailinchapter hadoopandsql howitsimplementedonyarn standardsqlsyntax reducesthebarrierstostartanalyticexplorationofdatainhadoop hivehadnochoice mapreducejobs jobsbehindthescenesandexecutestheseonthecluster drawbacks mapreducemodelmeansthatseeminglysimplesqlstatementsareoftentranslatedinto alengthyseriesofmultipledependentmapreducejobs processingmorenaturallyconceptualizedasadagoftasks chapter yarn whentheprojectisfullyre-implemented usingtez processing sqlqueries parallelism ofrequiringtheapplicationtobeinstantiatedfromscratchforeachsqlsubmission isimportantbecause sometime muchofaninteractivetoolaspossible processing providesanabstractionthatallowsdatainhadooptobeviewedasadistributeddata structureuponwhichaseriesofoperationscanbeperformed thesameconceptstezdrawsinspirationfrom tobeheldandprocessedinmemory in-memorydatasetacrossthecluster acrossthecluster onatleasttwomachines hdfs release batch-oriented streamingsub-projectalsooffersnearreal-timeprocessingofdatastreams differentthingstodifferentpeople thetimeofwriting

examples/src/main/java/org/apache/tez/mapreduce/examples/WordCount.java.

DAG�dag�=�new�DAG("WordCount");
dag.addVertex(tokenizerVertex)
.addVertex(summerVertex)
.addEdge(new�Edge(tokenizerVertex,�summerVertex, edgeConf.createDefaultEdgeProperty()));

In�Hadoop�1,�Hive�had�no�choice,�but�to�implement�its�SQL�statements�as�a�series�of MapReduce�jobs.�When�SQL�is�submitted�to�Hive,�it�generates�the�required�MapReduce jobs�behind�the�scenes�and�executes�these�on�the�cluster.�This�approach�has�two�main drawbacks:�there�is�a�non-trivial�startup�penalty�each�time,�and�the�constrained
MapReduce�model�means�that�seemingly�simple�SQL�statements�are�often�translated�into a�lengthy�series�of�multiple�dependent�MapReduce�jobs.�This�is�an�example�of�the�type�of processing�more�naturally�conceptualized�as�a�DAG�of�tasks,�as�described�earlier�in�this chapter.

Although�some�benefits�are�achieved�when�Hive�executes�within�MapReduce,�within YARN,�the�major�benefits�come�in�Hive�0.13�when�the�project�is�fully�re-implemented using�Tez.�By�exploiting�the�Tez�APIs,�which�are�focused�on�providing�low-latency processing,�Hive�gains�even�more�performance�while�making�its�codebase�simpler.

Spark�started�as�a�standalone�system,�but�was�ported�to�also�run�on�YARN�as�of�its�0.8 release.�Spark�is�particularly�interesting�because,�although�its�classic�processing�model�is batch-oriented,�with�the�Spark�shell�it�provides�an�interactive�frontend�and�with�the�Spark Streaming�sub-project�also�offers�near�real-time�processing�of�data�streams.�Spark�is different�things�to�different�people;�it’s�both�a�high-level�API�and�an�execution�engine.�At the�time�of�writing,�ports�of�Hive�and�Pig�to�Spark�are�in�progress.

You are viewing 1/3rd of the document.Purchase the document to get full access instantly

Immediately available after payment
Both online and downloadable
No strings attached
How It Works
Login account
Login Your Account
Place in cart
Add to Cart
send in the money
Make payment
Document download
Download File
img

Uploaded by : Dr Sophie Slater

PageId: ELI769E814