Edgeconf boilerplaterequiredtoexecutethejobisconsiderable detectsafailedtask re-executethefailedtasks isthefirsthigh-profileprojecttousetezasitsexecutionengine hiveinalotmoredetailinchapter hadoopandsql howitsimplementedonyarn standardsqlsyntax reducesthebarrierstostartanalyticexplorationofdatainhadoop hivehadnochoice mapreducejobs jobsbehindthescenesandexecutestheseonthecluster drawbacks mapreducemodelmeansthatseeminglysimplesqlstatementsareoftentranslatedinto alengthyseriesofmultipledependentmapreducejobs processingmorenaturallyconceptualizedasadagoftasks chapter yarn whentheprojectisfullyre-implemented usingtez processing sqlqueries parallelism ofrequiringtheapplicationtobeinstantiatedfromscratchforeachsqlsubmission isimportantbecause sometime muchofaninteractivetoolaspossible processing providesanabstractionthatallowsdatainhadooptobeviewedasadistributeddata structureuponwhichaseriesofoperationscanbeperformed thesameconceptstezdrawsinspirationfrom tobeheldandprocessedinmemory in-memorydatasetacrossthecluster acrossthecluster onatleasttwomachines hdfs release batch-oriented streamingsub-projectalsooffersnearreal-timeprocessingofdatastreams differentthingstodifferentpeople thetimeofwriting

examples/src/main/java/org/apache/tez/mapreduce/examples/WordCount.java.

DAG�dag�=�new�DAG("WordCount");
dag.addVertex(tokenizerVertex)
.addVertex(summerVertex)
.addEdge(new�Edge(tokenizerVertex,�summerVertex, edgeConf.createDefaultEdgeProperty()));

In�Hadoop�1,�Hive�had�no�choice,�but�to�implement�its�SQL�statements�as�a�series�of MapReduce�jobs.�When�SQL�is�submitted�to�Hive,�it�generates�the�required�MapReduce jobs�behind�the�scenes�and�executes�these�on�the�cluster.�This�approach�has�two�main drawbacks:�there�is�a�non-trivial�startup�penalty�each�time,�and�the�constrained
MapReduce�model�means�that�seemingly�simple�SQL�statements�are�often�translated�into a�lengthy�series�of�multiple�dependent�MapReduce�jobs.�This�is�an�example�of�the�type�of processing�more�naturally�conceptualized�as�a�DAG�of�tasks,�as�described�earlier�in�this chapter.

Although�some�benefits�are�achieved�when�Hive�executes�within�MapReduce,�within YARN,�the�major�benefits�come�in�Hive�0.13�when�the�project�is�fully�re-implemented using�Tez.�By�exploiting�the�Tez�APIs,�which�are�focused�on�providing�low-latency processing,�Hive�gains�even�more�performance�while�making�its�codebase�simpler.

Spark�started�as�a�standalone�system,�but�was�ported�to�also�run�on�YARN�as�of�its�0.8 release.�Spark�is�particularly�interesting�because,�although�its�classic�processing�model�is batch-oriented,�with�the�Spark�shell�it�provides�an�interactive�frontend�and�with�the�Spark Streaming�sub-project�also�offers�near�real-time�processing�of�data�streams.�Spark�is different�things�to�different�people;�it’s�both�a�high-level�API�and�an�execution�engine.�At the�time�of�writing,�ports�of�Hive�and�Pig�to�Spark�are�in�progress.