Kite jars can found opt cdh data organized number subprojects

Making Development Easier

df = float(df)
num_doc = float(num_doc)
except:
logger.warn("Invalid record %s" % line)

/usr/bin/hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/ hadoop-streaming.jar \
-input tweets.json \
-output tweets.cnt \
-mapper /bin/cat \
-reducer /usr/bin/wc

The mapper source code can be found at https://github.com/learninghadoop2/ book-examples/blob/master/ch9/streaming/tf-idf/python/tf-idf.py.

On Cloudera's QuickStart VM, Kite JARs can be found at /opt/cloudera/parcels/ CDH/lib/kite/.

Kite Data is organized in a number of subprojects, some of which we'll describe in the following sections.

Implementations of the Reader<E> interface are used to read data from an underlying storage system and produce deserialized entities of type E. The newReader() method can be used to get an appropriate implementation for a given dataset:

public interface DatasetReader<E> extends Iterator<E>, Iterable<E>, Closeable {
void open();