Hadoop provides a streaming API which supports any programming language that can read from the standard input stdin and write to the standard output stdout. The Hadoop streaming API uses standard Linux streams as the interface between Hadoop and the program. Thus, input data is passed via the stdin to a map function, which processes it line by line and writes to the stdout. Input to the reduce function is stdin (which is guaranteed to be sorted by key by Hadoop) and the results are output to stdout.
- Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java!
- Hadoop Streaming use Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.
Little trick (set in ~/.bashrc
of hadoop user)
run_mapreduce() {
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-*streaming*.jar -mapper $1 -reducer $2 -file $1 -file $2 -input $3 -output $4
}
alias hs=run_mapreduce
then you can use it
hs mapper.py reducer.py hdfs_data_in hdfs_data_out
- "hdfs_data_out" is the output data folder, it is important that this folder doesn't already exist
Hadoop - The Definitive Guide
macOS installation guide
- Hortonworks Sandbox - The Sandbox is a straightforward, pre-configured, learning environment that contains the latest developments from Apache Hadoop, specifically the Hortonworks Data Platform (HDP).
- big-data-europe/docker-hadoop: Apache Hadoop docker image