Skip to content

Latest commit

 

History

History
75 lines (44 loc) · 2.71 KB

Hadoop.md

File metadata and controls

75 lines (44 loc) · 2.71 KB

Hadoop

Hadoop Fundamentals

HDFS

YARN

MapReduce

Spark

Hive

Pig

HBase

Sqoop

MongoDB

Hadoop Security

Hadoop Streaming

Hadoop provides a streaming API which supports any programming language that can read from the standard input stdin and write to the standard output stdout. The Hadoop streaming API uses standard Linux streams as the interface between Hadoop and the program. Thus, input data is passed via the stdin to a map function, which processes it line by line and writes to the stdout. Input to the reduce function is stdin (which is guaranteed to be sorted by key by Hadoop) and the results are output to stdout.

  • Hadoop provides an API to MapReduce that allows you to write your map and reduce functions in languages other than Java!
  • Hadoop Streaming use Unix standard streams as the interface between Hadoop and your program, so you can use any language that can read standard input and write to standard output to write your MapReduce program.

Little trick (set in ~/.bashrc of hadoop user)

run_mapreduce() {
    hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-*streaming*.jar -mapper $1 -reducer $2 -file $1 -file $2 -input $3 -output $4
}

alias hs=run_mapreduce

then you can use it

hs mapper.py reducer.py hdfs_data_in hdfs_data_out
  • "hdfs_data_out" is the output data folder, it is important that this folder doesn't already exist

Book

Hadoop - The Definitive Guide

Links

Article

Getting Started

macOS installation guide

Sandbox