What has to be written in the in the Observation before starting 1st Experiment.??
Hadoop History:
Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Hadoop - HDFS Overview:
Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing.
Features of HDFS:
- It is suitable for the distributed storage and processing.
- Hadoop provides a command interface to interact with HDFS.
- The built-in servers of namenode and datanode help users to easily check the status of cluster.
- Streaming access to file system data.
- HDFS provides file permissions and authentication.
Namenode:
The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks:
- Manages the file system namespace.
- Regulates client’s access to files.
- It also executes file system operations such as renaming, closing, and opening files and directories.
Datanode:
The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.
- Datanodes perform read-write operations on the file systems, as per client request.
- They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.
Mapreduce:
MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner.
MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.
Architecture of Mapreduce:
Experiment 01:
Steps to Do the Experiment in Hadoop Environment:
1. Go to the Terminal. (Ctrl + Alt + T)
2. Type the Command startCDH.sh
3. Create a Directory.
hadoop fs -mkdir DC (here DC is the Directory Name)
4. Create a File.
gedit <filename>.txt
2. Type the Command startCDH.sh
3. Create a Directory.
hadoop fs -mkdir DC (here DC is the Directory Name)
4. Create a File.
gedit <filename>.txt
5. Store the Instructions within the File and Save the File and Come out to the Terminal.
Ex. 1+2+3+4+5 (for our experiment the number seq. is taken)
6. Move this file to hadoop.
hadoop fs -put <filename>.txt DC/<filename>.txt
7. To check or See the File, go to the Browser --> Click on Hadoop Manager --> Select HDFS Namenode.
Then we can see the FileSystemManager Option. Click on it --> Click on Hadoop --> DC Folder. There U can see the <filename>.txt File.
8. Then Click on Ecllipse on the Desktop.
Then Go to File --> New --> Java Project. Give the Project Name ad then click on Next.
9. Click on '+' icon on the project name within the project, then you will find src folder. Right click on Project Name/Folder --> Select Biuld Path --> Select Configure Build Path option.
10. Now we have to Add some Libraries to the Project. In order to add, First you have to Download the Hadoopcore1.2.1.jar File.
11. After Downloading, Unzip/Extract it. Now Add jar Files by Clicking on External Jar Files Option.
12. Now we have to create Java Class Files. In order to do this, Right Click on Project Name --> Select Class Option. Add three Class Files namely, sum_mapper.java, sum_reducer.java, sum_runner.java respectively.
After creating these class files, simply copy the Code which had been given to you in the Lab and Save them.
13. Now we have to export these files to a jar Folder. Right Click on Project Name --> Select Export --> Select Java --> Select Jar File and then click Next and then click Finish.
14. Move the Exported jar file to Hadoop Folder.\
15. Now go to Ecllipse then go to 'Workspace'. Under this place the particular Jar File.
16. Now copy the Path of the <projectname>.jar File.
17. Now go the Terminal and Type:
hadoop jar <projectname>.jar sum_runner DC/<filename>.txt out
18. To see the Output:
Go to the Browser --> Select Hadoop Manager --> Select HDFS Namenode --> Browse FileSystem Manager.
19. Now in FileSystem Manager, Select User --> Click on Out --> Click on part0000. It will Display Result as Follows:
1+2+3+4+5 15
20. To stop the process, Open Terminal and Type: stopCDH.sh
Perform:
1. Start HDFS and verify that it's running.
2. Create a new directory /exercise1 on HDFS.
3. Upload $PLAY_AREA/exercises/filesystem/hamlet.txt to HDFS under /exercise1 directory.
4. View the content of the /exercise1 directory.
5. Determine the size of the hamlet.txt file in KB that resides on HDFS (not local directory).
6. Print the first 25 lines to the screen from hamlet.txt on HDFS.
7. Copy hamlet.txt to hamlet_hdfsCopy.txt .
8. Copy hamlet.txt back to local file system and name it hamlet_copy.txt .
9. Check the entire filesystem for inconsistencies/problems.
10. Delete hamlet.txt from HDFS.
11. Delete the /exercise1 directory from HDFS.
12. Take a second to look at other available shell options.
Answers:
1. Perform the following steps:
a. $ cd $HADOOP_HOME/sbin
b. $ ./start-dfs.sh This will start the Namenode, Secondary Namenode all the configured Datanodes, which in this case is just one (localhost)
c. You can verify with the browser or via command line:
i. Open a browser and navigate to http://localhost:50070, make sure there are no warnings
under 'Cluster Summary' section and there is 1 live node. Make sure there are no 'Dead
Nodes' and has 0 under replicated blocks Click on 'Live Nodes' links and verify that there
are no failed volumes and 'Admin State' is listed as 'In Service'
ii. Secondary Namenode can be confirmed via http://localhost:50090
iii. Execute on the command line $ hadoop dfsadmin -report, you will get a report about the
status of the cluster. Make sure there is 1 live node, 0 dead nodes and 0 under-replicated
blocks.
2. $ hdfs dfs -mkdir /exercise1
3. Perform the following steps:
a. $ cd $PLAY_AREA/exercises/filesystem
b. $ hdfs dfs -put hamlet.txt /exercise1/
4. $ hdfs dfs -ls /exercise1/
5. Perform the following steps:
a. $ hdfs dfs -du -h /exercise1/hamlet.txt
206.3k /exercise1/hamlet.txt
6. $ hdfs dfs -cat /exercise1/hamlet.txt | head -n 25
7. $ hdfs dfs -cp /exercise1/hamlet.txt /exercise1/hamlet_hdfsCopy.txt
8. $ hdfs dfs -get /exercise1/hamlet.txt hamlet_copy.txt
9. $ hdfs fsck /
10. $ hdfs dfs -rm /exercise1/hamlet.txt
11. $ hdfs dfs -rm -r /exercise1
12. $ hdfs dfs -help
Experiment 02:
Finding the Sum of the Given Sequence.
sum_mapper.java:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class sum_mapper extends MapReduceBase implements Mapper <LongWritable, Text, Text, IntWritable> {
public void map (LongWritable key, Text value, OutputCollector <Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()){
String s=tokenizer.nextToken("+");
int p=Integer.parseInt(s);
output.collect(value,new IntWritable(p));
}
}
}
sum_reducer.java:
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class sum_reducer extends MapReduceBase implements Reducer <Text, IntWritable, Text, IntWritable> {
public void reduce (Text key, Iterator <IntWritable> values, OutputCollector <Text, IntWritable> output, Reporter reporter) throws IOException {
int sum=0;
while (values.hasNext()) {
sum+=values.next().get();
}
output.collect(key,new IntWritable(sum));
}
}
}
sum_runner.java:
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class sum_runner {
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(sum_runner.class);
conf.setJobName("Sumofthedigits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(sum_mapper.class);
conf.setCombinerClass(sum_reducer.class);
conf.setReducerClass(sum_reducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}
Experiment 03:
Counting the number of Words in the given data.
wc_mapper.java:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class wc_mapper extends MapReduceBase implements Mapper <LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, OutputCollector <Text, IntWritable> output, Reporter reporter) throws IOException{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()){
String p=(tokenizer.nextToken());
output.collect(new Text(p), new IntWritable(1));
}
}
}
wc_reducer.java:
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class wc_reducer extends MapReduceBase implements Reducer <Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator <IntWritable> values, OutputCollector <Text, IntWritable> output, Reporter reporter) throws IOException {
int sum=0;
while (values.hasNext()) {
sum+=values.next().get();
}
output.collect(key,new IntWritable(sum));
}
}
wc_runner.java:
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class wc_runner {
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(wc_runner.class);
conf.setJobName("WordCount");
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(wc_mapper.class);
conf.setCombinerClass(wc_reducer.class);
conf.setReducerClass(wc_reducer.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}
Experiment 04:
Finding the Sum of Squares of the Given Sequence.
SqSum_mapper.java:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class SqSum_mapper extends MapReduceBase implements Mapper <LongWritable, Text, Text, IntWritable>{
public void map(LongWritable key, Text value, OutputCollector <Text, IntWritable> output, Reporter reporter) throws IOException{
String s = value.toString();
StringTokenizer tokenizer = new StringTokenizer(s);
while(tokenizer.hasMoreTokens()) {
String t=tokenizer.nextToken("+");
if(t.charAt(1)=='^'&&t.charAt(2)=='2') {
s=Character.toString(t.charAt(0));
}
int p=Integer.parseInt(s);
output.collect(value,new IntWritable(p*p));
}
}
}
SqSum_reducer.java:
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class SqSum_reducer extends MapReduceBase implements Reducer <Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector <Text, IntWritable> output, Reporter reporter) throws IOException {
int sum=0;
while (values.hasNext()) {
sum=sum+values.next().get();
}
output.collect(key,new IntWritable(sum));
}
}
SqSum_runner.java:
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class SqSum_runner {
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(sum_runner.class);
conf.setJobName("Sumofthesquaresdigits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(SqSum_mapper.class);
conf.setCombinerClass(SqSum_reducer.class);
conf.setReducerClass(SqSum_reducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}
Experiment 05:
Finding the Maximum Temperature in the Given Year.
TempMap.java:
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class TempMap extends MapReduceBase implements Mapper <LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, OutputCollector <Text,IntWritable>
output, Reporter reporter) throws IOException {
String record = value.toString();
String[] parts = record.split(",");
output.collect(new Text(parts[0]), new IntWritable(Integer.parseInt(parts[1])));
}
}
TempReduce.java:
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class TempReduce extends MapReduceBase implements Reducer <Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector <Text, IntWritable> output, Reporter reporter) throws IOException {
int maxValue = 0;
while (values.hasNext()) {
maxValue = Math.max(maxValue, values.next().get());
}
output.collect(key, new IntWritable(maxValue));
}
}
TempRun.java:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class TempRun {
public static void main(String[] args) throws Exception {
JobConf job = new JobConf(TempRun.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(TempMap.class);
job.setReducerClass(TempReduce.class);
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
JobClient.runJob(job);
}
}