paulwong

My Links

Blog Stats

Posts - 1198
Stories - 10
Comments - 108
Trackbacks - 0

常用鏈接

留言簿(67)

隨筆分類(1393)

隨筆檔案(1151)

文章分類(7)

文章檔案(10)

相冊

Test

收藏夾(2)

AI

AI智能PDF問答工具
CSV數據分析智能工具
docker image
ZLibrary
克隆ChatGPT
爆款小紅書AI寫作助手
視頻腳本生成器

Develop

!!!Event Sourcing
!!!Microservice Patterns
!!!NIO清晰解釋
!!PDF SEARCH
4+1 Architectural View Model
Apache安裝及jboss部署說明文檔
APK自動化測試網站
Command-Query Responsibility Segregation
data source
ELK日志分析平臺搭建全過程
Enterprise Architect中文網
EXT 中文站 ver2.0 since 2006-11-20
GOOGLE
GOOGLE
GOOGLE
Google代理
GOREAD RSS閱讀器
INOREADER RSS閱讀器
JavaScript 全棧工程師培訓教程
JBoss3.0 下配置和部署EJB簡介
Jquery Option Plug-in
LCA
MAVEN最佳實踐-版本管理
microservice-security
Mulity Tenant
MYSQL MHA
OAUTH2.0
RARBG TORRENT
Robin's Java World
Spring Boot Admin的使用
spring cloud
SPRING CLOUD教程
Spring 平臺整合 Activiti 工作流引擎實例
SPRING-BEAN自動組裝解釋
Spring-cloud-OAuth2-0配置
SQL2005客戶端下載
SRPING BOOT教程
TCC
TCC
TCC
一個extjs的好網站
一個優秀的CQRS框架Reveno
一個非常不錯的J2EE框架。
一個非常不錯的J2EE框架，從前端的JSP，到菜單，用戶和權限，都有了，還集成了STRUTS。
東莞源豐印刷
本人設計
中國象棋
中國軟件架構師網
不錯的培訓網，有相關文檔下載。
五行湯好轉反應
五行湯好轉反應
人體自愈的秘密
分布式事務1
分布式架構教學
各種大數據
在SPRING CLOUD中使用JAX-RS發布REST服務
在線思維導圖工具
大數據相關應用
學習課程
學習課程
安徽未名細胞治療有限公司
建模工具EA的使用
開源會議系統
指定MAVEN中的JDK版本
數據層的多租戶淺談
無法連接ITUNES STORE的原因
深圳房網
深圳通余額查詢
甘油三脂高應該用什么樣的食療方法
神級翻譯
簡歷模版
管理學
自動組裝SPRING-BEAN例子
通俗易懂的文章收藏
開放式課程
駕車學習
駕駛教學

E-BOOK

Ebook
ex libgen.io, libgen.org, alternative domains: *.li, *.gs, *.lc
EPDF
http://www.allitebooks.org

搜索

閱讀排行榜

評論排行榜

60天內閱讀排行

Submitting a Hadoop MapReduce job to a remote JobTracker

Posted on August 31, 2012 by pcbje

While messing around with MapReduce code, I’ve found it to be a bit tedious having to generate the jarfile, copy it to the machine running the JobTracker, and then run the job every time the job has been altered. I should be able to run my jobs directly from my development environment, as illustrated in the figure below. This post explains how I’ve “solved” this problem. This may also help when integrating Hadoop with other applications. I do by no means claim that this is the proper way to do it, but it does the trick for me.

My Hadoop infrastructure

I assume that you have a (single-node) Hadoop 1.0.3 cluster properly installed on a dedicated or virtual machine. In this example, the JobTracker and HDFS resides on IP address 192.168.102.131.Let’s start out with a simple job that does nothing except to start up and terminate:

package com.pcbje.hadoopjobs;

import java.io.IOException;

import java.util.Date;

import java.util.Iterator;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.FileInputFormat;

import org.apache.hadoop.mapred.FileOutputFormat;

import org.apache.hadoop.mapred.JobClient;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.Mapper;

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reporter;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapred.Reducer;

public class MyFirstJob {

public static void main(String[] args) throws Exception {

Configuration config = new Configuration();

JobConf job = new JobConf(config);

job.setJarByClass(MyFirstJob.class);

job.setJobName("My first job");

FileInputFormat.setInputPaths(job, new Path(args[0));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(MyFirstJob.MyFirstMapper.class);

job.setReducerClass(MyFirstJob.MyFirstReducer.class);

JobClient.runJob(job);

}

private static class MyFirstMapper extends MapReduceBase implements Mapper {

public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {

}

private static class MyFirstReducer extends MapReduceBase implements Reducer {

public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {

}

Now, most of the examples you find online typically shows you a local mode setup where all the components of Hadoop (HDFS, JobTracker, etc) run on the same machine. A typical mapred-site.xml configuration might look like:

<name>mapred.job.tracker</name>

<value>localhost:9001</value>

</property>

</configuration>

As far as I can tell, such a configuration requires that jobs are submitted from the same node as the JobTracker. This is what I want to avoid. The first thing to do is to change the fs.default.name attribute to the IP address of my NameNode.

Configuration conf = new Configuration();

conf.set("fs.default.name", "192.168.102.131:9000");

And in core-site.xml:

<name>fs.default.name</name>

</property>

</configuration>

This tells the job to connect to the HDFS residing on a different machine. Running the job with this configuration will read from and write to the remote HDFS correctly, but the JobTracker at 192.168.102.131:9001 will not notice it. This means that the admin panel at 192.168.102.131:50030 wont list the job either. So the next thing to do is to tell the job configuration to submit the job to the appropriate JobTracker like this:

config.set("mapred.job.tracker", "192.168.102.131:9001");

You also need to change mapred-site.xml to allow external connections, this can be done by replacing “localhost” with the JobTracker’s IP address:

<name>mapred.job.tracker</name>

</property>

</configuration>

Restart hadoop.Upon trying to run your job, you may get an exception like this:

SEVERE: PriviledgedActionException as:[user] cause:org.apache.hadoop.security.AccessControlException:
org.apache.hadoop.security.AccessControlException: Permission denied: user=[user], access=WRITE, inode="mapred":root:supergroup:rwxr-xr-x

If you do, this may be solved by adding the following mapred-site.xml:

<name>mapreduce.jobtracker.staging.root.dir</name>

</property>

</configuration>

And then execute the following commands:

stop-mapred.sh
start-mapred.sh

When you now submit your job, it should be picked up by the admin page over at :50030. However, it will most probably fail and the log will be telling you something like:

java.lang.ClassNotFoundException: com.pcbje.hadoopjobs.MyFirstJob$MyFirstMapper

In order to fix this, you have to ensure that all dependencies of the submitted job are available to the JobTracker. This can be achieved by exporting the project in as a runnable jar, and then execute something like:

java -jar myfirstjob-jar-with-dependencies.jar /input/path /output/path

If your user has the appropriate permissions to the input and out directory on HDFS, the job should now run successfully. This can be verified in the console and on the administration panel.

Manually exporting runnable jars requires a lot of clicks in IDEs such as Eclipse. If you are using Maven, you can tell it to build the jar with its dependencies (See this answer for details). This would make the process a whole lot easier.Finally, to make it even easier, place a tiny bash-script in the same folder as pom.xml for building the maven project and executing the jar:

#!/bin/sh
mvn assembly:assembly
java -jar $1 $2 $3

After making the script executable, you can build and submit the job with the following command:

./build-and-run-job target/myfirstjob-jar-with-dependencies.jar /input/path

posted on 2012-10-03 15:06 paulwong 閱讀(773) 評論(0) 編輯收藏所屬分類: HADOOP 、云計算、HBASE

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關文章: HADOOP各種框架應用領域編譯HADOOP源碼 Simplehbase 安裝CLOUDERA 2014年值得關注的十個Hadoop大數據創業公司 KMEANS PAGERANK ON HADOOP Packt celebrates International Day Against DRM, May 6th 2014 A book: Web Crawling and Data Mining with Apache Nutch 【轉載】經典漫畫講解HDFS原理 Install Hadoop in the AWS cloud

paulwong

My Links

Blog Stats

常用鏈接

留言簿(67)

隨筆分類(1393)

隨筆檔案(1151)

文章分類(7)

文章檔案(10)

相冊

收藏夾(2)

AI

Develop

E-BOOK

Other

養生

微服務

搜索

最新評論

閱讀排行榜

評論排行榜

60天內閱讀排行

Submitting a Hadoop MapReduce job to a remote JobTracker