paulwong

My Links

Blog Stats

Posts - 1193
Stories - 10
Comments - 108
Trackbacks - 0

常用鏈接

留言簿(66)

隨筆分類(1388)

隨筆檔案(1146)

文章分類(7)

文章檔案(10)

相冊

Test

收藏夾(2)

AI

AI智能PDF問答工具
CSV數據分析智能工具
docker image
ZLibrary
克隆ChatGPT
爆款小紅書AI寫作助手
視頻腳本生成器

Develop

!!!Event Sourcing
!!!Microservice Patterns
!!!NIO清晰解釋
!!PDF SEARCH
4+1 Architectural View Model
Apache安裝及jboss部署說明文檔
APK自動化測試網站
Command-Query Responsibility Segregation
data source
ELK日志分析平臺搭建全過程
Enterprise Architect中文網
EXT 中文站 ver2.0 since 2006-11-20
GOOGLE
GOOGLE
GOOGLE
Google代理
GOREAD RSS閱讀器
INOREADER RSS閱讀器
JavaScript 全棧工程師培訓教程
JBoss3.0 下配置和部署EJB簡介
Jquery Option Plug-in
LCA
MAVEN最佳實踐-版本管理
microservice-security
Mulity Tenant
MYSQL MHA
OAUTH2.0
RARBG TORRENT
Robin's Java World
Spring Boot Admin的使用
spring cloud
SPRING CLOUD教程
Spring 平臺整合 Activiti 工作流引擎實例
SPRING-BEAN自動組裝解釋
Spring-cloud-OAuth2-0配置
SQL2005客戶端下載
SRPING BOOT教程
TCC
TCC
TCC
一個extjs的好網站
一個優秀的CQRS框架Reveno
一個非常不錯的J2EE框架。
一個非常不錯的J2EE框架，從前端的JSP，到菜單，用戶和權限，都有了，還集成了STRUTS。
東莞源豐印刷
本人設計
中國象棋
中國軟件架構師網
不錯的培訓網，有相關文檔下載。
五行湯好轉反應
五行湯好轉反應
人體自愈的秘密
分布式事務1
分布式架構教學
各種大數據
在SPRING CLOUD中使用JAX-RS發布REST服務
在線思維導圖工具
大數據相關應用
學習課程
學習課程
安徽未名細胞治療有限公司
建模工具EA的使用
開源會議系統
指定MAVEN中的JDK版本
數據層的多租戶淺談
無法連接ITUNES STORE的原因
深圳房網
深圳通余額查詢
甘油三脂高應該用什么樣的食療方法
神級翻譯
簡歷模版
管理學
自動組裝SPRING-BEAN例子
通俗易懂的文章收藏
開放式課程
駕車學習
駕駛教學

E-BOOK

Ebook
ex libgen.io, libgen.org, alternative domains: *.li, *.gs, *.lc
EPDF
http://www.allitebooks.org

搜索

閱讀排行榜

評論排行榜

60天內閱讀排行

Install hadoop+hbase+nutch+elasticsearch

This document is for Anyela Chavarro.
Only these version of each framework work together

Hadoop 1.2.1

Hbase 0.90.4

Nutch 2.2.1

Elasticsearch 0.19.4

Linux version : Ubuntu 12.04.2 LTS

Hadoop cluster environment:

Name node/Job tracker
192.168.1.100 master

Data node/Task tracker
192.168.1.101 slave1
192.168.1.102 slave2
192.168.1.103 slave3

Install Hadoop(pseudo-distributed mode)

add user hadoop
useradd -s /bin/bash -d /home/hadoop -m hadoop
set password
passwd hadoop
login as hadoop
su hadoop
add a data folder
mkdir data
uninstall openjdk on centos
[hadoop@netfox ~] rpm -qa | grep java
[hadoop@netfox ~] java-1.4.2-gcj-compat-1.4.2.0-40jpp.115
[hadoop@netfox ~] java-1.4.2-gcj-compat-1.4.2.0-40jpp.115
[hadoop@netfox ~] rpm -e --nodeps java-1.4.2-gcj-compat-1.4.2.0-40jpp.115
[hadoop@netfox ~] rpm -e --nodeps java-1.6.0-openjdk-1.6.0.0-1.7.b09.el5
install JDK 1.6

apt-get update
apt-get install python-software-properties
add-apt-repository ppa:webupd8team/java
apt-get update
apt-get install oracle-java6-installer
get hadoop tar file

[hadoop@netfox ~]$ wget http://www.eu.apache.org/dist/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz
untar tar file
[hadoop@netfox hadoop]$ tar -vxf hadoop-1.2.1.tar.gz
install ssh-server

apt-get install openssh-server
setup ssh key(ssh-keygen is the built in tool in linux)
[hadoop@netfox hadoop]$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
make public key file
[hadoop@netfox hadoop]$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
change public key file authoriate mode
[hadoop@netfox hadoop]$ chmod 600 ~/.ssh/authorized_keys
find the ip of local machine

[hadoop@netfox hadoop]$ ifconfig
the ip can be found in this string:
inet addr:192.168.1.100
add to hosts, this line should be at the first line.

[hadoop@netfox hadoop]$ vi /etc/hosts
192.168.1.100 master
add to /etc/profile
export JAVA_HOME=/usr/lib/jvm/java-6-oracle

export HADOOP_HOME=/home/hadoop/hadoop-1.2.1

export HBASE_HOME=/home/hadoop/hbase-0.90.4

export PATH=$HADOOP_HOME/bin:$HBASE_HOME/bin:$JAVA_HOME/bin:$PATH
source it
[hadoop@netfox hadoop]$ source /etc/profile
create folder
hadoop@netfox:~$ mkdir /home/hadoop/data
edit /home/hadoop/hadoop-1.2.1/conf/hdfs-site.xml as below
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

<property>
<name>dfs.permissions</name>
<value>false</value>
</property>

</configuration>
edit /home/hadoop/hadoop-1.2.1/conf/mapred-site.xml as below
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>

<property>
  <name>mapred.job.tracker</name>
  <value>master:9002</value>
  <description>The host and port that the MapReduce job tracker runs
  at. If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

</configuration>
edit /home/hadoop/hadoop-1.2.1/conf/core-site.xml as below
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hadoop/data</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://master:9001</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

</configuration>
add to /home/hadoop/hadoop-1.2.1/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-oracle
add to /home/hadoop/hadoop-1.2.1/conf/slaves and masters
master
format hdoop namenode
[hadoop@netfox ~]$ hadoop namenode -format
start hadoop
[hadoop@netfox hadoop]$ start-all.sh
check if hdoop install correctly
[hadoop@netfox hadoop]$ hadoop dfs -ls /
for example, it will show the following output without error message.

Found 4 items
drwxr-xr-x   - hadoop supergroup          0 2013-08-28 14:02 /chukwa
drwxr-xr-x   - hadoop supergroup          0 2013-08-29 09:53 /hbase
drwxr-xr-x   - hadoop supergroup          0 2013-08-27 10:36 /opt
drwxr-xr-x   - hadoop supergroup          0 2013-09-01 15:22 /tmp

Install Hadoop(fully-distributed mode)
repeat step1-23 on slave1-3, but some steps will be different:

changet step 9 as below:
don't make the public key, just transfer the public key from master to each slave.

[hadoop@netfox hadoop]$ scp ~/.ssh/id_dsa.pub hadoop@slave1:/home/hadoop
change step 12 as below:
add to host

[hadoop@netfox hadoop]$ vi /etc/hosts
192.168.1.100 master
192.168.1.101 slave1
192.168.1.102 slave2
192.168.1.103 slave3
step 20, add to /home/hadoop/hadoop-1.2.1/conf/masters

master

add to /home/hadoop/hadoop-1.2.1/conf/slaves

slave1
slave2
slave3
step 22, start hadoop only on master

[hadoop@netfox hadoop]$ start-all.sh

Install Hbase

get hbase tar file
[hadoop@netfox ~]$ wget http://archive.apache.org/dist/hbase/hbase-0.90.4/hbase-0.90.4.tar.gz
untar the file
[hadoop@netfox ~]$ tar -vxf hbase-0.90.4.tar.gz
change /home/hadoop/hbase-0.90.4/conf/hbase-site.xml as below
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://master:9001/hbase</value>
  </property>

  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>

  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>localhost</value>
  </property>

</configuration>
change /home/hadoop/hbase-0.90.4/conf/regionservers as below
master
add JAVA_HOME to /home/hadoop/hbase-0.90.4/conf/hbase-env.sh
export JAVA_HOME=/usr/lib/jvm/java-6-oracle
replace with the new hadoop jar
[hadoop@netfox ~]$ rm /home/hadoop/hbase-0.90.4/lib/hadoop-core-0.20-append-r1056497.jar
[hadoop@netfox ~]$ cp /home/hadoop/hadoop-1.2.1/hadoop-core-1.2.1.jar /home/hadoop/hbase-0.90.4/lib
[hadoop@netfox ~]$ cp /home/hadoop/hadoop-1.2.1/lib/commons-collections-3.2.1.jar /home/hadoop/hbase-0.90.4/lib
[hadoop@netfox ~]$ cp /home/hadoop/hadoop-1.2.1/lib/commons-configuration-1.6.jar /home/hadoop/hbase-0.90.4/lib
start hbse
[hadoop@netfox ~]$ start-hbase.sh
check if hbase install correctly
[hadoop@netfox ~]$ hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.90.4, r1150278, Sun Jul 24 15:53:29 PDT 2011

hbase(main):001:0> list
TABLE webpage
1 row(s) in 0.5270 seconds

Install Nutch

install ant
[root@netfox ~]# apt-get install ant
switch user and folder
[root@netfox ~]# su hadoop
[hadoop@netfox root]$ cd ~
get nutch tar file
[hadoop@netfox ~]$ wget http://www.eu.apache.org/dist/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gz
untar this file
[hadoop@netfox webcrawer]$ tar -vxf apache-nutch-2.2.1-src.tar.gz
add to /etc/profile

export NUTCH_HOME=/home/hadoop/webcrawer/apache-nutch-2.2.1
export PATH=$NUTCH_HOME/runtime/deploy/bin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$JAVA_HOME/bin:$PATH
change /home/hadoop/webcrawer/apache-nutch-2.2.1/conf/hbase-site.xml as below
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://master:9001/hbase</value>
  </property>

  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
  </property>

  <property>
    <name>hbase.zookeeper.quorum</name>
    <value>localhost</value>
  </property>

</configuration>
change /home/hadoop/webcrawer/apache-nutch-2.2.1/conf/nutch-site.xml as below
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>

    <property>
        <name>storage.data.store.class</name>
        <value>org.apache.gora.hbase.store.HBaseStore</value>
        <description>Default class for storing data</description>
    </property>
    <property>
        <name>http.agent.name</name>
        <value>NutchCrawler</value>
    </property>
    <property>
        <name>http.robots.agents</name>
        <value>NutchCrawler,*</value>
    </property>

</configuration>
Uncomment the following in the /home/hadoop/webcrawer/apache-nutch-2.2.1/ivy/ivy.xml file
<dependency org="org.apache.gora" name="gora-hbase" rev="0.2"
conf="*->default" />
add to /home/hadoop/webcrawer/apache-nutch-2.2.1/conf/gora.properties file
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
go to nutch installation folder(/home/hadoop/webcrawer/apache-nutch-2.2.1) and run
ant clean
ant runtime
Create a directory in HDFS to upload the seed urls.
[hadoop@netfox ~]$ hadoop dfs -mkdir urls
Create a text file with the seed URLs for the crawl. Upload the seed URLs file to the directory created in the above step
[hadoop@netfox ~]$ hadoop dfs -put seed.txt urls
Issue the following command from inside the copied deploy directory in the
JobTracker node to inject the seed URLs to the Nutch database and to generate the
initial fetch list(-topN <N> - number of top URLs to be selected, default is Long.MAX_VALUE )

[hadoop@netfox ~]$ nutch inject urls
[hadoop@netfox ~]$ nutch generate -topN 3
Issue the following commands from inside the copied deploy directory in the
JobTracker node
[hadoop@netfox ~]$ nutch fetch -all
[hadoop@netfox ~]$ nutch parse -all
[hadoop@netfox ~]$ nutch updatedb
[hadoop@netfox ~]$ nutch generate -topN 10

Install ElasticSearch

get the tar file
[hadoop@netfox ~]$ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.19.4.tar.gz
untar file
[hadoop@netfox ~]$ tar -vxf elasticsearch-0.19.4.tar.gz
add to /etc/profile

export ELAST_HOME=/home/hadoop/webcrawer/elasticsearch-0.19.4

export PATH=$ELAST_HOME/bin:$NUTCH_HOME/runtime/deploy/bin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$JAVA_HOME/bin:$PATH
Go to the extracted ElasticSearch directory and execute the following command to
start the ElasticSearch server in the foreground
> bin/elasticsearch -f
Go to the $NUTCH_HOME/runtime/deploy (or $NUTCH_HOME/runtime/local
in case you are running Nutch in the local mode) directory. Execute the following
command to index the data crawled by Nutch in to the ElasticSearch server.
> bin/nutch elasticindex elasticsearch -all
install curl
[hadoop@netfox ~]$ sudo apt-get install curl
check if elasticsearch installation correct
[hadoop@netfox ~]$ curl master:9200
check query
[hadoop@netfox ~]$ curl -XGET 'http://master:9200/_search?q=hadoop'

posted on 2013-08-31 01:17 paulwong 閱讀(6308) 評論(3) 編輯收藏所屬分類: 分布式、HADOOP 、云計算、分布式搜索

Feedback

# re: Install hadoop+hbase+nutch+elasticsearch 2013-09-23 14:19 ap

nutch2.2.1默認支持hbase0.90.4 和 elasticsearch0.19.4 ，能否將其支持elasticsearch0.90.x以上版本呢（嘗試使用elasticsearch0.90.x.jar包替換nutch2.2.1 lib目錄下elasticsearch0.19.4.jar,但nutch elasticindex時報錯）？
Nutch1.7 默認是支持elasticsearch0.90.1的。
回復更多評論

# re: Install hadoop+hbase+nutch+elasticsearch 2013-09-24 18:27 paulwong

@ap
我試過換0.90以上的版本不行的。
Nutch1.7 不整合HBASE，就不試了回復更多評論

# re: Install hadoop+hbase+nutch+elasticsearch 2013-09-25 15:34 ap

@paulwong
我在Nutch1.7 的lib目錄下確實是沒找到HBASE的jar包，要是整合就好了。謝謝。
回復更多評論

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關文章: !!!架構網站內容不錯 SPRING CACHE資源使用WILDFLY中的分布式緩存INFISHPAN SPRING-SESSION 分布式調度QUARTZ+SPRING 樂視 TV 載入 4K 片點解咁快？CDN 網絡解構 Java并行處理框架 JPPF 騰訊CKV海量分布式存儲系統【轉載】經典漫畫講解HDFS原理一些數據切分、緩存、rpc框架、nosql方案資料