??xml version="1.0" encoding="utf-8" standalone="yes"?>国产日韩欧美激情,国产亚洲视频在线观看,国产三级精品三级在线观看国产http://www.aygfsteel.com/paulwong/archive/2013/10/26/405663.htmlpaulwongpaulwongSat, 26 Oct 2013 01:15:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/10/26/405663.htmlhttp://www.aygfsteel.com/paulwong/comments/405663.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/10/26/405663.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/405663.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/405663.html分布式文件系l比较出名的有HDFS  ?GFSQ其中HDFS比较单一炏V本文是一描q非常简z易懂的漫画形式讲解HDFS的原理。比一般PPT要通俗易懂很多。不隑־的学?fn)资料?br />

1、三个部? 客户端、nameserverQ可理解Z控和文g索引,cMlinux的inodeQ、datanodeQ存攑֮际数据)

在这里,client的Ş式我所了解的有两种Q通过hadoop提供的api所~写的程序可以和hdfsq行交互Q另外一U就是安装了hadoop的datanode其也可以通过命o(h)行与hdfspȝq行交互Q如在datanode上上传则使用如下命o(h)行:(x)bin/hadoop fs -put example1 user/chunk/


2、如何写数据q程





3、读取数据过E?/span>



4、容错:(x)W一部分Q故障类型及(qing)其检方法(nodeserver 故障Q和|络故障Q和脏数据问题)




5、容错第二部分:(x)d定w



6、容错第三部分:(x)dataNode 失效



7、备份规?/span>



8、结束语


paulwong 2013-10-26 09:15 发表评论
]]>
HIVE资源http://www.aygfsteel.com/paulwong/archive/2013/09/01/403532.htmlpaulwongpaulwongSun, 01 Sep 2013 04:41:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/09/01/403532.htmlhttp://www.aygfsteel.com/paulwong/comments/403532.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/09/01/403532.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/403532.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/403532.html Hive是徏立在Hadoop上的数据仓库基础构架。它提供了一pd的工P可以用来q行数据提取转化加蝲QETLQ,q是一U可以存储、查询和分析存储?Hadoop 中的大规模数据的机制。Hive 定义了简单的c?SQL 查询语言Q称?HQLQ它允许熟?zhn)?SQL 的用h询数据。同Ӟq个语言也允许熟(zhn)?MapReduce 开发者的开发自定义?mapper ?reducer 来处理内建的 mapper ?reducer 无法完成的复杂的分析工作?br />

Hive 没有专门的数据格式?Hive 可以很好的工作在 Thrift 之上Q控制分隔符Q也允许用户指定数据格式


hive与关pL据库的区别:(x)

数据存储不同QhiveZhadoop的HDFSQ关pL据库则基于本地文件系l?br />
计算模型不同QhiveZhadoop的mapreduceQ关pL据库则基于烦引的内存计算模型

应用场景不同Qhive是OLAP数据仓库pȝ提供量数据查询的,实时性很?关系数据库是OLTP事务pȝQؓ(f)实时查询业务服务

扩展性不同:(x)hiveZhadoop很容易通过分布式增加存储能力和计算能力Q关pL据库水^扩展很难Q要不断增加单机的性能


Hive安装?qing)用攻?br />http://blog.fens.me/hadoop-hive-intro/


R利剑NoSQLpd文章 ?Hive
http://cos.name/2013/07/r-nosql-hive/










paulwong 2013-09-01 12:41 发表评论
]]>
分布式搜索资?/title><link>http://www.aygfsteel.com/paulwong/archive/2013/08/31/403522.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Sat, 31 Aug 2013 07:52:00 GMT</pubDate><guid>http://www.aygfsteel.com/paulwong/archive/2013/08/31/403522.html</guid><wfw:comment>http://www.aygfsteel.com/paulwong/comments/403522.html</wfw:comment><comments>http://www.aygfsteel.com/paulwong/archive/2013/08/31/403522.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/paulwong/comments/commentRss/403522.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/paulwong/services/trackbacks/403522.html</trackback:ping><description><![CDATA[云端分布式搜索技?br /><a target="_blank">http://www.searchtech.pro</a><br /><br /><br />ELASTICSEARCH中文C֌<br /><a target="_blank">http://es-bbs.medcl.net/categories/%E6%9C%80%E6%96%B0%E5%8A%A8%E6%80%81</a><br /><br /><br /><a target="_blank">http://wangwei3.iteye.com/blog/1818599</a><br /><br /><br />Welcome to the Apache Nutch Wiki<br /><a target="_blank">https://wiki.apache.org/nutch/FrontPage</a><br /><br /><br />elasticsearch客户端大?br /><a target="_blank">http://www.searchtech.pro/elasticsearch-clients</a><br /><br /><br />客户?br /><a target="_blank">http://es-cn.medcl.net/guide/concepts/scaling-lucene/</a><br /><a target="_blank">https://github.com/aglover/elasticsearch_article/blob/master/src/main/java/com/b50/usat/load/MusicReviewSearch.java</a><br /><br /><br /> <img src ="http://www.aygfsteel.com/paulwong/aggbug/403522.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/paulwong/" target="_blank">paulwong</a> 2013-08-31 15:52 <a href="http://www.aygfsteel.com/paulwong/archive/2013/08/31/403522.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Install hadoop+hbase+nutch+elasticsearchhttp://www.aygfsteel.com/paulwong/archive/2013/08/31/403513.htmlpaulwongpaulwongFri, 30 Aug 2013 17:17:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/08/31/403513.htmlhttp://www.aygfsteel.com/paulwong/comments/403513.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/08/31/403513.html#Feedback3http://www.aygfsteel.com/paulwong/comments/commentRss/403513.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/403513.htmlH...  阅读全文

paulwong 2013-08-31 01:17 发表评论
]]>
Implementation for CombineFileInputFormat Hadoop 0.20.205http://www.aygfsteel.com/paulwong/archive/2013/08/29/403442.htmlpaulwongpaulwongThu, 29 Aug 2013 08:08:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/08/29/403442.htmlhttp://www.aygfsteel.com/paulwong/comments/403442.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/08/29/403442.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/403442.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/403442.html
具体的原理是下述三步:

1.Ҏ(gu)输入目录下的每个文g,如果光度超qmapred.max.split.size,以block为单位分成多个split(一个split是一个map的输?,每个split的长度都大于mapred.max.split.size, 因ؓ(f)以block为单? 因此也会(x)大于blockSize, 此文件剩下的长度如果大于mapred.min.split.size.per.node, 则生成一个split, 否则先暂时保?

2. 现在剩下的都是一些长度效短的片,把每个rack下碎片合q? 只要长度过mapred.max.split.size合q成一个split, 最后如果剩下的片比mapred.min.split.size.per.rack? 合q成一个split, 否则暂时保留.

3. 把不同rack下的片合ƈ, 只要长度过mapred.max.split.size合q成一个split, 剩下的碎片无论长? 合ƈ成一个split.
举例: mapred.max.split.size=1000
mapred.min.split.size.per.node=300
mapred.min.split.size.per.rack=100
输入目录下五个文?rack1下三个文?长度?050,1499,10, rack2下两个文?长度?010,80. 另外blockSize?00.
l过W一? 生成五个split: 1000,1000,1000,499,1000. 剩下的碎片ؓ(f)rack1?50,10; rack2?0:80
׃两个rack下的片和都不超q?00, 所以经q第二步, split和碎片都没有变化.
W三?合ƈ四个片成一个split, 长度?50.

如果要减map数量, 可以调大mapred.max.split.size, 否则调小卛_.

其特Ҏ(gu): 一个块臛_作ؓ(f)一个map的输入,一个文件可能有多个块,一个文件可能因为块多分l做Z同map的输入, 一个map可能处理多个块,可能处理多个文g?br />
注:(x)CombineFileInputFormat是一个抽象类Q需要编写一个承类?br />

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.LineRecordReader;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.lib.CombineFileInputFormat;
import org.apache.hadoop.mapred.lib.CombineFileRecordReader;
import org.apache.hadoop.mapred.lib.CombineFileSplit;

@SuppressWarnings("deprecation")
public class CombinedInputFormat extends CombineFileInputFormat<LongWritable, Text> {

    @SuppressWarnings({ "unchecked", "rawtypes" })
    @Override
    public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf conf, Reporter reporter) throws IOException {

        return new CombineFileRecordReader(conf, (CombineFileSplit) split, reporter, (Class) myCombineFileRecordReader.class);
    }

    public static class myCombineFileRecordReader implements RecordReader<LongWritable, Text> {
        private final LineRecordReader linerecord;

        public myCombineFileRecordReader(CombineFileSplit split, Configuration conf, Reporter reporter, Integer index) throws IOException {
            FileSplit filesplit = new FileSplit(split.getPath(index), split.getOffset(index), split.getLength(index), split.getLocations());
            linerecord = new LineRecordReader(conf, filesplit);
        }

        @Override
        public void close() throws IOException {
            linerecord.close();

        }

        @Override
        public LongWritable createKey() {
            // TODO Auto-generated method stub
            return linerecord.createKey();
        }

        @Override
        public Text createValue() {
            // TODO Auto-generated method stub
            return linerecord.createValue();
        }

        @Override
        public long getPos() throws IOException {
            // TODO Auto-generated method stub
            return linerecord.getPos();
        }

        @Override
        public float getProgress() throws IOException {
            // TODO Auto-generated method stub
            return linerecord.getProgress();
        }

        @Override
        public boolean next(LongWritable key, Text value) throws IOException {

            // TODO Auto-generated method stub
            return linerecord.next(key, value);
        }

    }
}


在运行时q样讄Q?br />
if (argument != null) {
                conf.set("mapred.max.split.size", argument);
            } else {
                conf.set("mapred.max.split.size", "134217728"); // 128 MB
            }
//

            conf.setInputFormat(CombinedInputFormat.class);




paulwong 2013-08-29 16:08 发表评论
]]>
大数据^台架构设计资?/title><link>http://www.aygfsteel.com/paulwong/archive/2013/08/18/403001.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Sun, 18 Aug 2013 10:27:00 GMT</pubDate><guid>http://www.aygfsteel.com/paulwong/archive/2013/08/18/403001.html</guid><wfw:comment>http://www.aygfsteel.com/paulwong/comments/403001.html</wfw:comment><comments>http://www.aygfsteel.com/paulwong/archive/2013/08/18/403001.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/paulwong/comments/commentRss/403001.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/paulwong/services/trackbacks/403001.html</trackback:ping><description><![CDATA[!!!ZHadoop的大数据q_实施?#8212;—整体架构设计<br /><a target="_blank">http://blog.csdn.net/jacktan/article/details/9200979</a><br /><br /><br /><br /><br /><br /><br /><br /><img src ="http://www.aygfsteel.com/paulwong/aggbug/403001.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/paulwong/" target="_blank">paulwong</a> 2013-08-18 18:27 <a href="http://www.aygfsteel.com/paulwong/archive/2013/08/18/403001.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>How to install Hadoop cluster(2 node cluster) and Hbase on Vmware Workstation. It also includes installing Pig and Hive in the appendixhttp://www.aygfsteel.com/paulwong/archive/2013/08/17/402982.htmlpaulwongpaulwongSat, 17 Aug 2013 14:23:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/08/17/402982.htmlhttp://www.aygfsteel.com/paulwong/comments/402982.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/08/17/402982.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/402982.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/402982.html

Requires: Ubuntu10.04, hadoop0.20.2, zookeeper 3.3.2 HBase0.90.0
1. Download Ubuntu 10.04 desktop 32 bit from Ubuntu website.

2. Install Ubuntu 10.04 with username: hadoop, password: password,  disk size: 20GB, memory: 2048MB, 1 processor, 2 cores

3. Install build-essential (for GNU C, C++ compiler)    $ sudo apt-get install build-essential

4. Install sun-jave-6-jdk
    (1) Add the Canonical Partner Repository to your apt repositories
    $ sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
     (2) Update the source list
    $ sudo apt-get update
     (3) Install sun-java-6-jdk and make sure Sun’s java is the default jvm
    $ sudo apt-get install sun-java6-jdk
     (4) Set environment variable by modifying ~/.bashrc file, put the following two lines in the end of the file
    export JAVA_HOME=/usr/lib/jvm/java-6-sun
    export PATH=$PATH:$JAVA_HOME/bin 

5. Configure SSH server so that ssh to localhost doesn’t need a passphrase
    (1) Install openssh server
    $ sudo apt-get install openssh-server
     (2) Generate RSA pair key
    $ ssh-keygen –t ras –P ""
     (3) Enable SSH access to local machine
    $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

6. Disable IPv6 by      modifying  /etc/sysctl.conf file, put the following two lines in the end of the file
#disable
ipv6 net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

7. Install hadoop
    (1) Download hadoop-0.20.2.tar.gz(stable release on 1/25/2011)  from Apache hadoop website   
    (2) Extract hadoop archive file to /usr/local/   
    (3) Make symbolic link   
    (4) Modify /usr/local/hadoop/conf/hadoop-env.sh   
Change from # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun To # The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-6-sun
     (5)Create /usr/local/hadoop-datastore folder   
$ sudo mkdir /usr/local/hadoop-datastore
$ sudo chown hadoop:hadoop /usr/local/hadoop-datastore
$ sudo chmod 750 /usr/local/hadoop-datastore
     (6)Put the following code in /usr/local/hadoop/conf/core-site.xml   
hadoop.tmp.dir/usr/local/hadoop/tmp/dir/hadoop-${user.name}A base for other temporary directories.fs.default.namehdfs://master:54310The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.
    (7) Put the following code in /usr/local/hadoop/conf/mapred-site.xml   
mapred.job.trackermaster:54311The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
     (8) Put the following code in /usr/local/hadoop/conf/hdfs-site.xml   
dfs.replication1Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
     (9) Add hadoop to environment variable by modifying ~/.bashrc   
export HADOOP_HOME=/usr/local/hadoop export PATH=$HADOOP_HOME/bin:$PATH

8. Restart Ubuntu Linux

9. Copy this virtual machine to another folder. At least we have 2 copies of Ubuntu linux

10. Modify /etc/hosts on both Linux Virtual Image machines, add in the following lines in the file. The IP address depends on each machine. We can use (ifconfig) to find out IP address.
# /etc/hosts (for master AND slave) 192.168.0.1 master 192.168.0.2 slave     Modify the following line, because it might cause Hbase to find out wrong ip.   
192.168.0.1 ubuntu

11. Check hadoop user access on both machines.
The hadoop user on the master (aka hadoop@master) must be able to connect a) to its own user account on the master – i.e. ssh master in this context and not necessarily ssh localhost – and b) to the hadoop user account on the slave (aka hadoop@slave)  via a password-less SSH login. On both machines, make sure each one can connect to master, slave without typing passwords.

12. Cluster configuration
    (1) Modify /usr/local/hadoop/conf/masters
         only on master machine    master
     (2) Modify /usr/local/hadoop/conf/slaves
          only on master machine    master slave
     (3) Change “localhost” to “master” in /usr/local/conf/hadoop/conf/core-site.xml and /usr/local/hadoop/conf/mapred-site.xml
        only on master machine   
    (4) Change dfs.replication to “1” in /usr/local/conf/hadoop/conf/hdfs-site.xml
    only on master machine   

13. Format the namenode only once and only on master machine
$ /usr/local/hadoop/bin/hadoop namenode –format

14. Later on, start the multi-node cluster by typing following code only on master. So far, please don’t start hadoop yet.
$ /usr/local/hadoop/bin/start-dfs.sh $ /usr/local/hadoop/bin/start-mapred.sh

15. Install zookeeper only on master node
    (1) download zookeeper-3.3.2.tar.gz from Apache hadoop website   
    (2) Extract  zookeeper-3.3.2.tar.gz    $ tar –xzf zookeeper-3-3.2.tar.gz
     (3) Move folder zookeeper-3.3.2 to /home/hadoop/ and create a symbloink link
    $ mv zookeeper-3.3.2 /home/hadoop/ ; ln –s /home/hadoop/zookeeper-3.3.2 /home/hadoop/zookeeper
     (4) copy conf/zoo_sample.cfg to conf/zoo.cfg
    $ cp conf/zoo_sample.cfg confg/zoo.cfg
     (5) Modify conf/zoo.cfg    dataDir=/home/hadoop/zookeeper/snapshot

16. Install Hbase on both master and slave nodes, configure it as fully-distributed
    (1) Download hbase-0.90.0.tar.gz from Apache hadoop website   
    (2) Extract  hbase-0.90.0.tar.gz    $ tar –xzf hbase-0.90.0.tar.gz
     (3) Move folder hbase-0.90.0 to /home/hadoop/ and create a symbloink link    $ mv hbase-0.90.0 /home/hadoop/ ; ln –s /home/hadoop/hbase-0.90.0 /home/hadoop/hbase
     (4) Edit /home/hadoop/hbase/conf/hbase-site.xml, put the following in between and hbase.rootdirhdfs://master:54310/hbase The directory shared by region servers. Should be fully-qualified to include the filesystem to use. E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR hbase.cluster.distributedtrueThe mode the cluster will be in. Possible values are false: standalone and pseudo-distributed setups with managed Zookeeper true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh) hbase.zookeeper.quorummasterComma separated list of servers in the ZooKeeper Quorum. If HBASE_MANAGES_ZK is set in hbase-env.sh this is the list of servers which we will start/stop ZooKeeper on.
     (5) modify environment variables in /home/hadoop/hbase/conf/hbase-env.sh
    export JAVA_HOME=/usr/lib/jvm/java-6-sun/
export HBASE_IDENT_STRING=$HOSTNAME
export HBASE_MANAGES_ZK=false
     (6)Overwrite /home/hadoop/hbase/conf/regionservers
  on both machines    master slave
     (7)copy /usr/local/hadoop-0.20.2/haoop-0.20.2-core.jar to /home/hadoop/hbase/lib/  on both machines.
      This is very important to fix version difference issue. Pay attention to its ownership and mode(755).   

17. Start zookeeper. It seems the zookeeper bundled with Hbase is not set up correctly.
$ /home/hadoop/zookeeper/bin/zkServer.sh start     (Optional)We can test if zookeeper is running correctly by  typing     $ /home/hadoop/zookeeper/bin/zkCli.sh –server 127.0.0.1:2181

18. Start hadoop cluster
$ /usr/local/hadoop/bin/start-dfs.sh $ /usr/local/hadoop/bin/start-mapred.sh

19. Start Hbase
$ /home/hadoop/hbase/bin/start-hbase.sh

20. Use Hbase shell
$ /home/hadoop/hbase/bin/hbase shell     Check if hbase is running smoothly
    Open your browser, and type in the following.
    http://localhost:60010   


21. Later on, stop the multi-node cluster by typing following code only on master
    (1) Stop Hbase    $ /home/hadoop/hbase/bin/stop-hbase.sh
     (2) Stop hadoop file system (HDFS)       
$ /usr/local/hadoop/bin/stop-mapred.sh
$ /usr/local/hadoop/bin/stop-dfs.sh
     (3) Stop zookeeper    
$ /home/hadoop/zookeeper/bin/zkServer.sh stop

Reference
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
http://wiki.apache.org/hadoop/Hbase/10Minutes
http://hbase.apache.org/book/quickstart.html
http://alans.se/blog/2010/hadoop-hbase-cygwin-windows-7-x64/

Author
Tzu-Cheng Chuang


Appendix- Install Pig and Hive
1. Install Pig 0.8.0 on this cluster
    (1) Download pig-0.8.0.tar.gz from Apache pig project website.  Then extract the file and move it to /home/hadoop/   
$ tar –xzf pig-0.8.0.tar.gz ; mv pig-0.8.0 /home/hadoop/
     (2) Make symbolink link under pig-0.8.0/conf/   
$ ln -s /usr/local/hadoop/conf/core-site.xml /home/hadoop/pig-0.8.0/conf/core-site.xml
$ ln -s /usr/local/hadoop/conf/mapred-site.xml /home/hadoop/pig-0.8.0/conf/mapred-site.xml
$ ln -s /usr/local/hadoop/conf/hdfs-site.xml /home/hadoop/pig-0.8.0/conf/hdfs-site.xml
     3) Start pig in map-reduce mode: $ /home/hadoop/pig-0.8.0/bin/pig
     (4) Exit pig from grunt>    quit

2. Install Hive on this cluster
    (1) Download hive-0.6.0.tar.gz from Apache hive project website, and then extract the file and move it to /home/hadoop/    $ tar –xzf hive-0.6.0.tar.gz ; mv hive-0.6.0 ~/
     (2) Modify java heap size in hive-0.6.0/bin/ext/execHiveCmd.sh  Change 4096 to 1024   
    (3) Create /tmp and /user/hive/warehouse and set them chmod g+w in HDFS before a table can be created in Hive    $ hadoop fs –mkdir /tmp $ hadoop fs –mkdir /user/hive/warehouse $ hadoop fs –chmod g+w /tmp $ hadoop fs –chmod g+w /user/hive/warehouse
     (4) start Hive     $ /home/hadoop/hive-0.6.0/bin/hive

     3. (Optional)Load data by using Hive
    Create a file /home/hadoop/customer.txt    1, Kevin 2, David 3, Brian 4, Jane 5, Alice     After hive shell is started, type in    > CREATE TABLE IF NOT EXISTS customer(id INT, name STRING) > ROW FORMAT delimited fields terminated by ',' > STORED AS TEXTFILE; >LOAD DATA INPATH '/home/hadoop/customer.txt' OVERWRITE INTO TABLE customer; >SELECT customer.id, customer.name from customer;

http://chuangtc.info/ParallelComputing/SetUpHadoopClusterOnVmwareWorkstation.htm

paulwong 2013-08-17 22:23 发表评论
]]>
HBASE界面工具http://www.aygfsteel.com/paulwong/archive/2013/08/14/402775.htmlpaulwongpaulwongWed, 14 Aug 2013 01:51:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/08/14/402775.htmlhttp://www.aygfsteel.com/paulwong/comments/402775.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/08/14/402775.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/402775.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/402775.html下蝲?.6的WAR包时Q要lib下的jasper-runtime-5.5.23.jar和jasper-compiler-5.5.23.jar删掉Q否则会(x)报错
http://sourceforge.net/projects/hbaseexplorer/?source=dlp

HBaseXplorer
https://github.com/bit-ware/HBaseXplorer/downloads

HBase Manager
http://sourceforge.net/projects/hbasemanagergui/

paulwong 2013-08-14 09:51 发表评论
]]>
Kettle - HADOOP数据转换工具http://www.aygfsteel.com/paulwong/archive/2013/08/01/402269.htmlpaulwongpaulwongThu, 01 Aug 2013 09:21:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/08/01/402269.htmlhttp://www.aygfsteel.com/paulwong/comments/402269.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/08/01/402269.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/402269.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/402269.html
http://www.cnblogs.com/limengqiang/archive/2013/01/16/KettleApply1.html

paulwong 2013-08-01 17:21 发表评论
]]>
使用Sqoop实现HDFS与Mysql互{http://www.aygfsteel.com/paulwong/archive/2013/05/11/399153.htmlpaulwongpaulwongSat, 11 May 2013 13:27:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/05/11/399153.htmlhttp://www.aygfsteel.com/paulwong/comments/399153.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/05/11/399153.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/399153.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/399153.html ?br /> Sqoop是一个用来将Hadoop和关pd数据库中的数据相互{Uȝ工具Q可以将一个关pd数据库(例如 Q?MySQL ,Oracle ,Postgres{)中的数据导入到Hadoop的HDFS中,也可以将HDFS的数据导入到关系型数据库中?br />
http://sqoop.apache.org/

环境
当调试过E出现IncompatibleClassChangeError一般都是版本兼定w题?br />
Z保证hadoop和sqoop版本的兼Ҏ(gu),使用ClouderaQ?br />
Cloudera介:(x)

ClouderaZ让Hadoop的配|标准化Q可以帮助企业安装,配置Q运行hadoop以达到大规模企业数据的处理和分析?br />
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDHTarballs/3.25.2013/CDH4-Downloadable-Tarballs/CDH4-Downloadable-Tarballs.html

下蝲安装hadoop-0.20.2-cdh3u6Qsqoop-1.3.0-cdh3u6?br />
安装
安装比较单,直接解压卛_

唯一需要做的就是将mysql的jdbc适配包mysql-connector-java-5.0.7-bin.jar copy?SQOOP_HOME/lib下?br />
配置好环境变量:(x)/etc/profile

export SQOOP_HOME=/home/hadoop/sqoop-1.3.0-cdh3u6/

export PATH=$SQOOP_HOME/bin:$PATH

MYSQL转HDFS-CZ
./sqoop import --connect jdbc:mysql://10.8.210.166:3306/recsys --username root --password root --table shop -m 1 --target-dir /user/recsys/input/shop/$today


HDFS转MYSQ-CZ
./sqoop export --connect jdbc:mysql://10.8.210.166:3306/recsys --username root --password root --table shopassoc --fields-terminated-by ',' --export-dir /user/recsys/output/shop/$today

CZ参数说明
(其他参数我未使用Q故不作解释Q未使用Q就没有发言权,详见命o(h)help)


参数cd

参数?br />
解释

公共

connect

Jdbc-url

公共

username

---

公共

password

---

公共

table

表名

Import

target-dir

制定输出hdfs目录Q默认输出到/user/$loginName/

export

fields-terminated-by

Hdfs文g中的字段分割W,默认?#8220;\t”

export

export-dir

hdfs文g的\?img src ="http://www.aygfsteel.com/paulwong/aggbug/399153.html" width = "1" height = "1" />

paulwong 2013-05-11 21:27 发表评论
]]>
一|打?3Ƒּ源Java大数据工?/title><link>http://www.aygfsteel.com/paulwong/archive/2013/05/03/398700.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Fri, 03 May 2013 01:05:00 GMT</pubDate><guid>http://www.aygfsteel.com/paulwong/archive/2013/05/03/398700.html</guid><wfw:comment>http://www.aygfsteel.com/paulwong/comments/398700.html</wfw:comment><comments>http://www.aygfsteel.com/paulwong/archive/2013/05/03/398700.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/paulwong/comments/commentRss/398700.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/paulwong/services/trackbacks/398700.html</trackback:ping><description><![CDATA[<p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>下面介l大数据领域支持Java的主开源工?/strong>Q?/p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce391277b5.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>1. HDFS</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">HDFS是Hadoop应用E序中主要的分布式储存系l, HDFS集群包含了一个NameNodeQ主节点Q,q个节点负责理所有文件系l的元数据及(qing)存储了真实数据的DataNodeQ数据节点,可以有很多)。HDFS针对量数据所设计Q所以相比传l文件系l在大批量小文g上的优化QHDFS优化的则是对批量大型文件的讉K和存储?/p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"></a><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce3c49ded6.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>2. MapReduce</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Hadoop MapReduce是一个Y件框Ӟ用以L~写处理量QTBU)数据的ƈ行应用程序,以可靠和定w的方式连?span style="line-height: 1.45em;">大型集群?/span><span style="line-height: 1.45em;">上万个节点(商用gQ?/span></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce3ee64519.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>3. HBase</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache HBase是Hadoop数据库,一个分布式、可扩展的大数据存储。它提供了大数据集上随机和实时的?写访问,q对了商用服务器集上的大型表格做Z?#8212;—上百亿行Q上千万列。其核心是Google Bigtable论文的开源实玎ͼ分布式列式存储。就像Bigtable利用GFSQGoogle File SystemQ提供的分布式数据存储一P它是Apache Hadoop在HDFS基础上提供的一个类Bigatable?/p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce413366c7.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>4. Cassandra</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Cassandra是一个高性能、可U性扩展、高有效性数据库Q可以运行在商用g或云基础设施上打造完的d关键性数据^台。在横跨数据中心的复制中QCassandra同类最佻I为用h供更低的延时以及(qing)更可靠的N备䆾。通过log-structured update、反规范化和物化视图的强支持以及(qing)强大的内|缓存,Cassandra的数据模型提供了方便的二U烦引(column indexeQ?/p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce4611885c.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>5. Hive</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Hive是Hadoop的一个数据仓库系l,促进了数据的lDQ将l构化的数据文g映射Z张数据库表)、即席查询以?qing)存储在Hadoop兼容pȝ中的大型数据集分析。Hive提供完整的SQL查询功能——HiveQL语言Q同时当使用q个语言表达一?span style="line-height: 1.45em;">逻辑</span><span style="line-height: 1.45em;">变得低效和繁?/span><span style="line-height: 1.45em;">ӞHiveQLq允怼l的Map/ReduceE序员用自己定制的Mapper和Reducer?/span></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce470085ed.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>6. Pig</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Pig是一个用于大型数据集分析的^収ͼ它包含了一个用于数据分析应用的高语言以及(qing)评估q些应用的基设施。Pig应用的闪光特性在于它们的l构l得起大量的q行Q也是说让它们支撑起非常大的数据集。Pig的基设施层包含了产生Map-Reduced的编译器。Pig的语a层当前包含了一个原生语a——Pig LatinQ开发的初衷是易于编E和保证可扩展性?/p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce47b8e077.jpg" border="0" alt="" style="vertical-align: middle; border: none; width: 99px; height: 99px; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>7. Chukwa</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Chukwa是个开源的数据攉pȝQ用以监视大型分布系l。徏立于HDFS和Map/Reduce框架之上Q承了Hadoop的可扩展性和E_性。Chukwa同样包含了一个灵zd强大的工具包Q用以显C、监视和分析l果Q以保证数据的用达到最x果?/p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce4870b072.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>8. Ambari</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Ambari是一个基于web的工P用于配置、管理和监视Apache Hadoop集群Q支持Hadoop HDFS,、Hadoop MapReduce、Hive、HCatalog,、HBase、ZooKeeper、Oozie、Pig和Sqoop。Ambari同样q提供了集群状况仪表盘,比如heatmaps和查看MapReduce、Pig、Hive应用E序的能力,以友好的用户界面对它们的性能Ҏ(gu)进行诊断?/p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce49282930.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>9. ZooKeeper</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache ZooKeeper是一个针对大型分布式pȝ的可靠协调系l,提供的功能包括:(x)配置l护、命名服务、分布式同步、组服务{。ZooKeeper的目标就是封装好复杂易出错的关键服务Q将单易用的接口和性能高效、功能稳定的pȝ提供l用戗?/p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce49e31e19.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>10. Sqoop</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Sqoop是一个用来将Hadoop和关pd数据库中的数据相互{Uȝ工具Q可以将一个关pd数据库中数据导入Hadoop的HDFS中,也可以将HDFS中数据导入关pd数据库中?/p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce4b0d3c61.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>11. Oozie</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Oozie是一个可扩展、可靠及(qing)可扩充的工作调度系l,用以理Hadoop作业。Oozie Workflow作业是活动的Directed Acyclical GraphsQDAGsQ。Oozie Coordinator作业是由周期性的Oozie Workflow作业触发Q周期一般决定于旉Q频率)和数据可用性。Oozie与余下的Hadoop堆栈l合使用Q开即用的支持多种cdHadoop作业Q比如:(x)Java map-reduce、Streaming map-reduce、Pig?Hive、Sqoop和DistcpQ以?qing)其它系l作业(比如JavaE序和Shell脚本Q?/p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce4bdedb23.jpg" border="0" alt="" style="vertical-align: middle; border: none; width: 100px; height: 100px; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>12. Mahout</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Mahout是个可扩展的机器学习(fn)和数据挖掘库Q当前Mahout支持主要?个用例:(x)</p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"></p><ul style="margin: 0px 0px 1em 20px; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">推荐挖掘Q搜集用户动作ƈ以此l用h荐可能喜Ƣ的事物?/span></li><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">聚集Q收集文件ƈq行相关文g分组?/span></li><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">分类Q从现有的分cL档中学习(fn)Q寻找文档中的相似特征,qؓ(f)无标{文档q行正确的归cR?/span></li><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">频繁w挖掘Q将一l项分组Qƈ识别哪些个别会(x)l常一起出现?/span></li></ul><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce4cf93346.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>13. HCatalog</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache HCatalog是Hadoop建立数据的映表和存储管理服务,它包括:(x)</p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"></p><ul style="margin: 0px 0px 1em 20px; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">提供一个共享模式和数据cd机制?/span></li><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">提供一个抽象表Q这L(fng)户就不需要关注数据存储的方式和地址?/span></li><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">为类似Pig、MapReduce?qing)Hiveq些数据处理工具提供互操作性?/span></li></ul><img src ="http://www.aygfsteel.com/paulwong/aggbug/398700.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/paulwong/" target="_blank">paulwong</a> 2013-05-03 09:05 <a href="http://www.aygfsteel.com/paulwong/archive/2013/05/03/398700.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>一个PIG脚本例子分析http://www.aygfsteel.com/paulwong/archive/2013/04/13/397791.htmlpaulwongpaulwongSat, 13 Apr 2013 07:21:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/04/13/397791.htmlhttp://www.aygfsteel.com/paulwong/comments/397791.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/04/13/397791.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/397791.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/397791.html
PIGGYBANK_PATH=$PIG_HOME/contrib/piggybank/java/piggybank.jar
INPUT=pig/input/test-pig-full.txt
OUTPUT=pig/output/test-pig-output-$(date  +%Y%m%d%H%M%S)
PIGSCRIPT=analyst_status_logs.pig

#analyst_500_404_month.pig
#
analyst_500_404_day.pig
#
analyst_404_percentage.pig
#
analyst_500_percentage.pig
#
analyst_unique_path.pig
#
analyst_user_logs.pig
#
analyst_status_logs.pig


pig -p PIGGYBANK_PATH=$PIGGYBANK_PATH -p INPUT=$INPUT -p OUTPUT=$OUTPUT $PIGSCRIPT


要分析的数据源,LOG 文g
46.20.45.18 - - [25/Dec/2012:23:00:25 +0100] "GET / HTTP/1.0" 302 - "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" "-" "-" 46.20.45.18 "" 11011AEC9542DB0983093A100E8733F8 0
46.20.45.18 - - [25/Dec/2012:23:00:25 +0100] "GET /sign-in.jspx HTTP/1.0" 200 3926 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" "-" "-" 46.20.45.18 "" 11011AEC9542DB0983093A100E8733F8 0
69.59.28.19 - - [25/Dec/2012:23:01:25 +0100] "GET / HTTP/1.0" 302 - "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" "-" "-" 69.59.28.19 "" 36D80DE7FE52A2D89A8F53A012307B0A 15


PIG脚本Q?br />
--注册JAR包,因ؓ(f)要用到DateExtractor
register '$PIGGYBANK_PATH';

--声明一个短函数?br />DEFINE DATE_EXTRACT_MM 
org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('yyyy-MM');

DEFINE DATE_EXTRACT_DD 
org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('yyyy-MM-dd');

-- pig/input/test-pig-full.txt
--把数据从变量所指的文g加蝲到PIG中,q定义数据列名,此时的数据集为数l?a,b,c)
raw_logs = load '$INPUT' USING org.apache.pig.piggybank.storage.MyRegExLoader('^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(\\S+) (\\S+) (HTTP[^"]+)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" "(\\S+)" "(\\S+)" (\\S+) "(.*)" (\\S+) (\\S+)')
as (remoteAddr: chararray, 
n2: chararray, 
n3: chararray, 
time: chararray, 
method: chararray,
path:chararray,
protocol:chararray,
status: int, 
bytes_string: chararray, 
referrer: chararray, 
browser: chararray, 
n10:chararray,
remoteLogname: chararray, 
remoteAddr12: chararray, 
path2: chararray, 
sessionid: chararray, 
n15: chararray
);

--qo(h)数据
filter_logs = FILTER raw_logs BY not (browser matches '.*pingdom.*');
--item_logs = FOREACH raw_logs GENERATE browser;

--percent 500 logs
--重定义数据项Q数据集只取2status,month
reitem_percent_500_logs = FOREACH filter_logs GENERATE status,DATE_EXTRACT_MM(time) as month;
--分组数据集,此时的数据结构ؓ(f)MAP(a{(aa,bb,cc),(dd,ee,ff)},b{(bb,cc,dd),(ff,gg,hh)})
group_month_percent_500_logs = GROUP reitem_percent_500_logs BY (month);
--重定义分l数据集数据,q行分组l计Q此时要联合分组数据集和原数据集l计
final_month_500_logs = FOREACH group_month_percent_500_logs 
{
    --对原数据集做countQ因为是在foreachj里做count的,即是对原数据集Q也?x)自动?x)加month==group的条?br />    --从这里可以看出对于group里的数据集,完全没用?br />    --q时是以每一行ؓ(f)单位的,l计MAP中的KEY-a对应的数l在原数据集中的个数
    total = COUNT(reitem_percent_500_logs);
    --对原数据集做filterQ因为是在foreachj里做count的,即是对原数据集Q也?x)自动?x)加month==group的条?br />    --重新qo(h)一下原数据集,得到status==500,month==group的数据集
    t = filter reitem_percent_500_logs by status== 500; --create a bag which contains only T values
    --重定义数据项Q取groupQ统计结?br />    generate flatten(group) as col1, 100*(double)COUNT(t)/(double)total;
}
STORE final_month_500_logs into '$OUTPUT' using PigStorage(',');



paulwong 2013-04-13 15:21 发表评论
]]>
把命令行中的gqPIG?/title><link>http://www.aygfsteel.com/paulwong/archive/2013/04/10/397645.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Wed, 10 Apr 2013 07:32:00 GMT</pubDate><guid>http://www.aygfsteel.com/paulwong/archive/2013/04/10/397645.html</guid><wfw:comment>http://www.aygfsteel.com/paulwong/comments/397645.html</wfw:comment><comments>http://www.aygfsteel.com/paulwong/archive/2013/04/10/397645.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/paulwong/comments/commentRss/397645.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/paulwong/services/trackbacks/397645.html</trackback:ping><description><![CDATA[<a target="_blank">http://wiki.apache.org/pig/ParameterSubstitution<br /> <br /> <br /> </a> <div> <div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /> <br /> Code highlighting produced by Actipro CodeHighlighter (freeware)<br /> http://www.CodeHighlighter.com/<br /> <br /> -->%pig -param input=/user/paul/sample.txt -param output=/user/paul/output/</div> </div><br /><br />PIG中获?br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->records = LOAD <span style="color: #800080; ">$input</span>;</div><img src ="http://www.aygfsteel.com/paulwong/aggbug/397645.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/paulwong/" target="_blank">paulwong</a> 2013-04-10 15:32 <a href="http://www.aygfsteel.com/paulwong/archive/2013/04/10/397645.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>PIG中的分组l计癑ֈ?/title><link>http://www.aygfsteel.com/paulwong/archive/2013/04/10/397642.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Wed, 10 Apr 2013 06:13:00 GMT</pubDate><guid>http://www.aygfsteel.com/paulwong/archive/2013/04/10/397642.html</guid><wfw:comment>http://www.aygfsteel.com/paulwong/comments/397642.html</wfw:comment><comments>http://www.aygfsteel.com/paulwong/archive/2013/04/10/397642.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/paulwong/comments/commentRss/397642.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/paulwong/services/trackbacks/397642.html</trackback:ping><description><![CDATA[<a target="_blank">http://stackoverflow.com/questions/15318785/pig-calculating-percentage-of-total-for-a-field<br /><br /></a><a target="_blank">http://stackoverflow.com/questions/13476642/calculating-percentage-in-a-pig-query</a><img src ="http://www.aygfsteel.com/paulwong/aggbug/397642.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/paulwong/" target="_blank">paulwong</a> 2013-04-10 14:13 <a href="http://www.aygfsteel.com/paulwong/archive/2013/04/10/397642.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>PIG议http://www.aygfsteel.com/paulwong/archive/2013/04/05/397411.htmlpaulwongpaulwongFri, 05 Apr 2013 13:33:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/04/05/397411.htmlhttp://www.aygfsteel.com/paulwong/comments/397411.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/04/05/397411.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/397411.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/397411.html什么是PIG
是一U设计语aQ通过设计数据怎么动Q然后由相应的引擎将此变成MAPREDUCE JOB去HADOOP中运行?/div>
PIG与SQL
两者有相同之处Q执行一个或多个语句Q然后出来一些结果?/div>
但不同的是,SQL要先把数据导到表中才能执行,SQL不关心中间如何做Q即发一个SQL语句q去Q就有结果出来?/div>
PIGQ无d数据到表中,但要设计直到出结果的中间q程Q步骤如何等{?/div>

paulwong 2013-04-05 21:33 发表评论
]]>PIG资源http://www.aygfsteel.com/paulwong/archive/2013/04/05/397406.htmlpaulwongpaulwongFri, 05 Apr 2013 10:19:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/04/05/397406.htmlhttp://www.aygfsteel.com/paulwong/comments/397406.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/04/05/397406.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/397406.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/397406.html http://guoyunsky.iteye.com/blog/1317084

http://guoyunsky.iteye.com/category/196632

Hadoop学习(fn)W记(9) Pig?br /> http://www.distream.org/?p=385


[hadooppd]Pig的安装和单示?br /> http://blog.csdn.net/inkfish/article/details/5205999


Hadoop and Pig for Large-Scale Web Log Analysis
http://www.devx.com/Java/Article/48063


Pig实战
http://www.cnblogs.com/xuqiang/archive/2011/06/06/2073601.html


[原创]Apache Pig中文教程Q进Ӟ
http://www.codelast.com/?p=4249


Zhadoopq_的pig语言对apache日志pȝ的分?br /> http://goodluck-wgw.iteye.com/blog/1107503


!!Pig语言
http://hi.baidu.com/cpuramdisk/item/a2980b78caacfa3d71442318


Embedding Pig In Java Programs
http://wiki.apache.org/pig/EmbeddedPig


一个pig事例(REGEX_EXTRACT_ALL, DBStorageQ结果存q数据库)
http://www.myexception.cn/database/1256233.html


Programming Pig
http://ofps.oreilly.com/titles/9781449302641/index.html


[原创]Apache Pig的一些基概念?qing)用法ȝQ?Q?br /> http://www.codelast.com/?p=3621


!PIG手册
http://pig.apache.org/docs/r0.11.1/func.html#built-in-functions

paulwong 2013-04-05 18:19 发表评论
]]>
hadoop集群中添加节Ҏ(gu)?/title><link>http://www.aygfsteel.com/paulwong/archive/2013/03/16/396544.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Sat, 16 Mar 2013 15:04:00 GMT</pubDate><guid>http://www.aygfsteel.com/paulwong/archive/2013/03/16/396544.html</guid><wfw:comment>http://www.aygfsteel.com/paulwong/comments/396544.html</wfw:comment><comments>http://www.aygfsteel.com/paulwong/archive/2013/03/16/396544.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/paulwong/comments/commentRss/396544.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/paulwong/services/trackbacks/396544.html</trackback:ping><description><![CDATA[在新节点安装好hadoop<br /><br /><br />把namenode的有关配|文件复制到该节?br /><br /><br />修改masters和slaves文g,增加该节?br /><br /><br />讄ssh免密码进节点<br /><br /><br />单独启动该节点上的datanode和tasktracker(hadoop-daemon.sh start datanode/tasktracker)<br /><br /><br />q行start-balancer.shq行数据负蝲均衡<br /> <br /><br />负蝲均衡:作用:当节点出现故?或新增加节点?数据块分布可能不均匀,负蝲均衡可以重新q各个datanode上数据块的分?img src ="http://www.aygfsteel.com/paulwong/aggbug/396544.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/paulwong/" target="_blank">paulwong</a> 2013-03-16 23:04 <a href="http://www.aygfsteel.com/paulwong/archive/2013/03/16/396544.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Phoenix: HBasel于有SQL接口了~http://www.aygfsteel.com/paulwong/archive/2013/02/19/395432.htmlpaulwongpaulwongTue, 19 Feb 2013 15:15:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/02/19/395432.htmlhttp://www.aygfsteel.com/paulwong/comments/395432.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/02/19/395432.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/395432.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/395432.html
具体详见Q?a target="_blank">https://github.com/forcedotcom/phoenix

支持selectQfromQwhereQgroupbyQhavingQorderby和徏表操作,未来支持二U烦引,join操作Q动态列等功能?br />
是徏立在原生HBASE API基础上的Q响应时?0MU别的数据是毫秒Q?00MU别是秒?br />



paulwong 2013-02-19 23:15 发表评论
]]>
HBASEMW记-基础功能http://www.aygfsteel.com/paulwong/archive/2013/02/06/395168.htmlpaulwongpaulwongWed, 06 Feb 2013 01:53:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/02/06/395168.htmlhttp://www.aygfsteel.com/paulwong/comments/395168.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/02/06/395168.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/395168.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/395168.html
  • HBASE的SHELL命o(h)使用

  • HBASE的JAVA CLIENT的?br />
    新增和修改记录用PUT?br />
    PUT的执行流E:(x)
    首先?x)在内存中增加MEMSTOREQ如果这个表有N个COLOUMN FAMILYQ则?x)生N个MEMSTOREQ记录中的值属于不同的COLOUMN FAMILY的,?x)保存到不同的MEMSTORE中。MEMSTORE中的g?x)马上FLUSH到文件中Q而是到MEMSTORE满的时候再FLUSHQ且FLUSH的时候不?x)写入已存在的HFILE中,而是新增一个HFILEM存。另外会(x)写WRITE AHEAD LOGQ这是由于新增记录时不是马上写入HFILE的,如果中途出现DOWN机时Q则HBASE重启时会(x)Ҏ(gu)q个LOG来恢复数据?br />
    删除记录用DELETE?br />
    删除时ƈ不会(x)在HFILE中的内容删除Q而是作一标记Q然后在查询的时候可以不取这些记录?br />
    d单条记录用GET?br />
    d的时候会(x)记录保存到CAHE中,同样如果q个表有N个COLOUMN FAMILYQ则?x)生N个CAHE
    Q记录中的值属于不同的COLOUMN FAMILY的,?x)保存到不同的CAHE中。这样下ơ客L(fng)再取记录时会(x)l合CAHE和MEMSTORE来返回数据?br />
    新增表用HADMIN?br />
    查询多条记录用SCAN和FILTER?br />
  • HBASE的分布式计算

    Z么会(x)有分布式计算
    前面的API是针对ONLINE的应用,卌求低延时的,相当于OLTP。而针对大量数据时q些API׃适用了?br />如要针对全表数据q行分析时用SCANQ这样会(x)全表数据取回本圎ͼ如果数据量在100G时会(x)耗几个小ӞZ节省旉Q引入多U程做法Q但要引入多U程Ӟ需遵从新算法:(x)全表数据分成N个段Q每D는一个线E处理,处理完后Q交l果合成Q然后进行分析?br />
    如果数据量在200G或以上时间就加倍了Q多U程的方式不能满了Q因此引入多q程方式Q即计放在不同的物理Z处理Q这时就要考虑每个物理机DOWN机时的处理方式等情况了,HADOOP的MAPREDUCE则是q种分布式计的框架了,对于应用者而言Q只d理分散和聚合的算法,其他的无考虑?br />
    HBASE的MAPREDUCE
    使用TABLEMAP和TABLEREDUCE?br />
    HBASE的部|架构和l成的组?br />架构在HADOOP和ZOOPKEEPER之上?br />
    HBASE的查询记录和保存记录的流E?br />说见前一~博文?br />
    HBASE作ؓ(f)数据来源地、保存地和共享数据源的处理方?br />即相当于数据库中JOIN的算法:(x)REDUCE SIDE JOIN、MAP SIDE JOIN?br />


  • paulwong 2013-02-06 09:53 发表评论
    ]]>
    监控HBASEhttp://www.aygfsteel.com/paulwong/archive/2013/02/04/395107.htmlpaulwongpaulwongMon, 04 Feb 2013 07:08:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/02/04/395107.htmlhttp://www.aygfsteel.com/paulwong/comments/395107.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/02/04/395107.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/395107.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/395107.htmlHadoop/Hbase是开源版的google Bigtable, GFS, MapReduce的实玎ͼ随着互联|的发展Q大数据的处理显得越发重要,Hadoop/Hbase的用武之C发q泛。ؓ(f)了更好的使用Hadoop/HbasepȝQ需要有一套完善的监控pȝQ来了解pȝq行的实时状态,做到一切尽在掌握。Hadoop/Hbase有自己非常完善的metrics framework, 里面包种各种l度的系l指标的l计Q另外,q套metrics framework设计的也非常不错Q用户可以很方便地添加自定义的metrics。更为重要的一Ҏ(gu)metrics的展C方式,目前它支持三U方式:(x)一U是落地到本地文Ӟ一U是reportlGangliapȝQ另一U是通过JMX来展C。本文主要介l怎么把Hadoop/Hbase的metrics reportlGangliapȝQ通过览器来查看?br />
    介绍后面的内容之前有必要先简单介l一下Gangliapȝ。Ganglia是一个开源的用于pȝ监控的系l,它由三部分组成:(x)gmond, gmetad, webfrontend, 三部分是q样分工的:(x)

    gmond: 是一个守护进E,q行在每一个需要监的节点上,攉监测l计Q发送和接受在同一个组播或单播通道上的l计信息
    gmetad: 是一个守护进E,定期查gmondQ从那里拉取数据Qƈ他们的指标存储在RRD存储引擎?br /> webfrontend: 安装在有gmetadq行的机器上Q以便读取RRD文gQ用来做前台展示

    单ȝ它们三者的各自的功用,gmond攉数据各个node上的metrics数据Qgmetad汇总gmond攉到的数据Qwebfrontend在前台展Cgmetad汇ȝ数据。Ganglia~省是对pȝ的一些metricq行监控Q比如cpu/memory/net{。不qHadoop/Hbase内部做了对Ganglia的支持,只需要简单的攚w|就可以Hadoop/Hbase的metrics也接入到gangliapȝ中进行监控?br />
    接下来介l如何把Hadoop/Hbase接入到GangliapȝQ这里的Hadoop/Hbase的版本号?.94.2Q早期的版本可能?x)有一些不同,h意区别。Hbase本来是Hadoop下面的子目Q因此所用的metrics framework原本是同一套Hadoop metricsQ但后面hadoop有了改进版本的metrics framework:metrics2(metrics version 2), Hadoop下面的项目都已经开始用metrics2, 而Hbase成了Apache的顶U子目Q和Hadoop成ؓ(f)q的项目后Q目前还没跟qmetrics2Q它用的q是原始的metrics.因此q里需要把Hadoop和Hbase的metrics分开介绍?br />
    Hadoop接入Ganglia:

    1. Hadoop metrics2对应的配|文件ؓ(f)Qhadoop-metrics2.properties
    2. hadoop metrics2中引用了source和sink的概念,source是用来收集数据的, sink是用来把source攉的数据consume的(包括落地文gQ上报gangliaQJMX{)
    3. hadoop metrics2配置支持Ganglia:
    #*.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink30
    *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
     
    *.sink.ganglia.period=10
    *.sink.ganglia.supportsparse=true
    *.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both
    *.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40
     
    #uncomment as your needs
    namenode.sink.ganglia.servers=10.235.6.156:8649
    #datanode.sink.ganglia.servers=10.235.6.156:8649
    #jobtracker.sink.ganglia.servers=10.0.3.99:8649
    #tasktracker.sink.ganglia.servers=10.0.3.99:8649
    #maptask.sink.ganglia.servers=10.0.3.99:8649
    #reducetask.sink.ganglia.servers=10.0.3.99:8649


    q里需要注意的几点Q?br />
    (1) 因ؓ(f)Ganglia3.1?.0不兼容,需要根据Ganglia的版本选择使用GangliaSink30或者GangliaSink31
    (2) period配置上报周期Q单位是U?s)
    (3) namenode.sink.ganglia.servers指定Ganglia gmetad所在的host:portQ用来向其上报数?br /> (4) 如果同一个物理机器上同时启动了多个hadoopq程(namenode/datanode, etc)Q根据需要把相应的进E的sink.ganglia.servers配置好即?br /> Hbase接入Ganglia:

    1. Hbase所用的hadoop metrics对应的配|文件是: hadoop-metrics.properties
    2. hadoop metrics里核心是ContextQ写文g有写文g的TimeStampingFileContext, 向Ganglia上报有GangliaContext/GangliaContext31
    3. hadoop metrics配置支持Ganglia:
    # Configuration of the "hbase" context for ganglia
    # Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)
    # hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext
    hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
    hbase.period=10
    hbase.servers=10.235.6.156:8649

    q里需要注意几点:(x)

    (1) 因ؓ(f)Ganglia3.1?.0不兼容,所以如果是3.1以前的版本,需要用GangliaContext, 如果?.1版的GangliaQ需要用GangliaContext31
    (2) period的单位是U?s)Q通过period可以配置向Ganglia上报数据的周?br /> (3) servers指定的是Ganglia gmetad所在的host:portQ把数据上报到指定的gmetad
    (4) 对rpc和jvm相关的指标都可以q行cM的配|?/div>







    paulwong 2013-02-04 15:08 发表评论
    ]]>HBASE部v要点http://www.aygfsteel.com/paulwong/archive/2013/02/04/395101.htmlpaulwongpaulwongMon, 04 Feb 2013 04:10:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/02/04/395101.htmlhttp://www.aygfsteel.com/paulwong/comments/395101.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/02/04/395101.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/395101.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/395101.htmlREGIONS SERVER和TASK TRACKER SERVER不要在同一台机器上Q最好如果有MAPREDUCE JOBq行的话Q应该分开两个CLUSTERQ即两群不同的服务器上,q样MAPREDUCE 的线下负载不?x)媄响到SCANERq些U上负蝲?/div>

    如果主要是做MAPREDUCE JOB的话Q将REGIONS SERVER和TASK TRACKER SERVER攑֜一h可以的?/div>


    原始集群模式

    10个或以下节点Q无MAPREDUCE JOBQ主要用于低延迟的访问。每个节点上的配|ؓ(f)QCPU4-6COREQ内?4-32GQ?个SATA盘。Hadoop NameNode, JobTracker, HBase Master, 和ZooKeeper全都在同一个NODE上?


    型集群模式Q?0-20台服务器Q?/span>

    HBase Master攑֜单独一台机器上, 以便于用较低配|的机器。ZooKeeper也放在单独一台机器上QNameNode和JobTracker攑֜同一台机器上?/div>

    中型集群模式Q?0-50台服务器Q?/span>

    ׃无须再节省费用,可以HBase Master和ZooKeeper攑֜同一台机器上, ZooKeeper和HBase Master要三个实例。NameNode和JobTracker攑֜同一台机器上?/div>

    大型集群模式Q?gt;50台服务器Q?/span>

    和中型集模式相|但ZooKeeper和HBase Master要五个实例。NameNode和Second NameNode要有_大的内存?/div>

    HADOOP MASTER节点

    NameNode和Second NameNode服务器配|要求:(x)Q小型)8CORE CPUQ?6G内存Q?G|卡和SATA 盘Q中弄再增加?6G内存Q大型则再增加多32G内存?/div>

    HBASE MASTER节点

    服务器配|要求:(x)4CORE CPUQ?-16G内存Q?G|卡?个SATA 盘Q一个用于操作系l,另一个用于HBASE MASTER LOGS?/div>

    HADOOP DATA NODES和HBASE REGION SERVER节点

    DATA NODE和REGION SERVER应在同一台服务器上,且不应该和TASK TRACKER在一赗服务器配置要求Q?-12CORE CPUQ?4-32G内存Q?G|卡?2*1TB SATA 盘Q一个用于操作系l,另一个用于HBASE MASTER LOGS?/div>

    ZOOPKEEPERS节点

    服务器配|和HBASE MASTER怼Q也可以与HBASE MASTER攑֜一P但就要多增加一个硬盘单独给ZOOPKEEPER使用?/div>

    安装各节?/span>

    JVM配置Q?/div> -Xmx8g—讄HEAP的最大值到8GQ不讑ֈ15 GB.
    -Xms8g—讄HEAP的最值到8GS.
    -Xmn128m—讄新生代的值到128 MBQ默认值太?br /> -XX:+UseParNewGC—讄对于新生代的垃圾回收器类型,q种cd是会(x)停止JAVAq程Q然后再q行回收的,但由于新生代体积比较?yu),持箋旉通常只有几毫U,因此可以接受?br /> -XX:+UseConcMarkSweepGC—讄老生代的垃圾回收cdQ如果用新生代的那个?x)不合适,即会(x)DJAVAq程停止的时间太长,用这U不?x)停止JAVAq程Q而是在JAVAq程q行的同Ӟq行的进行回收?br /> -XX:CMSInitiatingOccupancyFraction—讄CMS回收器运行的频率?br />






    paulwong 2013-02-04 12:10 发表评论
    ]]>Hadoop的几UJoinҎ(gu)http://www.aygfsteel.com/paulwong/archive/2013/01/31/395000.htmlpaulwongpaulwongThu, 31 Jan 2013 10:24:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/01/31/395000.htmlhttp://www.aygfsteel.com/paulwong/comments/395000.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/01/31/395000.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/395000.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/395000.html2) 压羃字段,Ҏ(gu)据预处理,qo(h)不需要的字段.
    3) 最后一步就是在Mapper阶段qo(h),q个是Bloom Filter的用武之C.也就是需要详l说明的地方.


    下面拿一个我们大安熟?zhn)的场景来说明q个问题: 扑և上个月动感地带的客户资费的用情?包括接入和拨?

    (q个只是我臆惛_来的例子,Ҏ(gu)实际的DB数据存储l构,在这个场景下肯定有更好的解决Ҏ(gu),大家不要太较真哦)

    q个时候的两个个数据集都是比较大的,q两个数据集分别?上个月的通话记录,动感地带的手机号码列?


    比较直接的处理方法有2U?

    1)?Reduce 阶段,通过动感地带L(fng)来过?

    优点:q样需要处理的数据相对比较?yu)?q个也是比较常用的方?

    ~点:很多数据在Mapper阶段׃老E子力气汇M,q通过|络Shuffle到Reduce节点,l果到这个阶D늻qo(h)?



    2)?Mapper 阶段?通过动感地带L(fng)来过滤数?

    优点:q样可以qo(h)很多不是动感地带的数?比如州?全球?q些qo(h)的数据就可以节省很多|络带宽?

    ~点:是动感地带的号码不是小数目,如果q样处理需要把q个大块头复制到所有的Mapper节点,甚至是Distributed Cache.(Bloom Filter是用来解决q个问题?


    Bloom Filter是用来解决上面Ҏ(gu)2的缺点的.

    Ҏ(gu)2的缺点就是大量的数据需要在多个节点复制.Bloom Filter通过多个Hash法, 把这个号码列表压~到了一个Bitmap里面. 通过允许一定的错误率来换空? q个和我们^时经常提到的旉和空间的互换cM.详细情况可以参?

    http://blog.csdn.net/jiaomeng/article/details/1495500

    但是q个法也是有缺L(fng),是?x)把很多州?全球通之cȝL(fng)当成动感地带.但在q个场景?q根本不是问?因ؓ(f)q个法只是qo(h)一些号?漏网之鱼?x)在Reduce阶段q行_匚w旉虑掉.

    q个Ҏ(gu)改进之后基本上完全回避了Ҏ(gu)2的缺?

    1) 没有大量的动感地带号码发送到所有的Mapper节点.
    2) 很多非动感地带号码在Mapper阶段p滤了(虽然不是100%),避免了网l带宽的开销?qing)g?


    l箋需要学?fn)的地?Bitmap的大? Hash函数的多? 以及(qing)存储的数据的多少. q?个变量如何取值才能才能在存储I间与错误率之间取得一个^?

    paulwong 2013-01-31 18:24 发表评论
    ]]>
    配置secondarynamenodehttp://www.aygfsteel.com/paulwong/archive/2013/01/31/394998.htmlpaulwongpaulwongThu, 31 Jan 2013 09:39:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/01/31/394998.htmlhttp://www.aygfsteel.com/paulwong/comments/394998.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/01/31/394998.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/394998.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/394998.html

    SECOND NAME NODE?x)以HTTP的方式向NAME NODE要这两个文gQ当NAME NODE收到hӞ׃(x)韦一个新的EditLog来记录,q时SECOND NAME NODE׃(x)取得的q两个文件合qӞ成一个新的FsImageQ再发给NAME NODEQNAME NODE收到后,׃(x)以这个ؓ(f)准,旧的׃(x)归档不用?br />

    SECOND NAME NODEq有一个用途就是当NAME NODE DOWN了的时候,可以改SECOND NAME NODE的IP为NAME NODE所用的IPQ当NAME NODE用?br />
    secondary namenoded 配置很容易被忽视Q如果jps查都正常Q大安常不会(x)太关心,除非namenode发生问题的时候,才会(x)惌vq有个secondary namenodeQ它的配|共两步Q?br />
    1. 集群配置文gconf/master中添加secondarynamenode的机?/li>
    2. 修改/d hdfs-site.xml中如下属性:(x)

    <property>
     <name>dfs.http.address</name>
     <value>{your_namenode_ip}:50070</value>
     <description>
     The address and the base port where the dfs namenode web ui will listen on.
     If the port is 0 then the server will start on a free port.
     </description>
     </property>


    q两w|OK后,启动集群。进入secondary namenode 机器Q检查fs.checkpoint.dirQcore-site.xml文gQ默认ؓ(f)${hadoop.tmp.dir}/dfs/namesecondaryQ目录同步状态是否和namenode一致的?br />
    如果不配|第二项则,secondary namenode同步文gҎ(gu)qؓ(f)I,q时查看secondary namenode的log昄错误为:(x)


    2011-06-09 11:06:41,430 INFO org.apache.hadoop.hdfs.server.common.Storage: Recovering storage directory /tmp/hadoop-hadoop/dfs/namesecondary from failed checkpoint.
    2011-06-09 11:06:41,433 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint: 
    2011-06-09 11:06:41,434 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.net.ConnectException: Connection refused
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
    at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:211)
    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
    at java.net.Socket.connect(Socket.java:529)
    at java.net.Socket.connect(Socket.java:478)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:163)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
    at sun.net.www.http.HttpClient.<init>(HttpClient.java:233)
    at sun.net.www.http.HttpClient.New(HttpClient.java:306)
    at sun.net.www.http.HttpClient.New(HttpClient.java:323)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1172)
    at org.apache.hadoop.hdfs.server.namenode.TransferFsImage.getFileClient(TransferFsImage.java:151)
    at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.downloadCheckpointFiles(SecondaryNameNode.java:256)
    at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:313)
    at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:225)
    at java.lang.Thread.run(Thread.java:662)


    可能用到的core-site.xml文g相关属?/span>Q?br />
    <property>
    <name>fs.checkpoint.period</name>
    <value>300</value>
    <description>The number of seconds between two periodic checkpoints.
    </description>
    </property>

    <property>
     <name>fs.checkpoint.dir</name>
     <value>${hadoop.tmp.dir}/dfs/namesecondary</value>
     <description>Determines where on the local filesystem the DFS secondary
     name node should store the temporary images to merge.
     If this is a comma-delimited list of directories then the image is
     replicated in all of the directories for redundancy.
     </description>
    </property>


    paulwong 2013-01-31 17:39 发表评论
    ]]>
    配置Hadoop M/R 采用Fair Scheduler法代替FIFOhttp://www.aygfsteel.com/paulwong/archive/2013/01/31/394997.htmlpaulwongpaulwongThu, 31 Jan 2013 09:30:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/01/31/394997.htmlhttp://www.aygfsteel.com/paulwong/comments/394997.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/01/31/394997.html#Feedback1http://www.aygfsteel.com/paulwong/comments/commentRss/394997.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/394997.html
    hadoop-0.20.2-cdh3u0

    hbase-0.90.1-cdh3u0

    zookeeper-3.3.3-cdh3u0

    默认已支持FairScheduler调度法.

    只需攚w|期用FairSchedule而非默认的JobQueueTaskScheduler卛_.

    配置fair-scheduler.xml (/$HADOOP_HOME/conf/):

    <?xml version="1.0"?>
    <property>
        <name>mapred.fairscheduler.allocation.file</name>
        <value>[HADOOP_HOME]/conf/fair-scheduler.xml</value>
    </property>
    <allocations>
        <pool name="qiji-task-pool">
            <minMaps>5</minMaps>
            <minReduces>5</minReduces>
            <maxRunningJobs>
                <maxRunningJobs>5</maxRunningJobs>
                <minSharePreemptionTimeout>300</minSharePreemptionTimeout>
                <weight>1.0</weight>
        </pool>
        <user name="ecap">
            <maxRunningJobs>
                <maxRunningJobs>6</maxRunningJobs>
        </user>
        <poolMaxJobsDefault>10</poolMaxJobsDefault>
        <userMaxJobsDefault>8</userMaxJobsDefault>
        <defaultMinSharePreemptionTimeout>600
        </defaultMinSharePreemptionTimeout>
        <fairSharePreemptionTimeout>600</fairSharePreemptionTimeout>
    </allocations>



    配置$HADOOP_HOME/conf/mapred-site.xml,最后添?

    <property>
        <name>mapred.jobtracker.taskScheduler</name>
        <value>org.apache.hadoop.mapred.FairScheduler</value>
    </property>
    <property>
        <name>mapred.fairscheduler.allocation.file</name>
        <value>/opt/hadoop/conf/fair-scheduler.xml</value>
    </property>
    <property>
        <name>mapred.fairscheduler.assignmultiple</name>
        <value>true</value>
    </property>
    <property>
        <name>mapred.fairscheduler.sizebasedweight</name>
        <value>true</value>
    </property>



    然后重新q行集群,q样有几个Job(上面配置?个ƈ?q行q行?不会(x)因ؓ(f)一个Job把Map/Reduce占满而其它Job处于Pending状?

    可从: http://<masterip>:50030/scheduler查看q行q行的状?

    paulwong 2013-01-31 17:30 发表评论
    ]]>
    大规模数据查重的多种Ҏ(gu)Q及(qing)Bloom Filter的应?/title><link>http://www.aygfsteel.com/paulwong/archive/2013/01/31/394980.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Thu, 31 Jan 2013 05:55:00 GMT</pubDate><guid>http://www.aygfsteel.com/paulwong/archive/2013/01/31/394980.html</guid><wfw:comment>http://www.aygfsteel.com/paulwong/comments/394980.html</wfw:comment><comments>http://www.aygfsteel.com/paulwong/archive/2013/01/31/394980.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/paulwong/comments/commentRss/394980.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/paulwong/services/trackbacks/394980.html</trackback:ping><description><![CDATA[挺有意思的题目?br /><br /><br /><strong>1. l你A,B两个文gQ各存放50亿条URLQ每条URL占用64字节Q内存限制是4GQ让你找?A,B文g共同的URL?/strong> <br />解法一QHash成内存大的块文gQ然后分块内存内查交集?br />解法二:(x)Bloom FilterQ广泛应用于URLqo(h)、查重。参考http://en.wikipedia.org/wiki/Bloom_filter、http://blog.csdn.net/jiaomeng/archive/2007/01/28/1496329.aspxQ?br /><br /><br /><strong>2. ?0个文Ӟ每个文g1GQ?每个文g的每一行都存放的是用户的queryQ每个文件的query都可能重复。要你按照query的频度排序?/strong><br />解法一Q根据数据稀疏程度算法会(x)有不同,通用Ҏ(gu)是用Hash把文仉排,让相同query一定会(x)在同一个文Ӟ同时q行计数Q然后归qӞ用最堆来统计频度最大的?br />解法二:(x)cM1Q但是用的是与简单Bloom FilterE有不同的CBFQCounting Bloom FilterQ或者更q一步的SBFQSpectral Bloom FilterQ参考http://blog.csdn.net/jiaomeng/archive/2007/03/19/1534238.aspxQ?br />解法三:(x)MapReduceQ几分钟可以在hadoop集群上搞定。参考http://en.wikipedia.org/wiki/MapReduce<br /><br /><br /><strong>3. 有一?G大小的一个文Ӟ里面每一行是一个词Q词的大不过16个字节,内存限制大小?M。返回频数最高的100个词?/strong><br />解法一Q跟2cMQ只是不需要排序,各个文g分别l计?00Q然后一h?00?img src ="http://www.aygfsteel.com/paulwong/aggbug/394980.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/paulwong/" target="_blank">paulwong</a> 2013-01-31 13:55 <a href="http://www.aygfsteel.com/paulwong/archive/2013/01/31/394980.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Cassandra VS. HBase 全文zzhttp://www.aygfsteel.com/paulwong/archive/2013/01/30/394902.htmlpaulwongpaulwongTue, 29 Jan 2013 16:22:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/01/30/394902.htmlhttp://www.aygfsteel.com/paulwong/comments/394902.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/01/30/394902.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/394902.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/394902.html

    摘取了一部分Q全文请查看

    http://blog.sina.com.cn/s/blog_633f4ab20100r9nm.html

    背景

    “q是最好的时代Q也是最坏的时代?#8221; 

    每个时代的h都在q么形容自己所处的时代。在一ơ次IT潮下面Q有得当下乏x聊,有h却能锐意q取Q找到突破。数据存储这个话题自从有了计机之后Q就一直是一个有或者无聊的主题。上世纪七十q代Q关pL据库理论的出玎ͼ造就了一批又一批传奇,q推动整个世界信息化C一个新的高度。而进入新千年以来Q随着SNS{应用的出现Q传l的SQL数据库已l越来越不适应量数据的处理了。于是,q几qNoSQL数据库的呼声也越来越高?/p>

    在NoSQL数据库当中,呼声最高的是HBase和Cassandra两个。虽然严格意义上来说Q两者服务的目的有所不同Q侧重点也不相同,但是作ؓ(f)当前开源NoSQL数据库的g者,两者经常被用来做各U比较?/p>

    d十月QFacebook推出了他的新的Messagepȝ。Facebook宣布他们采用HBase作ؓ(f)后台存储pȝ。这引v了一片喧哗声。因为Cassandra恰恰是Facebook开发,q且?008q开源。这让很多h惊呼Q是否是Cassandra已经被Facebook攑ּ了?HBase在这场NoSQL数据库的角力当中取得了决定性的胜利Q本文打主要从技术角度分析,HBase和Cassandra的异同,q要给ZQ何结论,只是׃n自己研究的一些结果?/p>

     

    选手?/strong>

    HBase

    HBase是一个开源的分布式存储系l。他可以看作是Google的Bigtable的开源实现。如同Google的Bigtable使用Google File System一PHBase构徏于和Google File SystemcM的Hadoop HDFS之上?/p>

    Cassandra

    Cassandra可以看作是Amazon Dynamo的开源实现。和Dynamo不同之处在于QCassandral合了Google Bigtable的ColumnFamily的数据模型。可以简单地认ؓ(f)QCassandra是一个P2P的,高可靠性ƈh丰富的数据模型的分布式文件系l?/p>

    分布式文件系l的指标

    Ҏ(gu)UC Berkeley的教授Eric Brewer?000q提出猜? CAP定理Q一个分布式计算机系l,不可能同时满以下三个指标:(x)

    Consistency 所有节点在同一时刻保持同一状态Availability 某个节点p|Q不?x)媄响系l的正常q行Partition tolerance pȝ可以因ؓ(f)|络故障{原因被分裂成小的子pȝQ而不影响pȝ的运?p style="margin: 0px; padding: 0px;"> 

    Brewer教授推测QQ何一个系l,同时只能满以上两个指标?/p>

    ?002q_(d)MIT的Seth Gilbert和Nancy Lynch发表正式论文了CAP定理?/p>

     

    而HBase和Cassandra两者都属于分布式计机pȝ。但是其设计的侧重点则有所不同。HBasel承于Bigtable的设计,侧重于CA。而Cassandra则承于Dynamo的设计,侧重于AP?/p>

    。。。。。。。。。。。。。。。。。。?p style="margin: 0px; padding: 0px;">

    Ҏ(gu)比?/strong>

    ׃HBase和Cassandra的数据模型比较接q,所以这里就不再比较两者之间数据模型的异同了。接下来主要比较双方在数据一致性、多拯复制的特性?/p>

    HBase

    HBase保证写入的一致性。当一份数据被要求复制N份的时候,只有N份数据都被真正复制到N台服务器上之后,客户端才?x)成功返回。如果在复制q程中出现失败,所有的复制都将p|。连接上M一台服务器的客L(fng)都无法看到被复制的数据。HBase提供行锁Q但是不提供多行锁和事务。HBaseZHDFSQ因此数据的多䆾复制功能和可靠性将由HDFS提供。HBase和MapReduce天然集成?/p>

    Cassandra

    写入的时候,有多U模式可以选择。当一份数据模式被要求复制N份的时候,可以立即q回Q可以成功复制到一个服务器之后q回Q可以等到全部复制到N份服务器之后q回Q还可以讑֮一个复制到quorum份服务器之后q回。Quorum后面?x)有具体解释。复制不?x)失败。最l所有节Ҏ(gu)据都被写入。而在未被完全写入的时间间隙,q接C同服务器的客L(fng)有可能读C同的数据。在集群里面Q所有的服务器都是等L(fng)。不存在M一个单Ҏ(gu)障。节点和节点之间通过Gossip协议互相通信。写入顺序按照timestamp排序Q不提供行锁。新版本的Cassandra已经集成了MapReduce了?/p>

    相对于配|CassandraQ配|HBase是一个艰辛、复杂充满陷q工作。Facebook关于Z采取HBaseQ里面有一句,大意是,Facebook长期以来一直关注HBase的开发ƈ且有一只专门的l验丰富的HBasel护的team来负责HBase的安装和l护。可以想象,Facebook内部关于使用HBase和Cassandra有过Ȁ烈的斗争Q最lh数更多的HBase team占据了上风。对于大公司来说Q养一只相对庞大的cMDBA的team来维护HBase不算什么大的开销Q但是对于小公司Q这实在不是一个可以负担的L(fng)开销?/p>

    另外HBase在高可靠性上有一个很大的~陷Q就是HBase依赖HDFS。HDFS是Google File System的复制品QNameNode是HDFS的单Ҏ(gu)障点。而到目前为止QHDFSq没有加入NameNode的自我恢复功能。不q我怿QFacebook在内部一定有恢复NameNode的手D,只是没有开源出来而已?/p>

    相反QCassandra的P2P和去中心化设计,没有可能出现单点故障。从设计上来看,Cassandra比HBase更加可靠?/p>

    关于数据一致性,实际上,Cassandra也可以以牺牲响应旉的代h获得和HBase一L(fng)一致性。而且Q通过对Quorum的合适的讄Q可以在响应旉和数据一致性得C个很好的折衷倹{?/strong>

    Cassandra优缺?p style="margin: 0px; padding: 0px;">主要表现在:(x)

    配置单,不需要多模块协同操作。功能灵zL强Q数据一致性和性能之间Q可以根据应用不同而做不同的设|?nbsp;可靠性更强,没有单点故障?/p>

    管如此QCassandra没有弱点吗Q当然不是,Cassandra有一个致命的q?/p>

    q就是存储大文g。虽然说QCassandra的设计初衷就不是存储大文Ӟ但是Amazon的S3实际上就是基于Dynamo构徏的,L?x)让人想入非非地让Cassandrad储超大文件。而和Cassandra不同QHBaseZHDFSQHDFS的设计初衷就是存储超大规模文件ƈ且提供最大吞吐量和最可靠的可讉K性。因此,从这一Ҏ(gu)_(d)Cassandra׃背后不是一个类似HDFS的超大文件存储的文gpȝQ对于存储那U巨大的Q几百T甚至PQ的大文g目前是无能ؓ(f)力的。而且q由Client手工dԌq实际上是非怸明智和消耗Client CPU的工作的?/p>

    因此Q如果我们要构徏一个类似Google的搜索引擎,最,HDFS是我们所必不可少的。虽然目前HDFS的NameNodeq是一个单Ҏ(gu)障点Q但是相应的Hack可以让NameNode变得更皮实。基于HDFS的HBase相应圎ͼ也更适合做搜索引擎的背后倒排索引数据库。事实上QLucene和HBase的结合,q比Lucenel合Cassandra的项目Lucandra要顺畅和高效的多。(Lucandra要求Cassandra使用OrderPreservingPartitioner,q将可能DKey的分布不均匀Q而无法做负蝲均衡Q生访问热Ҏ(gu)器)?/p>

     

    所以我的结论是Q在q个需求多样化的年代,没有赢者通吃的事情。而且我也来不怿在工E界存在一x逸和一成不变的解决Ҏ(gu)?strong>当你仅仅是存储v量增长的消息数据Q存储v量增长的囄Q小视频的时候,你要求数据不能丢失,你要求h工维护尽可能,你要求能q速通过d机器扩充存储Q那么毫无疑问,Cassandra现在是占据上风的?/strong>

    但是如果你希望构Z个超大规模的搜烦引擎Q生超大规模的倒排索引文gQ当然是逻辑上的文gQ真实文件实际上被切分存储于不同的节点上Q,那么目前HDFS+HBase是你的首选?/strong>

    pq个看v来永q正的l论l尾吧,上帝的归上帝Q凯撒的归凯撒。大安有自q地盘Q野癑֐也会(x)有春天的Q?/p>



    paulwong 2013-01-30 00:22 发表评论
    ]]>NOSQL之旅---HBase(?http://www.aygfsteel.com/paulwong/archive/2013/01/29/394901.htmlpaulwongpaulwongTue, 29 Jan 2013 15:50:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/01/29/394901.htmlhttp://www.aygfsteel.com/paulwong/comments/394901.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/01/29/394901.html#Feedback1http://www.aygfsteel.com/paulwong/comments/commentRss/394901.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/394901.htmlhttp://www.jdon.com/38244

    最q因为项目原因,研究了Cassandra,Hbase{几个NoSQL数据库,最l决定采用HBase。在q里Q我向大家分n一下自己对HBase的理解?br />
    在说HBase之前Q我惛_唠叨几句。做互联|应用的哥们儿应该都清楚Q互联网应用q东西,你没办法预测你的pȝ什么时候会(x)被多h讉KQ你面(f)的用户到底有多少Q说不定今天你的用户q少Q明天系l用户就变多了,l果(zhn)的pȝ应付不过来了了,不干了,q岂不是咱哥几个的?zhn)哀Q说旉点就?#8220;杯具?#8221;?br />
    其实说白了,q些是事先没有认清楚互联网应用什么才是最重要的。从pȝ架构的角度来_(d)互联|应用更加看重系l性能以及(qing)伸羃性,而传l企业应用都是比较看重数据完整性和数据安全性。那么我们就来说说互联网应用伸羃性这事儿.对于伸羃性这事儿Q哥们儿我也写了几篇博文Q想看的兄弟可以参考我以前的博文,对于web server,app server的׾~性,我在q里先不说了Q因部分的׾~性相Ҏ(gu)说比较容易一点,我主要来回顾一些一个慢慢变大的互联|应用如何应Ҏ(gu)据库q一层的伸羃?br />
    首先刚开始,Z多,压力也不?搞一台数据库服务器就搞定了,此时所有的东东都塞q一个Server里,包括web server,app server,db server,但是随着来越多,pȝ压力来多Q这个时候可能你把web server,app server和db server分离了,好歹q样可以应付一阵子Q但是随着用户量的不断增加Q你?x)发玎ͼ数据库这哥们不行了,速度老慢了,有时候还?x)宕掉,所以这个时候,你得l数据库q哥们找几个_(d)q个时候Master-Salve出CQ这个时候有一个Master Server专门负责接收写操作,另外的几个Salve Server专门q行dQ这样Masterq哥们终于不抱怨了Qȝd分离了,压力ȝȝ?q个时候其实主要是对读取操作进行了水^扩张Q通过增加多个Salve来克服查询时CPU瓉。一般这样下来,你的pȝ可以应付一定的压力Q但是随着用户数量的增多,压力的不断增加,你会(x)发现Master serverq哥们的写压力还是变的太大,没办法,q个时候怎么办呢Q你得切分啊,俗话?#8220;只有切分了,才会(x)有׾~性嘛”Q所以啊Q这个时候只能分库了Q这也是我们常说的数据库“垂直切分”Q比如将一些不兌的数据存攑ֈ不同的库中,分开部vQ这L(fng)于可以带C部分的读取和写入压力了,Master又可以轻松一点了Q但是随着数据的不断增多,你的数据库表中的数据又变的非常的大,q样查询效率非常低,q个时候就需要进?#8220;水^分区”了,比如通过User表中的数据按?0W来划分,q样每张表不?x)超q?0W了?br />
    lg所qͼ一般一个流行的web站点都会(x)l历一个从单台DBQ到M复制Q到垂直分区再到水^分区的痛苦的q程。其实数据库切分q事儿,看v来原理貌似很单,如果真正做v来,我想凡是shardingq数据库的哥们儿都深受其苦啊。对于数据库伸羃的文章,哥们儿可以看看后面的参考资料介l?br />
    好了Q从上面的那一堆废话中Q我们也发现数据库存储水qx张scale out是多么痛苦的一件事情,不过q好技术在q步Q业界的其它弟兄也在努力Q?9q这一q出C非常多的NoSQL数据库,更准的应该说是No relation数据库,q些数据库多数都?x)对非结构化的数据提供透明的水qx张能力,大大减轻了哥们儿设计时候的压力。下面我拿Hbaseq分布式列存储系l来说说?br />
    一 Hbase是个啥东东?
    在说Hase是个啥家伙之前,首先我们来看看两个概念,面向行存储和面向列存储。面向行存储Q我怿大伙儿应该都清楚Q我们熟(zhn)的RDBMS是此种cd的,面向行存储的数据库主要适合于事务性要求严格场合,或者说面向行存储的存储pȝ适合OLTPQ但是根据CAP理论Q传l的RDBMSQؓ(f)了实现强一致性,通过严格的ACID事务来进行同步,q就造成了系l的可用性和伸羃性方面大大折扣,而目前的很多NoSQL产品Q包括HbaseQ它们都是一U最l一致性的pȝQ它们ؓ(f)了高的可用性牺牲了一部分的一致性。好像,我上面说了面向列存储Q那么到底什么是面向列存储呢QHbase,Casandra,Bigtable都属于面向列存储的分布式存储pȝ。看到这里,如果(zhn)不明白Hbase是个啥东东,不要紧,我再ȝ一下下Q?br />
    Hbase是一个面向列存储的分布式存储pȝQ它的优点在于可以实现高性能的ƈ发读写操作,同时Hbaseq会(x)Ҏ(gu)据进行透明的切分,q样׃得存储本w具有了水^伸羃性?br />

    ?Hbase数据模型
    HBase,Cassandra的数据模型非常类|他们的思想都是来源于Google的BigtableQ因此这三者的数据模型非常cMQ唯一不同的就是CassandrahSuper cloumn family的概念,而Hbase目前我没发现。好了,废话说Q我们来看看Hbase的数据模型到底是个啥东东?br />
    在Hbase里面有以下两个主要的概念QRow key,Column FamilyQ我们首先来看看Column family,Column family中文又名“列族”QColumn family是在pȝ启动之前预先定义好的Q每一个Column Family都可以根?#8220;限定W?#8221;有多个column.下面我们来D个例子就?x)非常的清晰了?br />
    假如pȝ中有一个User表,如果按照传统的RDBMS的话QUser表中的列是固定的Q比如schema 定义了name,age,sex{属性,User的属性是不能动态增加的。但是如果采用列存储pȝQ比如HbaseQ那么我们可以定义User表,然后定义info 列族QUser的数据可以分为:(x)info:name = zhangsan,info:age=30,info:sex=male{,如果后来你又惛_加另外的属性,q样很方便只需要info:newProperty可以了?br />
    也许前面的这个例子还不够清晰Q我们再举个例子来解释一下,熟?zhn)SNS的朋友,应该都知道有好友FeedQ一般设计FeedQ我们都是按?#8220;某h在某时做了标题ؓ(f)某某的事?#8221;Q但是同时一般我们也?x)预留一下关键字Q比如有时候feed也许需要urlQfeed需要image属性等Q这h_(d)feed本n的属性是不确定的Q因此如果采用传l的关系数据库将非常ȝQ况且关pL据库?x)造成一些ؓ(f)null的单元浪费,而列存储׃?x)出现这个问题,在Hbase里,如果每一个column 单元没有|那么是占用空间的。下面我们通过两张图来形象的表CU关p:(x)




    上图是传l的RDBMS设计的Feed表,我们可以看出feed有多列是固定的Q不能增加,q且为null的列费了空间。但是我们再看看下图Q下图ؓ(f)HbaseQCassandra,Bigtable的数据模型图Q从下图可以看出QFeed表的列可以动态的增加Qƈ且ؓ(f)I的列是不存储的Q这大大节U了I间Q关键是Feedq东襉K着pȝ的运行,各种各样的Feed?x)出玎ͼ我们事先没办法预有多少UFeedQ那么我们也没有办法确定Feed表有多少列,因此Hbase,Cassandra,Bigtable的基于列存储的数据模型就非常适合此场景。说到这里,采用Hbase的这U方式,q有一个非帔R要的好处是Feed?x)自动切分,当Feed表中的数据超q某一个阀g后,Hbase?x)自动?f)我们切分数据Q这L(fng)话,查询具有了伸羃性,而再加上Hbase的弱事务性的Ҏ(gu),对Hbase的写入操作也变得非常快?br />


    上面说了Column familyQ那么我之前说的Row key是啥东东Q其实你可以理解row key为RDBMS中的某一个行的主键,但是因ؓ(f)Hbase不支持条件查询以?qing)Order by{查询,因此Row key的设计就要根据你pȝ的查询需求来设计了额。我q拿刚才那个Feed的列子来_(d)我们一般是查询某个人最新的一些FeedQ因此我们F(xin)eed的Row key可以有以下三个部分构?lt;userId><timestamp><feedId>Q这样以来当我们要查询某个h的最q的Feed可以指定Start Rowkey?lt;userId><0><0>QEnd Rowkey?lt;userId><Long.MAX_VALUE><Long.MAX_VALUE>来查询了Q同时因为Hbase中的记录是按照rowkey来排序的Q这样就使得查询变得非常快?br />

    ?Hbase的优~点
    1 列的可以动态增加,q且列ؓ(f)I就不存储数?节省存储I间.

    2 Hbase自动切分数据Q得数据存储自动具有水qscalability.

    3 Hbase可以提供高ƈ发读写操作的支持

    Hbase的缺点:(x)

    1 不能支持条g查询Q只支持按照Row key来查?

    2 暂时不能支持Master server的故障切?当Master宕机?整个存储pȝ׃(x)挂掉.



    关于数据库׾~性的一点资料:(x)
    http://www.jurriaanpersyn.com/archives/2009/02/12/database-sharding-at-netlog-with-mysql-and-php/

    http://adam.blog.heroku.com/past/2009/7/6/sql_databases_dont_scale/

    paulwong 2013-01-29 23:50 发表评论
    ]]>
    MAPREDUCEq行原理http://www.aygfsteel.com/paulwong/archive/2013/01/29/394872.htmlpaulwongpaulwongTue, 29 Jan 2013 04:54:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/01/29/394872.htmlhttp://www.aygfsteel.com/paulwong/comments/394872.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/01/29/394872.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/394872.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/394872.html
  • INPUT通过SPLIT成M个MAPd

  • JOB TRACKER这M个Q务分zTASK TRACKER执行

  • TASK TRACKER执行完MAPd后,?x)在本地生成文gQ然后通知JOB TRACKER

  • JOB TRACKER收到通知后,此d标记为已完成Q如果收到失败的消息Q会(x)此d重置为原始状态,再分z另一TASK TRACKER执行

  • 当所有的MAPd完成后,JOB TRACKERMAP执行后生成的LIST重新整理Q整合相同的KEYQ根据KEY的数量生成R个REDUCEdQ再分派lTASK TRACKER执行

  • TASK TRACKER执行完REDUCEd后,?x)在HDFS生成文gQ然后通知JOB TRACKER


  • JOB TRACKER{到所有的REDUCEd执行完后Q进行合qӞ产生最后结果,通知CLIENT


  • TASK TRACKER执行完MAPdӞ可以重新生成新的KEY VALUE对,从而媄响REDUCE个数




  • paulwong 2013-01-29 12:54 发表评论
    ]]>
    Windows环境下用ECLIPSE提交MAPREDUCE JOB臌EHBASE中运?/title><link>http://www.aygfsteel.com/paulwong/archive/2013/01/29/394851.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Mon, 28 Jan 2013 16:19:00 GMT</pubDate><guid>http://www.aygfsteel.com/paulwong/archive/2013/01/29/394851.html</guid><wfw:comment>http://www.aygfsteel.com/paulwong/comments/394851.html</wfw:comment><comments>http://www.aygfsteel.com/paulwong/archive/2013/01/29/394851.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/paulwong/comments/commentRss/394851.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/paulwong/services/trackbacks/394851.html</trackback:ping><description><![CDATA[<ol> <li>假设q程HADOOPL名ؓ(f)ubuntuQ则应在hosts文g中加?92.168.58.130       ubuntu<br /> <br /><br /> </li> <li>新徏MAVEN目Q加上相应的配置<br /> pom.xml<br /> <div style="background-color: #eeeeee; font-size: 13px; border: 1px solid #cccccc; padding: 4px 5px 4px 4px; width: 98%; word-break: break-all;"><!--<br /> <br /> Code highlighting produced by Actipro CodeHighlighter (freeware)<br /> http://www.CodeHighlighter.com/<br /> <br /> --><span style="color: #0000FF; "><</span><span style="color: #800000; ">project </span><span style="color: #FF0000; ">xmlns</span><span style="color: #0000FF; ">="http://maven.apache.org/POM/4.0.0"</span><span style="color: #FF0000; "> xmlns:xsi</span><span style="color: #0000FF; ">="http://www.w3.org/2001/XMLSchema-instance"</span><span style="color: #FF0000; "><br />   xsi:schemaLocation</span><span style="color: #0000FF; ">="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"</span><span style="color: #0000FF; ">></span><br />   <span style="color: #0000FF; "><</span><span style="color: #800000; ">modelVersion</span><span style="color: #0000FF; ">></span>4.0.0<span style="color: #0000FF; "></</span><span style="color: #800000; ">modelVersion</span><span style="color: #0000FF; ">></span><br /> <br />   <span style="color: #0000FF; "><</span><span style="color: #800000; ">groupId</span><span style="color: #0000FF; ">></span>com.cloudputing<span style="color: #0000FF; "></</span><span style="color: #800000; ">groupId</span><span style="color: #0000FF; ">></span><br />   <span style="color: #0000FF; "><</span><span style="color: #800000; ">artifactId</span><span style="color: #0000FF; ">></span>bigdata<span style="color: #0000FF; "></</span><span style="color: #800000; ">artifactId</span><span style="color: #0000FF; ">></span><br />   <span style="color: #0000FF; "><</span><span style="color: #800000; ">version</span><span style="color: #0000FF; ">></span>1.0<span style="color: #0000FF; "></</span><span style="color: #800000; ">version</span><span style="color: #0000FF; ">></span><br />   <span style="color: #0000FF; "><</span><span style="color: #800000; ">packaging</span><span style="color: #0000FF; ">></span>jar<span style="color: #0000FF; "></</span><span style="color: #800000; ">packaging</span><span style="color: #0000FF; ">></span><br /> <br />   <span style="color: #0000FF; "><</span><span style="color: #800000; ">name</span><span style="color: #0000FF; ">></span>bigdata<span style="color: #0000FF; "></</span><span style="color: #800000; ">name</span><span style="color: #0000FF; ">></span><br />   <span style="color: #0000FF; "><</span><span style="color: #800000; ">url</span><span style="color: #0000FF; ">></span>http://maven.apache.org<span style="color: #0000FF; "></</span><span style="color: #800000; ">url</span><span style="color: #0000FF; ">></span><br /> <br />   <span style="color: #0000FF; "><</span><span style="color: #800000; ">properties</span><span style="color: #0000FF; ">></span><br />     <span style="color: #0000FF; "><</span><span style="color: #800000; ">project</span><span style="color: #FF0000; ">.build.sourceEncoding</span><span style="color: #0000FF; ">></span>UTF-8<span style="color: #0000FF; "></</span><span style="color: #800000; ">project.build.sourceEncoding</span><span style="color: #0000FF; ">></span><br />   <span style="color: #0000FF; "></</span><span style="color: #800000; ">properties</span><span style="color: #0000FF; ">></span><br /> <br />     <span style="color: #0000FF; "><</span><span style="color: #800000; ">dependencies</span><span style="color: #0000FF; ">></span><br />         <span style="color: #0000FF; "><</span><span style="color: #800000; ">dependency</span><span style="color: #0000FF; ">></span><br />             <span style="color: #0000FF; "><</span><span style="color: #800000; ">groupId</span><span style="color: #0000FF; ">></span>junit<span style="color: #0000FF; "></</span><span style="color: #800000; ">groupId</span><span style="color: #0000FF; ">></span><br />             <span style="color: #0000FF; "><</span><span style="color: #800000; ">artifactId</span><span style="color: #0000FF; ">></span>junit<span style="color: #0000FF; "></</span><span style="color: #800000; ">artifactId</span><span style="color: #0000FF; ">></span><br />             <span style="color: #0000FF; "><</span><span style="color: #800000; ">version</span><span style="color: #0000FF; ">></span>3.8.1<span style="color: #0000FF; "></</span><span style="color: #800000; ">version</span><span style="color: #0000FF; ">></span><br />             <span style="color: #0000FF; "><</span><span style="color: #800000; ">scope</span><span style="color: #0000FF; ">></span>test<span style="color: #0000FF; "></</span><span style="color: #800000; ">scope</span><span style="color: #0000FF; ">></span><br />         <span style="color: #0000FF; "></</span><span style="color: #800000; ">dependency</span><span style="color: #0000FF; ">></span><br />         <span style="color: #0000FF; "><</span><span style="color: #800000; ">dependency</span><span style="color: #0000FF; ">></span><br />             <span style="color: #0000FF; "><</span><span style="color: #800000; ">groupId</span><span style="color: #0000FF; ">></span>org.springframework.data<span style="color: #0000FF; "></</span><span style="color: #800000; ">groupId</span><span style="color: #0000FF; ">></span><br />             <span style="color: #0000FF; "><</span><span style="color: #800000; ">artifactId</span><span style="color: #0000FF; ">></span>spring-data-hadoop<span style="color: #0000FF; "></</span><span style="color: #800000; ">artifactId</span><span style="color: #0000FF; ">></span><br />             <span style="color: #0000FF; "><</span><span style="color: #800000; ">version</span><span style="color: #0000FF; ">></span>0.9.0.RELEASE<span style="color: #0000FF; "></</span><span style="color: #800000; ">version</span><span style="color: #0000FF; ">></span><br />         <span style="color: #0000FF; "></</span><span style="color: #800000; ">dependency</span><span style="color: #0000FF; ">></span><br />         <span style="color: #0000FF; "><</span><span style="color: #800000; ">dependency</span><span style="color: #0000FF; ">></span><br />             <span style="color: #0000FF; "><</span><span style="color: #800000; ">groupId</span><span style="color: #0000FF; ">></span>org.apache.hbase<span style="color: #0000FF; "></</span><span style="color: #800000; ">groupId</span><span style="color: #0000FF; ">></span><br />             <span style="color: #0000FF; "><</span><span style="color: #800000; ">artifactId</span><span style="color: #0000FF; ">></span>hbase<span style="color: #0000FF; "></</span><span style="color: #800000; ">artifactId</span><span style="color: #0000FF; ">></span><br />             <span style="color: #0000FF; "><</span><span style="color: #800000; ">version</span><span style="color: #0000FF; ">></span>0.94.1<span style="color: #0000FF; "></</span><span style="color: #800000; ">version</span><span style="color: #0000FF; ">></span><br />         <span style="color: #0000FF; "></</span><span style="color: #800000; ">dependency</span><span style="color: #0000FF; ">></span><br />         <br />         <span style="color: #008000; "><!--</span><span style="color: #008000; "> <dependency><br />             <groupId>org.apache.hbase</groupId><br />             <artifactId>hbase</artifactId><br />             <version>0.90.2</version><br />         </dependency> </span><span style="color: #008000; ">--></span><br />         <span style="color: #0000FF; "><</span><span style="color: #800000; ">dependency</span><span style="color: #0000FF; ">></span><br />             <span style="color: #0000FF; "><</span><span style="color: #800000; ">groupId</span><span style="color: #0000FF; ">></span>org.apache.hadoop<span style="color: #0000FF; "></</span><span style="color: #800000; ">groupId</span><span style="color: #0000FF; ">></span><br />             <span style="color: #0000FF; "><</span><span style="color: #800000; ">artifactId</span><span style="color: #0000FF; ">></span>hadoop-core<span style="color: #0000FF; "></</span><span style="color: #800000; ">artifactId</span><span style="color: #0000FF; ">></span><br />             <span style="color: #0000FF; "><</span><span style="color: #800000; ">version</span><span style="color: #0000FF; ">></span>1.0.3<span style="color: #0000FF; "></</span><span style="color: #800000; ">version</span><span style="color: #0000FF; ">></span><br />         <span style="color: #0000FF; "></</span><span style="color: #800000; ">dependency</span><span style="color: #0000FF; ">></span><br />         <span style="color: #0000FF; "><</span><span style="color: #800000; ">dependency</span><span style="color: #0000FF; ">></span><br />             <span style="color: #0000FF; "><</span><span style="color: #800000; ">groupId</span><span style="color: #0000FF; ">></span>org.springframework<span style="color: #0000FF; "></</span><span style="color: #800000; ">groupId</span><span style="color: #0000FF; ">></span><br />             <span style="color: #0000FF; "><</span><span style="color: #800000; ">artifactId</span><span style="color: #0000FF; ">></span>spring-test<span style="color: #0000FF; "></</span><span style="color: #800000; ">artifactId</span><span style="color: #0000FF; ">></span><br />             <span style="color: #0000FF; "><</span><span style="color: #800000; ">version</span><span style="color: #0000FF; ">></span>3.0.5.RELEASE<span style="color: #0000FF; "></</span><span style="color: #800000; ">version</span><span style="color: #0000FF; ">></span><br />         <span style="color: #0000FF; "></</span><span style="color: #800000; ">dependency</span><span style="color: #0000FF; ">></span><br />     <span style="color: #0000FF; "></</span><span style="color: #800000; ">dependencies</span><span style="color: #0000FF; ">></span><br /> <span style="color: #0000FF; "></</span><span style="color: #800000; ">project</span><span style="color: #0000FF; ">></span></div> </li> <br /><br /> <li> <div>hbase-site.xml<br /> <div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /> <br /> Code highlighting produced by Actipro CodeHighlighter (freeware)<br /> http://www.CodeHighlighter.com/<br /> <br /> --><span style="color: #0000FF; "><?</span><span style="color: #FF00FF; ">xml version="1.0"</span><span style="color: #0000FF; ">?></span><br /> <span style="color: #0000FF; "><?</span><span style="color: #FF00FF; ">xml-stylesheet type="text/xsl" href="configuration.xsl"</span><span style="color: #0000FF; ">?></span><br /> <span style="color: #008000; "><!--</span><span style="color: #008000; "><br /> /**<br />  * Copyright 2010 The Apache Software Foundation<br />  *<br />  * Licensed to the Apache Software Foundation (ASF) under one<br />  * or more contributor license agreements.  See the NOTICE file<br />  * distributed with this work for additional information<br />  * regarding copyright ownership.  The ASF licenses this file<br />  * to you under the Apache License, Version 2.0 (the<br />  * "License"); you may not use this file except in compliance<br />  * with the License.  You may obtain a copy of the License at<br />  *<br />  *     http://www.apache.org/licenses/LICENSE-2.0<br />  *<br />  * Unless required by applicable law or agreed to in writing, software<br />  * distributed under the License is distributed on an "AS IS" BASIS,<br />  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.<br />  * See the License for the specific language governing permissions and<br />  * limitations under the License.<br />  */<br /> </span><span style="color: #008000; ">--></span><br /> <span style="color: #0000FF; "><</span><span style="color: #800000; ">configuration</span><span style="color: #0000FF; ">></span><br /> <br />     <span style="color: #0000FF; "><</span><span style="color: #800000; ">property</span><span style="color: #0000FF; ">></span><br />         <span style="color: #0000FF; "><</span><span style="color: #800000; ">name</span><span style="color: #0000FF; ">></span>hbase.rootdir<span style="color: #0000FF; "></</span><span style="color: #800000; ">name</span><span style="color: #0000FF; ">></span><br />         <span style="color: #0000FF; "><</span><span style="color: #800000; ">value</span><span style="color: #0000FF; ">></span>hdfs://ubuntu:9000/hbase<span style="color: #0000FF; "></</span><span style="color: #800000; ">value</span><span style="color: #0000FF; ">></span><br />     <span style="color: #0000FF; "></</span><span style="color: #800000; ">property</span><span style="color: #0000FF; ">></span><br /> <br />     <span style="color: #008000; "><!--</span><span style="color: #008000; "> 在构造JOBӞ?x)新Z文gҎ(gu)准备所需文g?br />            如果q一D|写,则默认本地环境ؓ(f)LINUXQ将用LINUX命o(h)d施,在WINDOWS环境下会(x)出错 </span><span style="color: #008000; ">--></span><br />     <span style="color: #0000FF; "><</span><span style="color: #800000; ">property</span><span style="color: #0000FF; ">></span><br />         <span style="color: #0000FF; "><</span><span style="color: #800000; ">name</span><span style="color: #0000FF; ">></span>mapred.job.tracker<span style="color: #0000FF; "></</span><span style="color: #800000; ">name</span><span style="color: #0000FF; ">></span><br />         <span style="color: #0000FF; "><</span><span style="color: #800000; ">value</span><span style="color: #0000FF; ">></span>ubuntu:9001<span style="color: #0000FF; "></</span><span style="color: #800000; ">value</span><span style="color: #0000FF; ">></span><br />     <span style="color: #0000FF; "></</span><span style="color: #800000; ">property</span><span style="color: #0000FF; ">></span><br />     <br />     <span style="color: #0000FF; "><</span><span style="color: #800000; ">property</span><span style="color: #0000FF; ">></span><br />         <span style="color: #0000FF; "><</span><span style="color: #800000; ">name</span><span style="color: #0000FF; ">></span>hbase.cluster.distributed<span style="color: #0000FF; "></</span><span style="color: #800000; ">name</span><span style="color: #0000FF; ">></span><br />         <span style="color: #0000FF; "><</span><span style="color: #800000; ">value</span><span style="color: #0000FF; ">></span>true<span style="color: #0000FF; "></</span><span style="color: #800000; ">value</span><span style="color: #0000FF; ">></span><br />     <span style="color: #0000FF; "></</span><span style="color: #800000; ">property</span><span style="color: #0000FF; ">></span><br />     <br />     <span style="color: #008000; "><!--</span><span style="color: #008000; "> 此处?x)向ZOOKEEPER咨询JOB TRACKER的可用IP </span><span style="color: #008000; ">--></span><br />     <span style="color: #0000FF; "><</span><span style="color: #800000; ">property</span><span style="color: #0000FF; ">></span><br />         <span style="color: #0000FF; "><</span><span style="color: #800000; ">name</span><span style="color: #0000FF; ">></span>hbase.zookeeper.quorum<span style="color: #0000FF; "></</span><span style="color: #800000; ">name</span><span style="color: #0000FF; ">></span><br />         <span style="color: #0000FF; "><</span><span style="color: #800000; ">value</span><span style="color: #0000FF; ">></span>ubuntu<span style="color: #0000FF; "></</span><span style="color: #800000; ">value</span><span style="color: #0000FF; ">></span><br />     <span style="color: #0000FF; "></</span><span style="color: #800000; ">property</span><span style="color: #0000FF; ">></span><br />     <span style="color: #0000FF; "><</span><span style="color: #800000; ">property </span><span style="color: #FF0000; ">skipInDoc</span><span style="color: #0000FF; ">="true"</span><span style="color: #0000FF; ">></span><br />         <span style="color: #0000FF; "><</span><span style="color: #800000; ">name</span><span style="color: #0000FF; ">></span>hbase.defaults.for.version<span style="color: #0000FF; "></</span><span style="color: #800000; ">name</span><span style="color: #0000FF; ">></span><br />         <span style="color: #0000FF; "><</span><span style="color: #800000; ">value</span><span style="color: #0000FF; ">></span>0.94.1<span style="color: #0000FF; "></</span><span style="color: #800000; ">value</span><span style="color: #0000FF; ">></span><br />     <span style="color: #0000FF; "></</span><span style="color: #800000; ">property</span><span style="color: #0000FF; ">></span><br /> <br /> <span style="color: #0000FF; "></</span><span style="color: #800000; ">configuration</span><span style="color: #0000FF; ">></span></div> </div> </li> <br /><br /> <li>试文gQMapreduceTest.java<br /> <div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /> <br /> Code highlighting produced by Actipro CodeHighlighter (freeware)<br /> http://www.CodeHighlighter.com/<br /> <br /> --><span style="color: #0000FF; ">package</span> com.cloudputing.mapreduce;<br /> <br /> <span style="color: #0000FF; ">import</span> java.io.IOException;<br /> <br /> <span style="color: #0000FF; ">import</span> junit.framework.TestCase;<br /> <br /> <span style="color: #0000FF; ">public</span> <span style="color: #0000FF; ">class</span> MapreduceTest <span style="color: #0000FF; ">extends</span> TestCase{<br />     <br />     <span style="color: #0000FF; ">public</span> <span style="color: #0000FF; ">void</span> testReadJob() <span style="color: #0000FF; ">throws</span> IOException, InterruptedException, ClassNotFoundException<br />     {<br />         MapreduceRead.read();<br />     }<br /> <br /> }</div> </li> <br /><br /> <li> <div>MapreduceRead.java</div> <div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /> <br /> Code highlighting produced by Actipro CodeHighlighter (freeware)<br /> http://www.CodeHighlighter.com/<br /> <br /> --><span style="color: #0000FF; ">package</span> com.cloudputing.mapreduce;<br /> <br /> <span style="color: #0000FF; ">import</span> java.io.IOException;<br /> <br /> <span style="color: #0000FF; ">import</span> org.apache.hadoop.conf.Configuration;<br /> <span style="color: #0000FF; ">import</span> org.apache.hadoop.fs.FileSystem;<br /> <span style="color: #0000FF; ">import</span> org.apache.hadoop.fs.Path;<br /> <span style="color: #0000FF; ">import</span> org.apache.hadoop.hbase.HBaseConfiguration;<br /> <span style="color: #0000FF; ">import</span> org.apache.hadoop.hbase.client.Result;<br /> <span style="color: #0000FF; ">import</span> org.apache.hadoop.hbase.client.Scan;<br /> <span style="color: #0000FF; ">import</span> org.apache.hadoop.hbase.io.ImmutableBytesWritable;<br /> <span style="color: #0000FF; ">import</span> org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;<br /> <span style="color: #0000FF; ">import</span> org.apache.hadoop.hbase.mapreduce.TableMapper;<br /> <span style="color: #0000FF; ">import</span> org.apache.hadoop.hbase.util.Bytes;<br /> <span style="color: #0000FF; ">import</span> org.apache.hadoop.io.Text;<br /> <span style="color: #0000FF; ">import</span> org.apache.hadoop.mapreduce.Job;<br /> <span style="color: #0000FF; ">import</span> org.apache.hadoop.mapreduce.lib.output.NullOutputFormat;<br /> <br /> <span style="color: #0000FF; ">public</span> <span style="color: #0000FF; ">class</span> MapreduceRead {<br />     <br />     <span style="color: #0000FF; ">public</span> <span style="color: #0000FF; ">static</span> <span style="color: #0000FF; ">void</span> read() <span style="color: #0000FF; ">throws</span> IOException, InterruptedException, ClassNotFoundException<br />     {<br />         <span style="color: #008000; ">//</span><span style="color: #008000; "> Add these statements. XXX<br /> </span><span style="color: #008000; ">//</span><span style="color: #008000; ">        File jarFile = EJob.createTempJar("target/classes");<br /> </span><span style="color: #008000; ">//</span><span style="color: #008000; ">        EJob.addClasspath("D:/PAUL/WORK/WORK-SPACES/TEST1/cloudputing/src/main/resources");<br /> </span><span style="color: #008000; ">//</span><span style="color: #008000; ">        ClassLoader classLoader = EJob.getClassLoader();<br /> </span><span style="color: #008000; ">//</span><span style="color: #008000; ">        Thread.currentThread().setContextClassLoader(classLoader);</span><span style="color: #008000; "><br /> </span><br />         Configuration config = HBaseConfiguration.create();<br />         addTmpJar("file:/D:/PAUL/WORK/WORK-SPACES/TEST1/cloudputing/target/bigdata-1.0.jar",config);<br />         <br />         Job job = <span style="color: #0000FF; ">new</span> Job(config, "ExampleRead");<br />         <span style="color: #008000; ">//</span><span style="color: #008000; "> And add this statement. XXX<br /> </span><span style="color: #008000; ">//</span><span style="color: #008000; ">        ((JobConf) job.getConfiguration()).setJar(jarFile.toString());<br /> <br /> </span><span style="color: #008000; ">//</span><span style="color: #008000; ">        TableMapReduceUtil.addDependencyJars(job);<br /> </span><span style="color: #008000; ">//</span><span style="color: #008000; ">        TableMapReduceUtil.addDependencyJars(job.getConfiguration(),<br /> </span><span style="color: #008000; ">//</span><span style="color: #008000; ">                MapreduceRead.class,MyMapper.class);</span><span style="color: #008000; "><br /> </span>        <br />         job.setJarByClass(MapreduceRead.<span style="color: #0000FF; ">class</span>);     <span style="color: #008000; ">//</span><span style="color: #008000; "> class that contains mapper</span><span style="color: #008000; "><br /> </span>        <br />         Scan scan = <span style="color: #0000FF; ">new</span> Scan();<br />         scan.setCaching(500);        <span style="color: #008000; ">//</span><span style="color: #008000; "> 1 is the default in Scan, which will be bad for MapReduce jobs</span><span style="color: #008000; "><br /> </span>        scan.setCacheBlocks(<span style="color: #0000FF; ">false</span>);  <span style="color: #008000; ">//</span><span style="color: #008000; "> don't set to true for MR jobs<br />         </span><span style="color: #008000; ">//</span><span style="color: #008000; "> set other scan attrs</span><span style="color: #008000; "><br /> </span>        <br />         TableMapReduceUtil.initTableMapperJob(<br />                 "wiki",        <span style="color: #008000; ">//</span><span style="color: #008000; "> input HBase table name</span><span style="color: #008000; "><br /> </span>                scan,             <span style="color: #008000; ">//</span><span style="color: #008000; "> Scan instance to control CF and attribute selection</span><span style="color: #008000; "><br /> </span>                MapreduceRead.MyMapper.<span style="color: #0000FF; ">class</span>,   <span style="color: #008000; ">//</span><span style="color: #008000; "> mapper</span><span style="color: #008000; "><br /> </span>                <span style="color: #0000FF; ">null</span>,             <span style="color: #008000; ">//</span><span style="color: #008000; "> mapper output key </span><span style="color: #008000; "><br /> </span>                <span style="color: #0000FF; ">null</span>,             <span style="color: #008000; ">//</span><span style="color: #008000; "> mapper output value</span><span style="color: #008000; "><br /> </span>                job);<br />         job.setOutputFormatClass(NullOutputFormat.<span style="color: #0000FF; ">class</span>);   <span style="color: #008000; ">//</span><span style="color: #008000; "> because we aren't emitting anything from mapper<br />         <br /> </span><span style="color: #008000; ">//</span><span style="color: #008000; ">        DistributedCache.addFileToClassPath(new Path("hdfs:</span><span style="color: #008000; ">//</span><span style="color: #008000; ">node.tracker1:9000/user/root/lib/stat-analysis-mapred-1.0-SNAPSHOT.jar"),job.getConfiguration());</span><span style="color: #008000; "><br /> </span>        <br />         <span style="color: #0000FF; ">boolean</span> b = job.waitForCompletion(<span style="color: #0000FF; ">true</span>);<br />         <span style="color: #0000FF; ">if</span> (!b) {<br />             <span style="color: #0000FF; ">throw</span> <span style="color: #0000FF; ">new</span> IOException("error with job!");<br />         }<br />         <br />     }<br />     <br />     <span style="color: #008000; ">/**</span><span style="color: #008000; "><br />      * 为MapreducedW三方jar?br />      * <br />      * </span><span style="color: #808080; ">@param</span><span style="color: #008000; "> jarPath<br />      *            举例QD:/Java/new_java_workspace/scm/lib/guava-r08.jar<br />      * </span><span style="color: #808080; ">@param</span><span style="color: #008000; "> conf<br />      * </span><span style="color: #808080; ">@throws</span><span style="color: #008000; "> IOException<br />      </span><span style="color: #008000; ">*/</span><br />     <span style="color: #0000FF; ">public</span> <span style="color: #0000FF; ">static</span> <span style="color: #0000FF; ">void</span> addTmpJar(String jarPath, Configuration conf) <span style="color: #0000FF; ">throws</span> IOException {<br />         System.setProperty("path.separator", ":");<br />         FileSystem fs = FileSystem.getLocal(conf);<br />         String newJarPath = <span style="color: #0000FF; ">new</span> Path(jarPath).makeQualified(fs).toString();<br />         String tmpjars = conf.get("tmpjars");<br />         <span style="color: #0000FF; ">if</span> (tmpjars == <span style="color: #0000FF; ">null</span> || tmpjars.length() == 0) {<br />             conf.set("tmpjars", newJarPath);<br />         } <span style="color: #0000FF; ">else</span> {<br />             conf.set("tmpjars", tmpjars + ":" + newJarPath);<br />         }<br />     }<br />     <br />     <span style="color: #0000FF; ">public</span> <span style="color: #0000FF; ">static</span> <span style="color: #0000FF; ">class</span> MyMapper <span style="color: #0000FF; ">extends</span> TableMapper<Text, Text> {<br /> <br />         <span style="color: #0000FF; ">public</span> <span style="color: #0000FF; ">void</span> map(ImmutableBytesWritable row, Result value,<br />                 Context context) <span style="color: #0000FF; ">throws</span> InterruptedException, IOException {<br />             String val1 = getValue(value.getValue(Bytes.toBytes("text"), Bytes.toBytes("qual1")));<br />             String val2 = getValue(value.getValue(Bytes.toBytes("text"), Bytes.toBytes("qual2")));<br />             System.out.println(val1 + " -- " + val2);<br />         }<br />         <br />         <span style="color: #0000FF; ">private</span> String getValue(<span style="color: #0000FF; ">byte</span> [] value)<br />         {<br />             <span style="color: #0000FF; ">return</span> value == <span style="color: #0000FF; ">null</span>? "null" : <span style="color: #0000FF; ">new</span> String(value);<br />         }<br />     } <br /> <br /> }</div> </li> </ol><img src ="http://www.aygfsteel.com/paulwong/aggbug/394851.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/paulwong/" target="_blank">paulwong</a> 2013-01-29 00:19 <a href="http://www.aygfsteel.com/paulwong/archive/2013/01/29/394851.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>未来企业IT技术关注点?qing)IT架构变革探讨http://www.aygfsteel.com/paulwong/archive/2013/01/14/394221.htmlpaulwongpaulwongMon, 14 Jan 2013 15:09:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/01/14/394221.htmlhttp://www.aygfsteel.com/paulwong/comments/394221.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/01/14/394221.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/394221.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/394221.htmlgartner十大战略性技术分析如下:(x)

    1.       Ud讑֤战争

    Ud讑֤多样化,Windows仅仅?span style="font-family: 'Times New Roman'; ">IT需要支持的多种环境之一,IT需要支持多样化环境?/p>

    2.       Ud应用?span style="font-family: 'Times New Roman'; ">HTML5

    HTML5变得愈发重要,以满_元化的需求,以满_安全性非常看重的企业U应用?/p>

    3.       个h?/strong>

    个h云将把重心从客户端设备向跨设备交付基于云的服务{UR?/p>

    4.       企业应用商店

    有了企业应用商店Q?span style="font-family: 'Times New Roman'; ">IT的角色将从集权式规划者{变ؓ(f)市场理者,qؓ(f)用户提供监管和经U服务,甚至可能为应用程序专家提供生态系l支持?/p>

    5.       物联|?/strong>

    物联|是一个概念,描述了互联网如何作为物理实物扩展,如消费电(sh)子设备和实物资都连接到互联|上?/p>

    6.       混合?span style="font-family: 'Times New Roman'; ">IT和云计算

    打造私有云q搭建相应的理q_Q再利用该^台来理内外部服?/p>

    7.       战略性大数据

    企业应当大数据看成变革性的构架Q用多元化数据库代替Z同质划分的关pL据库?/p>

    8.       可行性分?/strong>

    大数据的核心在于Z业提供可行的创意。受Ud|络、社交网l、v量数据等因素的驱动,企业需要改变分析方式以应对新观?/p>

    9.       内存计算

    内存计算以云服务的Ş式提供给内部或外部用?span style="font-family: 'Times New Roman'; ">,C百万的事件能在几十毫U内被扫描以相x和规律?/p>

    10.    整合生态系l?/strong>

    市场正在l历从松散耦合的异构系l向更ؓ(f)整合的系l和生态系l{U,应用E序与硬件、Y件、Y件及(qing)服务打包形成整合生态系l?/p>

    l合应用实践?qing)客户需求,可以有以下结论:(x)

    1.       大数据时代已l到?/strong>

           物联|发展及(qing)非结构化、半l构化数据的剧增推动了大数据应用需求发展。大数据高效应用是挖掘企业数据资源h(hun)值的势与发展方向?/p>

    2.       云计依旧是主题Q云更加关注个?/strong>

           云计是改变IT现状的核心技术之一Q云计算是大数据、应用商店交付的基础。个Z的发展将促云端服务更关注个体?/p>

    3.       Ud势Q企业应用商店将改变传统软g交付模式

           Windows逐步不再是客L(fng)Lq_Q?span style="font-family: 'Times New Roman'; ">IT技术需要逐步转向支持多^台服务。在云^C构徏企业应用商店Q逐步促成IT的角色将从集权式规划者{变ؓ(f)应用市场理?/p>

    4.       物联|将持箋改变工作?qing)生zL?/strong>

           物联|将改变生活?qing)工作方式,物联|将是一U革新的力量。在物联|方向,IPV6是值得研究的一个技术?/p>

    未来企业IT架构囑֦下:(x)

    架构说明Q?/p>

    1.应用被拆分Q客L(fng)变得极Q用户只需要关注极部分和自己有关的内容,打开pȝ后不再是上百个业务菜单?/p>

    2.企业后端架构以分布式架构ؓ(f)主,大数据服务能力将成ؓ(f)企业核心竞争力的集中体现?/p>

    3.非结构化数据处理?qing)分析相x术将?x)得到前所未有的重视?/p>

    受个人水qx限,仅供参考,不当之处Q欢q拍砖!


    http://blog.csdn.net/sdhustyh/article/details/8484780



    paulwong 2013-01-14 23:09 发表评论
    ]]>
    վ֩ģ壺 μԴ| Ʊ| | ɳ| | | ɽ| | | | | | | Ǩ| ˮ| | | | | Ͽ| | | | | ɽ| | | | «ɽ| ˮ| ɽʡ| | п| | | | | | ɽ| | ϵ|