??xml version="1.0" encoding="utf-8" standalone="yes"?>伊人成人在线视频,久久亚洲精品视频,日韩在线一区二区http://www.aygfsteel.com/paulwong/category/50835.htmlzh-cnWed, 22 Apr 2015 03:37:36 GMTWed, 22 Apr 2015 03:37:36 GMT60HADOOP各种框架应用领域http://www.aygfsteel.com/paulwong/archive/2015/01/04/422020.htmlpaulwongpaulwongSun, 04 Jan 2015 04:57:00 GMThttp://www.aygfsteel.com/paulwong/archive/2015/01/04/422020.htmlhttp://www.aygfsteel.com/paulwong/comments/422020.htmlhttp://www.aygfsteel.com/paulwong/archive/2015/01/04/422020.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/422020.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/422020.html1. Real Time Analytics : Apache Storm
2. In-memory Analytics : Apache Spark
3. Search Analytics : Apache Elastic search, SOLR
4. Log Analytics : Apache ELK Stack,ESK Stack(Elastic Search, Log
Stash, Spark Streaming, Kibana)
5. Batch Analytics : Apache MapReduce

***** NO SQL DB *****
1. MongoDB
2. Hbase
3. Cassandra

***** SOA *****
1. Oracle SOA
2. JBoss SOA
3. TiBco SOA
4. SOAP, RESTful Webservices 

paulwong 2015-01-04 12:57 发表评论
]]>
~译HADOOP源码http://www.aygfsteel.com/paulwong/archive/2014/12/16/421437.htmlpaulwongpaulwongMon, 15 Dec 2014 17:41:00 GMThttp://www.aygfsteel.com/paulwong/archive/2014/12/16/421437.htmlhttp://www.aygfsteel.com/paulwong/comments/421437.htmlhttp://www.aygfsteel.com/paulwong/archive/2014/12/16/421437.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/421437.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/421437.htmlhttps://github.com/apache/hadoop/blob/trunk/BUILDING.txt

配置 eclipse ~译、开?HadoopQMapReduceQ源代码
http://blog.csdn.net/basicthinker/article/details/6174442

hadoop2.2.0源代码编?br /> http://my.oschina.net/cloudcoder/blog/192224

Apache Hadoop 源代码编译环境搭?br /> http://qq85609655.iteye.com/blog/1986991



  1. Download code from https://codeload.github.com/apache/hadoop/zip/trunk, then unzip it, there is a folder hadoop-trunk.
    wget https://codeload.github.com/apache/hadoop/zip/trunk
    unzip trunk
  2. Install native libraries
    Ubuntu
    sudo apt-get -y install maven build-essential autoconf automake libtool cmake zlib1g-dev pkg-config libssl-dev

    Cent OS
    yum -y install  lzo-devel  zlib-devel  gcc autoconf automake libtool openssl-devel cmake
    get protobuf zip from http://f.dataguru.cn/thread-459689-1-1.html
    ./configure
    make
    make check
    make install
  3. $vi /etc/profile
    export PROTOC_HOME=/root/java/hadoop-source/protobuf-2.5.0
    export PATH=$PATH:$PROTOC_HOME/src
  4. cd to hadoop-trunk, run 
    mvn compile -Pnative
  5. cd to hadoop-maven-plugins, run
    mvn install
  6. cd to hadoop-trunk
    mvn install -DskipTests
  7. Make sure still in hadoop-trunk folder, Build Eclipse project
    mvn eclipse:eclipse -DskipTests
  8. Import the maven project to Eclipse




paulwong 2014-12-16 01:41 发表评论
]]>
Simplehbasehttp://www.aygfsteel.com/paulwong/archive/2014/07/15/415803.htmlpaulwongpaulwongTue, 15 Jul 2014 00:35:00 GMThttp://www.aygfsteel.com/paulwong/archive/2014/07/15/415803.htmlhttp://www.aygfsteel.com/paulwong/comments/415803.htmlhttp://www.aygfsteel.com/paulwong/archive/2014/07/15/415803.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/415803.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/415803.htmlhttps://github.com/zhang-xzhi/simplehbase/
https://github.com/zhang-xzhi/simplehbase/wiki


## simplehbase?
simplehbase是java和hbase之间的轻量中间件?
主要包含以下功能?
* 数据cd映射Qjavacd和hbase的bytes之间的数据{换?
* 单操作封装:装了hbase的put,get,scan{操作ؓ单的java操作方式?
* hbase query装Q封装了hbase的filterQ可以用sql-like的方式操作hbase?
* 动态query装Q类gmyibatisQ可以用xml配置动态语句查询hbase?
* insert,update支持: 建立在hbase的checkAndPut之上?
* hbase多版本支持:提供接口可以对hbase多版本数据进行查?映射?
* hbase原生接口支持?


### v0.9
新增

支持HTable如下使用方式Q对HTable可以定时flush?
主要场景Q?
扚w写入Q但是flush可以配置为指定时间间隔进行?
不降低批操作的吞吐,同时Q有一定的实时性保证?

支持用户自定义htablePoolService?
多个HTable可以使用同一个线E池?

intelligentScanSize功能,可以Ҏlimit的D定scan的cachingsize大小?


### v0.8
扚w操作接口新增
public <T> void putObjectList(List<PutRequest<T>> putRequestList); 
public void deleteObjectList(List<RowKey> rowKeyList, Class<?> type); 
public <T> void putObjectListMV(List<PutRequest<T>> putRequests,long timestamp) 
public <T> void putObjectListMV(List<PutRequest<T>> putRequests,Date timestamp) 
public <T> void putObjectListMV(List<PutRequest<T>> putRequestList) 
public void deleteObjectMV(RowKey rowKey, Class<?> type, long timeStamp) 
public void deleteObjectMV(RowKey rowKey, Class<?> type, Date timeStamp) 
public void deleteObjectListMV(List<RowKey> rowKeyList, Class<?> type,long timeStamp) 
public void deleteObjectListMV(List<RowKey> rowKeyList, Class<?> type,Date timeStamp) 
public void deleteObjectListMV(List<DeleteRequest> deleteRequestList,Class<?> type); 


Util新增Q前~查询使用Q?
public static RowKey getEndRowKeyOfPrefix(RowKey prefixRowKey) 

性能改进
把get的实Cscan调回get?

### v0.7新增功能Q?
支持查询时主记录和关联的RowKey同时q回?nbsp;

paulwong 2014-07-15 08:35 发表评论
]]>
安装CLOUDERAhttp://www.aygfsteel.com/paulwong/archive/2014/05/23/414035.htmlpaulwongpaulwongFri, 23 May 2014 10:16:00 GMThttp://www.aygfsteel.com/paulwong/archive/2014/05/23/414035.htmlhttp://www.aygfsteel.com/paulwong/comments/414035.htmlhttp://www.aygfsteel.com/paulwong/archive/2014/05/23/414035.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/414035.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/414035.htmlhttp://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_4_4.html


http://www.cnblogs.com/xuesong/p/3604080.html


http://www.linuxidc.com/Linux/2013-12/94180.htm

卸蝲
http://www.cnblogs.com/shudonghe/articles/3133290.html

安装文gQ?br /> http://www.cloudera.com/content/support/en/downloads/download-components/download-products.html?productID=4ZFrtT9ZQN


  1. change to no password
    sudo chmod +w /etc/sudoers
    sudo vi /etc/sudoers 
    ufuser ALL=(ALL) NOPASSWD: ALL
    sudo chmod -w /etc/sudoers

  2. change disable
    sudo vi /etc/selinux/config
    SELINUX=disabled
    sudo reboot

  3. add to /etc/hosts
    sudo vi /etc/hosts

    10.0.0.4 ufhdp001.cloudapp.net ufhdp001
    10.0.0.5 ufhdp002.cloudapp.net ufhdp002

  4. download bin
    wget http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin

  5. run the bin
    chmod 755 cloudera-manager-installer.bin 
    sudo ./cloudera-manager-installer.bin 


paulwong 2014-05-23 18:16 发表评论
]]>
2014q值得x的十个Hadoop大数据创业公?/title><link>http://www.aygfsteel.com/paulwong/archive/2014/05/23/414019.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Fri, 23 May 2014 04:15:00 GMT</pubDate><guid>http://www.aygfsteel.com/paulwong/archive/2014/05/23/414019.html</guid><wfw:comment>http://www.aygfsteel.com/paulwong/comments/414019.html</wfw:comment><comments>http://www.aygfsteel.com/paulwong/archive/2014/05/23/414019.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/paulwong/comments/commentRss/414019.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/paulwong/services/trackbacks/414019.html</trackback:ping><description><![CDATA[<p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">开源大数据框架Apache Hadoop已经成了大数据处理的事实标准Q同时也几乎成了大数据的代名词,虽然q多有些以偏概全?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">ҎGartner的估计,目前的Hadoop生态系l市模在7700万美元左叻I2016q_该市模将快速增长至8.13亿美元?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">但是在Hadoopq个快速扩增的蓝v中游泛_ƈ非易事,不仅开发大数据基础设施技术品这件事很难Q销售v来也很难Q具体到大数据基设施工具?Hadoop、NoSQL数据库和处理系l则更是难上加难。客户需要大量培训和教育Q付费用户需要大量支持和及时跟进的品开发工作。而跟企业U客h 交道往往q创业公司团队的强V此外,大数据基设施技术创业通常寚w险投资规模也有较高要求?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">管困难重重QHadoop创业公司依然如雨后春W冒出,除了Cloudera、Datameer、DataStax和MapR{已l功成名q Hadoop创业公司外,最qCIO杂志评出?014q十大最值得x的Hadoop创业公司Q了解这些公司的产品和商业模式对企业大数据技术创业者和 大数据应用用h说都非常有参考h|</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><strong style="margin: 0px; padding: 0px;">一?a style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;">Platfora</a></strong></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><img title="platfora.png" src="http://static.oschina.net/uploads/img/201404/24071338_Dg4Y.png" alt="platfora" border="0" height="72" width="226" style="margin: 0px; padding: 0px; border: 0px; max-width: 640px;" /></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">业务Q所提供的大数据分析解决Ҏ能够Hadoop中的原始数据转换成可互动的,Z内存计算的商业智能服务?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">介:创立?011q_q今已募?500万美元?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">入选理由:Platfora的目标是化复杂难用的HadoopQ推动Hadoop在企业市场的应用。Platfora的做法是化数据采集和分析 程Q将Hadoop中的原始数据自动转化成可以互动的商业服务Q无需ETL或者数据仓库?参考阅读:Hadoop只是Ih的ETL)</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"> </p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><strong style="margin: 0px; padding: 0px;">二?a style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;">Alpine Data Labs</a></strong></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><img title="Alpine Data.png" src="http://static.oschina.net/uploads/img/201404/24071339_qHWV.png" alt="alpine data" border="0" height="150" width="364" style="margin: 0px; padding: 0px; border: 0px; max-width: 640px;" /></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">业务Q提供基于Hadoop的数据分析^?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">介:创立?010q_q今累计融资2350万美元?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">入选理由:复杂的高U分析和机器学习应用通常都需要脚本和代码开发高手实玎ͼq进一?a style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;">推高了数据科学家的技术门?/a>。实际上大数据企业高和ITl理都没旉也没兴致学习~程技术,或者去了解复杂的Hadoop。Alpine Data通过SaaS服务的方式大q降低了预测分析的应用门槛?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"> </p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><strong style="margin: 0px; padding: 0px;">三?/strong><a style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;"><strong style="margin: 0px; padding: 0px;">Altiscale</strong></a></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><img title="altiscale.png" src="http://static.oschina.net/uploads/img/201404/24071339_c03z.png" alt="altiscale" border="0" height="84" width="230" style="margin: 0px; padding: 0px; border: 0px; max-width: 640px;" /></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">业务Q提供Hadoopx务(HaaSQ?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">介:创立?012q?月,q今融资1200万美元?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">入选理由:大数据正在闹人才荒,而通过云计提供Hadoop相关服务无疑是普及Hadoo的一条捷径,ҎTechNavio的估计,2016q?HaaS市场规模高?90亿美元,是块大蛋p。但是HaaS市场的竞争已l日激烈,包括亚马逊EMR、微软的Hadoop on AzureQ以及Rackspace的Hortonworks云服务等都是重量U玩ӞAltiscaleq需要与Hortonworks?Cloudera、Mortar Data、Qubole、Xpleny展开直接竞争?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"> </p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><strong style="margin: 0px; padding: 0px;">四?a style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;">Trifacta</a></strong></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><img title="trifacta.png" src="http://static.oschina.net/uploads/img/201404/24071339_aHSi.png" alt="trifacta" border="0" height="80" width="272" style="margin: 0px; padding: 0px; border: 0px; max-width: 640px;" /></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">业务Q提供^台帮助用户将复杂的原始数据{化成q净的结构化格式供分析用?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">介:创立?012q_q今融资1630万美元?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">入选理由:大数据技术^台和分析工具之间存在一个巨大的瓉Q那是数据分析专家需要花费大量精力和旉转化数据Q而且业务数据分析师们往往也ƈ?具备独立完成数据转化工作的技术能力。ؓ了解册个问题Trifacta开发出?#8220;预测互动”技术,数据操作可视化Q而且Trifacta的机器学习算 法还能同时观察用户和数据属性,预测用户意图Qƈ自动l出。Trifata的竞争对手是Paxata、Informatica和CirroHow?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"> </p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><strong style="margin: 0px; padding: 0px;">五?a style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;">Splice Machin</a></strong><strong style="margin: 0px; padding: 0px;"><a style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;">e</a></strong></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><img title="Splice Machine.png" src="http://static.oschina.net/uploads/img/201404/24071339_GqsP.png" alt="splice machine" border="0" height="72" width="132" style="margin: 0px; padding: 0px; border: 0px; max-width: 640px;" /></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">业务Q提供面向大数据应用的,ZHadoop的SQL兼容数据库?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">介:创立?012q_q今融资1900万美元?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">入选理由:新的数据技术得传l关pd数据库的一些流行功能如ACID合规、交易一致性和标准的SQL查询语言{得以在廉h可扩展的Hadoop?延箋。Splice Machine保留了NoSQL数据库所有的优点Q例如auto-shardingQ容错、可扩展性等Q同时又保留了SQL?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"> </p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><strong style="margin: 0px; padding: 0px;">六?a style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;">DataTorrent</a></strong></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><img title="datatorrent.png" src="http://static.oschina.net/uploads/img/201404/24071339_DqP6.png" alt="datarorrent" border="0" height="68" width="274" style="margin: 0px; padding: 0px; border: 0px; max-width: 640px;" /></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">业务Q提供基于Hadoopq_的实时流处理q_</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">介:创立?012q_2013q?月获?00万美元A轮融资?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">入选理由:大数据的未来是快数据Q而DataTorrent正是要解军_数据的问题?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"> </p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><strong style="margin: 0px; padding: 0px;">七?a style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;">Qubole</a></strong></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><img title="qubole.png" src="http://static.oschina.net/uploads/img/201404/24071339_Vbbd.png" alt="qubole" border="0" height="128" width="318" style="margin: 0px; padding: 0px; border: 0px; max-width: 640px;" /></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">业务Q提供大数据DaaS服务Q基?#8220;真正的自动扩展Hadoop集群”?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">介:创立?011q_累计融资700万美元?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">入选理由:大数据h才一难求,对于大多C业来_像用SaaS企业应用一样用Hadoop是一个现实的选择?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"> </p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><strong style="margin: 0px; padding: 0px;">八?a style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;">Continuuity </a></strong></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><img title="COntinuuity.png" src="http://static.oschina.net/uploads/img/201404/24071340_2Duv.png" alt="continuuity" border="0" height="60" width="314" style="margin: 0px; padding: 0px; border: 0px; max-width: 640px;" /></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">业务Q提供基于Hadoop的大数据应用托管q_</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">介:创立?011q_累计获得1250万美元融资,创始人兼CEO Todd Papaioannou曾是雅虎副总裁云架构负责hQ去q夏天TodddContinuuity后,联合创始人CTO Jonathan Gray接替担QCEO一职?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">入选理由:Continuuity的商业模式非常聪明也非常独特Q他们绕q非帔R~的Hadoop专家Q直接向Java开发者提供应用开发^収ͼ?旗舰产品Reactor是一个基于Hadoop的Java集成化数据和应用框架QContinuuity底层基设施q行抽象处理Q通过单的Java 和REST API提供底层基础设施服务Qؓ用户大大化了Hadoop基础设施的复杂性。Continuuity最新发布的服务——Loom是一个集管理方案,?qLoom创徏的集可以用Q意硬件和软g堆叠的模板,从单一的LAMP服务器和传统应用服务器如JBoss到包含数千个节点的大规模的Hadoop?。集还可以部v在多个云服务商的环境中(例如Rackspace、Joyent、Openstack{)而且q能使用常见的SCM工具?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"> </p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><strong style="margin: 0px; padding: 0px;">九?/strong><a style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;"><strong style="margin: 0px; padding: 0px;">Xplenty</strong></a></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><img title="Xplenty.png" src="http://static.oschina.net/uploads/img/201404/24071340_ERx8.png" alt="xplenty" border="0" height="92" width="112" style="margin: 0px; padding: 0px; border: 0px; max-width: 640px;" /></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">业务Q提供HaaS服务</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">介:创立?012q_从Magma风险投资获得金额不详的融资?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">入选理由:虽然Hadoop已经成了大数据的事实工业标准Q但是Hadoop的开发、部|和l护Ҏ术h员的技能依然有着极高要求。Xplenty 的技术通过无需~写代码的Hadoop开发环境提供Hadoop处理服务Q企业无需投资软硬件和专业人才p快速n受大数据技术?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><strong style="margin: 0px; padding: 0px;">十?/strong><a style="margin: 0px; padding: 0px; color: #3e62a6; outline: 0px;"><strong style="margin: 0px; padding: 0px;">Nuevora</strong></a></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;"><img title="Nuevora.png" src="http://static.oschina.net/uploads/img/201404/24071340_8vNy.png" alt="nuevora" border="0" height="56" width="196" style="margin: 0px; padding: 0px; border: 0px; max-width: 640px;" /></p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">业务Q提供大数据分析应用</p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">介:创立?011q_累计获得300万早期投资?/p><p style="margin: 0px 0px 15pt; padding: 0px; font-family: 微Y雅黑, Verdana, sans-serif, 宋体; line-height: 22.399999618530273px; background-color: #ffffff;">入选理由:Nuevora的着眼点是大数据应用最早启动的两个领域Q营销和客h触。Nuevora的nBAAPQ大数据分析与应用)q_的主要功 能包括基于最x间预算法的定制分析应用QnBAAPZ三个关键大数据技术:HadoopQ大数据处理Q、RQ预分析)和TableauQ数据可?化)</p><img src ="http://www.aygfsteel.com/paulwong/aggbug/414019.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/paulwong/" target="_blank">paulwong</a> 2014-05-23 12:15 <a href="http://www.aygfsteel.com/paulwong/archive/2014/05/23/414019.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>KMEANS PAGERANK ON HADOOPhttp://www.aygfsteel.com/paulwong/archive/2014/05/07/413384.htmlpaulwongpaulwongWed, 07 May 2014 15:57:00 GMThttp://www.aygfsteel.com/paulwong/archive/2014/05/07/413384.htmlhttp://www.aygfsteel.com/paulwong/comments/413384.htmlhttp://www.aygfsteel.com/paulwong/archive/2014/05/07/413384.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/413384.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/413384.htmlhttps://github.com/keokilee/kmeans-hadoop

https://github.com/rorlig/hadoop-pagerank-java

http://wuyanzan60688.blog.163.com/blog/static/12777616320131011426159/

http://codecloud.net/hadoop-k-means-591.html


import java.io.*;
import java.net.URI;
import java.util.Iterator;
import java.util.Random;
import java.util.Vector;

import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.GenericOptionsParser;

public class KMeans {
    static enum Counter { CENTERS, CHANGE, ITERATIONS }

    public static class Point implements WritableComparable<Point> {
        // Longs because this will store sum of many ints
        public LongWritable x;
        public LongWritable y;
        public IntWritable num; // For summation points

        public Point() {
            this.x = new LongWritable(0);
            this.y = new LongWritable(0);
            this.num = new IntWritable(0);
        }

        public Point(int x, int y) {
            this.x = new LongWritable(x);
            this.y = new LongWritable(y);
            this.num = new IntWritable(1);
        }

        public Point(IntWritable x, IntWritable y) {
            this.x = new LongWritable(x.get());
            this.y = new LongWritable(y.get());
            this.num = new IntWritable(1);
        }

        public void add(Point that) {
            x.set(x.get() + that.x.get());
            y.set(y.get() + that.y.get());
            num.set(num.get() + that.num.get());
        }

        public void norm() {
            x.set(x.get() / num.get());
            y.set(y.get() / num.get());
            num.set(1);
        }

        public void write(DataOutput out) throws IOException {
            x.write(out);
            y.write(out);
            num.write(out);
        }

        public void readFields(DataInput in) throws IOException {
            x.readFields(in);
            y.readFields(in);
            num.readFields(in);
        }

        public long distance(Point that) {
            long dx = that.x.get() - x.get();
            long dy = that.y.get() - y.get();

            return dx * dx + dy * dy;
        }

        public String toString() {
            String ret = x.toString() + '\t' + y.toString();
            if (num.get() != 1)
                ret += '\t' + num.toString();
            return ret;
        }

        public int compareTo(Point that) {
            int ret = x.compareTo(that.x);
            if (ret == 0)
                ret = y.compareTo(that.y);
            if (ret == 0)
                ret = num.compareTo(that.num);
            return ret;
        }
    }

    public static class Map
            extends MapReduceBase
            implements Mapper<Text, Text, Point, Point>
    {
        private Vector<Point> centers;
        private IOException error;

        public void configure(JobConf conf) {
            try {
                Path paths[] = DistributedCache.getLocalCacheFiles(conf);
                if (paths.length != 1)
                    throw new IOException("Need exactly 1 centers file");

                FileSystem fs = FileSystem.getLocal(conf);
                SequenceFile.Reader in = new SequenceFile.Reader(fs, paths[0], conf);

                centers = new Vector<Point>();
                IntWritable x = new IntWritable();
                IntWritable y = new IntWritable();
                while(in.next(x, y))
                    centers.add(new Point(x, y));
                in.close();

                // Generate new points if we don't have enough.
                int k = conf.getInt("k", 0);
                Random rand = new Random();
                final int MAX = 1024*1024;
                for (int i = centers.size(); i < k; i++) {
                    x.set(rand.nextInt(MAX));
                    y.set(rand.nextInt(MAX));
                    centers.add(new Point(x, y));
                }
            } catch (IOException e) {
                error = e;
            }
        }

        public void map(Text xt, Text yt,
                OutputCollector<Point, Point> output, Reporter reporter)
            throws IOException
        {
            if (error != null)
                throw error;

            int x = Integer.valueOf(xt.toString());
            int y = Integer.valueOf(yt.toString());
            Point p = new Point(x, y);
            Point center = null;
            long distance = Long.MAX_VALUE;

            for (Point c : centers) {
                long d = c.distance(p);
                if (d <= distance) {
                    distance = d;
                    center = c;
                }
            }

            output.collect(center, p);
        }
    }

    public static class Combine
            extends MapReduceBase
            implements Reducer<Point, Point, Point, Point>
    {
        public void reduce(Point center, Iterator<Point> points,
                OutputCollector<Point, Point> output, Reporter reporter)
            throws IOException
        {
            Point sum = new Point();
            while(points.hasNext()) {
                sum.add(points.next());
            }

            output.collect(center, sum);
        }
    }

    public static class Reduce
            extends MapReduceBase
            implements Reducer<Point, Point, IntWritable, IntWritable>
    {
        public void reduce(Point center, Iterator<Point> points,
                OutputCollector<IntWritable, IntWritable> output,
                Reporter reporter)
            throws IOException
        {
            Point sum = new Point();
            while (points.hasNext()) {
                sum.add(points.next());
            }
            sum.norm();

            IntWritable x = new IntWritable((int) sum.x.get());
            IntWritable y = new IntWritable((int) sum.y.get());

            output.collect(x, y);

            reporter.incrCounter(Counter.CHANGE, sum.distance(center));
            reporter.incrCounter(Counter.CENTERS, 1);
        }
    }

    public static void error(String msg) {
        System.err.println(msg);
        System.exit(1);
    }

    public static void initialCenters(
            int k, JobConf conf, FileSystem fs,
            Path in, Path out)
        throws IOException
    {
        BufferedReader input = new BufferedReader(
                new InputStreamReader(fs.open(in)));
        SequenceFile.Writer output = new SequenceFile.Writer(
                fs, conf, out, IntWritable.class, IntWritable.class);
        IntWritable x = new IntWritable();
        IntWritable y = new IntWritable();
        for (int i = 0; i < k; i++) {
            String line = input.readLine();
            if (line == null)
                error("Not enough points for number of means");

            String parts[] = line.split("\t");
            if (parts.length != 2)
                throw new IOException("Found a point without two parts");

            x.set(Integer.valueOf(parts[0]));
            y.set(Integer.valueOf(parts[1]));
            output.append(x, y);
        }
        output.close();
        input.close();
    }

    public static void main(String args[]) throws IOException {
        JobConf conf = new JobConf(KMeans.class);
        GenericOptionsParser opts = new GenericOptionsParser(conf, args);
        String paths[] = opts.getRemainingArgs();

        FileSystem fs = FileSystem.get(conf);

        if (paths.length < 3)
            error("Usage:\n"
                    + "\tKMeans <file to display>\n"
                    + "\tKMeans <output> <k> <input file>"
                 );

        Path outdir  = new Path(paths[0]);
        int k = Integer.valueOf(paths[1]);
        Path firstin = new Path(paths[2]);
        
        if (k < 1 || k > 20)
            error("Strange number of means: " + paths[1]);

        if (fs.exists(outdir)) {
            if (!fs.getFileStatus(outdir).isDir())
                error("Output directory \"" + outdir.toString()
                        + "\" exists and is not a directory.");
        } else {
            fs.mkdirs(outdir);
        }

        // Input: text file, each line "x\ty"
        conf.setInputFormat(KeyValueTextInputFormat.class);
        for (int i = 2; i < paths.length; i++)
            FileInputFormat.addInputPath(conf, new Path(paths[i]));

        conf.setInt("k", k);

        // Map: (x,y) -> (centroid, point)
        conf.setMapperClass(Map.class);
        conf.setMapOutputKeyClass(Point.class);
        conf.setMapOutputValueClass(Point.class);

        // Combine: (centroid, points) -> (centroid, weighted point)
        conf.setCombinerClass(Combine.class);

        // Reduce: (centroid, weighted points) -> (x, y) new centroid
        conf.setReducerClass(Reduce.class);
        conf.setOutputKeyClass(IntWritable.class);
        conf.setOutputValueClass(IntWritable.class);

        // Output
        conf.setOutputFormat(SequenceFileOutputFormat.class);

        // Chose initial centers
        Path centers = new Path(outdir, "initial.seq");
        initialCenters(k, conf, fs, firstin, centers);

        // Iterate
        long change  = Long.MAX_VALUE;
        URI cache[] = new URI[1];
        for (int iter = 1; iter <= 1000 && change > 100 * k; iter++) {
            Path jobdir = new Path(outdir, Integer.toString(iter));
            FileOutputFormat.setOutputPath(conf, jobdir);

            conf.setJobName("k-Means " + iter);
            conf.setJarByClass(KMeans.class);

            cache[0] = centers.toUri();
            DistributedCache.setCacheFiles( cache, conf );

            RunningJob result = JobClient.runJob(conf);
            System.out.println("Iteration: " + iter);

            change   = result.getCounters().getCounter(Counter.CHANGE);
            centers  = new Path(jobdir, "part-00000");
        }
    }
}

192.5.53.208


paulwong 2014-05-07 23:57 发表评论
]]>
Packt celebrates International Day Against DRM, May 6th 2014http://www.aygfsteel.com/paulwong/archive/2014/05/06/413334.htmlpaulwongpaulwongTue, 06 May 2014 12:05:00 GMThttp://www.aygfsteel.com/paulwong/archive/2014/05/06/413334.htmlhttp://www.aygfsteel.com/paulwong/comments/413334.htmlhttp://www.aygfsteel.com/paulwong/archive/2014/05/06/413334.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/413334.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/413334.htmlPackt celebrates International Day Against DRM, May 6th 2014

 

 

 

 

 

According to the definition of DRM on Wikipedia, Digital Rights Management (DRM) is a class of technologies that are used by hardware manufacturers, publishers, copyright holders, and individuals with the intent to control the use of digital content and devices after sale.

 

However, Packt Publishing firmly believes that you should be able to read and interact with your content when you want, where you want, and how you want – to that end they have been advocates of DRM-free content since their very first eBook was published back in 2004.

 

To show their continuing support for Day Against DRM, Packt Publishing is offering all its DRM-free content at $10 for 24 hours only on May 6th – that’s all 2000+ eBooks and Videos. Check it out at: http://bit.ly/1q6bpha.

paulwong 2014-05-06 20:05 发表评论
]]>
A book: Web Crawling and Data Mining with Apache Nutchhttp://www.aygfsteel.com/paulwong/archive/2014/02/03/409510.htmlpaulwongpaulwongMon, 03 Feb 2014 05:14:00 GMThttp://www.aygfsteel.com/paulwong/archive/2014/02/03/409510.htmlhttp://www.aygfsteel.com/paulwong/comments/409510.htmlhttp://www.aygfsteel.com/paulwong/archive/2014/02/03/409510.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/409510.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/409510.htmlRecently I am reading a book <Web Crawling and Data Mining with Apache Nutch>, http://www.packtpub.com/web-crawling-and-data-mining-with-apache-nutch/book, it is really a great book. And I get help in my project.

In my project I need to crawl the web content and do the data analyst. From the book I can know how to use and integrate Nutch and Solr frameworks to implement it.

If you have similiar case, recommand to read this book.


paulwong 2014-02-03 13:14 发表评论
]]>
【{载】经典O画讲解HDFS原理 http://www.aygfsteel.com/paulwong/archive/2013/10/26/405663.htmlpaulwongpaulwongSat, 26 Oct 2013 01:15:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/10/26/405663.htmlhttp://www.aygfsteel.com/paulwong/comments/405663.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/10/26/405663.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/405663.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/405663.html分布式文件系l比较出名的有HDFS  ?GFSQ其中HDFS比较单一炏V本文是一描q非常简z易懂的漫画形式讲解HDFS的原理。比一般PPT要通俗易懂很多。不隑־的学习资料?br />

1、三个部? 客户端、nameserverQ可理解Z控和文g索引,cMlinux的inodeQ、datanodeQ存攑֮际数据)

在这里,client的Ş式我所了解的有两种Q通过hadoop提供的api所~写的程序可以和hdfsq行交互Q另外一U就是安装了hadoop的datanode其也可以通过命o行与hdfspȝq行交互Q如在datanode上上传则使用如下命o行:bin/hadoop fs -put example1 user/chunk/


2、如何写数据q程





3、读取数据过E?/span>



4、容错:W一部分Q故障类型及其检方法(nodeserver 故障Q和|络故障Q和脏数据问题)




5、容错第二部分:d定w



6、容错第三部分:dataNode 失效



7、备份规?/span>



8、结束语


paulwong 2013-10-26 09:15 发表评论
]]>
Install Hadoop in the AWS cloudhttp://www.aygfsteel.com/paulwong/archive/2013/09/08/403816.htmlpaulwongpaulwongSun, 08 Sep 2013 05:45:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/09/08/403816.htmlhttp://www.aygfsteel.com/paulwong/comments/403816.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/09/08/403816.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/403816.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/403816.html
  • get the Whirr tar file
    wget http://www.eu.apache.org/dist/whirr/stable/whirr-0.8.2.tar.gz
  • untar the Whirr tar file
    tar -vxf whirr-0.8.2.tar.gz
  • create credentials file
    mkdir ~/.whirr
    cp conf/credentials.sample ~/.whirr/credentials
  • add the following content to credentials file
    # Set cloud provider connection details
    PROVIDER=aws-ec2
    IDENTITY=<AWS Access Key ID>
    CREDENTIAL=<AWS Secret Access Key>
  • generate a rsa key pair
    ssh-keygen -t rsa -P ''
  • create a hadoop.properties file and add the following content
    whirr.cluster-name=whirrhadoopcluster
    whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,2 hadoop-datanode+hadoop-tasktracker
    whirr.provider=aws-ec2
    whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
    whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
    whirr.hadoop.version=1.0.2
    whirr.aws-ec2-spot-price=0.08
  • launch hadoop
    bin/whirr launch-cluster --config hadoop.properties
  • launch proxy
    cd ~/.whirr/whirrhadoopcluster/
    ./hadoop-proxy.sh
  • add a rule to iptables
    0.0.0.0/0 50030
    0.0.0.0/0 50070
  • check the web ui in the browser
    http://<aws-public-dns>:50030
  • add to /etc/profile
    export HADOOP_CONF_DIR=~/.whirr/whirrhadoopcluster/
  • check if the hadoop works
    hadoop fs -ls /



















    paulwong 2013-09-08 13:45 发表评论
    ]]>
    Install hadoop+hbase+nutch+elasticsearchhttp://www.aygfsteel.com/paulwong/archive/2013/08/31/403513.htmlpaulwongpaulwongFri, 30 Aug 2013 17:17:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/08/31/403513.htmlhttp://www.aygfsteel.com/paulwong/comments/403513.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/08/31/403513.html#Feedback3http://www.aygfsteel.com/paulwong/comments/commentRss/403513.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/403513.htmlH...  阅读全文

    paulwong 2013-08-31 01:17 发表评论
    ]]>
    Implementation for CombineFileInputFormat Hadoop 0.20.205http://www.aygfsteel.com/paulwong/archive/2013/08/29/403442.htmlpaulwongpaulwongThu, 29 Aug 2013 08:08:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/08/29/403442.htmlhttp://www.aygfsteel.com/paulwong/comments/403442.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/08/29/403442.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/403442.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/403442.html
    具体的原理是下述三步:

    1.Ҏ输入目录下的每个文g,如果光度超qmapred.max.split.size,以block为单位分成多个split(一个split是一个map的输?,每个split的长度都大于mapred.max.split.size, 因ؓ以block为单? 因此也会大于blockSize, 此文件剩下的长度如果大于mapred.min.split.size.per.node, 则生成一个split, 否则先暂时保?

    2. 现在剩下的都是一些长度效短的片,把每个rack下碎片合q? 只要长度过mapred.max.split.size合q成一个split, 最后如果剩下的片比mapred.min.split.size.per.rack? 合q成一个split, 否则暂时保留.

    3. 把不同rack下的片合ƈ, 只要长度过mapred.max.split.size合q成一个split, 剩下的碎片无论长? 合ƈ成一个split.
    举例: mapred.max.split.size=1000
    mapred.min.split.size.per.node=300
    mapred.min.split.size.per.rack=100
    输入目录下五个文?rack1下三个文?长度?050,1499,10, rack2下两个文?长度?010,80. 另外blockSize?00.
    l过W一? 生成五个split: 1000,1000,1000,499,1000. 剩下的碎片ؓrack1?50,10; rack2?0:80
    ׃两个rack下的片和都不超q?00, 所以经q第二步, split和碎片都没有变化.
    W三?合ƈ四个片成一个split, 长度?50.

    如果要减map数量, 可以调大mapred.max.split.size, 否则调小卛_.

    其特Ҏ: 一个块臛_作ؓ一个map的输入,一个文件可能有多个块,一个文件可能因为块多分l做Z同map的输入, 一个map可能处理多个块,可能处理多个文g?br />
    注:CombineFileInputFormat是一个抽象类Q需要编写一个承类?br />

    import java.io.IOException;

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapred.FileSplit;
    import org.apache.hadoop.mapred.InputSplit;
    import org.apache.hadoop.mapred.JobConf;
    import org.apache.hadoop.mapred.LineRecordReader;
    import org.apache.hadoop.mapred.RecordReader;
    import org.apache.hadoop.mapred.Reporter;
    import org.apache.hadoop.mapred.lib.CombineFileInputFormat;
    import org.apache.hadoop.mapred.lib.CombineFileRecordReader;
    import org.apache.hadoop.mapred.lib.CombineFileSplit;

    @SuppressWarnings("deprecation")
    public class CombinedInputFormat extends CombineFileInputFormat<LongWritable, Text> {

        @SuppressWarnings({ "unchecked", "rawtypes" })
        @Override
        public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf conf, Reporter reporter) throws IOException {

            return new CombineFileRecordReader(conf, (CombineFileSplit) split, reporter, (Class) myCombineFileRecordReader.class);
        }

        public static class myCombineFileRecordReader implements RecordReader<LongWritable, Text> {
            private final LineRecordReader linerecord;

            public myCombineFileRecordReader(CombineFileSplit split, Configuration conf, Reporter reporter, Integer index) throws IOException {
                FileSplit filesplit = new FileSplit(split.getPath(index), split.getOffset(index), split.getLength(index), split.getLocations());
                linerecord = new LineRecordReader(conf, filesplit);
            }

            @Override
            public void close() throws IOException {
                linerecord.close();

            }

            @Override
            public LongWritable createKey() {
                // TODO Auto-generated method stub
                return linerecord.createKey();
            }

            @Override
            public Text createValue() {
                // TODO Auto-generated method stub
                return linerecord.createValue();
            }

            @Override
            public long getPos() throws IOException {
                // TODO Auto-generated method stub
                return linerecord.getPos();
            }

            @Override
            public float getProgress() throws IOException {
                // TODO Auto-generated method stub
                return linerecord.getProgress();
            }

            @Override
            public boolean next(LongWritable key, Text value) throws IOException {

                // TODO Auto-generated method stub
                return linerecord.next(key, value);
            }

        }
    }


    在运行时q样讄Q?br />
    if (argument != null) {
                    conf.set("mapred.max.split.size", argument);
                } else {
                    conf.set("mapred.max.split.size", "134217728"); // 128 MB
                }
    //

                conf.setInputFormat(CombinedInputFormat.class);




    paulwong 2013-08-29 16:08 发表评论
    ]]>
    大数据^台架构设计资?/title><link>http://www.aygfsteel.com/paulwong/archive/2013/08/18/403001.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Sun, 18 Aug 2013 10:27:00 GMT</pubDate><guid>http://www.aygfsteel.com/paulwong/archive/2013/08/18/403001.html</guid><wfw:comment>http://www.aygfsteel.com/paulwong/comments/403001.html</wfw:comment><comments>http://www.aygfsteel.com/paulwong/archive/2013/08/18/403001.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/paulwong/comments/commentRss/403001.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/paulwong/services/trackbacks/403001.html</trackback:ping><description><![CDATA[!!!ZHadoop的大数据q_实施?#8212;—整体架构设计<br /><a target="_blank">http://blog.csdn.net/jacktan/article/details/9200979</a><br /><br /><br /><br /><br /><br /><br /><br /><img src ="http://www.aygfsteel.com/paulwong/aggbug/403001.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/paulwong/" target="_blank">paulwong</a> 2013-08-18 18:27 <a href="http://www.aygfsteel.com/paulwong/archive/2013/08/18/403001.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>How to install Hadoop cluster(2 node cluster) and Hbase on Vmware Workstation. It also includes installing Pig and Hive in the appendixhttp://www.aygfsteel.com/paulwong/archive/2013/08/17/402982.htmlpaulwongpaulwongSat, 17 Aug 2013 14:23:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/08/17/402982.htmlhttp://www.aygfsteel.com/paulwong/comments/402982.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/08/17/402982.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/402982.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/402982.html

    Requires: Ubuntu10.04, hadoop0.20.2, zookeeper 3.3.2 HBase0.90.0
    1. Download Ubuntu 10.04 desktop 32 bit from Ubuntu website.

    2. Install Ubuntu 10.04 with username: hadoop, password: password,  disk size: 20GB, memory: 2048MB, 1 processor, 2 cores

    3. Install build-essential (for GNU C, C++ compiler)    $ sudo apt-get install build-essential

    4. Install sun-jave-6-jdk
        (1) Add the Canonical Partner Repository to your apt repositories
        $ sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"
         (2) Update the source list
        $ sudo apt-get update
         (3) Install sun-java-6-jdk and make sure Sun’s java is the default jvm
        $ sudo apt-get install sun-java6-jdk
         (4) Set environment variable by modifying ~/.bashrc file, put the following two lines in the end of the file
        export JAVA_HOME=/usr/lib/jvm/java-6-sun
        export PATH=$PATH:$JAVA_HOME/bin 

    5. Configure SSH server so that ssh to localhost doesn’t need a passphrase
        (1) Install openssh server
        $ sudo apt-get install openssh-server
         (2) Generate RSA pair key
        $ ssh-keygen –t ras –P ""
         (3) Enable SSH access to local machine
        $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

    6. Disable IPv6 by      modifying  /etc/sysctl.conf file, put the following two lines in the end of the file
    #disable
    ipv6 net.ipv6.conf.all.disable_ipv6 = 1
    net.ipv6.conf.default.disable_ipv6 = 1
    net.ipv6.conf.lo.disable_ipv6 = 1

    7. Install hadoop
        (1) Download hadoop-0.20.2.tar.gz(stable release on 1/25/2011)  from Apache hadoop website   
        (2) Extract hadoop archive file to /usr/local/   
        (3) Make symbolic link   
        (4) Modify /usr/local/hadoop/conf/hadoop-env.sh   
    Change from # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun To # The java implementation to use. Required. export JAVA_HOME=/usr/lib/jvm/java-6-sun
         (5)Create /usr/local/hadoop-datastore folder   
    $ sudo mkdir /usr/local/hadoop-datastore
    $ sudo chown hadoop:hadoop /usr/local/hadoop-datastore
    $ sudo chmod 750 /usr/local/hadoop-datastore
         (6)Put the following code in /usr/local/hadoop/conf/core-site.xml   
    hadoop.tmp.dir/usr/local/hadoop/tmp/dir/hadoop-${user.name}A base for other temporary directories.fs.default.namehdfs://master:54310The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.
        (7) Put the following code in /usr/local/hadoop/conf/mapred-site.xml   
    mapred.job.trackermaster:54311The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
         (8) Put the following code in /usr/local/hadoop/conf/hdfs-site.xml   
    dfs.replication1Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
         (9) Add hadoop to environment variable by modifying ~/.bashrc   
    export HADOOP_HOME=/usr/local/hadoop export PATH=$HADOOP_HOME/bin:$PATH

    8. Restart Ubuntu Linux

    9. Copy this virtual machine to another folder. At least we have 2 copies of Ubuntu linux

    10. Modify /etc/hosts on both Linux Virtual Image machines, add in the following lines in the file. The IP address depends on each machine. We can use (ifconfig) to find out IP address.
    # /etc/hosts (for master AND slave) 192.168.0.1 master 192.168.0.2 slave     Modify the following line, because it might cause Hbase to find out wrong ip.   
    192.168.0.1 ubuntu

    11. Check hadoop user access on both machines.
    The hadoop user on the master (aka hadoop@master) must be able to connect a) to its own user account on the master – i.e. ssh master in this context and not necessarily ssh localhost – and b) to the hadoop user account on the slave (aka hadoop@slave)  via a password-less SSH login. On both machines, make sure each one can connect to master, slave without typing passwords.

    12. Cluster configuration
        (1) Modify /usr/local/hadoop/conf/masters
             only on master machine    master
         (2) Modify /usr/local/hadoop/conf/slaves
              only on master machine    master slave
         (3) Change “localhost” to “master” in /usr/local/conf/hadoop/conf/core-site.xml and /usr/local/hadoop/conf/mapred-site.xml
            only on master machine   
        (4) Change dfs.replication to “1” in /usr/local/conf/hadoop/conf/hdfs-site.xml
        only on master machine   

    13. Format the namenode only once and only on master machine
    $ /usr/local/hadoop/bin/hadoop namenode –format

    14. Later on, start the multi-node cluster by typing following code only on master. So far, please don’t start hadoop yet.
    $ /usr/local/hadoop/bin/start-dfs.sh $ /usr/local/hadoop/bin/start-mapred.sh

    15. Install zookeeper only on master node
        (1) download zookeeper-3.3.2.tar.gz from Apache hadoop website   
        (2) Extract  zookeeper-3.3.2.tar.gz    $ tar –xzf zookeeper-3-3.2.tar.gz
         (3) Move folder zookeeper-3.3.2 to /home/hadoop/ and create a symbloink link
        $ mv zookeeper-3.3.2 /home/hadoop/ ; ln –s /home/hadoop/zookeeper-3.3.2 /home/hadoop/zookeeper
         (4) copy conf/zoo_sample.cfg to conf/zoo.cfg
        $ cp conf/zoo_sample.cfg confg/zoo.cfg
         (5) Modify conf/zoo.cfg    dataDir=/home/hadoop/zookeeper/snapshot

    16. Install Hbase on both master and slave nodes, configure it as fully-distributed
        (1) Download hbase-0.90.0.tar.gz from Apache hadoop website   
        (2) Extract  hbase-0.90.0.tar.gz    $ tar –xzf hbase-0.90.0.tar.gz
         (3) Move folder hbase-0.90.0 to /home/hadoop/ and create a symbloink link    $ mv hbase-0.90.0 /home/hadoop/ ; ln –s /home/hadoop/hbase-0.90.0 /home/hadoop/hbase
         (4) Edit /home/hadoop/hbase/conf/hbase-site.xml, put the following in between and hbase.rootdirhdfs://master:54310/hbase The directory shared by region servers. Should be fully-qualified to include the filesystem to use. E.g: hdfs://NAMENODE_SERVER:PORT/HBASE_ROOTDIR hbase.cluster.distributedtrueThe mode the cluster will be in. Possible values are false: standalone and pseudo-distributed setups with managed Zookeeper true: fully-distributed with unmanaged Zookeeper Quorum (see hbase-env.sh) hbase.zookeeper.quorummasterComma separated list of servers in the ZooKeeper Quorum. If HBASE_MANAGES_ZK is set in hbase-env.sh this is the list of servers which we will start/stop ZooKeeper on.
         (5) modify environment variables in /home/hadoop/hbase/conf/hbase-env.sh
        export JAVA_HOME=/usr/lib/jvm/java-6-sun/
    export HBASE_IDENT_STRING=$HOSTNAME
    export HBASE_MANAGES_ZK=false
         (6)Overwrite /home/hadoop/hbase/conf/regionservers
      on both machines    master slave
         (7)copy /usr/local/hadoop-0.20.2/haoop-0.20.2-core.jar to /home/hadoop/hbase/lib/  on both machines.
          This is very important to fix version difference issue. Pay attention to its ownership and mode(755).   

    17. Start zookeeper. It seems the zookeeper bundled with Hbase is not set up correctly.
    $ /home/hadoop/zookeeper/bin/zkServer.sh start     (Optional)We can test if zookeeper is running correctly by  typing     $ /home/hadoop/zookeeper/bin/zkCli.sh –server 127.0.0.1:2181

    18. Start hadoop cluster
    $ /usr/local/hadoop/bin/start-dfs.sh $ /usr/local/hadoop/bin/start-mapred.sh

    19. Start Hbase
    $ /home/hadoop/hbase/bin/start-hbase.sh

    20. Use Hbase shell
    $ /home/hadoop/hbase/bin/hbase shell     Check if hbase is running smoothly
        Open your browser, and type in the following.
        http://localhost:60010   


    21. Later on, stop the multi-node cluster by typing following code only on master
        (1) Stop Hbase    $ /home/hadoop/hbase/bin/stop-hbase.sh
         (2) Stop hadoop file system (HDFS)       
    $ /usr/local/hadoop/bin/stop-mapred.sh
    $ /usr/local/hadoop/bin/stop-dfs.sh
         (3) Stop zookeeper    
    $ /home/hadoop/zookeeper/bin/zkServer.sh stop

    Reference
    http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
    http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
    http://wiki.apache.org/hadoop/Hbase/10Minutes
    http://hbase.apache.org/book/quickstart.html
    http://alans.se/blog/2010/hadoop-hbase-cygwin-windows-7-x64/

    Author
    Tzu-Cheng Chuang


    Appendix- Install Pig and Hive
    1. Install Pig 0.8.0 on this cluster
        (1) Download pig-0.8.0.tar.gz from Apache pig project website.  Then extract the file and move it to /home/hadoop/   
    $ tar –xzf pig-0.8.0.tar.gz ; mv pig-0.8.0 /home/hadoop/
         (2) Make symbolink link under pig-0.8.0/conf/   
    $ ln -s /usr/local/hadoop/conf/core-site.xml /home/hadoop/pig-0.8.0/conf/core-site.xml
    $ ln -s /usr/local/hadoop/conf/mapred-site.xml /home/hadoop/pig-0.8.0/conf/mapred-site.xml
    $ ln -s /usr/local/hadoop/conf/hdfs-site.xml /home/hadoop/pig-0.8.0/conf/hdfs-site.xml
         3) Start pig in map-reduce mode: $ /home/hadoop/pig-0.8.0/bin/pig
         (4) Exit pig from grunt>    quit

    2. Install Hive on this cluster
        (1) Download hive-0.6.0.tar.gz from Apache hive project website, and then extract the file and move it to /home/hadoop/    $ tar –xzf hive-0.6.0.tar.gz ; mv hive-0.6.0 ~/
         (2) Modify java heap size in hive-0.6.0/bin/ext/execHiveCmd.sh  Change 4096 to 1024   
        (3) Create /tmp and /user/hive/warehouse and set them chmod g+w in HDFS before a table can be created in Hive    $ hadoop fs –mkdir /tmp $ hadoop fs –mkdir /user/hive/warehouse $ hadoop fs –chmod g+w /tmp $ hadoop fs –chmod g+w /user/hive/warehouse
         (4) start Hive     $ /home/hadoop/hive-0.6.0/bin/hive

         3. (Optional)Load data by using Hive
        Create a file /home/hadoop/customer.txt    1, Kevin 2, David 3, Brian 4, Jane 5, Alice     After hive shell is started, type in    > CREATE TABLE IF NOT EXISTS customer(id INT, name STRING) > ROW FORMAT delimited fields terminated by ',' > STORED AS TEXTFILE; >LOAD DATA INPATH '/home/hadoop/customer.txt' OVERWRITE INTO TABLE customer; >SELECT customer.id, customer.name from customer;

    http://chuangtc.info/ParallelComputing/SetUpHadoopClusterOnVmwareWorkstation.htm

    paulwong 2013-08-17 22:23 发表评论
    ]]>
    Kettle - HADOOP数据转换工具http://www.aygfsteel.com/paulwong/archive/2013/08/01/402269.htmlpaulwongpaulwongThu, 01 Aug 2013 09:21:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/08/01/402269.htmlhttp://www.aygfsteel.com/paulwong/comments/402269.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/08/01/402269.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/402269.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/402269.html
    http://www.cnblogs.com/limengqiang/archive/2013/01/16/KettleApply1.html

    paulwong 2013-08-01 17:21 发表评论
    ]]>
    使用Sqoop实现HDFS与Mysql互{http://www.aygfsteel.com/paulwong/archive/2013/05/11/399153.htmlpaulwongpaulwongSat, 11 May 2013 13:27:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/05/11/399153.htmlhttp://www.aygfsteel.com/paulwong/comments/399153.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/05/11/399153.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/399153.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/399153.html ?br /> Sqoop是一个用来将Hadoop和关pd数据库中的数据相互{Uȝ工具Q可以将一个关pd数据库(例如 Q?MySQL ,Oracle ,Postgres{)中的数据导入到Hadoop的HDFS中,也可以将HDFS的数据导入到关系型数据库中?br />
    http://sqoop.apache.org/

    环境
    当调试过E出现IncompatibleClassChangeError一般都是版本兼定w题?br />
    Z保证hadoop和sqoop版本的兼Ҏ,使用ClouderaQ?br />
    Cloudera介:

    ClouderaZ让Hadoop的配|标准化Q可以帮助企业安装,配置Q运行hadoop以达到大规模企业数据的处理和分析?br />
    http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDHTarballs/3.25.2013/CDH4-Downloadable-Tarballs/CDH4-Downloadable-Tarballs.html

    下蝲安装hadoop-0.20.2-cdh3u6Qsqoop-1.3.0-cdh3u6?br />
    安装
    安装比较单,直接解压卛_

    唯一需要做的就是将mysql的jdbc适配包mysql-connector-java-5.0.7-bin.jar copy?SQOOP_HOME/lib下?br />
    配置好环境变量:/etc/profile

    export SQOOP_HOME=/home/hadoop/sqoop-1.3.0-cdh3u6/

    export PATH=$SQOOP_HOME/bin:$PATH

    MYSQL转HDFS-CZ
    ./sqoop import --connect jdbc:mysql://10.8.210.166:3306/recsys --username root --password root --table shop -m 1 --target-dir /user/recsys/input/shop/$today


    HDFS转MYSQ-CZ
    ./sqoop export --connect jdbc:mysql://10.8.210.166:3306/recsys --username root --password root --table shopassoc --fields-terminated-by ',' --export-dir /user/recsys/output/shop/$today

    CZ参数说明
    (其他参数我未使用Q故不作解释Q未使用Q就没有发言权,详见命ohelp)


    参数cd

    参数?br />
    解释

    公共

    connect

    Jdbc-url

    公共

    username

    ---

    公共

    password

    ---

    公共

    table

    表名

    Import

    target-dir

    制定输出hdfs目录Q默认输出到/user/$loginName/

    export

    fields-terminated-by

    Hdfs文g中的字段分割W,默认?#8220;\t”

    export

    export-dir

    hdfs文g的\?img src ="http://www.aygfsteel.com/paulwong/aggbug/399153.html" width = "1" height = "1" />

    paulwong 2013-05-11 21:27 发表评论
    ]]>
    hadoop集群监控工具ambari安装http://www.aygfsteel.com/paulwong/archive/2013/05/03/398731.htmlpaulwongpaulwongFri, 03 May 2013 05:55:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/05/03/398731.htmlhttp://www.aygfsteel.com/paulwong/comments/398731.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/05/03/398731.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/398731.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/398731.html  Apache Ambari是对Hadoopq行监控、管理和生命周期理的开源项目。它也是一个ؓHortonworks数据q_选择理l徏的项目。Ambari向Hadoop MapReduce、HDFS?HBase、Pig, Hive、HCatalog以及Zookeeper提供服务。最q准备装ambariQ在|上找了怹Q没扑ֈ比较pȝ的ambari安装q程Q于是,根据官|进行了安装Q下面是我推荐的正确的较完善的安装方式,希望对大家有所帮助?/span>

      一、准备工?/span>

      1、系l:我的pȝ是CentOS6.2Qx86_64Q本ơ集采用两个节炏V管理节点:192.168.10.121Q客L节点Q?92.168.10.122

      2、系l最好配|能上网Q这h便后面的操作Q否则需要配|yum仓库Q比较麻烦?/span>

      3、集中ambari-serveerQ管理节点)到客L配置无密码登录?/span>

      4、集同步时?/span>

      5、SELinuxQiptables都处于关闭状态?/span>

      6、ambari版本Q?.2.0

      二、安装步?/span>

      A、配|好集群环境

    ############  配置无密码登?nbsp; #################
    [root@ccloud121 ~]# ssh-keygen -t dsa
    [root@ccloud121 ~]# cat /root/.ssh/id_dsa.pub >> /root/.ssh/authorized_keys
    [root@ccloud121 ~]# scp /root/.ssh/id_dsa.pub 192.168.10.122:/root/
    [root@ccloud121 ~]# ssh 192.168.10.122
    [root@ccloud122 ~]# cat /root/.ssh/id_dsa.pub >> /root/.ssh/authorized_keys

    #############  NTP 旉同步  #################

    [root@ccloud121 ~]# ntpdate time.windows.com
    [root@ccloud121 ~]# ssh ccloud122 ntpdate time.windows.com

    ###########  SELinux & iptables 关闭   ###########

    [root@ccloud121 ~]# setenforce 0
    [root@ccloud121 ~]# ssh ccloud122 setenforce 0
    [root@ccloud121 ~]# chkconfig iptables off
    [root@ccloud121 ~]# service iptables stop
    [root@ccloud121 ~]# ssh ccloud122 chkconfig iptables off
    [root@ccloud121 ~]# ssh ccloud122 service iptables stop

      B、管理节点上安装ambari-server

        1、下载repo文g     

    [root@ccloud121 ~]# wget http://public-repo-1.hortonworks.com/AMBARI-1.x/repos/centos6/ambari.repo

    [root@ccloud121 ~]# cp ambari.repo /etc/yum.repos.d

        q样Qambari-server的yum仓库做好了?/span>

        2、安装epel仓库

    [root@ccloud121 ~]# yum install epel-release   # 查看仓库列表Q应该有HDPQEPEL [root@ccloud121 ~]# yum repolist

        3、通过yum安装amabari bitsQ这同时也会安装PostgreSQL

    [root@ccloud121 ~]# yum install ambari-server

         q个步骤要等一会,它需要上|下载,大概39M左右?/span>

        4、运行ambari-server setupQ安装ambari-serverQ它会自动安装配|PostgreSQLQ同时要求输入用户名和密码,如果按nQ它用默认的用户?密码|ambari-server/bigdata。接着开始下载安装JDK。安装完成后Qambari-server可以启动了?/span>

      三、集启?/span>

        

        1、直接接通过ambari-server start和amabari-server stop卛_启动和关闭ambari-serveer?/span>

        2、启动成功后Q在览器输入http://192.168.10.121:8080

        界面如下图所C:

        

    d名和密码都是admin?/span>

    q样可以登录到理控制台?/span>




    paulwong 2013-05-03 13:55 发表评论
    ]]>
    一|打?3Ƒּ源Java大数据工?/title><link>http://www.aygfsteel.com/paulwong/archive/2013/05/03/398700.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Fri, 03 May 2013 01:05:00 GMT</pubDate><guid>http://www.aygfsteel.com/paulwong/archive/2013/05/03/398700.html</guid><wfw:comment>http://www.aygfsteel.com/paulwong/comments/398700.html</wfw:comment><comments>http://www.aygfsteel.com/paulwong/archive/2013/05/03/398700.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/paulwong/comments/commentRss/398700.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/paulwong/services/trackbacks/398700.html</trackback:ping><description><![CDATA[<p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>下面介l大数据领域支持Java的主开源工?/strong>Q?/p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce391277b5.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>1. HDFS</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">HDFS是Hadoop应用E序中主要的分布式储存系l, HDFS集群包含了一个NameNodeQ主节点Q,q个节点负责理所有文件系l的元数据及存储了真实数据的DataNodeQ数据节点,可以有很多)。HDFS针对量数据所设计Q所以相比传l文件系l在大批量小文g上的优化QHDFS优化的则是对批量大型文件的讉K和存储?/p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"></a><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce3c49ded6.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>2. MapReduce</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Hadoop MapReduce是一个Y件框Ӟ用以L~写处理量QTBU)数据的ƈ行应用程序,以可靠和定w的方式连?span style="line-height: 1.45em;">大型集群?/span><span style="line-height: 1.45em;">上万个节点(商用gQ?/span></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce3ee64519.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>3. HBase</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache HBase是Hadoop数据库,一个分布式、可扩展的大数据存储。它提供了大数据集上随机和实时的?写访问,q对了商用服务器集上的大型表格做Z?#8212;—上百亿行Q上千万列。其核心是Google Bigtable论文的开源实玎ͼ分布式列式存储。就像Bigtable利用GFSQGoogle File SystemQ提供的分布式数据存储一P它是Apache Hadoop在HDFS基础上提供的一个类Bigatable?/p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce413366c7.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>4. Cassandra</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Cassandra是一个高性能、可U性扩展、高有效性数据库Q可以运行在商用g或云基础设施上打造完的d关键性数据^台。在横跨数据中心的复制中QCassandra同类最佻I为用h供更低的延时以及更可靠的N备䆾。通过log-structured update、反规范化和物化视图的强支持以及强大的内|缓存,Cassandra的数据模型提供了方便的二U烦引(column indexeQ?/p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce4611885c.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>5. Hive</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Hive是Hadoop的一个数据仓库系l,促进了数据的lDQ将l构化的数据文g映射Z张数据库表)、即席查询以及存储在Hadoop兼容pȝ中的大型数据集分析。Hive提供完整的SQL查询功能——HiveQL语言Q同时当使用q个语言表达一?span style="line-height: 1.45em;">逻辑</span><span style="line-height: 1.45em;">变得低效和繁?/span><span style="line-height: 1.45em;">ӞHiveQLq允怼l的Map/ReduceE序员用自己定制的Mapper和Reducer?/span></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce470085ed.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>6. Pig</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Pig是一个用于大型数据集分析的^収ͼ它包含了一个用于数据分析应用的高语言以及评估q些应用的基设施。Pig应用的闪光特性在于它们的l构l得起大量的q行Q也是说让它们支撑起非常大的数据集。Pig的基设施层包含了产生Map-Reduced的编译器。Pig的语a层当前包含了一个原生语a——Pig LatinQ开发的初衷是易于编E和保证可扩展性?/p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce47b8e077.jpg" border="0" alt="" style="vertical-align: middle; border: none; width: 99px; height: 99px; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>7. Chukwa</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Chukwa是个开源的数据攉pȝQ用以监视大型分布系l。徏立于HDFS和Map/Reduce框架之上Q承了Hadoop的可扩展性和E_性。Chukwa同样包含了一个灵zd强大的工具包Q用以显C、监视和分析l果Q以保证数据的用达到最x果?/p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce4870b072.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>8. Ambari</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Ambari是一个基于web的工P用于配置、管理和监视Apache Hadoop集群Q支持Hadoop HDFS,、Hadoop MapReduce、Hive、HCatalog,、HBase、ZooKeeper、Oozie、Pig和Sqoop。Ambari同样q提供了集群状况仪表盘,比如heatmaps和查看MapReduce、Pig、Hive应用E序的能力,以友好的用户界面对它们的性能Ҏ进行诊断?/p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce49282930.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>9. ZooKeeper</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache ZooKeeper是一个针对大型分布式pȝ的可靠协调系l,提供的功能包括:配置l护、命名服务、分布式同步、组服务{。ZooKeeper的目标就是封装好复杂易出错的关键服务Q将单易用的接口和性能高效、功能稳定的pȝ提供l用戗?/p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce49e31e19.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>10. Sqoop</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Sqoop是一个用来将Hadoop和关pd数据库中的数据相互{Uȝ工具Q可以将一个关pd数据库中数据导入Hadoop的HDFS中,也可以将HDFS中数据导入关pd数据库中?/p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce4b0d3c61.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>11. Oozie</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Oozie是一个可扩展、可靠及可扩充的工作调度系l,用以理Hadoop作业。Oozie Workflow作业是活动的Directed Acyclical GraphsQDAGsQ。Oozie Coordinator作业是由周期性的Oozie Workflow作业触发Q周期一般决定于旉Q频率)和数据可用性。Oozie与余下的Hadoop堆栈l合使用Q开即用的支持多种cdHadoop作业Q比如:Java map-reduce、Streaming map-reduce、Pig?Hive、Sqoop和DistcpQ以及其它系l作业(比如JavaE序和Shell脚本Q?/p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce4bdedb23.jpg" border="0" alt="" style="vertical-align: middle; border: none; width: 100px; height: 100px; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>12. Mahout</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache Mahout是个可扩展的机器学习和数据挖掘库Q当前Mahout支持主要?个用例:</p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"></p><ul style="margin: 0px 0px 1em 20px; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">推荐挖掘Q搜集用户动作ƈ以此l用h荐可能喜Ƣ的事物?/span></li><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">聚集Q收集文件ƈq行相关文g分组?/span></li><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">分类Q从现有的分cL档中学习Q寻找文档中的相似特征,qؓ无标{文档q行正确的归cR?/span></li><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">频繁w挖掘Q将一l项分组Qƈ识别哪些个别会l常一起出现?/span></li></ul><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><a target="_blank" style="cursor: pointer; color: #0066cc; text-decoration: none;"><img src="http://cms.csdnimg.cn/article/201304/28/517ce4cf93346.jpg" border="0" alt="" style="vertical-align: middle; border: none; float: right; margin: 0px 0px 10px 10px;" /></a></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><strong>13. HCatalog</strong></p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;">Apache HCatalog是Hadoop建立数据的映表和存储管理服务,它包括:</p><p style="margin: 0px 0px 1.5em; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"></p><ul style="margin: 0px 0px 1em 20px; padding: 0px; list-style: none; color: #333333; font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 24px; background-color: #ffffff;"><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">提供一个共享模式和数据cd机制?/span></li><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">提供一个抽象表Q这L户就不需要关注数据存储的方式和地址?/span></li><li style="margin: 0px; padding: 0px; list-style: disc;"><span style="line-height: 1.45em;">为类似Pig、MapReduce及Hiveq些数据处理工具提供互操作性?/span></li></ul><img src ="http://www.aygfsteel.com/paulwong/aggbug/398700.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/paulwong/" target="_blank">paulwong</a> 2013-05-03 09:05 <a href="http://www.aygfsteel.com/paulwong/archive/2013/05/03/398700.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>HADOOP服务?/title><link>http://www.aygfsteel.com/paulwong/archive/2013/05/01/398605.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Tue, 30 Apr 2013 16:02:00 GMT</pubDate><guid>http://www.aygfsteel.com/paulwong/archive/2013/05/01/398605.html</guid><wfw:comment>http://www.aygfsteel.com/paulwong/comments/398605.html</wfw:comment><comments>http://www.aygfsteel.com/paulwong/archive/2013/05/01/398605.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/paulwong/comments/commentRss/398605.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/paulwong/services/trackbacks/398605.html</trackback:ping><description><![CDATA[Centos集群服务器,公网ip<br />服务器地址<br />masterQ?mypetsbj.xicp.net:13283<br />slave1 Q?mypetsbj.xicp.net:13282<br />slave2 Q?mypetsbj.xicp.net:13286<br /><br /><table border="0" cellpadding="0" cellspacing="0" width="438" style="border-collapse: collapse;width:329pt"> <colgroup><col width="438" style="width:329pt"> </colgroup><tbody><tr height="18" style="height:13.5pt"> <td height="18" width="438" style="height: 13.5pt; width: 329pt;"><a target="_blank">http://mypetsbj.xicp.net:13296</a></td> </tr> <tr height="18" style="height:13.5pt"> <td height="18" style="height:13.5pt"><a target="_blank">http://mypetsbj.xicp.net:13304</a></td> </tr> <tr height="18" style="height:13.5pt"> <td height="18" style="height:13.5pt"></td> </tr> <tr height="18" style="height:13.5pt"> <td height="18" style="height:13.5pt"><a target="_blank">http://mypetsbj.xicp.net:14113</a></td> </tr> <tr height="18" style="height:13.5pt"> <td height="18" style="height: 13.5pt;"><a target="_blank">http://mypetsbj.xicp.net:11103</a></td></tr></tbody></table><br />服务器开机时?br />08:00 ?23:59 <br /><br />opt/hadoop<br /><br /><span style="color: #ffffff;">用户?密码</span><br /><span style="color: #ffffff;">hadoop/wzp </span><img src ="http://www.aygfsteel.com/paulwong/aggbug/398605.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/paulwong/" target="_blank">paulwong</a> 2013-05-01 00:02 <a href="http://www.aygfsteel.com/paulwong/archive/2013/05/01/398605.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>一个PIG脚本例子分析http://www.aygfsteel.com/paulwong/archive/2013/04/13/397791.htmlpaulwongpaulwongSat, 13 Apr 2013 07:21:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/04/13/397791.htmlhttp://www.aygfsteel.com/paulwong/comments/397791.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/04/13/397791.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/397791.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/397791.html
    PIGGYBANK_PATH=$PIG_HOME/contrib/piggybank/java/piggybank.jar
    INPUT=pig/input/test-pig-full.txt
    OUTPUT=pig/output/test-pig-output-$(date  +%Y%m%d%H%M%S)
    PIGSCRIPT=analyst_status_logs.pig

    #analyst_500_404_month.pig
    #
    analyst_500_404_day.pig
    #
    analyst_404_percentage.pig
    #
    analyst_500_percentage.pig
    #
    analyst_unique_path.pig
    #
    analyst_user_logs.pig
    #
    analyst_status_logs.pig


    pig -p PIGGYBANK_PATH=$PIGGYBANK_PATH -p INPUT=$INPUT -p OUTPUT=$OUTPUT $PIGSCRIPT


    要分析的数据源,LOG 文g
    46.20.45.18 - - [25/Dec/2012:23:00:25 +0100] "GET / HTTP/1.0" 302 - "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" "-" "-" 46.20.45.18 "" 11011AEC9542DB0983093A100E8733F8 0
    46.20.45.18 - - [25/Dec/2012:23:00:25 +0100] "GET /sign-in.jspx HTTP/1.0" 200 3926 "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" "-" "-" 46.20.45.18 "" 11011AEC9542DB0983093A100E8733F8 0
    69.59.28.19 - - [25/Dec/2012:23:01:25 +0100] "GET / HTTP/1.0" 302 - "-" "Pingdom.com_bot_version_1.4_(http://www.pingdom.com/)" "-" "-" 69.59.28.19 "" 36D80DE7FE52A2D89A8F53A012307B0A 15


    PIG脚本Q?br />
    --注册JAR包,因ؓ要用到DateExtractor
    register '$PIGGYBANK_PATH';

    --声明一个短函数?br />DEFINE DATE_EXTRACT_MM 
    org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('yyyy-MM');

    DEFINE DATE_EXTRACT_DD 
    org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('yyyy-MM-dd');

    -- pig/input/test-pig-full.txt
    --把数据从变量所指的文g加蝲到PIG中,q定义数据列名,此时的数据集为数l?a,b,c)
    raw_logs = load '$INPUT' USING org.apache.pig.piggybank.storage.MyRegExLoader('^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(\\S+) (\\S+) (HTTP[^"]+)" (\\S+) (\\S+) "([^"]*)" "([^"]*)" "(\\S+)" "(\\S+)" (\\S+) "(.*)" (\\S+) (\\S+)')
    as (remoteAddr: chararray, 
    n2: chararray, 
    n3: chararray, 
    time: chararray, 
    method: chararray,
    path:chararray,
    protocol:chararray,
    status: int, 
    bytes_string: chararray, 
    referrer: chararray, 
    browser: chararray, 
    n10:chararray,
    remoteLogname: chararray, 
    remoteAddr12: chararray, 
    path2: chararray, 
    sessionid: chararray, 
    n15: chararray
    );

    --qo数据
    filter_logs = FILTER raw_logs BY not (browser matches '.*pingdom.*');
    --item_logs = FOREACH raw_logs GENERATE browser;

    --percent 500 logs
    --重定义数据项Q数据集只取2status,month
    reitem_percent_500_logs = FOREACH filter_logs GENERATE status,DATE_EXTRACT_MM(time) as month;
    --分组数据集,此时的数据结构ؓMAP(a{(aa,bb,cc),(dd,ee,ff)},b{(bb,cc,dd),(ff,gg,hh)})
    group_month_percent_500_logs = GROUP reitem_percent_500_logs BY (month);
    --重定义分l数据集数据,q行分组l计Q此时要联合分组数据集和原数据集l计
    final_month_500_logs = FOREACH group_month_percent_500_logs 
    {
        --对原数据集做countQ因为是在foreachj里做count的,即是对原数据集Q也会自动会加month==group的条?br />    --从这里可以看出对于group里的数据集,完全没用?br />    --q时是以每一行ؓ单位的,l计MAP中的KEY-a对应的数l在原数据集中的个数
        total = COUNT(reitem_percent_500_logs);
        --对原数据集做filterQ因为是在foreachj里做count的,即是对原数据集Q也会自动会加month==group的条?br />    --重新qo一下原数据集,得到status==500,month==group的数据集
        t = filter reitem_percent_500_logs by status== 500; --create a bag which contains only T values
        --重定义数据项Q取groupQ统计结?br />    generate flatten(group) as col1, 100*(double)COUNT(t)/(double)total;
    }
    STORE final_month_500_logs into '$OUTPUT' using PigStorage(',');



    paulwong 2013-04-13 15:21 发表评论
    ]]>
    把命令行中的gqPIG?/title><link>http://www.aygfsteel.com/paulwong/archive/2013/04/10/397645.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Wed, 10 Apr 2013 07:32:00 GMT</pubDate><guid>http://www.aygfsteel.com/paulwong/archive/2013/04/10/397645.html</guid><wfw:comment>http://www.aygfsteel.com/paulwong/comments/397645.html</wfw:comment><comments>http://www.aygfsteel.com/paulwong/archive/2013/04/10/397645.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/paulwong/comments/commentRss/397645.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/paulwong/services/trackbacks/397645.html</trackback:ping><description><![CDATA[<a target="_blank">http://wiki.apache.org/pig/ParameterSubstitution<br /> <br /> <br /> </a> <div> <div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /> <br /> Code highlighting produced by Actipro CodeHighlighter (freeware)<br /> http://www.CodeHighlighter.com/<br /> <br /> -->%pig -param input=/user/paul/sample.txt -param output=/user/paul/output/</div> </div><br /><br />PIG中获?br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->records = LOAD <span style="color: #800080; ">$input</span>;</div><img src ="http://www.aygfsteel.com/paulwong/aggbug/397645.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/paulwong/" target="_blank">paulwong</a> 2013-04-10 15:32 <a href="http://www.aygfsteel.com/paulwong/archive/2013/04/10/397645.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>PIG中的分组l计癑ֈ?/title><link>http://www.aygfsteel.com/paulwong/archive/2013/04/10/397642.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Wed, 10 Apr 2013 06:13:00 GMT</pubDate><guid>http://www.aygfsteel.com/paulwong/archive/2013/04/10/397642.html</guid><wfw:comment>http://www.aygfsteel.com/paulwong/comments/397642.html</wfw:comment><comments>http://www.aygfsteel.com/paulwong/archive/2013/04/10/397642.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/paulwong/comments/commentRss/397642.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/paulwong/services/trackbacks/397642.html</trackback:ping><description><![CDATA[<a target="_blank">http://stackoverflow.com/questions/15318785/pig-calculating-percentage-of-total-for-a-field<br /><br /></a><a target="_blank">http://stackoverflow.com/questions/13476642/calculating-percentage-in-a-pig-query</a><img src ="http://www.aygfsteel.com/paulwong/aggbug/397642.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/paulwong/" target="_blank">paulwong</a> 2013-04-10 14:13 <a href="http://www.aygfsteel.com/paulwong/archive/2013/04/10/397642.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>PIG议http://www.aygfsteel.com/paulwong/archive/2013/04/05/397411.htmlpaulwongpaulwongFri, 05 Apr 2013 13:33:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/04/05/397411.htmlhttp://www.aygfsteel.com/paulwong/comments/397411.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/04/05/397411.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/397411.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/397411.html什么是PIG
    是一U设计语aQ通过设计数据怎么动Q然后由相应的引擎将此变成MAPREDUCE JOB去HADOOP中运行?/div>
    PIG与SQL
    两者有相同之处Q执行一个或多个语句Q然后出来一些结果?/div>
    但不同的是,SQL要先把数据导到表中才能执行,SQL不关心中间如何做Q即发一个SQL语句q去Q就有结果出来?/div>
    PIGQ无d数据到表中,但要设计直到出结果的中间q程Q步骤如何等{?/div>

    paulwong 2013-04-05 21:33 发表评论
    ]]>PIG资源http://www.aygfsteel.com/paulwong/archive/2013/04/05/397406.htmlpaulwongpaulwongFri, 05 Apr 2013 10:19:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/04/05/397406.htmlhttp://www.aygfsteel.com/paulwong/comments/397406.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/04/05/397406.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/397406.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/397406.html http://guoyunsky.iteye.com/blog/1317084

    http://guoyunsky.iteye.com/category/196632

    Hadoop学习W记(9) Pig?br /> http://www.distream.org/?p=385


    [hadooppd]Pig的安装和单示?br /> http://blog.csdn.net/inkfish/article/details/5205999


    Hadoop and Pig for Large-Scale Web Log Analysis
    http://www.devx.com/Java/Article/48063


    Pig实战
    http://www.cnblogs.com/xuqiang/archive/2011/06/06/2073601.html


    [原创]Apache Pig中文教程Q进Ӟ
    http://www.codelast.com/?p=4249


    Zhadoopq_的pig语言对apache日志pȝ的分?br /> http://goodluck-wgw.iteye.com/blog/1107503


    !!Pig语言
    http://hi.baidu.com/cpuramdisk/item/a2980b78caacfa3d71442318


    Embedding Pig In Java Programs
    http://wiki.apache.org/pig/EmbeddedPig


    一个pig事例(REGEX_EXTRACT_ALL, DBStorageQ结果存q数据库)
    http://www.myexception.cn/database/1256233.html


    Programming Pig
    http://ofps.oreilly.com/titles/9781449302641/index.html


    [原创]Apache Pig的一些基概念及用法ȝQ?Q?br /> http://www.codelast.com/?p=3621


    !PIG手册
    http://pig.apache.org/docs/r0.11.1/func.html#built-in-functions

    paulwong 2013-04-05 18:19 发表评论
    ]]>
    hadoop集群中添加节Ҏ?/title><link>http://www.aygfsteel.com/paulwong/archive/2013/03/16/396544.html</link><dc:creator>paulwong</dc:creator><author>paulwong</author><pubDate>Sat, 16 Mar 2013 15:04:00 GMT</pubDate><guid>http://www.aygfsteel.com/paulwong/archive/2013/03/16/396544.html</guid><wfw:comment>http://www.aygfsteel.com/paulwong/comments/396544.html</wfw:comment><comments>http://www.aygfsteel.com/paulwong/archive/2013/03/16/396544.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/paulwong/comments/commentRss/396544.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/paulwong/services/trackbacks/396544.html</trackback:ping><description><![CDATA[在新节点安装好hadoop<br /><br /><br />把namenode的有关配|文件复制到该节?br /><br /><br />修改masters和slaves文g,增加该节?br /><br /><br />讄ssh免密码进节点<br /><br /><br />单独启动该节点上的datanode和tasktracker(hadoop-daemon.sh start datanode/tasktracker)<br /><br /><br />q行start-balancer.shq行数据负蝲均衡<br /> <br /><br />负蝲均衡:作用:当节点出现故?或新增加节点?数据块分布可能不均匀,负蝲均衡可以重新q各个datanode上数据块的分?img src ="http://www.aygfsteel.com/paulwong/aggbug/396544.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/paulwong/" target="_blank">paulwong</a> 2013-03-16 23:04 <a href="http://www.aygfsteel.com/paulwong/archive/2013/03/16/396544.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>HBASEMW记-基础功能http://www.aygfsteel.com/paulwong/archive/2013/02/06/395168.htmlpaulwongpaulwongWed, 06 Feb 2013 01:53:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/02/06/395168.htmlhttp://www.aygfsteel.com/paulwong/comments/395168.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/02/06/395168.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/395168.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/395168.html
  • HBASE的SHELL命o使用

  • HBASE的JAVA CLIENT的?br />
    新增和修改记录用PUT?br />
    PUT的执行流E:
    首先会在内存中增加MEMSTOREQ如果这个表有N个COLOUMN FAMILYQ则会生N个MEMSTOREQ记录中的值属于不同的COLOUMN FAMILY的,会保存到不同的MEMSTORE中。MEMSTORE中的g会马上FLUSH到文件中Q而是到MEMSTORE满的时候再FLUSHQ且FLUSH的时候不会写入已存在的HFILE中,而是新增一个HFILEM存。另外会写WRITE AHEAD LOGQ这是由于新增记录时不是马上写入HFILE的,如果中途出现DOWN机时Q则HBASE重启时会Ҏq个LOG来恢复数据?br />
    删除记录用DELETE?br />
    删除时ƈ不会在HFILE中的内容删除Q而是作一标记Q然后在查询的时候可以不取这些记录?br />
    d单条记录用GET?br />
    d的时候会记录保存到CAHE中,同样如果q个表有N个COLOUMN FAMILYQ则会生N个CAHE
    Q记录中的值属于不同的COLOUMN FAMILY的,会保存到不同的CAHE中。这样下ơ客L再取记录时会l合CAHE和MEMSTORE来返回数据?br />
    新增表用HADMIN?br />
    查询多条记录用SCAN和FILTER?br />
  • HBASE的分布式计算

    Z么会有分布式计算
    前面的API是针对ONLINE的应用,卌求低延时的,相当于OLTP。而针对大量数据时q些API׃适用了?br />如要针对全表数据q行分析时用SCANQ这样会全表数据取回本圎ͼ如果数据量在100G时会耗几个小ӞZ节省旉Q引入多U程做法Q但要引入多U程Ӟ需遵从新算法:全表数据分成N个段Q每D는一个线E处理,处理完后Q交l果合成Q然后进行分析?br />
    如果数据量在200G或以上时间就加倍了Q多U程的方式不能满了Q因此引入多q程方式Q即计放在不同的物理Z处理Q这时就要考虑每个物理机DOWN机时的处理方式等情况了,HADOOP的MAPREDUCE则是q种分布式计的框架了,对于应用者而言Q只d理分散和聚合的算法,其他的无考虑?br />
    HBASE的MAPREDUCE
    使用TABLEMAP和TABLEREDUCE?br />
    HBASE的部|架构和l成的组?br />架构在HADOOP和ZOOPKEEPER之上?br />
    HBASE的查询记录和保存记录的流E?br />说见前一~博文?br />
    HBASE作ؓ数据来源地、保存地和共享数据源的处理方?br />即相当于数据库中JOIN的算法:REDUCE SIDE JOIN、MAP SIDE JOIN?br />


  • paulwong 2013-02-06 09:53 发表评论
    ]]>
    监控HBASEhttp://www.aygfsteel.com/paulwong/archive/2013/02/04/395107.htmlpaulwongpaulwongMon, 04 Feb 2013 07:08:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/02/04/395107.htmlhttp://www.aygfsteel.com/paulwong/comments/395107.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/02/04/395107.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/395107.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/395107.htmlHadoop/Hbase是开源版的google Bigtable, GFS, MapReduce的实玎ͼ随着互联|的发展Q大数据的处理显得越发重要,Hadoop/Hbase的用武之C发q泛。ؓ了更好的使用Hadoop/HbasepȝQ需要有一套完善的监控pȝQ来了解pȝq行的实时状态,做到一切尽在掌握。Hadoop/Hbase有自己非常完善的metrics framework, 里面包种各种l度的系l指标的l计Q另外,q套metrics framework设计的也非常不错Q用户可以很方便地添加自定义的metrics。更为重要的一Ҏmetrics的展C方式,目前它支持三U方式:一U是落地到本地文Ӟ一U是reportlGangliapȝQ另一U是通过JMX来展C。本文主要介l怎么把Hadoop/Hbase的metrics reportlGangliapȝQ通过览器来查看?br />
    介绍后面的内容之前有必要先简单介l一下Gangliapȝ。Ganglia是一个开源的用于pȝ监控的系l,它由三部分组成:gmond, gmetad, webfrontend, 三部分是q样分工的:

    gmond: 是一个守护进E,q行在每一个需要监的节点上,攉监测l计Q发送和接受在同一个组播或单播通道上的l计信息
    gmetad: 是一个守护进E,定期查gmondQ从那里拉取数据Qƈ他们的指标存储在RRD存储引擎?br /> webfrontend: 安装在有gmetadq行的机器上Q以便读取RRD文gQ用来做前台展示

    单ȝ它们三者的各自的功用,gmond攉数据各个node上的metrics数据Qgmetad汇总gmond攉到的数据Qwebfrontend在前台展Cgmetad汇ȝ数据。Ganglia~省是对pȝ的一些metricq行监控Q比如cpu/memory/net{。不qHadoop/Hbase内部做了对Ganglia的支持,只需要简单的攚w|就可以Hadoop/Hbase的metrics也接入到gangliapȝ中进行监控?br />
    接下来介l如何把Hadoop/Hbase接入到GangliapȝQ这里的Hadoop/Hbase的版本号?.94.2Q早期的版本可能会有一些不同,h意区别。Hbase本来是Hadoop下面的子目Q因此所用的metrics framework原本是同一套Hadoop metricsQ但后面hadoop有了改进版本的metrics framework:metrics2(metrics version 2), Hadoop下面的项目都已经开始用metrics2, 而Hbase成了Apache的顶U子目Q和Hadoop成ؓq的项目后Q目前还没跟qmetrics2Q它用的q是原始的metrics.因此q里需要把Hadoop和Hbase的metrics分开介绍?br />
    Hadoop接入Ganglia:

    1. Hadoop metrics2对应的配|文件ؓQhadoop-metrics2.properties
    2. hadoop metrics2中引用了source和sink的概念,source是用来收集数据的, sink是用来把source攉的数据consume的(包括落地文gQ上报gangliaQJMX{)
    3. hadoop metrics2配置支持Ganglia:
    #*.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink30
    *.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
     
    *.sink.ganglia.period=10
    *.sink.ganglia.supportsparse=true
    *.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both
    *.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40
     
    #uncomment as your needs
    namenode.sink.ganglia.servers=10.235.6.156:8649
    #datanode.sink.ganglia.servers=10.235.6.156:8649
    #jobtracker.sink.ganglia.servers=10.0.3.99:8649
    #tasktracker.sink.ganglia.servers=10.0.3.99:8649
    #maptask.sink.ganglia.servers=10.0.3.99:8649
    #reducetask.sink.ganglia.servers=10.0.3.99:8649


    q里需要注意的几点Q?br />
    (1) 因ؓGanglia3.1?.0不兼容,需要根据Ganglia的版本选择使用GangliaSink30或者GangliaSink31
    (2) period配置上报周期Q单位是U?s)
    (3) namenode.sink.ganglia.servers指定Ganglia gmetad所在的host:portQ用来向其上报数?br /> (4) 如果同一个物理机器上同时启动了多个hadoopq程(namenode/datanode, etc)Q根据需要把相应的进E的sink.ganglia.servers配置好即?br /> Hbase接入Ganglia:

    1. Hbase所用的hadoop metrics对应的配|文件是: hadoop-metrics.properties
    2. hadoop metrics里核心是ContextQ写文g有写文g的TimeStampingFileContext, 向Ganglia上报有GangliaContext/GangliaContext31
    3. hadoop metrics配置支持Ganglia:
    # Configuration of the "hbase" context for ganglia
    # Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)
    # hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext
    hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
    hbase.period=10
    hbase.servers=10.235.6.156:8649

    q里需要注意几点:

    (1) 因ؓGanglia3.1?.0不兼容,所以如果是3.1以前的版本,需要用GangliaContext, 如果?.1版的GangliaQ需要用GangliaContext31
    (2) period的单位是U?s)Q通过period可以配置向Ganglia上报数据的周?br /> (3) servers指定的是Ganglia gmetad所在的host:portQ把数据上报到指定的gmetad
    (4) 对rpc和jvm相关的指标都可以q行cM的配|?/div>







    paulwong 2013-02-04 15:08 发表评论
    ]]>HBASE部v要点http://www.aygfsteel.com/paulwong/archive/2013/02/04/395101.htmlpaulwongpaulwongMon, 04 Feb 2013 04:10:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/02/04/395101.htmlhttp://www.aygfsteel.com/paulwong/comments/395101.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/02/04/395101.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/395101.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/395101.htmlREGIONS SERVER和TASK TRACKER SERVER不要在同一台机器上Q最好如果有MAPREDUCE JOBq行的话Q应该分开两个CLUSTERQ即两群不同的服务器上,q样MAPREDUCE 的线下负载不会媄响到SCANERq些U上负蝲?/div>

    如果主要是做MAPREDUCE JOB的话Q将REGIONS SERVER和TASK TRACKER SERVER攑֜一h可以的?/div>


    原始集群模式

    10个或以下节点Q无MAPREDUCE JOBQ主要用于低延迟的访问。每个节点上的配|ؓQCPU4-6COREQ内?4-32GQ?个SATA盘。Hadoop NameNode, JobTracker, HBase Master, 和ZooKeeper全都在同一个NODE上?


    型集群模式Q?0-20台服务器Q?/span>

    HBase Master攑֜单独一台机器上, 以便于用较低配|的机器。ZooKeeper也放在单独一台机器上QNameNode和JobTracker攑֜同一台机器上?/div>

    中型集群模式Q?0-50台服务器Q?/span>

    ׃无须再节省费用,可以HBase Master和ZooKeeper攑֜同一台机器上, ZooKeeper和HBase Master要三个实例。NameNode和JobTracker攑֜同一台机器上?/div>

    大型集群模式Q?gt;50台服务器Q?/span>

    和中型集模式相|但ZooKeeper和HBase Master要五个实例。NameNode和Second NameNode要有_大的内存?/div>

    HADOOP MASTER节点

    NameNode和Second NameNode服务器配|要求:Q小型)8CORE CPUQ?6G内存Q?G|卡和SATA 盘Q中弄再增加?6G内存Q大型则再增加多32G内存?/div>

    HBASE MASTER节点

    服务器配|要求:4CORE CPUQ?-16G内存Q?G|卡?个SATA 盘Q一个用于操作系l,另一个用于HBASE MASTER LOGS?/div>

    HADOOP DATA NODES和HBASE REGION SERVER节点

    DATA NODE和REGION SERVER应在同一台服务器上,且不应该和TASK TRACKER在一赗服务器配置要求Q?-12CORE CPUQ?4-32G内存Q?G|卡?2*1TB SATA 盘Q一个用于操作系l,另一个用于HBASE MASTER LOGS?/div>

    ZOOPKEEPERS节点

    服务器配|和HBASE MASTER怼Q也可以与HBASE MASTER攑֜一P但就要多增加一个硬盘单独给ZOOPKEEPER使用?/div>

    安装各节?/span>

    JVM配置Q?/div> -Xmx8g—讄HEAP的最大值到8GQ不讑ֈ15 GB.
    -Xms8g—讄HEAP的最值到8GS.
    -Xmn128m—讄新生代的值到128 MBQ默认值太?br /> -XX:+UseParNewGC—讄对于新生代的垃圾回收器类型,q种cd是会停止JAVAq程Q然后再q行回收的,但由于新生代体积比较,持箋旉通常只有几毫U,因此可以接受?br /> -XX:+UseConcMarkSweepGC—讄老生代的垃圾回收cdQ如果用新生代的那个会不合适,即会DJAVAq程停止的时间太长,用这U不会停止JAVAq程Q而是在JAVAq程q行的同Ӟq行的进行回收?br /> -XX:CMSInitiatingOccupancyFraction—讄CMS回收器运行的频率?br />






    paulwong 2013-02-04 12:10 发表评论
    ]]>HBASEMW记http://www.aygfsteel.com/paulwong/archive/2013/02/01/395020.htmlpaulwongpaulwongFri, 01 Feb 2013 05:55:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/02/01/395020.htmlhttp://www.aygfsteel.com/paulwong/comments/395020.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/02/01/395020.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/395020.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/395020.htmlGET、PUT是ONLINE的操作,MAPREDUCE是OFFLINE的操?/div>


    HDFS写流E?/span>
    客户端收到要保存文g的请求后Q将文g?4M为单位拆成若q䆾BLOCKQŞ成一个列表,即由几个BLOCKl成Q将q些信息告诉NAME NODEQ我要保存这个,NAME NODE出一个列表,哪段BLOCK应该写到哪个DATA NODEQ客L第一个BLOCK传到W一个节点DATA NODE AQ通知其保存,同时让它通知DATA NODE D和DATA NODE B也保存一份,DATA NODE D收到信息后进行了保存Q同旉知DATA NODE B保存一份,DATA NODE B保存完成后则通知客户端保存完成,客户端再dNAME NODE中取下一个BLOCK要保存的位置Q重复以上的动作Q直到所有的BLOCK都保存完成?/div>

    HDFSLE?/span>
    客户端向NAME NODEhM个文ӞNAME NODEq回q个文g所构成的所有BLOCK的DATA NODE IP及BLOCK IDQ客Lq行的向各DATA NODE发出hQ要取某个BLOCK ID的BLOCKQDATA NODE发回所要的BLOCKl客LQ客L攉到所有的BLOCK后,整合成一个完整的文g后,此流E结束?br />

    MAPREDUCE程
    输入数据 -- 非多U程了,而是多进E的挑选数据,卛_输入数据分成多块Q每个进E处理一?-- 分组 -- 多进E的汇集数据 -- 输出

    HBASE表结?/span>
    HBASE中将一个大表数据分成不同的表Q每个小表叫REGIONQ存放REGION的服务器叫REGIONSERVERQ一个REGIONSERVER可以存放多个REGION。通常REGIONSERVER和DATA NODE是在同一服务器,以减NETWORK IO?/div>
    -ROOT-表存放于MASTER SERVER上,记录了一共有多少个REGIONSERVERQ每个REGION SERVER上都有一?META.表,上面记录了本REGION SERVER放有哪几个表的哪几个REGION。如果要知道某个表共有几个REGIONQ就得去所有的REGION SERVER上查.META.表,q行汇L能得知?/div>
    客户端如果要查ROW009的信息,先去咨询ZOOPKEEPERQ?ROOT-表在哪里Q然后问-ROOT-表,哪个.META.知道q个信息Q然后去?META.表,哪个REGION有这个信息,然后去那个REGION问ROW009的信息,然后那个REGIONq回此信息?br />


    HBASE MAPREDUCE
    一个REGION一个MAPdQ而Q务里的mapҎ执行多少ơ,则由查询出来的记录有多少条,则执行多次?/div>
    REDUCEd负责向REGION写数据,但写到哪个REGION则由那个KEY归属哪个REGION,则写到哪个REGIONQ有可能REDUCEd会和所有的REGION SERVER交互?br />


    在HBASE的MAPREDUCE JOB中用JOIN
    REDUCE-SIDE JOIN
    利用现有的SHUTTLE分组机制Q在REDUCE阶段做JOINQ但׃MAP阶段数据大,可能会有性能问题?/div>
    MAP-SIDE JOIN
    数据较的一表读C公共文g中,然后在MPAҎ中@环另一表的数据Q再要的数据从公共文g中读取。这样可以减SHUTTLE和SORT的时_同时也不需要REDUCEd?/div>


    paulwong 2013-02-01 13:55 发表评论
    ]]>Hadoop的几UJoinҎhttp://www.aygfsteel.com/paulwong/archive/2013/01/31/395000.htmlpaulwongpaulwongThu, 31 Jan 2013 10:24:00 GMThttp://www.aygfsteel.com/paulwong/archive/2013/01/31/395000.htmlhttp://www.aygfsteel.com/paulwong/comments/395000.htmlhttp://www.aygfsteel.com/paulwong/archive/2013/01/31/395000.html#Feedback0http://www.aygfsteel.com/paulwong/comments/commentRss/395000.htmlhttp://www.aygfsteel.com/paulwong/services/trackbacks/395000.html2) 压羃字段,Ҏ据预处理,qo不需要的字段.
    3) 最后一步就是在Mapper阶段qo,q个是Bloom Filter的用武之C.也就是需要详l说明的地方.


    下面拿一个我们大安熟悉的场景来说明q个问题: 扑և上个月动感地带的客户资费的用情?包括接入和拨?

    (q个只是我臆惛_来的例子,Ҏ实际的DB数据存储l构,在这个场景下肯定有更好的解决Ҏ,大家不要太较真哦)

    q个时候的两个个数据集都是比较大的,q两个数据集分别?上个月的通话记录,动感地带的手机号码列?


    比较直接的处理方法有2U?

    1)?Reduce 阶段,通过动感地带L来过?

    优点:q样需要处理的数据相对比较?q个也是比较常用的方?

    ~点:很多数据在Mapper阶段׃老E子力气汇M,q通过|络Shuffle到Reduce节点,l果到这个阶D늻qo?



    2)?Mapper 阶段?通过动感地带L来过滤数?

    优点:q样可以qo很多不是动感地带的数?比如州?全球?q些qo的数据就可以节省很多|络带宽?

    ~点:是动感地带的号码不是小数目,如果q样处理需要把q个大块头复制到所有的Mapper节点,甚至是Distributed Cache.(Bloom Filter是用来解决q个问题?


    Bloom Filter是用来解决上面Ҏ2的缺点的.

    Ҏ2的缺点就是大量的数据需要在多个节点复制.Bloom Filter通过多个Hash法, 把这个号码列表压~到了一个Bitmap里面. 通过允许一定的错误率来换空? q个和我们^时经常提到的旉和空间的互换cM.详细情况可以参?

    http://blog.csdn.net/jiaomeng/article/details/1495500

    但是q个法也是有缺L,是会把很多州?全球通之cȝL当成动感地带.但在q个场景?q根本不是问?因ؓq个法只是qo一些号?漏网之鱼会在Reduce阶段q行_匚w旉虑掉.

    q个Ҏ改进之后基本上完全回避了Ҏ2的缺?

    1) 没有大量的动感地带号码发送到所有的Mapper节点.
    2) 很多非动感地带号码在Mapper阶段p滤了(虽然不是100%),避免了网l带宽的开销及g?


    l箋需要学习的地方:Bitmap的大? Hash函数的多? 以及存储的数据的多少. q?个变量如何取值才能才能在存储I间与错误率之间取得一个^?

    paulwong 2013-01-31 18:24 发表评论
    ]]>
    վ֩ģ壺 ɽ| | | | | ƺ| ƽ| | ˮ| ѳ| ͨ| | | մ| | Ϫ| ƽ| ʯ| | ̩| | ʡ| Զ| | ɽ| | Ű| | ɽ| ɽ| | ʯ| | ²| ɽ| | °Ͷ| | | | |