??xml version="1.0" encoding="utf-8" standalone="yes"?>国产三级伦理在线,国产精品国产,99久久精品99国产精品http://www.aygfsteel.com/wangxinsh55/category/54532.htmlzh-cnMon, 02 Mar 2015 13:45:56 GMTMon, 02 Mar 2015 13:45:56 GMT60Storm集成Kafka~程模型http://www.aygfsteel.com/wangxinsh55/archive/2015/03/01/423114.htmlSIMONESIMONESun, 01 Mar 2015 07:47:00 GMThttp://www.aygfsteel.com/wangxinsh55/archive/2015/03/01/423114.htmlhttp://www.aygfsteel.com/wangxinsh55/comments/423114.htmlhttp://www.aygfsteel.com/wangxinsh55/archive/2015/03/01/423114.html#Feedback0http://www.aygfsteel.com/wangxinsh55/comments/commentRss/423114.htmlhttp://www.aygfsteel.com/wangxinsh55/services/trackbacks/423114.html阅读全文

SIMONE 2015-03-01 15:47 发表评论
]]>
Hadoop作业调优参数整理及原?/title><link>http://www.aygfsteel.com/wangxinsh55/archive/2014/11/19/420297.html</link><dc:creator>SIMONE</dc:creator><author>SIMONE</author><pubDate>Wed, 19 Nov 2014 05:42:00 GMT</pubDate><guid>http://www.aygfsteel.com/wangxinsh55/archive/2014/11/19/420297.html</guid><wfw:comment>http://www.aygfsteel.com/wangxinsh55/comments/420297.html</wfw:comment><comments>http://www.aygfsteel.com/wangxinsh55/archive/2014/11/19/420297.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/wangxinsh55/comments/commentRss/420297.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/wangxinsh55/services/trackbacks/420297.html</trackback:ping><description><![CDATA[<div>http://www.linuxidc.com/Linux/2012-01/51615.htm</div><br /><div><h2><span>1 </span>Map side tuning <span style="font-family: 宋体">参数</span> </h2> <h3><span>1.1 </span>MapTask <span style="font-family: 宋体">q行内部原理</span> </h3> <p><img alt="" src="http://www.linuxidc.com/upload/2012_01/120116103468161.gif" /> <br /></p> <p style="text-indent: 21pt; line-height: normal;" align="left"><span style="color: black; font-family: 宋体">当map task 开始运,q生中间数据时Q其产生的中间结果ƈ非直接就单的写入盘。这中间的过E比较复杂,q且利用C内存buffer 来进行已l生的部分l果的缓存,q在内存buffer 中进行一些预排序来优化整个map 的性能。如上图所C,每一个map 都会对应存在一个内存buffer QMapOutputBuffer Q即上图的buffer in memory Q,map 会将已经产生的部分结果先写入到该buffer 中,q个buffer 默认?00MB 大小Q但是这个大是可以Ҏjob 提交时的参数讑֮来调整的Q该参数即ؓQ?/span> <strong><span style="color: red; font-family: 宋体">io.sort.mb</span> </strong><span style="color: black; font-family: 宋体">。当map 的生数据非常大Ӟq且把io.sort.mb 调大Q那么map 在整个计过E中spill 的次数就势必会降低,map task 对磁盘的操作׃变少Q如果map tasks 的瓶颈在盘上,q样调整׃大大提高map 的计性能。map 做sort 和spill 的内存结构如下如所C:</span> </p> <p style="line-height: normal;" align="left"><span style="color: black; font-family: 宋体"><img alt="" src="http://www.linuxidc.com/upload/2012_01/120116103468162.gif" height="404" width="575" /> <br /></span></p> <p style="text-indent: 21pt; line-height: normal;" align="left"><span style="color: black; font-family: 宋体">map</span> <span style="color: black; font-family: 宋体">在运行过E中Q不停的向该buffer 中写入已有的计算l果Q但是该buffer q不一定能全部的map 输出~存下来Q当map 输出出一定阈|比如100M Q,那么map 必d该buffer 中的数据写入到磁盘中去,q个q程在mapreduce 中叫做spill 。map q不是要{到该buffer 全部写满时才q行spill Q因为如果全部写满了再去写spill Q势必会造成map 的计部分等待buffer 释放I间的情c所以,map 其实是当buffer 被写满到一定程度(比如80% Q时Q就开始进行spill 。这个阈g是由一个job 的配|参数来控制Q即</span> <strong><span style="color: red; font-family: 宋体">io.sort.spill.percent</span> </strong><span style="color: black; font-family: 宋体">Q默认ؓ0.80 ?0% 。这个参数同样也是媄响spill 频繁E度Q进而媄响map task q行周期对磁盘的d频率的。但非特D情况下Q通常不需要h为的调整。调整io.sort.mb 对用h说更加方ѝ?/span> </p> <p style="text-indent: 21pt; line-height: normal;" align="left"><span style="color: black; font-family: 宋体">当map task 的计部分全部完成后Q如果map 有输出,׃生成一个或者多个spill 文gQ这些文件就是map 的输出结果。map 在正帔RZ前,需要将q些spill 合ƈQmerge Q成一个,所以map 在结束之前还有一个merge 的过E。merge 的过E中Q有一个参数可以调整这个过E的行ؓQ该参数为:</span> <strong><span style="color: red; font-family: 宋体">io.sort.factor</span> </strong><span style="color: black; font-family: 宋体">。该参数默认?0 。它表示当merge spill 文gӞ最多能有多ƈ行的stream 向merge 文g中写入。比如如果map 产生的数据非常的大,产生的spill 文g大于10 Q而io.sort.factor 使用的是默认?0 Q那么当map 计算完成做merge Ӟ没有办法一ơ将所有的spill 文gmerge 成一个,而是会分多次Q每ơ最?0 个stream 。这也就是说Q当map 的中间结果非常大Q调大io.sort.factor Q有利于减少merge ơ数Q进而减map 对磁盘的d频率Q有可能辑ֈ优化作业的目的?/span> </p> <p style="text-indent: 21pt; line-height: normal;" align="left"><span style="color: black; font-family: 宋体">当job 指定了combiner 的时候,我们都知道map 介绍后会在map 端根据combiner 定义的函数将map l果q行合ƈ。运行combiner 函数的时机有可能会是merge 完成之前Q或者之后,q个时机可以׃个参数控Ӟ?/span> <strong><span style="color: red; font-family: 宋体">min.num.spill.for.combine</span> </strong><span style="color: black; font-family: 宋体">Qdefault 3 Q,当job 中设定了combiner Qƈ且spill 数最有3 个的时候,那么combiner 函数׃在merge 产生l果文g之前q行。通过q样的方式,可以在spill 非常多需要merge Qƈ且很多数据需要做conbine 的时候,减少写入到磁盘文件的数据数量Q同hZ减少对磁盘的d频率Q有可能辑ֈ优化作业的目的?/span> </p> <p style="text-indent: 21pt; line-height: normal;" align="left"><span style="color: black; font-family: 宋体">减少中间l果dq出盘的方法不止这些,q有是压羃。也是说map 的中_无论是spill 的时候,q是最后merge 产生的结果文Ӟ都是可以压羃的。压~的好处在于Q通过压羃减少写入d盘的数据量。对中间l果非常大,盘速度成ؓmap 执行瓉的job Q尤其有用。控制map 中间l果是否使用压羃的参CؓQ?/span> <strong><span style="color: red; font-family: 宋体">mapred.compress.map.output</span> </strong><span style="color: black; font-family: 宋体">(true/false)</span> <span style="color: black; font-family: 宋体">。将q个参数讄为true Ӟ那么map 在写中间l果Ӟ׃数据压~后再写入磁盘,ȝ果时也会采用先解压后d数据。这样做的后果就是:写入盘的中间结果数据量会变,但是cpu 会消耗一些用来压~和解压。所以这U方式通常适合job 中间l果非常大,瓉不在cpu Q而是在磁盘的d的情c说的直白一些就是用cpu 换IO 。根据观察,通常大部分的作业cpu 都不是瓶颈,除非q算逻辑异常复杂。所以对中间l果采用压羃通常来说是有收益的。以下是一个wordcount 中间l果采用压羃和不采用压羃产生的map 中间l果本地盘d的数据量ҎQ?/span> </p> <p style="line-height: normal;" align="left"><strong><span style="color: black; font-family: 宋体">map</span> </strong><strong><span style="color: black; font-family: 宋体">中间l果不压~:</span> </strong></p> <p style="line-height: normal;" align="left"><span style="color: black; font-family: 宋体"><img alt="" src="http://www.linuxidc.com/upload/2012_01/120116103468163.gif" /> <br /></span></p> <p style="line-height: normal;" align="left"><strong><span style="color: black; font-family: 宋体">map</span> </strong><strong><span style="color: black; font-family: 宋体">中间l果压羃Q?/span> </strong></p> <p style="line-height: normal;" align="left"><strong><span style="color: black; font-family: 宋体"><img alt="" src="http://www.linuxidc.com/upload/2012_01/120116103468164.gif" /> <br /></span></strong></p> <p style="text-indent: 21pt; line-height: normal;" align="left"><span style="color: black; font-family: 宋体">可以看出Q同Ljob Q同L数据Q在采用压羃的情况下Qmap 中间l果能羃将q?0 倍,如果map 的瓶颈在盘Q那么job 的性能提升会非常可观?/span> </p> <p style="text-indent: 21pt; line-height: normal;" align="left"><span style="color: black; font-family: 宋体">当采用map 中间l果压羃的情况下Q用戯可以选择压羃ӞK�用哪U压~格式进行压~,现在<a target="_blank" title="Hadoop">Hadoop</a> 支持的压~格式有Q?/span> GzipCodec <span style="font-family: 宋体">Q?/span> LzoCodec <span style="font-family: 宋体">Q?/span> BZip2Codec <span style="font-family: 宋体">Q?/span> LzmaCodec <span style="font-family: 宋体">{压~格式。通常来说Q想要达到比较^衡的</span> cpu <span style="font-family: 宋体">和磁盘压~比Q?/span> LzoCodec <span style="font-family: 宋体">比较适合。但也要取决?/span> job <span style="font-family: 宋体">的具体情c用戯惌自行选择中间l果的压~算法,可以讄配置参数Q?/span> <strong><span style="color: red">mapred.map.output.compression.codec</span> </strong>=org.apache.hadoop.io.compress.DefaultCodec <span style="font-family: 宋体">或者其他用戯行选择的压~方式?/span> </p></div><br /><div><h3><span>1.2 </span>Map side <span style="font-family: 宋体">相关参数调优</span> </h3> <table style="border-right: medium none; border-top: medium none; border-left: medium none; border-bottom: medium none; border-collapse: collapse" border="1" cellpadding="0" cellspacing="0"> <tbody> <tr> <td style="border-right: black 1pt solid; padding-right: 5.4pt; border-top: black 1pt solid; padding-left: 5.4pt; background: #c4bc96; padding-bottom: 0cm; border-left: black 1pt solid; width: 175.3pt; padding-top: 0cm; border-bottom: black 1pt solid" valign="top" width="234"> <p style="line-height: normal"><strong><span style="font-family: 宋体">选项</span> </strong></p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; background: #c4bc96; padding-bottom: 0cm; width: 73.8pt; padding-top: 0cm" valign="top" width="98"> <p style="line-height: normal"><strong><span style="font-family: 宋体">cd</span> </strong></p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; background: #c4bc96; padding-bottom: 0cm; width: 111.4pt; padding-top: 0cm" valign="top" width="149"> <p style="line-height: normal"><strong><span style="font-family: 宋体">默认?/span> </strong></p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; background: #c4bc96; padding-bottom: 0cm; width: 65.6pt; padding-top: 0cm" valign="top" width="87"> <p style="line-height: normal"><strong><span style="font-family: 宋体">描述</span> </strong></p></td></tr> <tr> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 175.3pt; padding-top: 0cm" valign="top" width="234"> <p style="line-height: normal">io.sort.mb </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 73.8pt; padding-top: 0cm" valign="top" width="98"> <p style="line-height: normal">int </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 111.4pt; padding-top: 0cm" valign="top" width="149"> <p style="line-height: normal">100 </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 65.6pt; padding-top: 0cm" valign="top" width="87"> <p style="line-height: normal"><span style="font-family: 宋体">~存</span> map <span style="font-family: 宋体">中间l果?/span> buffer <span style="font-family: 宋体">大小</span> (in MB) </p></td></tr> <tr> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 175.3pt; padding-top: 0cm" valign="top" width="234"> <p style="line-height: normal">io.sort.record.percent </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 73.8pt; padding-top: 0cm" valign="top" width="98"> <p style="line-height: normal">float </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 111.4pt; padding-top: 0cm" valign="top" width="149"> <p style="line-height: normal">0.05 </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 65.6pt; padding-top: 0cm" valign="top" width="87"> <p style="line-height: normal">io.sort.mb <span style="font-family: 宋体">中用来保?/span> map output <span style="font-family: 宋体">记录边界的百分比Q其他缓存用来保存数?/span> </p></td></tr> <tr> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 175.3pt; padding-top: 0cm" valign="top" width="234"> <p style="line-height: normal">io.sort.spill.percent </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 73.8pt; padding-top: 0cm" valign="top" width="98"> <p style="line-height: normal">float </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 111.4pt; padding-top: 0cm" valign="top" width="149"> <p style="line-height: normal">0.80 </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 65.6pt; padding-top: 0cm" valign="top" width="87"> <p style="line-height: normal">map <span style="font-family: 宋体">开始做</span> spill <span style="font-family: 宋体">操作的阈?/span> </p></td></tr> <tr> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 175.3pt; padding-top: 0cm" valign="top" width="234"> <p style="line-height: normal">io.sort.factor </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 73.8pt; padding-top: 0cm" valign="top" width="98"> <p style="line-height: normal">int </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 111.4pt; padding-top: 0cm" valign="top" width="149"> <p style="line-height: normal">10 </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 65.6pt; padding-top: 0cm" valign="top" width="87"> <p style="line-height: normal"><span style="font-family: 宋体">?/span> merge <span style="font-family: 宋体">操作时同时操作的</span> stream <span style="font-family: 宋体">C限?/span> </p></td></tr> <tr> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 175.3pt; padding-top: 0cm" valign="top" width="234"> <p style="line-height: normal">min.num.spill.for.combine </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 73.8pt; padding-top: 0cm" valign="top" width="98"> <p style="line-height: normal">int </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 111.4pt; padding-top: 0cm" valign="top" width="149"> <p style="line-height: normal">3 </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 65.6pt; padding-top: 0cm" valign="top" width="87"> <p style="line-height: normal">combiner <span style="font-family: 宋体">函数q行的最?/span> spill <span style="font-family: 宋体">?/span> </p></td></tr> <tr> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 175.3pt; padding-top: 0cm" valign="top" width="234"> <p style="line-height: normal">mapred.compress.map.output </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 73.8pt; padding-top: 0cm" valign="top" width="98"> <p style="line-height: normal">boolean </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 111.4pt; padding-top: 0cm" valign="top" width="149"> <p style="line-height: normal">false </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 65.6pt; padding-top: 0cm" valign="top" width="87"> <p style="line-height: normal">map <span style="font-family: 宋体">中间l果是否采用压羃</span> </p></td></tr> <tr> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 175.3pt; padding-top: 0cm" valign="top" width="234"> <p style="line-height: normal">mapred.map.output.compression.codec </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 73.8pt; padding-top: 0cm" valign="top" width="98"> <p style="line-height: normal">class name </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 111.4pt; padding-top: 0cm" valign="top" width="149"> <p style="line-height: normal;" align="left">org.apache.<a target="_blank" title="Hadoop">Hadoop</a>.io. </p> <p style="line-height: normal">compress.DefaultCodec </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 65.6pt; padding-top: 0cm" valign="top" width="87"> <p style="line-height: normal">map <span style="font-family: 宋体">中间l果的压~格?/span> </p></td></tr></tbody></table> <p>  </p> <h2><span>2 </span>Reduce side tuning <span style="font-family: 宋体">参数</span> </h2> <h3><span>2.1 </span>ReduceTask <span style="font-family: 宋体">q行内部原理</span> </h3> <p><img alt="" src="http://www.linuxidc.com/upload/2012_01/120116103478481.gif" /> <br /></p> <p style="text-indent: 21pt">reduce <span style="font-family: 宋体">的运行是分成三个阶段的。分别ؓ</span> copy->sort->reduce <span style="font-family: 宋体">。由?/span> job <span style="font-family: 宋体">的每一?/span> map <span style="font-family: 宋体">都会Ҏ</span> reduce(n) <span style="font-family: 宋体">数将数据分成</span> map <span style="font-family: 宋体">输出l果分成</span> n <span style="font-family: 宋体">?/span> partition <span style="font-family: 宋体">Q所?/span> map <span style="font-family: 宋体">的中间结果中是有可能包含每一?/span> reduce <span style="font-family: 宋体">需要处理的部分数据的。所以,Z优化</span> reduce <span style="font-family: 宋体">的执行时_</span> hadoop <span style="font-family: 宋体">中是{?/span> job <span style="font-family: 宋体">的第一?/span> map <span style="font-family: 宋体">l束后,所有的</span> reduce <span style="font-family: 宋体">开始尝试从完成?/span> map <span style="font-family: 宋体">中下载该</span> reduce <span style="font-family: 宋体">对应?/span> partition <span style="font-family: 宋体">部分数据。这个过E就是通常所说的</span> shuffle <span style="font-family: 宋体">Q也是</span> copy <span style="font-family: 宋体">q程?/span> </p> <p><span>       Reduce task</span> <span style="font-family: 宋体">在做</span> shuffle <span style="font-family: 宋体">Ӟ实际上就是从不同的已l完成的</span> map <span style="font-family: 宋体">上去下蝲属于自己q个</span> reduce <span style="font-family: 宋体">的部分数据,׃</span> map <span style="font-family: 宋体">通常有许多个Q所以对一?/span> reduce <span style="font-family: 宋体">来说Q下载也可以是ƈ行的从多?/span> map <span style="font-family: 宋体">下蝲Q这个ƈ行度是可以调整的Q调整参CؓQ?/span> <strong><span style="color: red">mapred.reduce.parallel.copies</span> </strong><span style="font-family: 宋体">Q?/span> default 5 <span style="font-family: 宋体">Q。默认情况下Q每个只会有</span> 5 <span style="font-family: 宋体">个ƈ行的下蝲U程在从</span> map <span style="font-family: 宋体">下数据,如果一个时间段?/span> job <span style="font-family: 宋体">完成?/span> map <span style="font-family: 宋体">?/span> 100 <span style="font-family: 宋体">个或者更多,那么</span> reduce <span style="font-family: 宋体">也最多只能同时下?/span> 5 <span style="font-family: 宋体">?/span> map <span style="font-family: 宋体">的数据,所以这个参数比较适合</span> map <span style="font-family: 宋体">很多q且完成的比较快?/span> job <span style="font-family: 宋体">的情况下调大Q有利于</span> reduce <span style="font-family: 宋体">更快的获取属于自己部分的数据?/span> </p> <p><span>       reduce</span> <span style="font-family: 宋体">的每一个下载线E在下蝲某个</span> map <span style="font-family: 宋体">数据的时候,有可能因为那?/span> map <span style="font-family: 宋体">中间l果所在机器发生错误,或者中间结果的文g丢失Q或者网l瞬断等{情况,q样</span> reduce <span style="font-family: 宋体">的下载就有可能失败,所?/span> reduce <span style="font-family: 宋体">的下载线Eƈ不会无休止的{待下去Q当一定时间后下蝲仍然p|Q那么下载线E就会放弃这ơ下载,q在随后试从另外的地方下蝲Q因D|?/span> map <span style="font-family: 宋体">可能重跑Q。所?/span> reduce <span style="font-family: 宋体">下蝲U程的这个最大的下蝲旉D|可以调整的,调整参数为:</span> <strong><span style="color: red">mapred.reduce.copy.backoff</span> </strong><span style="font-family: 宋体">Q?/span> default 300 <span style="font-family: 宋体">U)。如果集环境的|络本n是瓶颈,那么用户可以通过调大q个参数来避?/span> reduce <span style="font-family: 宋体">下蝲U程被误判ؓp|的情c不q在|络环境比较好的情况下,没有必要调整。通常来说专业的集网l不应该有太大问题,所以这个参数需要调整的情况不多?/span> </p> <p><span>       Reduce</span> <span style="font-family: 宋体">?/span> map <span style="font-family: 宋体">l果下蝲到本地时Q同样也是需要进?/span> merge <span style="font-family: 宋体">的,所?/span> io.sort.factor <span style="font-family: 宋体">的配|选项同样会媄?/span> reduce <span style="font-family: 宋体">q行</span> merge <span style="font-family: 宋体">时的行ؓQ该参数的详l介l上文已l提刎ͼ当发?/span> reduce <span style="font-family: 宋体">?/span> shuffle <span style="font-family: 宋体">阶段</span> iowait <span style="font-family: 宋体">非常的高的时候,有可能通过调大q个参数来加大一?/span> merge <span style="font-family: 宋体">时的q发吞吐Q优?/span> reduce <span style="font-family: 宋体">效率?/span> </p> <p><span>       Reduce</span> <span style="font-family: 宋体">?/span> shuffle <span style="font-family: 宋体">阶段对下载来?/span> map <span style="font-family: 宋体">数据Qƈ不是立刻写入磁盘的Q而是会先~存在内存中Q然后当使用内存辑ֈ一定量的时候才刷入盘。这个内存大的控制׃?/span> map <span style="font-family: 宋体">一样可以通过</span> io.sort.mb <span style="font-family: 宋体">来设定了Q而是通过另外一个参数来讄Q?/span> <strong><span style="color: red">mapred.job.shuffle.input.buffer.percent </span></strong><span style="font-family: 宋体">Q?/span> default 0.7 <span style="font-family: 宋体">Q,q个参数其实是一个百分比Q意思是_</span> shuffile <span style="font-family: 宋体">?/span> reduce <span style="font-family: 宋体">内存中的数据最多用内存量为:</span> 0.7 <span style="font-family: 宋体">×</span> maxHeap of reduce task <span style="font-family: 宋体">。也是_如果?/span> reduce task <span style="font-family: 宋体">的最?/span> heap <span style="font-family: 宋体">使用量(通常通过</span> mapred.child.java.opts <span style="font-family: 宋体">来设|,比如讄?/span> -Xmx1024m <span style="font-family: 宋体">Q的一定比例用来缓存数据。默认情况下Q?/span> reduce <span style="font-family: 宋体">会用其</span> heapsize <span style="font-family: 宋体">?/span> 70% <span style="font-family: 宋体">来在内存中缓存数据。如?/span> reduce <span style="font-family: 宋体">?/span> heap <span style="font-family: 宋体">׃业务原因调整的比较大Q相应的~存大小也会变大Q这也是Z?/span> reduce <span style="font-family: 宋体">用来做缓存的参数是一个百分比Q而不是一个固定的g?/span> </p> <p style="text-indent: 21.2pt"><span style="font-family: 宋体">假设</span> mapred.job.shuffle.input.buffer.percent <span style="font-family: 宋体">?/span> 0.7 <span style="font-family: 宋体">Q?/span> reduce task <span style="font-family: 宋体">?/span> max heapsize <span style="font-family: 宋体">?/span> 1G <span style="font-family: 宋体">Q那么用来做下蝲数据~存的内存就为大?/span> 700MB <span style="font-family: 宋体">左右Q这</span> 700M <span style="font-family: 宋体">的内存,?/span> map <span style="font-family: 宋体">端一P也不是要{到全部写满才会往盘LQ而是当这</span> 700M <span style="font-family: 宋体">中被使用C一定的限度Q通常是一个百分比Q,׃开始往盘列这个限度阈g是可以通过</span> job <span style="font-family: 宋体">参数来设定的Q设定参CؓQ?/span> <strong><span style="color: red">mapred.job.shuffle.merge.percent</span> </strong><span style="font-family: 宋体">Q?/span> default 0.66 <span style="font-family: 宋体">Q。如果下载速度很快Q很Ҏ把内存~存撑大Q那么调整一下这个参数有可能会对</span> reduce <span style="font-family: 宋体">的性能有所帮助?/span> </p> <p style="text-indent: 21.2pt"><span style="font-family: 宋体">?/span> reduce <span style="font-family: 宋体">所有的</span> map <span style="font-family: 宋体">上对应自?/span> partition <span style="font-family: 宋体">的数据下载完成后Q就会开始真正的</span> reduce <span style="font-family: 宋体">计算阶段Q中间有?/span> sort <span style="font-family: 宋体">阶段通常旉非常短,几秒钟就完成了,因ؓ整个下蝲阶段已l是边下载边</span> sort <span style="font-family: 宋体">Q然后边</span> merge <span style="font-family: 宋体">的)。当</span> reduce task <span style="font-family: 宋体">真正q入</span> reduce <span style="font-family: 宋体">函数的计阶D늚时候,有一个参C是可以调?/span> reduce <span style="font-family: 宋体">的计行为。也是Q?/span> <strong><span style="color: red">mapred.job.reduce.input.buffer.percent</span> </strong><span style="font-family: 宋体">Q?/span> default 0.0 <span style="font-family: 宋体">Q。由?/span> reduce <span style="font-family: 宋体">计算时肯定也是需要消耗内存的Q而在d</span> reduce <span style="font-family: 宋体">需要的数据Ӟ同样是需要内存作?/span> buffer <span style="font-family: 宋体">Q这个参数是控制Q需要多的内存癑ֈ比来作ؓ</span> reduce <span style="font-family: 宋体">dl?/span> sort <span style="font-family: 宋体">好的数据?/span> buffer <span style="font-family: 宋体">癑ֈ比。默认情况下?/span> 0 <span style="font-family: 宋体">Q也是_默认情况下,</span> reduce <span style="font-family: 宋体">是全部从盘开始读处理数据。如果这个参数大?/span> 0 <span style="font-family: 宋体">Q那么就会有一定量的数据被~存在内存ƈ输送给</span> reduce <span style="font-family: 宋体">Q当</span> reduce <span style="font-family: 宋体">计算逻辑消耗内存很时Q可以分一部分内存用来~存数据Q反?/span> reduce <span style="font-family: 宋体">的内存闲着也是闲着?/span> </p> <h3><span>2.2 </span>Reduce side <span style="font-family: 宋体">相关参数调优</span> </h3> <table style="border-right: medium none; border-top: medium none; border-left: medium none; border-bottom: medium none; border-collapse: collapse" border="1" cellpadding="0" cellspacing="0"> <tbody> <tr> <td style="border-right: black 1pt solid; padding-right: 5.4pt; border-top: black 1pt solid; padding-left: 5.4pt; background: #c4bc96; padding-bottom: 0cm; border-left: black 1pt solid; width: 175.3pt; padding-top: 0cm; border-bottom: black 1pt solid" valign="top" width="234"> <p style="line-height: normal"><strong><span style="font-family: 宋体">选项</span> </strong></p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; background: #c4bc96; padding-bottom: 0cm; width: 42.75pt; padding-top: 0cm" valign="top" width="57"> <p style="line-height: normal"><strong><span style="font-family: 宋体">cd</span> </strong></p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; background: #c4bc96; padding-bottom: 0cm; width: 49.6pt; padding-top: 0cm" valign="top" width="66"> <p style="line-height: normal"><strong><span style="font-family: 宋体">默认?/span> </strong></p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; background: #c4bc96; padding-bottom: 0cm; width: 158.45pt; padding-top: 0cm" valign="top" width="211"> <p style="line-height: normal"><strong><span style="font-family: 宋体">描述</span> </strong></p></td></tr> <tr> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 175.3pt; padding-top: 0cm" valign="top" width="234"> <p style="line-height: normal">mapred.reduce.parallel.copies </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 42.75pt; padding-top: 0cm" valign="top" width="57"> <p style="line-height: normal">int </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 49.6pt; padding-top: 0cm" valign="top" width="66"> <p style="line-height: normal">5 </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 158.45pt; padding-top: 0cm" valign="top" width="211"> <p style="line-height: normal"><span style="font-family: 宋体">每个</span> reduce <span style="font-family: 宋体">q行下蝲</span> map <span style="font-family: 宋体">l果的最大线E数</span> </p></td></tr> <tr> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 175.3pt; padding-top: 0cm" valign="top" width="234"> <p style="line-height: normal">mapred.reduce.copy.backoff </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 42.75pt; padding-top: 0cm" valign="top" width="57"> <p style="line-height: normal">int </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 49.6pt; padding-top: 0cm" valign="top" width="66"> <p style="line-height: normal">300 </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 158.45pt; padding-top: 0cm" valign="top" width="211"> <p style="line-height: normal">reduce <span style="font-family: 宋体">下蝲U程最大等待时_</span> in sec <span style="font-family: K�?>Q?/span> </p></td></tr> <tr> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 175.3pt; padding-top: 0cm" valign="top" width="234"> <p style="line-height: normal">io.sort.factor </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 42.75pt; padding-top: 0cm" valign="top" width="57"> <p style="line-height: normal">int </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 49.6pt; padding-top: 0cm" valign="top" width="66"> <p style="line-height: normal">10 </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 158.45pt; padding-top: 0cm" valign="top" width="211"> <p style="line-height: normal"><span style="font-family: 宋体">同上</span> </p></td></tr> <tr> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 175.3pt; padding-top: 0cm" valign="top" width="234"> <p style="line-height: normal">mapred.job.shuffle.input.buffer.percent </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 42.75pt; padding-top: 0cm" valign="top" width="57"> <p style="line-height: normal">float </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 49.6pt; padding-top: 0cm" valign="top" width="66"> <p style="line-height: normal">0.7 </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 158.45pt; padding-top: 0cm" valign="top" width="211"> <p style="line-height: normal"><span style="font-family: 宋体">用来~存</span> shuffle <span style="font-family: 宋体">数据?/span> reduce task heap <span style="font-family: 宋体">癑ֈ?/span> </p></td></tr> <tr> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 175.3pt; padding-top: 0cm" valign="top" width="234"> <p style="line-height: normal">mapred.job.shuffle.merge.percent </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 42.75pt; padding-top: 0cm" valign="top" width="57"> <p style="line-height: normal">float </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 49.6pt; padding-top: 0cm" valign="top" width="66"> <p style="line-height: normal">0.66 </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 158.45pt; padding-top: 0cm" valign="top" width="211"> <p style="line-height: normal"><span style="font-family: 宋体">~存的内存中多少癑ֈ比后开始做</span> merge <span style="font-family: 宋体">操作</span> </p></td></tr> <tr> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 175.3pt; padding-top: 0cm" valign="top" width="234"> <p style="line-height: normal">mapred.job.reduce.input.buffer.percent </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 42.75pt; padding-top: 0cm" valign="top" width="57"> <p style="line-height: normal">float </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 49.6pt; padding-top: 0cm" valign="top" width="66"> <p style="line-height: normal">0.0 </p></td> <td style="padding-right: 5.4pt; padding-left: 5.4pt; padding-bottom: 0cm; width: 158.45pt; padding-top: 0cm" valign="top" width="211"> <p style="line-height: normal">sort <span style="font-family: 宋体">完成?/span> reduce <span style="font-family: 宋体">计算阶段用来~存数据的百分比</span> </p></td></tr></tbody></table><a target="_blank"><img src="http://www.linuxidc.com/linuxfile/logo.gif" alt="linux" height="17" width="15" /></a><a target="_blank"></a></div><img src ="http://www.aygfsteel.com/wangxinsh55/aggbug/420297.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/wangxinsh55/" target="_blank">SIMONE</a> 2014-11-19 13:42 <a href="http://www.aygfsteel.com/wangxinsh55/archive/2014/11/19/420297.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>mapreduce job让一个文件只׃个map来处?/title><link>http://www.aygfsteel.com/wangxinsh55/archive/2014/09/16/417971.html</link><dc:creator>SIMONE</dc:creator><author>SIMONE</author><pubDate>Tue, 16 Sep 2014 01:28:00 GMT</pubDate><guid>http://www.aygfsteel.com/wangxinsh55/archive/2014/09/16/417971.html</guid><wfw:comment>http://www.aygfsteel.com/wangxinsh55/comments/417971.html</wfw:comment><comments>http://www.aygfsteel.com/wangxinsh55/archive/2014/09/16/417971.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/wangxinsh55/comments/commentRss/417971.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/wangxinsh55/services/trackbacks/417971.html</trackback:ping><description><![CDATA[<div><div> <p><div>http://www.rigongyizu.com/mapreduce-job-one-map-process-one-file/</div><br /></p><p>有一Ҏ据用<a title="查看hadoop中的全部文章" target="_blank">hadoop</a> mapreduce job处理Ӟ业务特点要求一个文件对应一个map来处理,如果两个或多个map处理了同一个文Ӟ可能会有问题。开始想通过讄 dfs.blocksize 或?mapreduce.input.file<a title="查看inputformat中的全部文章" target="_blank">inputformat</a>.split.minsize/maxsize 参数来控制map的个敎ͼ后来惛_其实不用q么复杂Q在自定义的InputFormat里面直接让文件不要进行split可以了?/p> <div nogutter="" "="" id="highlighter_665528"><div><div alt1"=""><table><tbody><tr><td><code>public</code> <code>class</code> <code>CustemDocInputFormat </code><code>extends</code> <code>TextInputFormat {</code></td></tr></tbody></table></div><div alt2"=""><table><tbody><tr><td> </td></tr></tbody></table></div><div alt1"=""><table><tbody><tr><td><code>    </code><code>@Override</code></td></tr></tbody></table></div><div alt2"=""><table><tbody><tr><td><code>    </code><code>public</code> <code>RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {</code></td></tr></tbody></table></div><div alt1"=""><table><tbody><tr><td><code>        </code><code>DocRecordReader reader = </code><code>null</code><code>;</code></td></tr></tbody></table></div><div alt2"=""><table><tbody><tr><td><code>        </code><code>try</code> <code>{</code></td></tr></tbody></table></div><div alt1"=""><table><tbody><tr><td><code>            </code><code>reader = </code><code>new</code> <code>DocRecordReader(); </code><code>// 自定义的reader</code></td></tr></tbody></table></div><div alt2"=""><table><tbody><tr><td><code>        </code><code>} </code><code>catch</code> <code>(IOException e) {</code></td></tr></tbody></table></div><div alt1"=""><table><tbody><tr><td><code>            </code><code>e.printStackTrace();</code></td></tr></tbody></table></div><div alt2"=""><table><tbody><tr><td><code>        </code><code>}</code></td></tr></tbody></table></div><div alt1"=""><table><tbody><tr><td><code>        </code><code>return</code> <code>reader;</code></td></tr></tbody></table></div><div alt2"=""><table><tbody><tr><td><code>    </code><code>}</code></td></tr></tbody></table></div><div alt1"=""><table><tbody><tr><td> </td></tr></tbody></table></div><div alt2"=""><table><tbody><tr><td><code>    </code><code>@Override</code></td></tr></tbody></table></div><div alt1"=""><table><tbody><tr><td><code>    </code><code>protected</code> <code>boolean</code> <code>isSplitable(JobContext context, Path file) {</code></td></tr></tbody></table></div><div alt2"=""><table><tbody><tr><td><code>        </code><code>return</code> <code>false</code><code>;</code></td></tr></tbody></table></div><div alt1"=""><table><tbody><tr><td><code>    </code><code>}</code></td></tr></tbody></table></div><div alt2"=""><table><tbody><tr><td><code>}</code></td></tr></tbody></table></div></div></div> <p>q样Q输入文件有多少个,job׃启动多少个map了?/p> <div wp_rp_plain"="" id="wp_rp_first"><div><h3>相关文章</h3><ul wp_rp"=""><li data-position="0" data-poid="in-1093" data-post-type="none"><small>2014q??9?/small> <a >Hadoop : 一个目录下的数据只׃个map处理</a></li><li data-position="1" data-poid="in-1074" data-post-type="none"><small>2014q??7?/small> <a >一个HadoopE序的优化过E?– Ҏ文g实际大小实现CombineFileInputFormat</a></li><li data-position="2" data-poid="in-818" data-post-type="none"><small>2013q??3?/small> <a >hadoop用MultipleInputs/MultiInputFormat实现一个mapreduce job中读取不同格式的文g</a></li><li data-position="3" data-poid="in-289" data-post-type="none"><small>2012q???/small> <a >hadoop mapreduce和hive中用SequeceFile+lzo格式数据</a></li><li data-position="4" data-poid="in-939" data-post-type="none"><small>2014q??1?/small> <a >hadoop集群DataNode起不来:“DiskChecker$DiskErrorException: Invalid volume failure config value: 1”</a></li></ul></div></div> </div></div><img src ="http://www.aygfsteel.com/wangxinsh55/aggbug/417971.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/wangxinsh55/" target="_blank">SIMONE</a> 2014-09-16 09:28 <a href="http://www.aygfsteel.com/wangxinsh55/archive/2014/09/16/417971.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>hadoop用MultipleInputs/MultiInputFormat实现一个mapreduce job中读取不同格式的文ghttp://www.aygfsteel.com/wangxinsh55/archive/2014/09/16/417969.htmlSIMONESIMONETue, 16 Sep 2014 01:27:00 GMThttp://www.aygfsteel.com/wangxinsh55/archive/2014/09/16/417969.htmlhttp://www.aygfsteel.com/wangxinsh55/comments/417969.htmlhttp://www.aygfsteel.com/wangxinsh55/archive/2014/09/16/417969.html#Feedback0http://www.aygfsteel.com/wangxinsh55/comments/commentRss/417969.htmlhttp://www.aygfsteel.com/wangxinsh55/services/trackbacks/417969.htmlhttp://www.rigongyizu.com/use-multiinputformat-read-different-files-in-one-job/

hadoop中提供了 MultiOutputFormat 能将l果数据输出C同的目录Q也提供?FileInputFormat 来一ơ读取多个目录的数据Q但是默认一个job只能使用 job.setInputFormatClass 讄使用一个inputfomat处理一U格式的数据。如果需要实?在一个job中同时读取来自不同目录的不同格式文g 的功能,需要自己实C?MultiInputFormat 来读取不同格式的文g?原来已经提供?a title="MultipleInputs" target="_blank">MultipleInputs)?/p>

例如Q有一个mapreduce job需要同时读取两U格式的数据Q一U格式是普通的文本文gQ用 LineRecordReader 一行一行读取;另外一U文件是伪XML文gQ用自定义的AJoinRecordReaderd?/p>

自己实现了一个简单的 MultiInputFormat 如下Q?/p>

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
 
public class MultiInputFormat extends TextInputFormat {
 
    @Override
    public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {
        RecordReader reader = null;
        try {
            String inputfile = ((FileSplit) split).getPath().toString();
            String xmlpath = context.getConfiguration().get("xml_prefix");
            String textpath = context.getConfiguration().get("text_prefix");
 
            if (-1 != inputfile.indexOf(xmlpath)) {
                reader = new AJoinRecordReader();
            } else if (-1 != inputfile.indexOf(textpath)) {
                reader = new LineRecordReader();
            } else {
                reader = new LineRecordReader();
            }
        } catch (IOException e) {
            // do something ...
        }
 
        return reader;
    }
}

其实原理很简单,是?createRecordReader 的时候,通过 ((FileSplit) split).getPath().toString() 获取到当前要处理的文件名Q然后根据特征匹配,选取对应?RecordReader 卛_。xml_prefix和text_prefix可以在程序启动时通过 -D 传给Configuration?/p>

比如某次执行打印的值如下:

inputfile=hdfs://test042092.sqa.cm4:9000/test/input_xml/common-part-00068
xmlpath_prefix=hdfs://test042092.sqa.cm4:9000/test/input_xml
textpath_prefix=hdfs://test042092.sqa.cm4:9000/test/input_txt

q里只是通过单的文g路径和标C符匚w来做Q也可以采用更复杂的ҎQ比如文件名、文件后~{?/p>

接着在mapcMQ也同样可以Ҏ不同的文件名特征q行不同的处理:

@Override
public void map(LongWritable offset, Text inValue, Context context)
        throws IOException {
 
    String inputfile = ((FileSplit) context.getInputSplit()).getPath()
            .toString();
 
    if (-1 != inputfile.indexOf(textpath)) {
        ......
    } else if (-1 != inputfile.indexOf(xmlpath)) {
        ......
    } else {
        ......
    }
}

q种方式太土了,原来hadoop里面已经提供?MultipleInputs 来实现对一个目录指定一?a title="查看inputformat中的全部文章" target="_blank">inputformat和对应的map处理cR?/p>

MultipleInputs.addInputPath(conf, new Path("/foo"), TextInputFormat.class,
   MapClass.class);
MultipleInputs.addInputPath(conf, new Path("/bar"),
   KeyValueTextInputFormat.class, MapClass2.class);


SIMONE 2014-09-16 09:27 发表评论
]]>
一个HadoopE序的优化过E??Ҏ文g实际大小实现CombineFileInputFormathttp://www.aygfsteel.com/wangxinsh55/archive/2014/09/16/417968.htmlSIMONESIMONETue, 16 Sep 2014 01:25:00 GMThttp://www.aygfsteel.com/wangxinsh55/archive/2014/09/16/417968.htmlhttp://www.aygfsteel.com/wangxinsh55/comments/417968.htmlhttp://www.aygfsteel.com/wangxinsh55/archive/2014/09/16/417968.html#Feedback1http://www.aygfsteel.com/wangxinsh55/comments/commentRss/417968.htmlhttp://www.aygfsteel.com/wangxinsh55/services/trackbacks/417968.htmlhttp://www.rigongyizu.com/hadoop-job-optimize-combinefileinputformat/

某日Q接手了同事写的?a title="查看hadoop中的全部文章" target="_blank">Hadoop集群拯数据到另外一个集的E序Q该E序是运行在Hadoop集群上的job。这个job只有map阶段Q读取hdfs目录下数据的数据Q然后写入到另外一个集?/p>

昄Q这个程序没有考虑大数据量的情况,如果输入目录下文件很多或数据量很大,׃Dmap数很多。而实际上我们需要拷贝的一个数据源有q?6TQjob启动h?w多个mapQ一下子整个queue的资源就占满了。虽焉过调整一些参数可以控制map?也就是ƈ发数)Q但是无法准的?制map敎ͼ而且换个数据源又得重新配|参数?/p>

W一个改q的版本是,加了Reduceq程Q以期望通过讄Reduce数量来控制ƈ发数。这栯然能_地控制ƈ发数Q但是增加了shuffle q程Q实际运行中发现输入数据有倾斜Q而partition的key׃业务需要无法更改)Q导致部分机器网l被打满Q从而媄响到了集中的其他应用。即 佉K过 mapred.reduce.parallel.copies 参数来限制shuffle也是L不治本。这个^白增加的shuffleq程实际上浪费了很多|络带宽和IO?/p>

最理想的情况当然是只有map阶段Q而且能够准确的控制ƈ发数了?/p>

于是Q第二个优化版本诞生了。这个job只有map阶段Q采?a title="CombineFileInputFormat" target="_blank">CombineFileInputFormatQ?它可以将多个文件打包成一个InputSplit提供l一个Map处理Q避免因为大量小文g问题Q启动大量map。通过 mapred.max.split.size 参数可以大概地控制ƈ发数。本以ؓq样p解决问题了,l果又发C数据倾斜的问题。这U粗略地分splits的方式,D有的map处理的数据少Q有?map处理的数据多Qƈ不均匀。几个拖后退的map导致job的实际运行时间长了一倍多?/p>

看来只有让每个map处理的数据量一样多Q才能完的解决q个问题了?/p>

W三个版本也诞生了,q次是重写了CombineFileInputFormatQ自己实现getSplitsҎ。由于输入数据ؓSequenceFile格式Q因此需要一个SequenceFileRecordReaderWrappercR?/p>

实现代码如下Q?br /> CustomCombineSequenceFileInputFormat.java

import java.io.IOException;
 
import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReaderWrapper;
import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
 
/**
 * Input format that is a <code>CombineFileInputFormat</code>-equivalent for
 * <code>SequenceFileInputFormat</code>.
 *
 * @see CombineFileInputFormat
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class CustomCombineSequenceFileInputFormat<K, V> extends MultiFileInputFormat<K, V> {
    @SuppressWarnings({"rawtypes", "unchecked"})
    public RecordReader<K, V> createRecordReader(InputSplit split, TaskAttemptContext context)
            throws IOException {
        return new CombineFileRecordReader((CombineFileSplit) split, context,
                SequenceFileRecordReaderWrapper.class);
    }
 
    /**
     * A record reader that may be passed to <code>CombineFileRecordReader</code> so that it can be
     * used in a <code>CombineFileInputFormat</code>-equivalent for
     * <code>SequenceFileInputFormat</code>.
     *
     * @see CombineFileRecordReader
     * @see CombineFileInputFormat
     * @see SequenceFileInputFormat
     */
    private static class SequenceFileRecordReaderWrapper<K, V>
            extends CombineFileRecordReaderWrapper<K, V> {
        // this constructor signature is required by CombineFileRecordReader
        public SequenceFileRecordReaderWrapper(CombineFileSplit split, TaskAttemptContext context,
                Integer idx) throws IOException, InterruptedException {
            super(new SequenceFileInputFormat<K, V>(), split, context, idx);
        }
    }
}

MultiFileInputFormat.java

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
 
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
 
/**
 * multiple files can be combined in one InputSplit so that InputSplit number can be limited!
 */
public abstract class MultiFileInputFormat<K, V> extends CombineFileInputFormat<K, V> {
 
    private static final Log LOG = LogFactory.getLog(MultiFileInputFormat.class);
    public static final String CONFNAME_INPUT_SPLIT_MAX_NUM = "multifileinputformat.max_split_num";
    public static final Integer DEFAULT_MAX_SPLIT_NUM = 50;
 
    public static void setMaxInputSplitNum(Job job, Integer maxSplitNum) {
        job.getConfiguration().setInt(CONFNAME_INPUT_SPLIT_MAX_NUM, maxSplitNum);
    }
 
    @Override
    public List<InputSplit> getSplits(JobContext job) throws IOException {
        // get all the files in input path
        List<FileStatus> stats = listStatus(job);
        List<InputSplit> splits = new ArrayList<InputSplit>();
        if (stats.size() == 0) {
            return splits;
        }
        // 计算split的^均长?/code>
        long totalLen = 0;
        for (FileStatus stat : stats) {
            totalLen += stat.getLen();
        }
        int maxSplitNum = job.getConfiguration().getInt(CONFNAME_INPUT_SPLIT_MAX_NUM, DEFAULT_MAX_SPLIT_NUM);
        int expectSplitNum = maxSplitNum < stats.size() ? maxSplitNum : stats.size();
        long averageLen = totalLen / expectSplitNum;
        LOG.info("Prepare InputSplit : averageLen(" + averageLen + ") totalLen(" + totalLen
                + ") expectSplitNum(" + expectSplitNum + ") ");
        // 讄inputSplit
        List<Path> pathLst = new ArrayList<Path>();
        List<Long> offsetLst = new ArrayList<Long>();
        List<Long> lengthLst = new ArrayList<Long>();
        long currentLen = 0;
        for (int i = 0; i < stats.size(); i++) {
            FileStatus stat = stats.get(i);
            pathLst.add(stat.getPath());
            offsetLst.add(0L);
            lengthLst.add(stat.getLen());
            currentLen += stat.getLen();
            if (splits.size() < expectSplitNum - 1   && currentLen > averageLen) {
                Path[] pathArray = new Path[pathLst.size()];
                CombineFileSplit thissplit = new CombineFileSplit(pathLst.toArray(pathArray),
                    getLongArray(offsetLst), getLongArray(lengthLst), new String[0]);
                LOG.info("combineFileSplit(" + splits.size() + ") fileNum(" + pathLst.size()
                        + ") length(" + currentLen + ")");
                splits.add(thissplit);
                //
                pathLst.clear();
                offsetLst.clear();
                lengthLst.clear();
                currentLen = 0;
            }
        }
        if (pathLst.size() > 0) {
            Path[] pathArray = new Path[pathLst.size()];
            CombineFileSplit thissplit =
                    new CombineFileSplit(pathLst.toArray(pathArray), getLongArray(offsetLst),
                            getLongArray(lengthLst), new String[0]);
            LOG.info("combineFileSplit(" + splits.size() + ") fileNum(" + pathLst.size()
                    + ") length(" + currentLen + ")");
            splits.add(thissplit);
        }
        return splits;
    }
 
    private long[] getLongArray(List<Long> lst) {
        long[] rst = new long[lst.size()];
        for (int i = 0; i < lst.size(); i++) {
            rst[i] = lst.get(i);
        }
        return rst;
    }
}

通过 multifileinputformat.max_split_num 参数可以较为准的控制map数量Q而且会发现每个map处理的数据量很均匀。至此,问题ȝ解决了?/p>



SIMONE 2014-09-16 09:25 发表评论
]]>
վ֩ģ壺 Ҫ| ɽ| | | Զ| ֹ| | | | | | ¤| | ϰ| ¸| Ӵ| | ɳ| ʷ| | ˮ| | Ӣ| մ| ۰| ɽ| | | | Ӳ| | | Թ| | | | ˳| | | | |