ï»??xml version="1.0" encoding="utf-8" standalone="yes"?>久久精品亚洲一区二区,天堂av在线7,偷拍与自拍一区http://www.aygfsteel.com/ricdong/zh-cnSun, 13 Jul 2025 08:53:49 GMTSun, 13 Jul 2025 08:53:49 GMT60MapReduce 数据分布倾斜æ€?/title><link>http://www.aygfsteel.com/ricdong/articles/366991.html</link><dc:creator>Ric Dong</dc:creator><author>Ric Dong</author><pubDate>Thu, 22 Dec 2011 02:17:00 GMT</pubDate><guid>http://www.aygfsteel.com/ricdong/articles/366991.html</guid><wfw:comment>http://www.aygfsteel.com/ricdong/comments/366991.html</wfw:comment><comments>http://www.aygfsteel.com/ricdong/articles/366991.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/ricdong/comments/commentRss/366991.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/ricdong/services/trackbacks/366991.html</trackback:ping><description><![CDATA[<span id="wmqeeuq" class="Apple-style-span" style="font-family: Helvetica, Tahoma, Arial, sans-serif; line-height: 25px; background-color: #ffffff; "><p class="MsoNormal" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; "><span style="font-family: 宋体; "><span style="font-size: small; ">数据分布倾斜性指的是数据分布˜q‡åº¦é›†ä¸­äºŽæ•°æ®ç©ºé—´çš„æŸç«¯åQŒé€ æˆ“头重脚轻”或è€?#8220;比萨斜塔”½{‰ä¸å‡åŒ€çš„分布特炏V€‚数据分布倾斜性将造成˜qç®—效率上的“瓉™¢ˆ”和数据分析结果的“以偏概全”ã€?/span></span></p><p class="MsoNormal" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; "><span style="font-size: small; "><br /></span></p><p class="MsoNormal" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; "><span style="font-size: small; "><span style="font-family: 宋体; "><strong style="font-weight: bold; ">效率上的“瓉™¢ˆ”</strong></span><span lang="EN-US"></span></span></p><p class="MsoNormal" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; "><span style="font-size: small; "><span style="font-family: 宋体; ">假如在大型商åœÞZ¸­åQŒå…±æœ?/span><span lang="EN-US">A,B1,B2</span><span style="font-family: 宋体; ">…</span><span lang="EN-US">..B9</span><span style="font-family: 宋体; ">十家店铺åQŒå…¶ä¸?/span><span lang="EN-US">A</span><span style="font-family: 宋体; ">店铺中有</span><span lang="EN-US">99W</span><span style="font-family: 宋体; ">商品åQ?/span><span lang="EN-US">B1,B2</span><span style="font-family: 宋体; ">…</span><span lang="EN-US">.B9</span><span style="font-family: 宋体; ">˜q™ä¹å®¶åº—铺分别有</span><span lang="EN-US">1W</span><span style="font-family: 宋体; ">商品。我们要¾lŸè®¡å•†åœºä¸­å•†å“æ€ÀL•°åQŒè®¡½Ž—初åQŒé‡‡ç”?/span><span lang="EN-US">HASHMAP</span><span style="font-family: 宋体; ">ä½œäØ“å­˜å‚¨¾l“æž„åQŒå…¶ä¸?/span><span lang="EN-US">Key</span><span style="font-family: 宋体; ">åQšåº—é“?/span><span lang="EN-US"> Value</span><span style="font-family: 宋体; ">åQšå•†å“ã€‚我们的计算˜q‡ç¨‹æ˜¯å…ˆ¾lŸè®¡æ¯ä¸ªåº—铺的商品æ€ÀL•°åQŒæœ€åŽå°†¾l“果累加。可以发玎ͼŒç”׃ºŽ</span><span lang="EN-US">A</span><span style="font-family: 宋体; ">æœ?/span><span lang="EN-US">99W</span><span style="font-family: 宋体; ">商品åQŒæŒ‰ç…?/span><span lang="EN-US">1+1</span><span style="font-family: 宋体; ">的篏¿U¯æ–¹å¼ï¼ˆå‡å¦‚</span><span lang="EN-US">1+1</span><span style="font-family: 宋体; ">耗时</span><span lang="EN-US">1</span><span style="font-family: 宋体; ">¿U’)åQŒæˆ‘们要åŠ?/span><span lang="EN-US">99W</span><span style="font-family: 宋体; ">ä¸?/span><span lang="EN-US">1</span><span style="font-family: 宋体; ">才能得到</span><span lang="EN-US">A</span><span style="font-family: 宋体; ">店铺的商品æ€ÀL•°åQˆæ€»è€—æ—¶</span><span lang="EN-US">99W</span><span style="font-family: 宋体; ">¿U’)åQŒè€?/span><span lang="EN-US">B1,B2</span><span style="font-family: 宋体; ">…</span><span lang="EN-US">.B9</span><span style="font-family: 宋体; ">只需分别累加</span><span lang="EN-US">1W</span><span style="font-family: 宋体; ">ä¸?/span><span lang="EN-US">1</span><span style="font-family: 宋体; ">åQˆåˆ†åˆ«è€—æ—¶</span><span lang="EN-US">1W</span><span style="font-family: 宋体; ">¿U’)åQŒè€Œäؓ了得到商åœÞZ¸­çš„商品æ€ÀL•°åQŒæˆ‘们必™åȝ­‰å¾…所有店铺都分别累计¾l“束才能处理æ€Õd’ŒåQŒæ˜¾è€Œæ˜“见,此时˜qç®—瓉™¢ˆä¾‰K›†ä¸­åœ¨</span><span lang="EN-US">A</span><span style="font-family: 宋体; ">店铺的商品篏计上ã€?/span><span lang="EN-US"></span></span></p><p class="MsoNormal" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; "><span style="font-size: small; "><span style="font-family: 宋体; ">˜q™ç±»çж况¾lå¸¸å‘生在分布式˜qç®—˜q‡ç¨‹ä¸­ï¼Œæ¯”如</span><span lang="EN-US">Hadoop Job</span><span style="font-family: 宋体; ">计算åQŒå› ä¸?/span><span lang="EN-US">map/reduce </span><span style="font-family: 宋体; ">˜q‡ç¨‹ä¸­æ˜¯ä»?/span><span lang="EN-US">Key-value</span><span style="font-family: 宋体; ">形式来处理数据,假如æŸ?/span><span lang="EN-US">key</span><span style="font-family: 宋体; ">下的数据量太大,会导致整个计½Ž—过½E‹ä¸­</span><span lang="EN-US">move/shuffle/sort</span><span style="font-family: 宋体; ">的耗时˜qœè¿œé«˜äºŽå…¶ä»–</span><span lang="EN-US">key</span><span style="font-family: 宋体; ">åQŒå› æ­¤è¯¥</span><span lang="EN-US">Key</span><span style="font-family: 宋体; ">变成为效çŽ?#8220;瓉™¢ˆ”。一般解军_Šžæ³•æ˜¯åQŒè‡ªå®šä¹‰</span><span lang="EN-US">partitioner</span><span style="font-family: 宋体; ">åQŒå¯¹æ‰€æœ‰çš„</span><span lang="EN-US">Value</span><span style="font-family: 宋体; ">˜q›è¡Œè‡ªå®šä¹‰åˆ†¾l„,使得每组的量较åã^均,从而解å†Ïx—¶é—´ç“¶é¢ˆé—®é¢˜ã€?/span></span></p><p class="MsoNormal" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; "><span style="font-size: small; "><br /></span></p><p class="MsoNormal" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; "><span style="font-size: small; "><span style="font-family: 宋体; "><strong style="font-weight: bold; ">数据分析¾l“æžœçš?#8220;以偏概全”</strong></span><span lang="EN-US"></span></span></p><p class="MsoNormal" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; "><span style="font-size: small; "><span style="font-family: 宋体; ">同样使用上述çš?#8220;商场”案例åQŒåƈ且在此基¼‹€ä¸Šæˆ‘们假è®?/span><span lang="EN-US">A</span><span style="font-family: 宋体; ">店铺</span><span lang="EN-US">,B9</span><span style="font-family: 宋体; ">店铺是卖低端商品åQŒè€?/span><span lang="EN-US">B1,B2</span><span style="font-family: 宋体; ">…</span><span lang="EN-US">..B8</span><span style="font-family: 宋体; ">是卖高端商品åQŒé”€é‡è¾ƒž®ã€‚如果我们要æ ÒŽ®å•†å“é”€å”®çŠ¶å†µåˆ†æžåº—é“ºåœ¨ä¹°å®¶å½“ä¸­çš„å—‹Æ¢è¿Ž½E‹åº¦ã€‚ç”±äº?/span><span lang="EN-US">A</span><span style="font-family: 宋体; ">店铺本èín商品量大åQŒè€Œä¸”定位的销售ä­h位是属于薄利多销åQŒå¦‚果只从销售量的考虑åQŒæˆ‘ä»¬ä¼šä»¥äØ“</span><span lang="EN-US">A</span><span style="font-family: 宋体; ">店铺在商åœÞZ¸­æ˜¯æœ€å—买家欢˜qŽçš„åQŒé€ æˆ“片面”的分析结果ã€?/span><span lang="EN-US"></span></span></p><p class="MsoNormal" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; "><span style="font-size: small; "><span style="font-family: 宋体; ">其实åQŒé‡åˆ°è¿™¿Uæƒ…况,我们首先的分析卖家性质和买家性质åQŒåƈ且ä‹É用相寚w‡æ¥ä½œä¸ø™¯„ä¼°å€û|¼Œæ¯”如</span><span lang="EN-US">A</span><span style="font-family: 宋体; ">店铺卖低端商品,日销售量</span><span lang="EN-US">1W</span><span style="font-family: 宋体; ">商品åQ?/span><span lang="EN-US">1W/99W<1%, </span><span style="font-family: 宋体; ">è€?/span><span lang="EN-US">B9</span><span style="font-family: 宋体; ">店铺卖低端商品,日销售量</span><span lang="EN-US">5K</span><span style="font-family: 宋体; ">商品åQ?/span><span lang="EN-US">5K/1W=50%,</span><span style="font-family: 宋体; ">所以在低端买家中,低端商品店铺</span><span lang="EN-US">B9</span><span style="font-family: 宋体; ">应该是最受欢˜qŽçš„ã€?/span></span></p></span><img src ="http://www.aygfsteel.com/ricdong/aggbug/366991.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/ricdong/" target="_blank">Ric Dong</a> 2011-12-22 10:17 <a href="http://www.aygfsteel.com/ricdong/articles/366991.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>MapReduce 解析XML½Ž—法的一ç‚ÒŽž„æ€?/title><link>http://www.aygfsteel.com/ricdong/articles/366960.html</link><dc:creator>Ric Dong</dc:creator><author>Ric Dong</author><pubDate>Wed, 21 Dec 2011 13:15:00 GMT</pubDate><guid>http://www.aygfsteel.com/ricdong/articles/366960.html</guid><wfw:comment>http://www.aygfsteel.com/ricdong/comments/366960.html</wfw:comment><comments>http://www.aygfsteel.com/ricdong/articles/366960.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/ricdong/comments/commentRss/366960.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/ricdong/services/trackbacks/366960.html</trackback:ping><description><![CDATA[<div><div>没想到Hadoop在解析XML时如此纠¾l“,以至于新版apiçš„mapreduce竟然攑ּƒäº†XML格式的format以及readeråQŒåœ¨è€ç‰ˆåQˆhadoop-0.19.*åQ‰çš„streaming模块提供了这æ ïLš„apiåQŒç”±äºŽæˆ‘用的hadoop-0.20.2 3U1版本åQŒå› æ­¤éœ€è¦æŠŠå¤„理XML的几个类¿UÀL¤˜q‡æ¥ä½¿ç”¨ã€?/div><div> </div><div>¿UÀL¤æ‰€å¸¦æ¥çš„问题是各处依赖包,和各¿Uapi不兼宏V€‚没关系åQŒæˆ‘可以看一下源码,然后自己写一个。细看了一下reader的代码,发现mapreduce使用了BufferedInputStreamçš„markåQŒreset来寻找XMLçš„tagåQŒè¿™ä¸ªtagž®±æ˜¯æˆ‘们在提交作业所讄¡½®çš„,比如<log>åQ?lt;/log>˜q™æ ·çš„æ ‡½{¾ã€‚Java中stream‹¹çš„markå’ŒresetåQŒå…è®¸æŒ‡é’ˆå›žè¯»ï¼Œå›_œ¨æ‰‘Öˆ°<log>æ—Óž¼Œmark一下指针,然后再找åˆ?lt;/log>标签åQŒæœ€åŽé€šè¿‡resetæ–ÒŽ³•åQŒè¿”回到mark的位¾|®ï¼ŒæŠ?lt;log></log>内的数据è¯Õd–出来。但在匹配的˜q‡ç¨‹ä¸­ï¼Œæˆ‘发现mapred使用了BufferedInputStream çš?read(); æ–ÒŽ³•åQŒè¯¥æ–ÒŽ³•˜q”回下一个可è¯Èš„字节。那么整个处理过½E‹å°±æ˜¯è¯»ä¸€ä¸ªå­—节,比较一个字节,我没有在mapreduce中用˜q™æ ·çš„算法,但我‹¹‹è¯•˜q‡ï¼Œå‘缓冲区åQˆBufferedInputStreamåQ‰ä¸­ä¸€ä¸ªå­—节一个字节的读,性能严重不èƒöåQŒread(); æ–ÒŽ³•òq›_‡˜q”回旉™—´åœ?31¾U³ç§’åQŒå¤„理一ä¸?70Mçš„xml文档åQˆtag比较多)åQŒç«Ÿç„¶èбäº?00+¿U’。(streaming模块˜q˜å†™äº†ä¸€ä¸ªfaster*æ–ÒŽ³•åQŒå“ŽåQŒæ…¢æ­ÖMº†åQ?/div><div> </div><div>周敏同学提供了pig中处理xmlçš„readeråQŒä½†pig那边的代码我˜q˜æ²¡¾l†çœ‹åQŒä¹Ÿä¸çŸ¥é“hadoopçš„jira中有没有新的feature来解决现有xml的问题。如果有的话åQŒä¸é˜²å¯ä»¥å‘Šè¯‰æˆ‘一下下。呵å‘üc€?nbsp;</div><div> </div><div>现在有一个构思,即主要思想仍然围绕字节比较åQŒå› ä¸ºå­—½W¦ä¸²åŒšw…æ•ˆçŽ‡æ›´ä½ŽåQŒå¦å¤–算法源于String.indexOf(“”)åQŒå³æ‰‘Öˆ°<log>˜q™ä¸ªåŽï¼Œè®îC½ä½ç½®åQŒç„¶åŽå†æ‰?lt;/log>åQŒè¿™æ ïL®—完全匚w…åQŒä¸­é—´çš„内容用system.arraycopy来复制到新的字节数组åQŒç›®å‰è¿™½Ž—法我实çŽîCº†ä¸€åŠï¼ŒåÏx‰¾åˆ?lt;log>å’?lt;/log>后,把这两个½{¾æ ‡å…¨éƒ¨æ›¿æ¢æŽ‰ï¼Œ170M文档åQŒç”¨æ—?.2¿U’(最å¿?.3¿U’)ã€?/div><div> </div><div>½Ž—法及问题:</div><div>首先提供一个BufferedInputStreamåQŒé»˜è®¤å¤§ž®?kåQŒåœ¨½E‹åºä¸­å¾ä¸€ä¸ªå­—节数¾l„,大小ä¸?kåQŒå³æ¯æ¬¡å‘BufferedInputStreamè¯?kåQŒè¿™ä¸ªæ•ˆçŽ‡æ˜¯å¾ˆä¸é”™çš„åQŒç„¶åŽåŽ»å¯ÀL‰¾<log>.toArray˜q™æ ·çš„字节数¾l„,˜q™ä¸€æ­¥é€Ÿåº¦æ˜¯å¾ˆæƒŠäh的。但˜q™é‡Œæœ‰ä¸€ä¸ªå°çš„问题,åÏx¯‹Æ¡è¯»4k的大ž®åŽ»å¤„ç†åQŒé‚£å¾ˆæœ‰å¯èƒ½<log></log>位于两次è¯Õd–的一ž®¾ä¸€å¤ß_¼Œé‚£ä¹ˆæˆ‘çš„æƒÏx³•是做一个半循环的字节数¾l„,卛_¦‚果在4k的字节数¾l„中的最后找åˆ?lt;log>åQŒé‚£ä¹ˆå°±æŠŠå‰é¢æœªåŒšw…çš„仍掉,然后æŠ?lt;log>标签¿UÕdˆ°å­—节数组最前端åQŒç„¶åŽå¦ç”¨è¿™ä¸ªå­—节数¾l„再向BufferedInputStream中去è¯?k-5长度的内容(5æ˜?lt;log>的字节长度)。关äº?k˜q™ä¸ªå¤§å°åQŒé¦–先要对XML数据˜q›è¡ŒsamplingåQŒå³¼‹®å®š<log></log>当中的内定w•¿åº¦ï¼Œç„¶åŽå†å®š˜q™ä¸ª¾~“冲buf的大ž®ã€?/div></div><img src ="http://www.aygfsteel.com/ricdong/aggbug/366960.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/ricdong/" target="_blank">Ric Dong</a> 2011-12-21 21:15 <a href="http://www.aygfsteel.com/ricdong/articles/366960.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss> <footer> <div class="friendship-link"> <a href="http://www.aygfsteel.com/" title="狠狠久久亚洲欧美专区_中文字幕亚洲综合久久202_国产精品亚洲第五区在线_日本免费网站视频">狠狠久久亚洲欧美专区_中文字幕亚洲综合久久202_国产精品亚洲第五区在线_日本免费网站视频</a> </div> </footer> Ö÷Õ¾Ö©Öë³ØÄ£°å£º <a href="http://" target="_blank">Àײ¨ÏØ</a>| <a href="http://" target="_blank">Õ¿½­ÊÐ</a>| <a href="http://" target="_blank">Ì©Ë³ÏØ</a>| <a href="http://" target="_blank">ξÊÏÏØ</a>| <a href="http://" target="_blank">É̺ÓÏØ</a>| <a href="http://" target="_blank">°¢Â³¿Æ¶ûÇ߯ì</a>| <a href="http://" target="_blank">Çì³ÇÏØ</a>| <a href="http://" target="_blank">ÀϺӿÚÊÐ</a>| <a href="http://" target="_blank">·áÕòÊÐ</a>| <a href="http://" target="_blank">µÆËþÊÐ</a>| <a href="http://" target="_blank">ÊÙÄþÏØ</a>| <a href="http://" target="_blank">ÉçÆìÏØ</a>| <a href="http://" target="_blank">ÁÖÖÝÊÐ</a>| <a href="http://" target="_blank">°ÍÖÐÊÐ</a>| <a href="http://" target="_blank">SHOW</a>| <a href="http://" target="_blank">ÁÖµéÏØ</a>| <a href="http://" target="_blank">Ã÷ÏªÏØ</a>| <a href="http://" target="_blank">ÃÜɽÊÐ</a>| <a href="http://" target="_blank">Â¡Ò¢ÏØ</a>| <a href="http://" target="_blank">À¼ÖÝÊÐ</a>| <a href="http://" target="_blank">½ðÌÃÏØ</a>| <a href="http://" target="_blank">ÄþÁêÏØ</a>| <a href="http://" target="_blank">ÕýÏâ°×Æì</a>| <a href="http://" target="_blank">Íß·¿µêÊÐ</a>| <a href="http://" target="_blank">°²ÐÂÏØ</a>| <a href="http://" target="_blank">ÖêÖÞÏØ</a>| <a href="http://" target="_blank">¶õÎÂ</a>| <a href="http://" target="_blank">½·áÊÐ</a>| <a href="http://" target="_blank">¼ÎÓø¹ØÊÐ</a>| <a href="http://" target="_blank">ÏæÒõÏØ</a>| <a href="http://" target="_blank">Ë®¸»ÏØ</a>| <a href="http://" target="_blank">°°É½ÊÐ</a>| <a href="http://" target="_blank">ÂíÉ½ÏØ</a>| <a href="http://" target="_blank">ÄþÝõ</a>| <a href="http://" target="_blank">Îä³ÇÏØ</a>| <a href="http://" target="_blank">¸»ÄþÏØ</a>| <a href="http://" target="_blank">¶¡ÇàÏØ</a>| <a href="http://" target="_blank">º¬É½ÏØ</a>| <a href="http://" target="_blank">¹®ÒåÊÐ</a>| <a href="http://" target="_blank">½¨ÄþÏØ</a>| <a href="http://" target="_blank">Õò½­ÊÐ</a>| <script> (function(){ var bp = document.createElement('script'); var curProtocol = window.location.protocol.split(':')[0]; if (curProtocol === 'https') { bp.src = 'https://zz.bdstatic.com/linksubmit/push.js'; } else { bp.src = 'http://push.zhanzhang.baidu.com/push.js'; } var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(bp, s); })(); </script> </body>