分布计算¾pÈ»Ÿæ¡†æž¶åQŒæŒ‰ç…§æ•°æ®é›†çš„特ç‚ÒŽ¥è¯ß_¼Œä¸»è¦åˆ†äØ“data-flowå’Œstreaming两ç§ã€‚data-flowä¸»è¦æ˜¯ä»¥æ•°æ®å—äØ“æ•°æ®æºæ¥å¤„ç†æ•°æ®åQŒä»£è¡¨æœ‰åQšMRã€Spark½{‰ï¼Œæˆ‘ç§°ä½œå®ƒä»¬äØ“å¤§æ•°æ®ï¼Œè€Œstreamingä¸»è¦æ˜¯å¤„ç†å•ä½å†…得到的数æ®ï¼Œ˜q™ç§æ–¹å¼åQŒæ›´æ³¨é‡äºŽå®žæ—¶æ€§ï¼Œä¸»è¦åŒ…括Stromã€JStormå’ŒSamza½{‰ï¼Œæˆ‘ç§°ä½œå®ƒä»¬äØ“å¿«æ•°æ®ã€?/p>
在这½‹‡æ–‡ç« ä¸åQŒæˆ‘主è¦è°ˆè®ºstreaming相关的框架ã€?/p>
½W¬ä¸€ä¸ªæ˜¯StormåQŒä¸€ä¸ªå®žæ—¶è®¡½Ž—ç³»¾lŸï¼Œå®ƒå‡å®šæ•°æ®æºæ˜¯åЍæ€çš„åQŒå¯ä»¥å‘‹¹æ°´ä¸€æ ·å¤„ç†æ•°æ®ã€?/p>
它的特点是:低åšg˜qŸã€é«˜æ€§èƒ½ã€åˆ†å¸ƒå¼ã€å¯æ‰©å±•和容错性ã€?/p>
架构如下图所½Cºã€?/p>
Storm的具体概念å¯ä»¥å‚照:http://blog.csdn.net/hljlzc2007/article/details/12976211åQŒè¿™é‡Œä¸åšå…·ä½“介¾lã€?/p>
Stormç›®å‰½Ž—æ˜¯æœ€æœ€½E›_®šçš„å¼€æºæµå¼å¤„ç†æ¡†æžÓž¼Œä½†æ˜¯ä¸ªähè®¤äØ“å®ƒæœ‰ä¸¤ä¸ªé—®é¢˜ã€?/p>
1. Storm虽然支æŒå¤šä¸ªè¯è¨€¾~–写spoutå’Œbolt端的代ç åQŒä½†æ˜¯å®ƒçš„ä¸»è¦æŠ€æœ¯å®žçŽ°æ˜¯clojureåQŒè¿™¾l™çŽ©å¤§æ•°æ®ã€å¼€æºçš„æœ‹å‹å¸¦æ¥äº†æžå¤§çš„ä¸å˜åQŒå› 为大家会的è¯a€ä¸æ˜¯ä»¥javaå’ŒC++½{‰å¤§ä¼—è¯a€ä¸ÞZ¸»åQŒè¿™æ ïLš„è¯ï¼Œå˜å¾—ä¸å¯æŽ§äº†åQŒéš¾ä»¥æ·±å…¥äº†è§£ã€ä¿®æ”¹å…¶¾l†èŠ‚ã€?/p>
2. Stormå¯ä»¥æ”¯æŒåœ¨Yarn(Hadoop 2.0)上,å¯ä»¥å’Œå…¶ä»–å¼€æºæ¡†æž¶å…±äº«Hadoop集群的资æºï¼Œä½†æ˜¯æ€§èƒ½ä¸ä½³åQŒè¿™ä¸ªæœ‰å¾…Storm改善
å½“ç„¶æ— è®ºå¦‚ä½•åQŒStormä¾ç„¶æ˜¯ç›®å‰å¼€æºæµå¼å¤„ç†æ¡†æž¶çš„王者ã€?/p>
½W¬äºŒä¸ªæˆ‘惌™¯´çš„æ˜¯JStormåQŒè¿™ä¸ªæ˜¯é˜‰K‡Œåšçš„åQŒç®—是Stormçš„å¦ä¸€ä¸ªå®žçŽŽÍ¼Œå®ƒç”¨çš„è¯a€æ˜¯Java.
特点åQ?/p>
1. 客户端的API与Storm基本上是一致的åQŒå¦‚果从Storm˜qç§»˜q‡æ¥åQŒä¸éœ€è¦ä¿®æ”¹boltå’Œspout的代ç ?/p>
2. Jstrom比Strom½E›_®šåQŒé€Ÿåº¦æ›´å¿«
3. æä¾›äº†ä¸€äº›æ–°çš„特æ€?/p>
大家有兴‘£å¯ä»¥åŽ»çŽ©çŽ©åQŒé¡¹ç›®åœ°å€https://github.com/alibaba/jstorm
½W¬ä¸‰ä¸ªæ˜¯Samza
Samza是由LinkedInå¼€æºçš„一个技术,它是一个开æºçš„åˆ†å¸ƒå¼æµå¤„熾pÈ»ŸåQŒéžå¸¸ç±»ä¼égºŽStorm。ä¸åŒçš„æ˜¯å®ƒ˜q行在Hadoop之上åQŒåƈ且ä‹É用了自己开å‘çš„Kafkaåˆ†å¸ƒå¼æ¶ˆæ¯å¤„ç†ç³»¾lŸã€?/p>
˜q™æ˜¯Linkinå¼€å‘的一个å°è€Œç¾Žçš„项目,如何¾ŸŽå‘¢åQ?/p>
1. åªæœ‰å‡ åƒè¡Œä»£ç ,完æˆçš„功能就å¯ä»¥å’ŒStorm媲美åQŒå½“ç„¶ç›®å‰è¿˜æœ‰å¾ˆå¤šçš„ä¸èƒö
2. å’ŒKafka¾l“åˆç´§å¯†åQŒæ›´æ–¹ä¾¿çš„å¤„ç†æ•°æ?/p>
3. ˜q行在Yarnä¸?/p>
之剿ˆ‘嚘q‡çš„一个项目,是Kafka + Storm + ElasticSearchåQŒå°†æ¥å®Œå…¨å¯ä»¥å°†Stormæ›¿æ¢æˆSamzaåQŒè¿™æ ïLš„è¯ï¼Œ˜q˜å¯ä»¥åˆ©ç”¨Hadoop集群的资æºï¼Œåšä¸€äº›å˜å‚¨ã€ç¦»¾U¿åˆ†æžçš„功能。将实时处ç†å’Œç¦»¾U¿åˆ†æžéƒ½˜q行在Hadoop上,ä¸å¾—ä¸è¯´Samza是一个伟大的™å¹ç›®åQŒè¿™æ ·å¯ä»¥å‡ž®‘é¡¹ç›®çš„å¢žé•¿å¤æ‚度,利于¾l´æŠ¤åQŒè¿˜æ˜¯é‚£å¥è¯åQŒå°è€Œç¾Žçš„东西,更嗋Ƣ迎一些ã€?/p>
æž¶æž„åQ?/p>
Samza主è¦åŒ…å«ä¸‰å±‚åQ?/p>
1. ‹¹å¤„ç†å±‚ --> Kafka
2. 执行� --> YARN
3. 处ç†å±? --> Samza API
Samzaçš„æµå¤„ç†å±‚å’Œæ‰§è¡Œå±‚éƒ½æ˜¯å¯æ’æ‹”å¼çš„åQŒå¼€å‘äh员å¯ä»¥ä‹Éç”¨å…¶ä»–æ¡†æž¶æ¥æ›¿ä»£åQŒä¸å±€é™äºŽä¸Šè¿°ä¸¤ç§æŠ€æœ¯ã€?/p>
Samzaæä¾›äº†ä¸€ä¸ªYARN ApplicationMasteråQŒå’ŒYARN jobåQŒè¿è¡Œåœ¨é›†ç¾¤ä¹‹å¤–åQŒä¸‹å›¾ä¸ä¸åŒé¢œè‰²ä»£è¡¨ä¸åŒçš„主机ã€?/p>
Samza客户端告诉YARNçš„Resouce ManageråQŒå®ƒæƒ›_¯åŠ¨ä¸€ä¸ªSamza jobåQ?YARN RM 告诉YARN Node manageråQŒåˆ†é…空间给YARN ApplicationMasteråQŒNM指定完空间åŽåQŒYARN container会è¿è¡ŒSamza Task Runnerã€?/p>
Samza状æ€ç®¡ç?/p>
‹¹å¼å¤„ç†æ•°æ®å¯¹çжæ€çš„½Ž¡ç†æ˜¯å¾ˆéš„¡š„åQŒç”±äºŽæ•°æ®æ˜¯‹¹åŠ¨çš„ï¼Œæœ¬èín没有状æ€ï¼Œ˜q™æ ·ž®±éœ€è¦é åŽ†å²æ•°æ®æ¥è®°å½•应用的场åˆåQŒSamzaæä¾›äº†ä¸€ä¸ªå†…部的key-valueæ•°æ®åº“,它是åŸÞZºŽLevelDBåQŒè¿è¡Œçš„JVM之外的,使用它æ¥å˜å‚¨åކ岿•°æ®ã€‚è¿™æ ïLš„åšçš„好处是:
1. å‡å°‘JVM的开销
2. 使用内部å˜å‚¨åQŒæžå¤§æé«˜çš„åžåçŽ?/p>
3. å‡å°‘òq¶å‘æ“作
Samza处熋¹ç¨‹.
下图是Samza官方¾l™çš„一例ååQŒæ ¹æ®Member ID分组åQŒè®¡½Ž—页é¢è®¿é—®æ¬¡æ•°ã€‚入壿¶ˆæ¯åˆ†åˆ«æ¥è‡ªMachine1ã€?åQŒå‡ºå£æ˜¯Machine3åQŒæˆ‘们å¯ä»¥è¿™æ ïL†è§£ï¼Œæ¶ˆæ¯åˆ†æ•£åœ¨ä¸åŒçš„æ¶ˆæ¯¾pÈ»Ÿä¸ï¼ˆKafkaåQ‰ï¼ŒSamza从ä¸åŒçš„Kafkaä¸è¯»å–topicåQŒåœ¨ž®†topic˜q›è¡Œå¤„ç†åŽï¼Œå‘é€åˆ°Machine3åQŒè¿™é‡Œä¸åšè¿‡å¤šåˆ†è§£ï¼Œå…·ä½“å¯ä»¥å‚照官方文档ã€?/p>
™å¹ç›®åœ°å€åQ?a target="_blank" style="color: #336699; text-decoration: none;">https://github.com/apache/incubator-samza
官方文äšgåQ?a target="_blank" style="color: #336699; text-decoration: none;">http://samza.incubator.apache.org/
以上¾l™äº†æˆ‘ä»¬æ— é™é想åQŒStorm是å¦ä¼šä¿æŒé¢†å…ˆåœ°ä½ï¼ŒSamza能å¦å–而代之呢åQŒæ— è®ºå¦‚ä½•ï¼Œä½œäØ“å¼€å‘者æ¥è¯ß_¼Œå‡ åƒè¡Œä»£ç ,我都˜q«ä¸åŠå¾…去è¦è¯ÖM¸€ä¸‹äº†ã€?/p>
setNumWorkers
) specifies how many processes you want allocated around the cluster to execute the topology. Each component in the topology will execute as many threads. The number of threads allocated to a given component is configured through the setBolt
and setSpout
methods. Those threadsexist within worker processes. Each worker process contains within it some number of threads for some number of components. For instance, you may have 300 threads specified across all your components and 50 worker processes specified in your config. Each worker process will execute 6 threads, each of which of could belong to a different component. You tune the performance of Storm topologies by tweaking the parallelism for each component and the number of worker processes those threads should run within.setDebug
), when set to true, tells Storm to log every message every emitted by a component. This is useful in local mode when testing topologies, but you probably want to keep this turned off when running topologies on the cluster.There's many other configurations you can set for the topology. The various configurations are detailed on the Javadoc for Config.
There are a variety of configurations you can set per topology. A list of all the configurations you can set can be found here. The ones prefixed with "TOPOLOGY" can be overridden on a topology-specific basis (the other ones are cluster configurations and cannot be overridden). Here are some common ones that are set for a topology: