ï»??xml version="1.0" encoding="utf-8" standalone="yes"?>永久亚洲成a人片777777,成人av二区,精品国产综合久久http://www.aygfsteel.com/DLevin/category/54892.htmlIn general the OO style is to use a lot of little objects with a lot of little methods that give us a lot of plug points for overriding and variation. To do is to be -Nietzsche, To bei is to do -Kant, Do be do be do -Sinatrazh-cnSun, 15 Nov 2015 05:16:24 GMTSun, 15 Nov 2015 05:16:24 GMT60SSTable详解http://www.aygfsteel.com/DLevin/archive/2015/09/25/427481.htmlDLevinDLevinThu, 24 Sep 2015 17:35:00 GMThttp://www.aygfsteel.com/DLevin/archive/2015/09/25/427481.htmlhttp://www.aygfsteel.com/DLevin/comments/427481.htmlhttp://www.aygfsteel.com/DLevin/archive/2015/09/25/427481.html#Feedback0http://www.aygfsteel.com/DLevin/comments/commentRss/427481.htmlhttp://www.aygfsteel.com/DLevin/services/trackbacks/427481.html 前记几年前在读Googleçš„BigTable论文的时候,当时òq¶æ²¡æœ‰ç†è§£è®ºæ–‡é‡Œé¢è¡¨è¾„¡š„思想åQŒå› è€Œå›«å›µåžæž£ï¼Œòq¶æ²¡æœ‰æ³¨æ„åˆ°SSTable的概å¿üc€‚再后来开始关注HBase的设计和源码后,开始对BigTable传递的思想慢慢的清晰è“væ¥ï¼Œä½†æ˜¯å› äØ“äº‹æƒ…å¤ªå¤šåQŒæ²¡æœ‰å®‰æŽ’出旉™—´é‡è¯»BigTable的论文。在™å¹ç›®é‡Œï¼Œæˆ‘因䏸™‡ªå·±åœ¨å­¦HBaseåQŒå¼€å§‹ä¸»æŽ¨HBaseåQŒè€Œå¦ä¸€ä¸ªåŒäº‹åˆ™å› äؓ对Cassandra比较感冒åQŒå› è€Œä»–主要å…Ïx³¨Cassandra的设计,不过我们两个人偶ž®”都会讨è®ÞZ¸€ä¸‹æŠ€æœ¯ã€è®¾è®¡çš„各种观点和心得,然后他偶然的说了一句:Cassandraå’ŒHBase都采用SSTable格式存储åQŒç„¶åŽæˆ‘本能的问了一句:什么是SSTableåQŸä»–òq¶æ²¡æœ‰å›ž½{”,可能也不是那么几句能说清楚的åQŒæˆ–者他自己也没有尝试的去问˜q‡è‡ªå·Þp¿™ä¸ªé—®é¢˜ã€‚然而这个问题本íw«å´ä¸€ç›´å›°æ‰°ç€æˆ‘,因而趁着现在有一些时间深入学习HBaseå’ŒCassandra相关设计的时候先把这个问题弄清楚了ã€?br />

SSTable的定ä¹?/h2>要解释这个术语的真正含义åQŒæœ€å¥½çš„æ–ÒŽ³•ž®±æ˜¯ä»Žå®ƒçš„出处找½{”案åQŒæ‰€ä»¥é‡æ–°ç¿»å¼€BigTable的论文。在˜q™ç¯‡è®ºæ–‡ä¸­ï¼Œæœ€åˆå¯¹SSTable是这么描˜q°çš„åQˆç¬¬ä¸‰é¡µæœ«å’Œ½W¬å››™åµåˆåQ‰ï¼š
SSTable

The Google SSTable file format is used internally to store Bigtable data. An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. Operations are provided to look up the value associated with a specified key, and to iterate over all key/value pairs in a specified key range. Internally, each SSTable contains a sequence of blocks (typically each block is 64KB in size, but this is configurable). A block index (stored at the end of the SSTable) is used to locate blocks; the index is loaded into memory when the SSTable is opened. A lookup can be performed with a single disk seek: we first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from disk. Optionally, an SSTable can be completely mapped into memory, which allows us to perform lookups and scans without touching disk.

½Ž€å•的非直译:
SSTable是Bigtable内部用于数据的文件格式,它的格式为文件本íw«å°±æ˜¯ä¸€ä¸ªæŽ’序的、不可变的、持久的Key/Value对MapåQŒå…¶ä¸­Keyå’Œvalue都可以是ä»ÀL„çš„byte字符丌Ӏ‚ä‹É用Key来查找ValueåQŒæˆ–通过¾l™å®šKey范围遍历所有的Key/Value寏V€‚每个SSTable包含一¾pÕdˆ—çš„BlockåQˆä¸€èˆ¬Block大小ä¸?4KBåQŒä½†æ˜¯å®ƒæ˜¯å¯é…ç½®çš„)åQŒåœ¨SSTable的末ž®¾æ˜¯Block索引åQŒç”¨äºŽå®šä½BlockåQŒè¿™äº›çƒ¦å¼•在SSTable打开时被加蝲到内存中åQŒåœ¨æŸ¥æ‰¾æ—‰™¦–先从内存中的索引二分查找扑ֈ°BlockåQŒç„¶åŽä¸€‹Æ¡ç£ç›˜å¯»é“即可读取到相应的Block。还有一¿Uæ–¹æ¡ˆæ˜¯ž®†è¿™ä¸ªSSTable加蝲到内存中åQŒä»Žè€Œåœ¨æŸ¥æ‰¾å’Œæ‰«æä¸­ä¸éœ€è¦è¯»å–磁盘ã€?/span>

˜q™ä¸ªè²Œä¼¼ž®±æ˜¯HFile½W¬ä¸€ä¸ªç‰ˆæœ¬çš„æ ¼å¼ä¹ˆï¼Œè´´å¼ å›¾æ„Ÿå—一下:

在HBase使用˜q‡ç¨‹ä¸­ï¼Œå¯¹è¿™ä¸ªç‰ˆæœ¬çš„HFile遇到以下一些问题(参è€?a >˜q™é‡ŒåQ‰ï¼š
1. 解析时内存ä‹É用量比较高ã€?br />2. Bloom Filterå’ŒBlock索引会变的很大,而媄响启动性能。具体的åQŒBloom Filter可以增长åˆ?00MB每个HFileåQŒè€ŒBlock索引可以增长åˆ?00MBåQŒå¦‚果一个HRegionServer中有20个HRegionåQŒåˆ™ä»–们分别能增长到2GBå’?GB的大ž®ã€‚HRegion需要在打开æ—Óž¼Œéœ€è¦åŠ è½½æ‰€æœ‰çš„Block索引到内存中åQŒå› è€Œåª„响启动性能åQ›è€Œåœ¨½W¬ä¸€‹Æ¡Requestæ—Óž¼Œéœ€è¦å°†æ•´ä¸ªBloom Filter加蝲到内存中åQŒå†å¼€å§‹æŸ¥æ‰¾ï¼Œå› è€ŒBloom Filter太大会媄响第一‹Æ¡è¯·æ±‚的延迟ã€?br />而HFile在版æœ?中对˜q™äº›é—®é¢˜åšäº†ä¸€äº›ä¼˜åŒ–,具体会在HFile解析时详¾l†è¯´æ˜Žã€?br />

SSTableä½œäØ“å­˜å‚¨ä½¿ç”¨

¾l§ç®‹BigTable的论文往下走åQŒåœ¨5.3 Tablet Servingž®èŠ‚ä¸­è¿™æ ·å†™é“ï¼ˆ½W?™åµï¼‰åQ?br />
Tablet Serving

Updates are committed to a commit log that stores redo records. Of these updates, the recently committed ones are stored in memory in a sorted buffer called a memtable; the older updates are stored in a sequence of SSTables. To recover a tablet, a tablet server reads its metadata from the METADATA table. This metadata contains the list of SSTables that comprise a tablet and a set of a redo points, which are pointers into any commit logs that may contain data for the tablet. The server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have committed since the redo points.

When a write operation arrives at a tablet server, the server checks that it is well-formed, and that the sender is authorized to perform the mutation. Authorization is performed by reading the list of permitted writers from a Chubby file (which is almost always a hit in the Chubby client cache). A valid mutation is written to the commit log. Group commit is used to improve the throughput of lots of small mutations [13, 16]. After the write has been committed, its contents are inserted into the memtable.

When a read operation arrives at a tablet server, it is similarly checked for well-formedness and proper authorization. A valid read operation is executed on a merged view of the sequence of SSTables and the memtable. Since the SSTables and the memtable are lexicographically sorted data structures, the merged view can be formed efficiently.

Incoming read and write operations can continue while tablets are split and merged.

½W¬ä¸€ŒDµå’Œ½W¬ä¸‰ŒD늮€å•描˜qŽÍ¼Œéžç¿»è¯‘:
在新数据写入æ—Óž¼Œ˜q™ä¸ªæ“ä½œé¦–å…ˆæäº¤åˆ°æ—¥å¿—ä¸­ä½œäØ“redo¾Uªå½•åQŒæœ€˜q‘的数据存储在内存的排序¾~“å­˜memtable中;旧的数据存储在一¾pÕdˆ—çš„SSTable 中。在recover中,tablet server从METADATA表中è¯Õd–metadataåQŒmetadata包含了组成Tablet的所有SSTableåQˆçºªå½•了˜q™äº›SSTable的元 数据信息åQŒå¦‚SSTable的位¾|®ã€StartKey、EndKey½{‰ï¼‰ä»¥åŠä¸€¾pÕdˆ—日志中的redo炏V€‚Tablet Serverè¯Õd–SSTable的烦引到内存åQŒåƈreplay˜q™äº›redo点之后的更新来重构memtableã€?br />在读æ—Óž¼Œå®Œæˆæ ¼å¼ã€æŽˆæƒç­‰‹‚€æŸ¥åŽåQŒè¯»ä¼šåŒæ—¶è¯»å–SSTable、memtableåQˆHBase中还包含了BlockCache中的数据åQ‰åÆˆåˆåÆˆä»–ä»¬çš„ç»“æžœï¼Œç”׃ºŽSSTableå’Œmemtable都是字典序排列,因而合òq¶æ“ä½œå¯ä»¥å¾ˆé«˜æ•ˆå®Œæˆã€?br />

SSTable在Compaction˜q‡ç¨‹ä¸­çš„使用

在BigTable论文5.4 Compactionž®èŠ‚ä¸­æ˜¯˜q™æ ·è¯´çš„åQ?br />
Compaction

As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. This minor compaction process has two goals: it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies. Incoming read and write operations can continue while compactions occur.

Every minor compaction creates a new SSTable. If this behavior continued unchecked, read operations might need to merge updates from an arbitrary number of SSTables. Instead, we bound the number of such files by periodically executing a merging compaction in the background. A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable. The input SSTables and memtable can be discarded as soon as the compaction has finished.

A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction. SSTables produced by non-major compactions can contain special deletion entries that suppress deleted data in older SSTables that are still live. A major compaction, on the other hand, produces an SSTable that contains no deletion information or deleted data. Bigtable cycles through all of its tablets and regularly applies major compactions to them. These major compactions allow Bigtable to reclaim resources used by deleted data, and also allow it to ensure that deleted data disappears from the system in a timely fashion, which is important for services that store sensitive data.

随着memtable大小增加åˆîC¸€ä¸ªé˜€å€û|¼Œ˜q™ä¸ªmemtable会被å†ÖM½è€Œåˆ›å»ÞZ¸€ä¸ªæ–°çš„memtable以供使用åQŒè€Œæ—§çš„memtable会è{换成一个SSTable而写道GFS中,˜q™ä¸ª˜q‡ç¨‹å«åšminor compaction。这个minor compaction可以减少内存使用量,òq¶å¯ä»¥å‡ž®‘日志大ž®ï¼Œå› äؓ持久化后的数据可以从日志中删除ã€?/span>在minor compaction˜q‡ç¨‹ä¸­ï¼Œå¯ä»¥¾l§ç®‹å¤„理è¯Õd†™è¯äh±‚ã€?br />每次minor compaction会生成新的SSTableæ–‡äšgåQŒå¦‚æžœSSTableæ–‡äšg数量增加åQŒåˆ™ä¼šåª„响读的性能åQŒå› è€Œæ¯‹Æ¡è¯»éƒ½éœ€è¦è¯»å–所有SSTableæ–‡äšgåQŒç„¶åŽåˆòq¶ç»“果,因而对SSTableæ–‡äšg个数需要有上限åQŒåƈ且时不时的需要在后台做merging compactionåQŒè¿™ä¸ªmerging compactionè¯Õd–一些SSTableæ–‡äšgå’Œmemtable的内容,òq¶å°†ä»–ä»¬åˆåÆˆå†™å…¥ä¸€ä¸ªæ–°çš„SSTable中。当˜q™ä¸ª˜q‡ç¨‹å®ŒæˆåŽï¼Œ˜q™äº›æºSSTableå’Œmemtablež®±å¯ä»¥è¢«åˆ é™¤äº†ã€?br />如果一个merging compaction是合òq¶æ‰€æœ‰SSTableåˆîC¸€ä¸ªSSTableåQŒåˆ™˜q™ä¸ª˜q‡ç¨‹¿U°åšmajor compaction。一‹Æ¡major compaction会将mark成删除的信息、数据删除,而其他两‹Æ¡compaction则会保留˜q™äº›ä¿¡æ¯ã€æ•°æ®ï¼ˆmarkçš„åŞ式)。Bigtable会时不时的扫描所有的TabletåQŒåƈ对它们做major compaction。这个major compaction可以ž®†éœ€è¦åˆ é™¤çš„æ•°æ®çœŸæ­£çš„删除从而节省空é—ß_¼Œòq¶ä¿æŒç³»¾lŸä¸€è‡´æ€§ã€?/span>

SSTable的locality和In Memory

在Bigtable中,它的本地性是由Locality group来定义的åQŒå³å¤šä¸ªcolumn family可以¾l„合åˆîC¸€ä¸ªlocality group中,在同一个Tablet中,使用单独的SSTable存储˜q™äº›åœ¨åŒä¸€ä¸ªlocality groupçš„column family。HBase把这个模型简化了åQŒå³æ¯ä¸ªcolumn family在每个HRegion都ä‹É用单独的HFile存储åQŒHFile没有locality group的概念,或者一个column familyž®±æ˜¯ä¸€ä¸ªlocality groupã€?/span>

在Bigtable中,˜q˜å¯ä»¥æ”¯æŒåœ¨locality group¾U§åˆ«è®„¡½®æ˜¯å¦ž®†æ‰€æœ‰è¿™ä¸ªlocality group的数据加载到内存中,在HBase中通过column family定义时设¾|®ã€‚这个内存加载采用åšg时加载,主要应用于一些小的column familyåQŒåƈ且经常被用到的,从而提升读的性能åQŒå› è€Œè¿™æ ·å°±ä¸éœ€è¦å†ä»Žç£ç›˜ä¸­è¯Õd–了ã€?/span>

SSTable压羃

Bigtable的压¾~©æ˜¯åŸÞZºŽlocality group¾U§åˆ«åQ?br />
Compression

Clients can control whether or not the SSTables for a locality group are compressed, and if so, which compression format is used. The user-specified compression format is applied to each SSTable block (whose size is controllable via a locality group specific tuning parameter). Although we lose some space by compressing each block separately, we benefit in that small portions of an SSTable can be read without decompressing the entire file. Many clients use a two-pass custom compression scheme. The first pass uses Bentley and McIlroy’s scheme [6], which compresses long common strings across a large window. The second pass uses a fast compression algorithm that looks for repetitions in a small 16 KB window of the data. Both compression passes are very fast—they encode at 100–200 MB/s, and decode at 400–1000 MB/s on modern machines.

Bigtable的压¾~©ä»¥SSTable中的一个Block为单位,虽然每个Block为压¾~©å•位损å¤×ƒ¸€äº›ç©ºé—ß_¼Œä½†æ˜¯é‡‡ç”¨˜q™ç§æ–¹å¼åQŒæˆ‘们可以以Block为单位读取、解压、分析,而不是每‹Æ¡ä»¥ä¸€ä¸?#8220;å¤?#8221;çš„SSTable为单位读取、解压、分析ã€?/span>

SSTable的读¾~“å­˜

ä¸ÞZº†æå‡è¯Èš„性能åQŒBigtable采用两层¾~“存机制åQ?br />
Caching for read performance

To improve read performance, tablet servers use two levels of caching. The Scan Cache is a higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code. The Block Cache is a lower-level cache that caches SSTables blocks that were read from GFS. The Scan Cache is most useful for applications that tend to read the same data repeatedly. The Block Cache is useful for applications that tend to read data that is close to the data they recently read (e.g., sequential reads, or random reads of different columns in the same locality group within a hot row).

两层¾~“存分别是:
1. High LevelåQŒç¼“存从SSTableè¯Õd–çš„Key/Value寏V€‚提升那些們֐‘重复的读取相同的数据的操作(引用局部性原理)ã€?br />2. Low LevelåQŒBlockCacheåQŒç¼“å­˜SSTable中的Block。提升那些們֐‘于读取相˜q‘数据的操作ã€?br />

Bloom Filter

前文有提到Bigtableé‡‡ç”¨åˆåÆˆè¯»ï¼Œå³éœ€è¦è¯»å–æ¯ä¸ªSSTable中的相关数据åQŒåÆˆåˆåÆˆæˆä¸€ä¸ªç»“æžœè¿”å›žï¼Œç„¶è€Œæ¯‹Æ¡è¯»éƒ½éœ€è¦è¯»å–所有SSTableåQŒè‡ªç„¶ä¼šè€—费性能åQŒå› è€Œå¼•入了Bloom FilteråQŒå®ƒå¯ä»¥å¾ˆå¿«é€Ÿçš„æ‰‘Öˆ°ä¸€ä¸ªRowKey不在某个SSTable中的事实åQˆæ³¨åQšå˜q‡æ¥åˆ™ä¸æˆç«‹åQ‰ã€?br />
Bloom Filter

As described in Section 5.3, a read operation has to read from all SSTables that make up the state of a tablet. If these SSTables are not in memory, we may end up doing many disk accesses. We reduce the number of accesses by allowing clients to specify that Bloom fil- ters [7] should be created for SSTables in a particu- lar locality group. A Bloom filter allows us to ask whether an SSTable might contain any data for a spec- ified row/column pair. For certain applications, a small amount of tablet server memory used for storing Bloom filters drastically reduces the number of disk seeks re- quired for read operations. Our use of Bloom filters also implies that most lookups for non-existent rows or columns do not need to touch disk.

SSTable设计成Immutable的好å¤?/h2>在SSTable定义中就有提到SSTable是一个Immutableçš„order mapåQŒè¿™ä¸ªImmutable的设计可以让¾pȝ»Ÿ½Ž€å•很多:
Exploiting Immutability

Besides the SSTable caches, various other parts of the Bigtable system have been simplified by the fact that all of the SSTables that we generate are immutable. For example, we do not need any synchronization of accesses to the file system when reading from SSTables. As a result, concurrency control over rows can be implemented very efficiently. The only mutable data structure that is accessed by both reads and writes is the memtable. To reduce contention during reads of the memtable, we make each memtable row copy-on-write and allow reads and writes to proceed in parallel.

Since SSTables are immutable, the problem of permanently removing deleted data is transformed to garbage collecting obsolete SSTables. Each tablet’s SSTables are registered in the METADATA table. The master removes obsolete SSTables as a mark-and-sweep garbage collection [25] over the set of SSTables, where the METADATA table contains the set of roots.

Finally, the immutability of SSTables enables us to split tablets quickly. Instead of generating a new set of SSTables for each child tablet, we let the child tablets share the SSTables of the parent tablet.

关于Immutable的优ç‚ÒŽœ‰ä»¥ä¸‹å‡ ç‚¹åQ?/span>
1. 在读SSTable是不需要同步。读写同步只需要在memtable中处理,ä¸ÞZº†å‡å°‘memtable的读写竞争,Bigtablež®†memtableçš„row设计成copy-on-writeåQŒä»Žè€Œè¯»å†™å¯ä»¥åŒæ—¶è¿›è¡Œã€?/span>
2. æ°æ€¹…的移除数据è{å˜äØ“SSTableçš„Garbage Collect。每个Tablet中的SSTable在METADATA表中有注册,master使用mark-and-sweep½Ž—法ž®†SSTable在GC˜q‡ç¨‹ä¸­ç§»é™¤ã€?/span>
3. 可以让Tablet Split˜q‡ç¨‹å˜çš„高效åQŒæˆ‘ä»¬ä¸éœ€è¦äØ“æ¯ä¸ªå­Tablet创徏新的SSTableåQŒè€Œæ˜¯å¯ä»¥å…׃ínçˆ?/span>Tabletçš„SSTableã€?/span>

]]>深入HBase架构解析åQˆäºŒåQ?/title><link>http://www.aygfsteel.com/DLevin/archive/2015/08/22/426950.html</link><dc:creator>DLevin</dc:creator><author>DLevin</author><pubDate>Sat, 22 Aug 2015 11:40:00 GMT</pubDate><guid>http://www.aygfsteel.com/DLevin/archive/2015/08/22/426950.html</guid><wfw:comment>http://www.aygfsteel.com/DLevin/comments/426950.html</wfw:comment><comments>http://www.aygfsteel.com/DLevin/archive/2015/08/22/426950.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/DLevin/comments/commentRss/426950.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/DLevin/services/trackbacks/426950.html</trackback:ping><description><![CDATA[<h2> 前言</h2>˜q™æ˜¯<a href="http://www.aygfsteel.com/DLevin/archive/2015/08/22/426877.html">《深入HBase架构解析åQˆä¸€åQ‰ã€?/a>的箋åQŒä¸å¤šåºŸè¯ï¼Œ¾l§ç®‹ã€‚。。ã€?br /><h2>HBaseè¯Èš„实现</h2>通过前文的描˜qŽÍ¼Œæˆ‘们知道在HBase写时åQŒç›¸åŒCell(RowKey/ColumnFamily/Column相同)òq¶ä¸ä¿è¯åœ¨ä¸€èµøP¼Œç”šè‡³åˆ é™¤ä¸€ä¸ªCell也只是写入一个新的CellåQŒå®ƒå«æœ‰Delete标记åQŒè€Œä¸ä¸€å®šå°†ä¸€ä¸ªCell真正删除了,因而这ž®±å¼•起了一个问题,如何实现è¯Èš„问题åQŸè¦è§£å†³˜q™ä¸ªé—®é¢˜åQŒæˆ‘们先来分析一下相同的Cell可能存在的位¾|®ï¼šé¦–å…ˆå¯ÒŽ–°å†™å…¥çš„CellåQŒå®ƒä¼šå­˜åœ¨äºŽMemStore中;然后对之前已¾lFlush到HDFS中的CellåQŒå®ƒä¼šå­˜åœ¨äºŽæŸä¸ªæˆ–某些StoreFile(HFile)中;最后,对刚è¯Õd–˜q‡çš„CellåQŒå®ƒå¯èƒ½å­˜åœ¨äºŽBlockCache中。既然相同的Cell可能存储在三个地方,在读取的时候只需要扫瞄这三个地方åQŒç„¶åŽå°†¾l“æžœåˆåÆˆå›_¯(Merge Read)åQŒåœ¨HBase中扫瞄的™åºåºä¾æ¬¡æ˜¯ï¼šBlockCache、MemStore、StoreFile(HFile)。其中StoreFile的扫瞄先会ä‹É用Bloom Filter˜q‡æ×o那些不可能符合条件的HFileåQŒç„¶åŽä‹É用Block Index快速定位CellåQŒåƈž®†å…¶åŠ è²åˆ°BlockCache中,然后从BlockCache中读取。我们知道一个HStore可能存在多个StoreFile(HFile)åQŒæ­¤æ—‰™œ€è¦æ‰«çž„多个HFileåQŒå¦‚æžœHFile˜q‡å¤šåˆæ˜¯ä¼šå¼•èµäh€§èƒ½é—®é¢˜ã€?br /><img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig16.png" height="278" width="769" /><br /><h2>Compaction</h2>MemStore每次Flush会创建新的HFileåQŒè€Œè¿‡å¤šçš„HFile会引赯‚¯»çš„æ€§èƒ½é—®é¢˜åQŒé‚£ä¹ˆå¦‚何解册™¿™ä¸ªé—®é¢˜å‘¢åQŸHBase采用Compaction机制来解册™¿™ä¸ªé—®é¢˜ï¼Œæœ‰ç‚¹¾cÖM¼¼Java中的GC机制åQŒè“v初Java不停的申请内存而不释放åQŒå¢žåŠ æ€§èƒ½åQŒç„¶è€Œå¤©ä¸‹æ²¡æœ‰å…è´¹çš„午餐åQŒæœ€¾lˆæˆ‘们还是要在某个条件下åŽÀL”¶é›†åžƒåœ¾ï¼Œå¾ˆå¤šæ—¶å€™éœ€è¦Stop-The-WorldåQŒè¿™¿UStop-The-World有些时候也会引起很大的问题åQŒæ¯”如参考本人写çš?a href="http://www.aygfsteel.com/DLevin/archive/2015/08/01/426418.html">˜q™ç¯‡æ–‡ç« </a>åQŒå› è€Œè®¾è®¡æ˜¯ä¸€¿Uæƒè¡¡ï¼Œæ²¡æœ‰å®Œç¾Žçš„。还是类似Java中的GCåQŒåœ¨HBase中Compactionåˆ†äØ“ä¸¤ç§åQšMinor Compactionå’ŒMajor Compactionã€?br /><ol><li>Minor Compaction是指选取一些小的、相é‚ȝš„StoreFilež®†ä»–们合òq¶æˆä¸€ä¸ªæ›´å¤§çš„StoreFileåQŒåœ¨˜q™ä¸ª˜q‡ç¨‹ä¸­ä¸ä¼šå¤„理已¾lDeleted或Expiredçš„Cell。一‹Æ¡Minor Compaction的结果是更少òq¶ä¸”更大的StoreFile。(˜q™ä¸ªæ˜¯å¯¹çš„吗åQŸBigTable中是˜q™æ ·æè¿°Minor Compactionçš?span style="font-size: 10.000000pt; font-family: 'Times'">åQšAs write operations execute, the size of the memtable in- creases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. This </span><span style="font-size: 10.000000pt; font-family: 'Times'; font-style: italic">minor compaction </span><span style="font-size: 10.000000pt; font-family: 'Times'">process has two goals: it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies. Incom- ing read and write operations can continue while com- pactions occur. </span>也就是说它将memtable的数据flush的一个HFile/SSTable¿UîCؓ一‹Æ¡Minor CompactionåQ?/li><li>Major Compaction是指ž®†æ‰€æœ‰çš„StoreFileåˆåÆˆæˆä¸€ä¸ªStoreFileåQŒåœ¨˜q™ä¸ª˜q‡ç¨‹ä¸­ï¼Œæ ‡è®°ä¸ºDeletedçš„Cell会被删除åQŒè€Œé‚£äº›å·²¾lExpiredçš„Cell会被丢弃åQŒé‚£äº›å·²¾lè¶…˜q‡æœ€å¤šç‰ˆæœ¬æ•°çš„Cell会被丢弃。一‹Æ¡Major Compaction的结果是一个HStore只有一个StoreFile存在。Major Compaction可以手动或自动触发,然而由于它会引起很多的IO操作而引èµäh€§èƒ½é—®é¢˜åQŒå› è€Œå®ƒä¸€èˆ¬ä¼šè¢«å®‰æŽ’在周末、凌晨等集群比较闲的旉™—´ã€?br /></li></ol>æ›´åŞ象一点,如下面两张图分别表示Minor Compactionå’ŒMajor Compactionã€?br /><img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig18.png" height="329" width="723" /><img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig19.png" height="339" width="653" /><br /><h2>HRegion Split</h2>最初,一个Table只有一个HRegionåQŒéšç€æ•°æ®å†™å…¥å¢žåŠ åQŒå¦‚果一个HRegion到达一定的大小åQŒå°±éœ€è¦Split成两个HRegionåQŒè¿™ä¸ªå¤§ž®ç”±hbase.hregion.max.filesize指定åQŒé»˜è®¤äØ“10GB。当splitæ—Óž¼Œä¸¤ä¸ªæ–°çš„HRegion会在同一个HRegionServer中创建,它们各自包含父HRegion一半的数据åQŒå½“Split完成后,父HRegion会下¾U¿ï¼Œè€Œæ–°çš„两个子HRegion会向HMaster注册上线åQŒå¤„于负载均衡的考虑åQŒè¿™ä¸¤ä¸ªæ–°çš„HRegion可能会被HMaster分配到其他的HRegionServer中。关于Split的详¾l†ä¿¡æ¯ï¼Œå¯ä»¥å‚考这½‹‡æ–‡ç« ï¼š<a >《Apache HBase Region Splitting and Mergingã€?/a>ã€?br /><img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig21.png" height="361" width="675" /><br /><h2>HRegion负蝲均衡</h2>在HRegion Split后,两个新的HRegion最初会和之前的父HRegion在相同的HRegionServer上,å‡ÞZºŽè´Ÿè²å‡è¡¡çš„考虑åQŒHMaster可能会将其中的一个甚至两个重新分配的其他的HRegionServer中,此时会引èµähœ‰äº›HRegionServer处理的数据在其他节点上,直到下一‹Æ¡Major Compactionž®†æ•°æ®ä»Ž˜qœç«¯çš„节点移动到本地节点ã€?br /><br /><img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig22.png" height="358" width="714" /><br /><h2>HRegionServer Recovery</h2>当一台HRegionServer宕机æ—Óž¼Œç”׃ºŽå®ƒä¸å†å‘送Heartbeat¾l™ZooKeeper而被监测刎ͼŒæ­¤æ—¶ZooKeeper会通知HMasteråQŒHMaster会检‹¹‹åˆ°å“ªå°HRegionServer宕机åQŒå®ƒž®†å®•机的HRegionServer中的HRegion重新分配¾l™å…¶ä»–çš„HRegionServeråQŒåŒæ—¶HMaster会把宕机的HRegionServer相关的WAL拆分分配¾l™ç›¸åº”çš„HRegionServer(ž®†æ‹†åˆ†å‡ºçš„WALæ–‡äšg写入对应的目的HRegionServerçš„WAL目录中,òq¶åƈ写入对应的DataNode中)åQŒä»Žè€Œè¿™äº›HRegionServer可以Replay分到的WAL来重建MemStoreã€?br /><img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig25.png" height="368" width="708" /><br /><img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig26.png" height="378" width="724" /><br /><h2>HBaseæž¶æž„½Ž€å•æ€È»“</h2>在NoSQL中,存在著名的CAP理论åQŒå³Consistency、Availability、Partition Tolerance不可全得åQŒç›®å‰å¸‚åœÞZ¸ŠåŸºæœ¬ä¸Šçš„NoSQL都采用Partition Tolerance以实现数据得水åã^扩展åQŒæ¥å¤„理Relational DataBase遇到的无法处理数据量太大的问题,或引èµïLš„æ€§èƒ½é—®é¢˜ã€‚因而只有剩下Cå’ŒA可以选择。HBase在两者之间选择了ConsistencyåQŒç„¶åŽä‹É用多个HMaster以及支持HRegionServerçš„failure监控、ZooKeeperå¼•å…¥ä½œäØ“åè°ƒè€…ç­‰å„ç§æ‰‹æ®µæ¥è§£å†³Availability问题åQŒç„¶è€Œå½“¾|‘络的Split-Brain(Network Partition)发生æ—Óž¼Œå®ƒè¿˜æ˜¯æ— æ³•完全解决Availability的问题。从˜q™ä¸ªè§’度上,Cassandra选择了AåQŒå³å®ƒåœ¨¾|‘络Split-Brain时还是能正常写,而ä‹É用其他技术来解决Consistency的问题,如读的时候触发Consistency判断和处理。这是设计上的限制ã€?br /><br />从实çŽîC¸Šçš„优点:<br /><ol><li>HBase采用å¼ÞZ¸€è‡´æ€§æ¨¡åž‹ï¼Œåœ¨ä¸€ä¸ªå†™˜q”回后,保证所有的读都è¯Õdˆ°ç›¸åŒçš„æ•°æ®ã€?/li><li>通过HRegion动态Splitå’ŒMerge实现自动扩展åQŒåƈ使用HDFS提供的多个数据备份功能,实现高可用性ã€?/li><li>采用HRegionServerå’ŒDataNode˜qè¡Œåœ¨ç›¸åŒçš„æœåŠ¡å™¨ä¸Šå®žçŽ°æ•°æ®çš„æœ¬åœ°åŒ–åQŒæå‡è¯»å†™æ€§èƒ½åQŒåƈ减少¾|‘络压力ã€?/li><li>内徏HRegionServerçš„å®•æœø™‡ªåŠ¨æ¢å¤ã€‚é‡‡ç”¨WAL来Replay˜q˜æœªæŒä¹…化到HDFS的数据ã€?/li><li>可以无缝的和Hadoop/MapReduce集成ã€?br /></li></ol>实现上的¾~ºç‚¹åQ?br /><ol><li>WALçš„Replay˜q‡ç¨‹å¯èƒ½ä¼šå¾ˆæ…¢ã€?/li><li>çùN𾿁¢å¤æ¯”较复杂åQŒä¹Ÿä¼šæ¯”较慢ã€?/li><li>Major Compaction会引起IO Stormã€?/li><li>。。。ã€?br /></li></ol><h2>参考:</h2> https://www.mapr.com/blog/in-depth-look-hbase-architecture#.VdNSN6Yp3qx<br /> http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable<br /> http://hbase.apache.org/book.html <br /> http://www.searchtb.com/2011/01/understanding-hbase.html <br /> http://research.google.com/archive/bigtable-osdi06.pdf<img src ="http://www.aygfsteel.com/DLevin/aggbug/426950.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/DLevin/" target="_blank">DLevin</a> 2015-08-22 19:40 <a href="http://www.aygfsteel.com/DLevin/archive/2015/08/22/426950.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>深入HBase架构解析åQˆä¸€åQ?/title><link>http://www.aygfsteel.com/DLevin/archive/2015/08/22/426877.html</link><dc:creator>DLevin</dc:creator><author>DLevin</author><pubDate>Sat, 22 Aug 2015 09:44:00 GMT</pubDate><guid>http://www.aygfsteel.com/DLevin/archive/2015/08/22/426877.html</guid><wfw:comment>http://www.aygfsteel.com/DLevin/comments/426877.html</wfw:comment><comments>http://www.aygfsteel.com/DLevin/archive/2015/08/22/426877.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/DLevin/comments/commentRss/426877.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/DLevin/services/trackbacks/426877.html</trackback:ping><description><![CDATA[<h2>前记</h2> 公司内部使用的是MapR版本的Hadoop生态系¾lŸï¼Œå› è€Œä»ŽMapR的官¾|‘看åˆîCº†˜q™ç¯‡æ–‡æ–‡ç« ï¼š<a >An In-Depth Look at the HBase Architecture</a>åQŒåŽŸæœ¬æƒ³¾˜»è¯‘全文åQŒç„¶è€Œå¦‚果翻译就需要各¿Uå’¬æ–‡åš¼å­—,太麻烦,因而本文大部分使用了自å·Þqš„语言åQŒåÆˆä¸”åŠ å…¥äº†å…¶ä»–èµ„æºçš„å‚è€ƒç†è§£ä»¥åŠæœ¬äºø™‡ªå·Þp¯»æºç æ—¶å¯¹å…¶çš„理解åQŒå±žäºŽåŠ¾˜»è¯‘、半原创吧ã€?br /> <h2>HBaseæž¶æž„¾l„成</h2> HBase采用Master/Slave架构搭徏集群åQŒå®ƒéš¶å±žäºŽHadoop生态系¾lŸï¼Œç”׃¸€ä¸‹ç±»åž‹èŠ‚ç‚¹ç»„æˆï¼šHMaster节点、HRegionServer节点、ZooKeeper集群åQŒè€Œåœ¨åº•层åQŒå®ƒž®†æ•°æ®å­˜å‚¨äºŽHDFS中,因而涉及到HDFSçš„NameNode、DataNode½{‰ï¼Œæ€ÖM½“¾l“构如下åQ?br /> <img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArch1.jpg" height="389" width="603" /><br /> 其中<strong>HMaster节点</strong>用于åQ?br /> <ol> <li>½Ž¡ç†HRegionServeråQŒå®žçŽ°å…¶è´Ÿè²å‡è¡¡ã€?/li> <li>½Ž¡ç†å’Œåˆ†é…HRegionåQŒæ¯”如在HRegion split时分配新的HRegionåQ›åœ¨HRegionServer退出时˜qç§»å…¶å†…çš„HRegion到其他HRegionServer上ã€?/li> <li>实现DDL操作åQˆData Definition LanguageåQŒnamespaceå’Œtable的增删改åQŒcolumn familiy的增删改½{‰ï¼‰ã€?/li> <li>½Ž¡ç†namespaceå’Œtable的元数据åQˆå®žé™…存储在HDFS上)ã€?/li> <li>权限控制åQˆACLåQ‰ã€?/li> </ol> <strong>HRegionServer节点</strong>用于åQ?br /> <ol> <li>存放和管理本地HRegionã€?/li> <li>è¯Õd†™HDFSåQŒç®¡ç†Table中的数据ã€?/li> <li>Client直接通过HRegionServerè¯Õd†™æ•°æ®åQˆä»ŽHMaster中获取元数据åQŒæ‰¾åˆ°RowKey所在的HRegion/HRegionServer后)ã€?/li> </ol> <strong>ZooKeeper集群是协调系¾l?/strong>åQŒç”¨äºŽï¼š<br /> <ol> <li>存放整个 HBase集群的元数据以及集群的状态信息ã€?/li> <li>实现HMasterä¸ÖM»ŽèŠ‚ç‚¹çš„failoverã€?/li> </ol> HBase Client通过RPC方式和HMaster、HRegionServer通信åQ›ä¸€ä¸ªHRegionServer可以存放1000个HRegionåQ›åº•层Table数据存储于HDFS中,而HRegion所处理的数据尽量和数据所在的DataNodeåœ¨ä¸€èµøP¼Œå®žçŽ°æ•°æ®çš„æœ¬åœ°åŒ–åQ›æ•°æ®æœ¬åœ°åŒ–òq¶ä¸æ˜¯æ€»èƒ½å®žçްåQŒæ¯”如在HRegion¿UÕdЍ(如因Split)æ—Óž¼Œéœ€è¦ç­‰ä¸‹ä¸€‹Æ¡Compact才能¾l§ç®‹å›žåˆ°æœ¬åœ°åŒ–ã€?br /> <br /> 本着半翻译的原则åQŒå†è´´ä¸€ä¸ªã€ŠAn In-Depth Look At The HBase Architecture》的架构图:<br /> <img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig1.png" height="343" width="632" /><br /> ˜q™ä¸ªæž¶æž„图比较清晰的表达了HMasterå’ŒNameNode都支持多个热备䆾åQŒä‹É用ZooKeeper来做协调åQ›ZooKeeperòq¶ä¸æ˜¯äº‘般神¿U˜ï¼Œå®ƒä¸€èˆ¬ç”±ä¸‰å°æœºå™¨¾l„成一个集¾Ÿ¤ï¼Œå†…部使用PAXOS½Ž—法支持三台Server中的一台宕机,也有使用五台机器的,此时则可以支持同时两台宕机,既少于半数的宕机åQŒç„¶è€Œéšç€æœºå™¨çš„增加,它的性能也会下降åQ›RegionServerå’ŒDataNode一般会攑֜¨ç›¸åŒçš„Server上实现数据的本地化ã€?br /> <h2>HRegion</h2> HBase使用RowKeyž®†è¡¨æ°´åã^切割成多个HRegionåQŒä»ŽHMaster的角度,每个HRegion都纪录了它的StartKeyå’ŒEndKeyåQˆç¬¬ä¸€ä¸ªHRegionçš„StartKey为空åQŒæœ€åŽä¸€ä¸ªHRegionçš„EndKey为空åQ‰ï¼Œç”׃ºŽRowKey是排序的åQŒå› è€ŒClient可以通过HMaster快速的定位每个RowKey在哪个HRegion中。HRegionç”±HMaster分配到相应的HRegionServer中,然后由HRegionServerè´Ÿè´£HRegion的启动和½Ž¡ç†åQŒå’ŒClient的通信åQŒè´Ÿè´£æ•°æ®çš„è¯?使用HDFS)。每个HRegionServer可以同时½Ž¡ç†1000个左右的HRegionåQˆè¿™ä¸ªæ•°å­—怎么来的åQŸæ²¡æœ‰ä»Žä»£ç ä¸­çœ‹åˆ°é™åˆÓž¼ŒéšùN“是出于经验?­‘…过1000个会引è“v性能问题åQ?strong>来回½{”这个问é¢?/strong>åQšæ„Ÿè§‰è¿™ä¸?000的数字是从BigTable的论文中来的åQ? Implementation节)åQšEach tablet server manages a set of tablets(typically we have somewhere between ten to a thousand tablets per tablet server)åQ‰ã€?br /> <img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig2.png" height="337" width="724" /><br /> <h2>HMaster</h2> HMaster没有单点故障问题åQŒå¯ä»¥å¯åŠ¨å¤šä¸ªHMasteråQŒé€šè¿‡ZooKeeperçš„Master Election机制保证同时只有一个HMasterå‡ÞZºŽActive状态,其他的HMaster则处于热备䆾状态。一般情况下会启动两个HMasteråQŒéžActiveçš„HMaster会定期的和Active HMaster通信以获取其最新状态,从而保证它是实时更新的åQŒå› è€Œå¦‚果启动了多个HMaster反而增加了Active HMaster的负担。前文已¾lä»‹¾lè¿‡äº†HMaster的主要用于HRegion的分配和½Ž¡ç†åQŒDDL(Data Definition LanguageåQŒæ—¢Table的新建、删除、修改等)的实现等åQŒæ—¢å®ƒä¸»è¦æœ‰ä¸¤æ–¹é¢çš„职责åQ?br /> <ol> <li>协调HRegionServer <ol> <li>启动时HRegion的分配,以及负蝲均衡和修复时HRegion的重新分配ã€?/li> <li>监控集群中所有HRegionServer的状æ€?通过Heartbeat和监听ZooKeeper中的状æ€?ã€?br /> </li> </ol> </li> <li>Admin职能 <ol> <li>创徏、删除、修改Table的定义ã€?br /> </li> </ol> </li> </ol> <img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig3.png" /><br /> <h2> ZooKeeperåQšåè°ƒè€?/h2> ZooKeeper为HBase集群提供协调服务åQŒå®ƒ½Ž¡ç†ç€HMasterå’ŒHRegionServer的状æ€?available/alive½{?åQŒåƈ且会在它们宕机时通知¾l™HMasteråQŒä»Žè€ŒHMaster可以实现HMaster之间的failoveråQŒæˆ–对宕机的HRegionServer中的HRegion集合的修å¤?ž®†å®ƒä»¬åˆ†é…ç»™å…¶ä»–çš„HRegionServer)。ZooKeeper集群本èín使用一致性协è®?PAXOS协议)保证每个节点状态的一致性ã€?br /> <img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig4.png" height="318" width="703" /><br /> <h2>How The Components Work Together</h2> ZooKeeper协调集群所有节点的å…׃ín信息åQŒåœ¨HMasterå’ŒHRegionServer˜qžæŽ¥åˆ°ZooKeeper后创建Ephemeral节点åQŒåƈ使用Heartbeat机制¾l´æŒ˜q™ä¸ªèŠ‚ç‚¹çš„å­˜‹zÈŠ¶æ€ï¼Œå¦‚æžœæŸä¸ªEphemeral节点实效åQŒåˆ™HMaster会收到通知åQŒåƈ做相应的处理ã€?br /> <img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig5.png" height="329" width="722" /><br /> 另外åQŒHMaster通过监听ZooKeeper中的Ephemeral节点(默认åQ?hbase/rs/*)来监控HRegionServer的加入和宕机。在½W¬ä¸€ä¸ªHMaster˜qžæŽ¥åˆ°ZooKeeper时会创徏Ephemeral节点(默认åQ?hbasae/master)来表½CºActiveçš„HMasteråQŒå…¶åŽåŠ ˜q›æ¥çš„HMaster则监听该Ephemeral节点åQŒå¦‚果当前Activeçš„HMaster宕机åQŒåˆ™è¯¥èŠ‚ç‚ÒŽ¶ˆå¤±ï¼Œå› è€Œå…¶ä»–HMaster得到通知åQŒè€Œå°†è‡ªèín转换成Activeçš„HMasteråQŒåœ¨å˜äØ“Activeçš„HMaster之前åQŒå®ƒä¼šåˆ›å»ºåœ¨/hbase/back-masters/ä¸‹åˆ›å»ø™‡ªå·Þqš„Ephemeral节点ã€?br /> <h3> HBase的第一‹Æ¡è¯»å†?/h3> 在HBase 0.96以前åQŒHBase有两个特ŒDŠçš„TableåQ?ROOT-å’?META.åQˆå¦‚<a >BigTable</a>中的设计åQ‰ï¼Œå…¶ä¸­-ROOT- Table的位¾|®å­˜å‚¨åœ¨ZooKeeperåQŒå®ƒå­˜å‚¨äº?META. Tableçš„RegionInfo信息åQŒåƈ且它只能存在一个HRegionåQŒè€?META. Table则存储了用户Tableçš„RegionInfo信息åQŒå®ƒå¯ä»¥è¢«åˆ‡åˆ†æˆå¤šä¸ªHRegionåQŒå› è€Œå¯¹½W¬ä¸€‹Æ¡è®¿é—®ç”¨æˆ·Tableæ—Óž¼Œé¦–先从ZooKeeper中读å?ROOT- Table所在HRegionServeråQ›ç„¶åŽä»Žè¯¥HRegionServer中根据请求的TableNameåQŒRowKeyè¯Õd–.META. Table所在HRegionServeråQ›æœ€åŽä»Žè¯¥HRegionServer中读å?META. Table的内容而获取此‹Æ¡è¯·æ±‚需要访问的HRegion所在的位置åQŒç„¶åŽè®¿é—®è¯¥HRegionSever获取è¯äh±‚的数据,˜q™éœ€è¦ä¸‰‹Æ¡è¯·æ±‚才能找到用户Table所在的位置åQŒç„¶åŽç¬¬å››æ¬¡è¯äh±‚å¼€å§‹èŽ·å–çœŸæ­£çš„æ•°æ®ã€‚å½“ç„¶äØ“äº†æå‡æ€§èƒ½åQŒå®¢æˆïL«¯ä¼šç¼“å­?ROOT- Table位置以及-ROOT-/.META. Table的内宏V€‚如下图所½Cºï¼š<br /> <img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/image0030.jpg" height="228" width="399" /><br /> 可是即ä‹É客户端有¾~“å­˜åQŒåœ¨åˆå§‹é˜¶æ®µéœ€è¦ä¸‰‹Æ¡è¯·æ±‚才能直到用户Table真正所在的位置也是性能低下的,而且真的有必要支持那么多的HRegion吗?或许对Google˜q™æ ·çš„公司来说是需要的åQŒä½†æ˜¯å¯¹ä¸€èˆ¬çš„集群来说好像òq¶æ²¡æœ‰è¿™ä¸ªå¿…要。在BigTable的论文中è¯ß_¼Œæ¯è¡ŒMETADATA存储1KB左右数据åQŒä¸­½{‰å¤§ž®çš„Tablet(HRegion)åœ?28MB左右åQ?层位¾|®çš„Schema设计可以支持2^34个Tablet(HRegion)。即使去æŽ?ROOT- TableåQŒä¹Ÿ˜q˜å¯ä»¥æ”¯æŒ?^17(131072)个HRegionåQ?如果每个HRegion˜q˜æ˜¯128MBåQŒé‚£ž®±æ˜¯16TBåQŒè¿™ä¸ªè²Œä¼ég¸å¤Ÿå¤§åQŒä½†æ˜¯çŽ°åœ¨çš„HRegion的最大大ž®éƒ½ä¼šè®¾¾|®çš„æ¯”较大,比如我们讄¡½®äº?GBåQŒæ­¤æ—¶æ”¯æŒçš„大小则变成了4PBåQŒå¯¹ä¸€èˆ¬çš„集群来说已经够了åQŒå› è€Œåœ¨HBase 0.96以后åŽÀLމäº?ROOT- TableåQŒåªå‰©ä¸‹˜q™ä¸ªç‰ÒŽ®Šçš„目录表叫做Meta Table(hbase:meta)åQŒå®ƒå­˜å‚¨äº†é›†¾Ÿ¤ä¸­æ‰€æœ‰ç”¨æˆ·HRegion的位¾|®ä¿¡æ¯ï¼Œè€ŒZooKeeper的节点中(/hbase/meta-region-server)存储的则直接是这个Meta Table的位¾|®ï¼Œòq¶ä¸”˜q™ä¸ªMeta Table如以前的-ROOT- Table一æ äh˜¯ä¸å¯splitçš„ã€‚è¿™æ øP¼Œå®¢æˆ·ç«¯åœ¨½W¬ä¸€‹Æ¡è®¿é—®ç”¨æˆ·Table的流½E‹å°±å˜æˆäº†ï¼š<br /> <ol> <li>从ZooKeeper(/hbase/meta-region-server)中获取hbase:meta的位¾|®ï¼ˆHRegionServer的位¾|®ï¼‰åQŒç¼“存该位置信息ã€?/li> <li>从HRegionServer中查询用户Table对应è¯äh±‚çš„RowKey所在的HRegionServeråQŒç¼“存该位置信息ã€?/li> <li>从查询到HRegionServer中读取Rowã€?/li> </ol> 从这个过½E‹ä¸­åQŒæˆ‘们发现客户会¾~“å­˜˜q™äº›ä½ç½®ä¿¡æ¯åQŒç„¶è€Œç¬¬äºŒæ­¥å®ƒåªæ˜¯ç¼“存当前RowKey对应的HRegion的位¾|®ï¼Œå› è€Œå¦‚果下一个要查的RowKey不在同一个HRegion中,则需要ç‘ô¾l­æŸ¥è¯¢hbase:meta所在的HRegionåQŒç„¶è€Œéšç€æ—‰™—´çš„æŽ¨¿U»ï¼Œå®¢æˆ·ç«¯ç¼“存的位置信息­‘Šæ¥­‘Šå¤šåQŒä»¥è‡³äºŽä¸éœ€è¦å†‹Æ¡æŸ¥æ‰¾hbase:meta Table的信息,除非某个HRegionå› äØ“å®•æœºæˆ–Splitè¢«ç§»åŠ¨ï¼Œæ­¤æ—¶éœ€è¦é‡æ–°æŸ¥è¯¢åÆˆä¸”æ›´æ–°ç¼“å­˜ã€?br /> <img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig6.png" height="356" width="590" /><br /> <h3> hbase:metaè¡?/h3> hbase:meta表存储了所有用户HRegion的位¾|®ä¿¡æ¯ï¼Œå®ƒçš„RowKey是:tableName,regionStartKey,regionId,replicaId½{‰ï¼Œå®ƒåªæœ‰info列族åQŒè¿™ä¸ªåˆ—族包含三个列åQŒä»–们分别是åQšinfo:regioninfo列是RegionInfoçš„proto格式åQšregionId,tableName,startKey,endKey,offline,split,replicaIdåQ›info:server格式åQšHRegionServer对应的server:portåQ›info:serverstartcode格式是HRegionServer的启动时间戳ã€?br /> <img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig7.png" height="362" width="736" /><br /> <h2>HRegionServer详解</h2> HRegionServer一般和DataNode在同一台机器上˜qè¡ŒåQŒå®žçŽ°æ•°æ®çš„æœ¬åœ°æ€§ã€‚HRegionServer包含多个HRegionåQŒç”±WAL(HLog)、BlockCache、MemStore、HFile¾l„成ã€?br /> <ol> <li><strong>WAL即Write Ahead Log</strong>åQŒåœ¨æ—©æœŸç‰ˆæœ¬ä¸­ç§°ä¸ºHLogåQŒå®ƒæ˜¯HDFS上的一个文ä»Óž¼Œå¦‚其名字所表示的,所有写操作都会先保证将数据写入˜q™ä¸ªLogæ–‡äšg后,才会真正更新MemStoreåQŒæœ€åŽå†™å…¥HFile中。采用这¿Uæ¨¡å¼ï¼Œå¯ä»¥ä¿è¯HRegionServer宕机后,我们依然可以从该Logæ–‡äšg中读取数据,Replay所有的操作åQŒè€Œä¸è‡³äºŽæ•°æ®ä¸¢å¤±ã€‚这个Logæ–‡äšg会定期Roll出新的文件而删除旧的文ä»?那些已持久化到HFile中的Log可以删除)。WALæ–‡äšg存储åœ?hbase/WALs/${HRegionServer_Name}的目录中(åœ?.94之前åQŒå­˜å‚¨åœ¨/hbase/.logs/目录ä¸?åQŒä¸€èˆ¬ä¸€ä¸ªHRegionServer只有一个WAL实例åQŒä¹Ÿž®±æ˜¯è¯´ä¸€ä¸ªHRegionServer的所有WAL写都是串行的(ž®±åƒlog4j的日志写也是串行çš?åQŒè¿™å½“然会引èµäh€§èƒ½é—®é¢˜åQŒå› è€Œåœ¨HBase 1.0之后åQŒé€šè¿‡<a >HBASE-5699</a>实现了多个WALòq¶è¡Œå†?MultiWAL)åQŒè¯¥å®žçŽ°é‡‡ç”¨HDFS的多个管道写åQŒä»¥å•个HRegion为单位。关于WAL可以参考Wikipediaçš?a >Write-Ahead Logging</a>。顺便吐槽一句,英文版的¾l´åŸºç™„¡§‘竟然能毫无压力的正常讉K—®äº†ï¼Œ˜q™æ˜¯æŸä¸ªGFW的疏忽还是以后的常态?</li> <li><strong>BlockCache是一个读¾~“å­˜</strong>åQŒå³“引用局部æ€?#8221;原理åQˆä¹Ÿåº”用于CPUåQ?a >分空间局部性和旉™—´å±€éƒ¨æ€?/a>åQŒç©ºé—´å±€éƒ¨æ€§æ˜¯æŒ‡CPU在某一时刻需要某个数据,那么有很大的概率在一下时åˆÕd®ƒéœ€è¦çš„æ•°æ®åœ¨å…¶é™„è¿‘åQ›æ—¶é—´å±€éƒ¨æ€§æ˜¯æŒ‡æŸä¸ªæ•°æ®åœ¨è¢«è®¿é—®è¿‡ä¸€‹Æ¡åŽåQŒå®ƒæœ‰å¾ˆå¤§çš„æ¦‚率在不久的ž®†æ¥ä¼šè¢«å†æ¬¡çš„访问)åQŒå°†æ•°æ®é¢„读取到内存中,以提升读的性能。HBase中提供两¿UBlockCache的实玎ͼšé»˜è®¤on-heap LruBlockCacheå’ŒBucketCache(通常是off-heap)。通常BucketCache的性能要差于LruBlockCacheåQŒç„¶è€Œç”±äºŽGC的媄响,LruBlockCacheçš„åšg˜qŸä¼šå˜çš„不稳定,而BucketCacheç”׃ºŽæ˜¯è‡ªå·Þq®¡ç†BlockCacheåQŒè€Œä¸éœ€è¦GCåQŒå› è€Œå®ƒçš„åšg˜qŸé€šå¸¸æ¯”较½E›_®šåQŒè¿™ä¹Ÿæ˜¯æœ‰äº›æ—¶å€™éœ€è¦é€‰ç”¨BucketCache的原因。这½‹‡æ–‡ç«?a >BlockCache101</a>对on-heapå’Œoff-heapçš„BlockCache做了详细的比较ã€?/li><strong> </strong><li><strong>HRegion是一个Table中的一个Region在一个HRegionServer中的表达</strong>。一个Table可以有一个或多个RegionåQŒä»–们可以在一个相同的HRegionServer上,也可以分布在不同的HRegionServer上,一个HRegionServer可以有多个HRegionåQŒä»–们分别属于不同的Table。HRegion由多个Store(HStore)构成åQŒæ¯ä¸ªHStore对应了一个Table在这个HRegion中的一个Column FamilyåQŒå³æ¯ä¸ªColumn Familyž®±æ˜¯ä¸€ä¸ªé›†ä¸­çš„存储单元åQŒå› è€Œæœ€å¥½å°†å…ähœ‰ç›¸è¿‘IOç‰ÒŽ€§çš„Column存储在一个Column FamilyåQŒä»¥å®žçŽ°é«˜æ•ˆè¯Õd–(数据局部性原理,可以提高¾~“存的命中率)。HStore是HBase中存储的核心åQŒå®ƒå®žçŽ°äº†è¯»å†™HDFS功能åQŒä¸€ä¸ªHStoreç”׃¸€ä¸ªMemStore å’?个或多个StoreFile¾l„成ã€?br /> <ol> <li><strong>MemStore是一个写¾~“å­˜</strong>(In Memory Sorted Buffer)åQŒæ‰€æœ‰æ•°æ®çš„写在完成WAL日志写后åQŒä¼š 写入MemStore中,由MemStoreæ ÒŽ®ä¸€å®šçš„½Ž—法ž®†æ•°æ®Flush到地层HDFSæ–‡äšgä¸?HFile)åQŒé€šå¸¸æ¯ä¸ªHRegion中的每个 Column Family有一个自å·Þqš„MemStoreã€?/li> <li><strong>HFile(StoreFile) 用于存储HBase的数æ?Cell/KeyValue)</strong>。在HFile中的数据是按RowKey、Column Family、Column排序åQŒå¯¹ç›¸åŒçš„Cell(卌™¿™ä¸‰ä¸ªå€¼éƒ½ä¸€æ ?åQŒåˆ™æŒ‰timestamp倒序排列ã€?/li> </ol> </li> </ol> <img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig8.png" /><br /> 虽然上面˜q™å¼ å›‘Ö±•现的是最新的HRegionServer的架æž?但是òq¶ä¸æ˜¯é‚£ä¹ˆçš„¾_„¡¡®)åQŒä½†æ˜¯æˆ‘一直比较喜‹Æ¢çœ‹ä»¥ä¸‹˜q™å¼ å›¾ï¼Œå³ä‹É它展现的应该æ˜?.94以前的架构ã€?br /> <img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/image0060.jpg" height="347" width="553" /><br /> <h3> HRegionServer中数据写‹¹ç¨‹å›¾è§£</h3> 当客æˆïL«¯å‘è“v一个Putè¯äh±‚æ—Óž¼Œé¦–先它从hbase:meta表中查出该Put数据最¾lˆéœ€è¦åŽ»çš„HRegionServer。然后客æˆïL«¯ž®†Putè¯äh±‚发送给相应的HRegionServeråQŒåœ¨HRegionServer中它首先会将该Put操作写入WAL日志文äšgä¸?Flush到磁盘中)ã€?br /><img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig9.png" height="363" width="716" /><br /> 写完WAL日志文äšg后,HRegionServeræ ÒŽ®Put中的TableNameå’ŒRowKey扑ֈ°å¯¹åº”çš„HRegionåQŒåƈæ ÒŽ®Column Family扑ֈ°å¯¹åº”çš„HStoreåQŒåƈž®†Put写入到该HStoreçš„MemStore中。此时写成功åQŒåƈ˜q”回通知客户端ã€?br /><img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig10.png" height="298" width="664" /><br /><h3>MemStore Flush<br /></h3>MemStore是一个In Memory Sorted BufferåQŒåœ¨æ¯ä¸ªHStore中都有一个MemStoreåQŒå³å®ƒæ˜¯ä¸€ä¸ªHRegion的一个Column Family对应一个实例。它的排列顺序以RowKey、Column Family、Column的顺序以及Timestamp的倒序åQŒå¦‚下所½Cºï¼š<br /><img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig11.png" height="351" width="719" /><br />每一‹Æ¡Put/Deleteè¯äh±‚都是先写入到MemStore中,当MemStore满后会Flush成一个新的StoreFile(底层实现是HFile)åQŒå³ä¸€ä¸ªHStore(Column Family)可以æœ?个或多个StoreFile(HFile)。有以下三种情况可以触发MemStoreçš„Flush动作åQ?strong>需要注意的是MemStore的最ž®Flush单元是HRegion而不是单个MemStore</strong>。据说这是Column Family有个数限制的其中一个原因,估计是因为太多的Column Family一起Flush会引èµäh€§èƒ½é—®é¢˜åQŸå…·ä½“原因有待考证ã€?br /><ol><li>当一个HRegion中的所有MemStore的大ž®æ€Õd’Œ­‘…过了hbase.hregion.memstore.flush.size的大ž®ï¼Œé»˜è®¤128MB。此时当前的HRegion中所有的MemStore会Flush到HDFS中ã€?/li><li>当全局MemStore的大ž®è¶…˜q‡äº†hbase.regionserver.global.memstore.upperLimit的大ž®ï¼Œé»˜è®¤40åQ…的内存使用量。此时当前HRegionServer中所有HRegion中的MemStore都会Flush到HDFS中,Flush™åºåºæ˜¯MemStore大小的倒序åQˆä¸€ä¸ªHRegion中所有MemStoreæ€Õd’Œä½œäؓ该HRegionçš„MemStore的大ž®è¿˜æ˜¯é€‰å–最大的MemStoreä½œäØ“å‚è€ƒï¼Ÿæœ‰å¾…è€ƒè¯åQ‰ï¼Œç›´åˆ°æ€ÖM½“çš„MemStore使用量低于hbase.regionserver.global.memstore.lowerLimitåQŒé»˜è®?8%的内存ä‹É用量ã€?/li><li>当前HRegionServer中WAL的大ž®è¶…˜q‡äº†hbase.regionserver.hlog.blocksize * hbase.regionserver.max.logs的数量,当前HRegionServer中所有HRegion中的MemStore都会Flush到HDFS中,Flush使用旉™—´™åºåºåQŒæœ€æ—©çš„MemStoreå…ˆFlush直到WAL的数量少于hbase.regionserver.hlog.blocksize * hbase.regionserver.max.logsã€?a >˜q™é‡Œ</a>è¯´è¿™ä¸¤ä¸ªç›æ€¹˜çš„默认大ž®æ˜¯2GBåQŒæŸ¥ä»£ç åQŒhbase.regionserver.max.logs默认值是32åQŒè€Œhbase.regionserver.hlog.blocksize是HDFS的默认blocksizeåQ?2MBã€‚ä½†ä¸ç®¡æ€Žä¹ˆæ øP¼Œå› äØ“˜q™ä¸ªå¤§å°­‘…过限制引è“vçš„Flush不是一件好事,可能引è“v长时间的延迟åQŒå› è€Œè¿™½‹‡æ–‡ç« ç»™çš„徏议:“<strong style="color: #339966; font-family: STHeiti; font-size: medium; font-style: normal; font-variant: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px; -webkit-text-stroke-width: 0px;">Hint</strong><span style="color: #339966; font-family: STHeiti; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px; -webkit-text-stroke-width: 0px; display: inline !important; float: none;">: keep hbase.regionserver.hlog.blocksize * hbase.regionserver.maxlogs just a bit above hbase.regionserver.global.memstore.lowerLimit * HBASE_HEAPSIZE.</span>”ã€‚åÆˆä¸”éœ€è¦æ³¨æ„ï¼Œ<a >˜q™é‡Œ</a>¾l™çš„æè¿°æ˜¯æœ‰é”™çš„(虽然它是官方的文æ¡?ã€?br /></li></ol>在MemStore Flush˜q‡ç¨‹ä¸­ï¼Œ˜q˜ä¼šåœ¨å°¾éƒ¨è¿½åŠ ä¸€äº›meta数据åQŒå…¶ä¸­å°±åŒ…括Flush时最大的WAL sequenceå€û|¼Œä»¥å‘Šè¯‰HBase˜q™ä¸ªStoreFile写入的最新数据的序列åQŒé‚£ä¹ˆåœ¨Recover时就直到从哪里开始。在HRegion启动æ—Óž¼Œ˜q™ä¸ªsequence会被è¯Õd–åQŒåÆˆå–æœ€å¤§çš„ä½œäØ“ä¸‹ä¸€‹Æ¡æ›´æ–°æ—¶çš„è“vå§‹sequenceã€?br /><img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig12.png" height="248" width="622" /><br /><h2> HFile格式</h2>HBase的数据以KeyValue(Cell)çš„åŞ式顺序的存储在HFile中,在MemStoreçš„Flush˜q‡ç¨‹ä¸­ç”ŸæˆHFileåQŒç”±äºŽMemStore中存储的Cell遵åó@相同的排列顺序,因而Flush˜q‡ç¨‹æ˜¯é¡ºåºå†™åQŒæˆ‘们直到磁盘的™åºåºå†™æ€§èƒ½å¾ˆé«˜åQŒå› ä¸ÞZ¸éœ€è¦ä¸åœçš„¿UÕdЍ¼‚ç›˜æŒ‡é’ˆã€?br /><img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig13.png" height="351" width="698" /><br />HFile参考BigTableçš„SSTableå’ŒHadoopçš?a >TFile</a>实现åQŒä»ŽHBase开始到现在åQŒHFile¾låŽ†äº†ä¸‰ä¸ªç‰ˆæœ¬ï¼Œå…¶ä¸­V2åœ?.92引入åQŒV3åœ?.98引入。首先我们来看一下V1的格式:<br /><img src="http://www.aygfsteel.com/images/blogjava_net/dlevin/image0080.jpg" alt="" height="160" border="0" width="554" /><br />V1çš„HFile由多个Data Block、Meta Block、FileInfo、Data Index、Meta Index、Trailer¾l„成åQŒå…¶ä¸­Data Block是HBase的最ž®å­˜å‚¨å•元,在前文中提到的BlockCachež®±æ˜¯åŸÞZºŽData Block的缓存的。一个Data Blockç”׃¸€ä¸ªé­”数和一¾pÕdˆ—çš„KeyValue(Cell)¾l„成åQŒé­”数是一个随机的数字åQŒç”¨äºŽè¡¨½Cø™¿™æ˜¯ä¸€ä¸ªData Block¾cÕdž‹åQŒä»¥å¿«é€Ÿç›‘‹¹‹è¿™ä¸ªData Block的格式,防止数据的破坏。Data Block的大ž®å¯ä»¥åœ¨åˆ›å¾Column Family时设¾|?HColumnDescriptor.setBlockSize())åQŒé»˜è®¤å€¼æ˜¯64KBåQŒå¤§åïLš„Block有利于顺序ScanåQŒå°å·Block利于随机查询åQŒå› è€Œéœ€è¦æƒè¡¡ã€‚Meta块是可选的åQŒFileInfo是固定长度的块,它纪录了文äšg的一些Meta信息åQŒä¾‹å¦‚:AVG_KEY_LEN, AVG_VALUE_LEN, LAST_KEY, COMPARATOR, MAX_SEQ_ID_KEY½{‰ã€‚Data Indexå’ŒMeta Index¾Uªå½•了每个Data块和Meta块的其实炏V€æœªåŽ‹ç¾ƒæ—¶å¤§ž®ã€Key(èµ·å§‹RowKeyåQ?½{‰ã€‚Trailer¾Uªå½•了FileInfo、Data Index、Meta Index块的起始位置åQŒData Indexå’ŒMeta Index索引的数量等。其中FileInfoå’ŒTrailer是固定长度的ã€?br /><br />HFile里面的每个KeyValue对就是一个简单的byte数组。但是这个byte数组里面包含了很多项åQŒåƈ且有固定的结构。我们来看看里面的具体结构:<br /><img src="http://www.aygfsteel.com/images/blogjava_net/dlevin/image0090.jpg" alt="" height="93" border="0" width="553" /><br />开始是两个固定长度的数å€û|¼Œåˆ†åˆ«è¡¨ç¤ºKey的长度和Value的长度。紧接着是KeyåQŒå¼€å§‹æ˜¯å›ºå®šé•¿åº¦çš„æ•°å€û|¼Œè¡¨ç¤ºRowKey的长度,紧接着æ˜? RowKeyåQŒç„¶åŽæ˜¯å›ºå®šé•¿åº¦çš„æ•°å€û|¼Œè¡¨ç¤ºFamily的长度,然后是FamilyåQŒæŽ¥ç€æ˜¯QualifieråQŒç„¶åŽæ˜¯ä¸¤ä¸ªå›ºå®šé•¿åº¦çš„æ•°å€û|¼Œè¡¨ç¤ºTime Stampå’ŒKey TypeåQˆPut/DeleteåQ‰ã€‚Value部分没有˜q™ä¹ˆå¤æ‚的结构,ž®±æ˜¯¾U¯çÑa的二˜q›åˆ¶æ•°æ®äº†ã€?strong>随着HFile版本˜qç§»åQŒKeyValue(Cell)çš„æ ¼å¼åÆˆæœªå‘ç”Ÿå¤ªå¤šå˜åŒ–ï¼Œåªæ˜¯åœ¨V3版本åQŒå°¾éƒ¨æ·»åŠ äº†ä¸€ä¸ªå¯é€‰çš„Tag数组</strong>ã€?br /> <br />HFileV1版本的在实际使用˜q‡ç¨‹ä¸­å‘现它占用内存多,òq¶ä¸”Bloom Fileå’ŒBlock Index会变的很大,而引起启动时间变é•ѝ€‚其中每个HFileçš„Bloom Filter可以增长åˆ?00MBåQŒè¿™åœ¨æŸ¥è¯¢æ—¶ä¼šå¼•èµäh€§èƒ½é—®é¢˜åQŒå› ä¸ºæ¯‹Æ¡æŸ¥è¯¢æ—¶éœ€è¦åŠ è½½åÆˆæŸ¥è¯¢Bloom FilteråQ?00MBçš„Bloom Filer会引起很大的延迟åQ›å¦ä¸€ä¸ªï¼ŒBlock Index在一个HRegionServer可能会增长到æ€Õd…±6GBåQŒHRegionServer在启动时需要先加蝲所有这些Block IndexåQŒå› è€Œå¢žåŠ äº†å¯åŠ¨æ—‰™—´ã€‚äØ“äº†è§£å†Œ™¿™äº›é—®é¢˜ï¼Œåœ?.92版本中引入HFileV2版本åQ?br /><img src="http://www.aygfsteel.com/images/blogjava_net/dlevin/hfilev2.png" alt="" height="418" border="0" width="566" /><br />在这个版本中åQŒBlock Indexå’ŒBloom Filteræ·ÕdŠ åˆîCº†Data Block中间åQŒè€Œè¿™¿Uè®¾è®¡åŒæ—¶ä¹Ÿå‡å°‘了写的内存ä‹É用量åQ›å¦å¤–,ä¸ÞZº†æå‡å¯åŠ¨é€Ÿåº¦åQŒåœ¨˜q™ä¸ªç‰ˆæœ¬ä¸­è¿˜å¼•入了åšg˜qŸè¯»çš„功能,卛_œ¨HFile真正被ä‹É用时才对其进行解析ã€?br /><br />FileV3版本基本和V2版本相比åQŒåƈ没有太大的改变,它在KeyValue(Cell)层面上添加了Tag数组的支持;òq¶åœ¨FileInfo¾l“构中添加了和Tag相关的两个字ŒDüc€‚关于具体HFile格式演化介绍åQŒå¯ä»¥å‚è€?a >˜q™é‡Œ</a>ã€?br /><br />对HFileV2格式具体分析åQŒå®ƒæ˜¯ä¸€ä¸ªå¤šå±‚çš„¾c»B+树烦引,采用˜q™ç§è®¾è®¡åQŒå¯ä»¥å®žçŽ°æŸ¥æ‰¾ä¸éœ€è¦è¯»å–æ•´ä¸ªæ–‡ä»Óž¼š<br /><img alt="" src="http://www.aygfsteel.com/images/blogjava_net/dlevin/HBaseArchitecture-Blog-Fig14.png" height="349" width="688" /><br />Data Block中的Cell都是升序排列åQŒæ¯ä¸ªblock都有它自å·Þqš„Leaf-IndexåQŒæ¯ä¸ªBlock的最后一个Key被放入Intermediate-Index中,Root-Index指向Intermediate-Index。在HFile的末ž®¾è¿˜æœ‰Bloom Filter用于快速定位那么没有在某个Data Block中的RowåQ›TimeRange信息用于¾l™é‚£äº›ä‹É用时间查询的参考。在HFile打开æ—Óž¼Œ˜q™äº›ç´¢å¼•信息都被加蝲òq¶ä¿å­˜åœ¨å†…存中,以增加以后的è¯Õd–性能ã€?br /><br />˜q™ç¯‡ž®±å…ˆå†™åˆ°˜q™é‡ŒåQŒæœªå®Œå¾…¾l­ã€‚。。ã€?br /><br /> <h2>参考:</h2> https://www.mapr.com/blog/in-depth-look-hbase-architecture#.VdNSN6Yp3qx<br /> http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable<br /> http://hbase.apache.org/book.html <br /> http://www.searchtb.com/2011/01/understanding-hbase.html <br /> http://research.google.com/archive/bigtable-osdi06.pdf<img src ="http://www.aygfsteel.com/DLevin/aggbug/426877.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/DLevin/" target="_blank">DLevin</a> 2015-08-22 17:44 <a href="http://www.aygfsteel.com/DLevin/archive/2015/08/22/426877.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss> <footer> <div class="friendship-link"> <a href="http://www.aygfsteel.com/" title="狠狠久久亚洲欧美专区_中文字幕亚洲综合久久202_国产精品亚洲第五区在线_日本免费网站视频">狠狠久久亚洲欧美专区_中文字幕亚洲综合久久202_国产精品亚洲第五区在线_日本免费网站视频</a> </div> </footer> Ö÷Õ¾Ö©Öë³ØÄ£°å£º <a href="http://" target="_blank">ÎâÇÅÏØ</a>| <a href="http://" target="_blank">ʯ¾°É½Çø</a>| <a href="http://" target="_blank">ºØÀ¼ÏØ</a>| <a href="http://" target="_blank">°ÍÖÐÊÐ</a>| <a href="http://" target="_blank">´óÓ¢ÏØ</a>| <a href="http://" target="_blank">ÄÉÓºÏØ</a>| <a href="http://" target="_blank">Óí³ÇÊÐ</a>| <a href="http://" target="_blank">·ðÆºÏØ</a>| <a href="http://" target="_blank">ÄêϽ£ºÊÐÏ½Çø</a>| <a href="http://" target="_blank">ÏåÔ«ÏØ</a>| <a href="http://" target="_blank">Ûº£ÏØ</a>| <a href="http://" target="_blank">×¼¸ñ¶ûÆì</a>| <a href="http://" target="_blank">ʼÐËÏØ</a>| <a href="http://" target="_blank">»³À´ÏØ</a>| <a href="http://" target="_blank">°¢³ÇÊÐ</a>| <a href="http://" target="_blank">¸ßÑôÏØ</a>| <a href="http://" target="_blank">³çÈÊÏØ</a>| <a href="http://" target="_blank">ÈðÀöÊÐ</a>| <a href="http://" target="_blank">ÆÁÉ½ÏØ</a>| <a href="http://" target="_blank">´óÒ¦ÏØ</a>| <a href="http://" target="_blank">ÄôÀ­Ä¾ÏØ</a>| <a href="http://" target="_blank">ÐÏÌ¨ÏØ</a>| <a href="http://" target="_blank">´óÌïÏØ</a>| <a href="http://" target="_blank">ò£ÉÏÏØ</a>| <a href="http://" target="_blank">ͨ»¯ÊÐ</a>| <a href="http://" target="_blank">ÑÓÇìÏØ</a>| <a href="http://" target="_blank">Í©è÷ÏØ</a>| <a href="http://" target="_blank">ÒËÐËÊÐ</a>| <a href="http://" target="_blank">ËÕÄáÌØÓÒÆì</a>| <a href="http://" target="_blank">·¿É½Çø</a>| <a href="http://" target="_blank">ÃñºÍ</a>| <a href="http://" target="_blank">Â¡Ò¢ÏØ</a>| <a href="http://" target="_blank">À¶ÌïÏØ</a>| <a href="http://" target="_blank">¿¦À®Ç߯ì</a>| <a href="http://" target="_blank">¸ßÇå</a>| <a href="http://" target="_blank">³¤ÐËÏØ</a>| <a href="http://" target="_blank">Ñ®ÒØÏØ</a>| <a href="http://" target="_blank">²ªÀûÏØ</a>| <a href="http://" target="_blank">¾Å½­ÊÐ</a>| <a href="http://" target="_blank">ÌÒÔ´ÏØ</a>| <a href="http://" target="_blank">¼Ó²éÏØ</a>| <script> (function(){ var bp = document.createElement('script'); var curProtocol = window.location.protocol.split(':')[0]; if (curProtocol === 'https') { bp.src = 'https://zz.bdstatic.com/linksubmit/push.js'; } else { bp.src = 'http://push.zhanzhang.baidu.com/push.js'; } var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(bp, s); })(); </script> </body>