上善若水
          In general the OO style is to use a lot of little objects with a lot of little methods that give us a lot of plug points for overriding and variation. To do is to be -Nietzsche, To bei is to do -Kant, Do be do be do -Sinatra
          posts - 146,comments - 147,trackbacks - 0

          前記

          幾年前在讀Google的BigTable論文的時候,當時并沒有理解論文里面表達的思想,因而囫圇吞棗,并沒有注意到SSTable的概念。再后來開始關注HBase的設計和源碼后,開始對BigTable傳遞的思想慢慢的清晰起來,但是因為事情太多,沒有安排出時間重讀BigTable的論文。在項目里,我因為自己在學HBase,開始主推HBase,而另一個同事則因為對Cassandra比較感冒,因而他主要關注Cassandra的設計,不過我們兩個人偶爾都會討論一下技術、設計的各種觀點和心得,然后他偶然的說了一句:Cassandra和HBase都采用SSTable格式存儲,然后我本能的問了一句:什么是SSTable?他并沒有回答,可能也不是那么幾句能說清楚的,或者他自己也沒有嘗試的去問過自己這個問題。然而這個問題本身卻一直困擾著我,因而趁著現在有一些時間深入學習HBase和Cassandra相關設計的時候先把這個問題弄清楚了。

          SSTable的定義

          要解釋這個術語的真正含義,最好的方法就是從它的出處找答案,所以重新翻開BigTable的論文。在這篇論文中,最初對SSTable是這么描述的(第三頁末和第四頁初):
          SSTable

          The Google SSTable file format is used internally to store Bigtable data. An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. Operations are provided to look up the value associated with a specified key, and to iterate over all key/value pairs in a specified key range. Internally, each SSTable contains a sequence of blocks (typically each block is 64KB in size, but this is configurable). A block index (stored at the end of the SSTable) is used to locate blocks; the index is loaded into memory when the SSTable is opened. A lookup can be performed with a single disk seek: we first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from disk. Optionally, an SSTable can be completely mapped into memory, which allows us to perform lookups and scans without touching disk.

          簡單的非直譯:
          SSTable是Bigtable內部用于數據的文件格式,它的格式為文件本身就是一個排序的、不可變的、持久的Key/Value對Map,其中Key和value都可以是任意的byte字符串。使用Key來查找Value,或通過給定Key范圍遍歷所有的Key/Value對。每個SSTable包含一系列的Block(一般Block大小為64KB,但是它是可配置的),在SSTable的末尾是Block索引,用于定位Block,這些索引在SSTable打開時被加載到內存中,在查找時首先從內存中的索引二分查找找到Block,然后一次磁盤尋道即可讀取到相應的Block。還有一種方案是將這個SSTable加載到內存中,從而在查找和掃描中不需要讀取磁盤。

          這個貌似就是HFile第一個版本的格式么,貼張圖感受一下:

          在HBase使用過程中,對這個版本的HFile遇到以下一些問題(參考這里):
          1. 解析時內存使用量比較高。
          2. Bloom Filter和Block索引會變的很大,而影響啟動性能。具體的,Bloom Filter可以增長到100MB每個HFile,而Block索引可以增長到300MB,如果一個HRegionServer中有20個HRegion,則他們分別能增長到2GB和6GB的大小。HRegion需要在打開時,需要加載所有的Block索引到內存中,因而影響啟動性能;而在第一次Request時,需要將整個Bloom Filter加載到內存中,再開始查找,因而Bloom Filter太大會影響第一次請求的延遲。
          而HFile在版本2中對這些問題做了一些優化,具體會在HFile解析時詳細說明。

          SSTable作為存儲使用

          繼續BigTable的論文往下走,在5.3 Tablet Serving小節中這樣寫道(第6頁):
          Tablet Serving

          Updates are committed to a commit log that stores redo records. Of these updates, the recently committed ones are stored in memory in a sorted buffer called a memtable; the older updates are stored in a sequence of SSTables. To recover a tablet, a tablet server reads its metadata from the METADATA table. This metadata contains the list of SSTables that comprise a tablet and a set of a redo points, which are pointers into any commit logs that may contain data for the tablet. The server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have committed since the redo points.

          When a write operation arrives at a tablet server, the server checks that it is well-formed, and that the sender is authorized to perform the mutation. Authorization is performed by reading the list of permitted writers from a Chubby file (which is almost always a hit in the Chubby client cache). A valid mutation is written to the commit log. Group commit is used to improve the throughput of lots of small mutations [13, 16]. After the write has been committed, its contents are inserted into the memtable.

          When a read operation arrives at a tablet server, it is similarly checked for well-formedness and proper authorization. A valid read operation is executed on a merged view of the sequence of SSTables and the memtable. Since the SSTables and the memtable are lexicographically sorted data structures, the merged view can be formed efficiently.

          Incoming read and write operations can continue while tablets are split and merged.

          第一段和第三段簡單描述,非翻譯:
          在新數據寫入時,這個操作首先提交到日志中作為redo紀錄,最近的數據存儲在內存的排序緩存memtable中;舊的數據存儲在一系列的SSTable 中。在recover中,tablet server從METADATA表中讀取metadata,metadata包含了組成Tablet的所有SSTable(紀錄了這些SSTable的元 數據信息,如SSTable的位置、StartKey、EndKey等)以及一系列日志中的redo點。Tablet Server讀取SSTable的索引到內存,并replay這些redo點之后的更新來重構memtable。
          在讀時,完成格式、授權等檢查后,讀會同時讀取SSTable、memtable(HBase中還包含了BlockCache中的數據)并合并他們的結果,由于SSTable和memtable都是字典序排列,因而合并操作可以很高效完成。

          SSTable在Compaction過程中的使用

          在BigTable論文5.4 Compaction小節中是這樣說的:
          Compaction

          As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. This minor compaction process has two goals: it shrinks the memory usage of the tablet server, and it reduces the amount of data that has to be read from the commit log during recovery if this server dies. Incoming read and write operations can continue while compactions occur.

          Every minor compaction creates a new SSTable. If this behavior continued unchecked, read operations might need to merge updates from an arbitrary number of SSTables. Instead, we bound the number of such files by periodically executing a merging compaction in the background. A merging compaction reads the contents of a few SSTables and the memtable, and writes out a new SSTable. The input SSTables and memtable can be discarded as soon as the compaction has finished.

          A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction. SSTables produced by non-major compactions can contain special deletion entries that suppress deleted data in older SSTables that are still live. A major compaction, on the other hand, produces an SSTable that contains no deletion information or deleted data. Bigtable cycles through all of its tablets and regularly applies major compactions to them. These major compactions allow Bigtable to reclaim resources used by deleted data, and also allow it to ensure that deleted data disappears from the system in a timely fashion, which is important for services that store sensitive data.

          隨著memtable大小增加到一個閥值,這個memtable會被凍住而創建一個新的memtable以供使用,而舊的memtable會轉換成一個SSTable而寫道GFS中,這個過程叫做minor compaction。這個minor compaction可以減少內存使用量,并可以減少日志大小,因為持久化后的數據可以從日志中刪除。在minor compaction過程中,可以繼續處理讀寫請求。
          每次minor compaction會生成新的SSTable文件,如果SSTable文件數量增加,則會影響讀的性能,因而每次讀都需要讀取所有SSTable文件,然后合并結果,因而對SSTable文件個數需要有上限,并且時不時的需要在后臺做merging compaction,這個merging compaction讀取一些SSTable文件和memtable的內容,并將他們合并寫入一個新的SSTable中。當這個過程完成后,這些源SSTable和memtable就可以被刪除了。
          如果一個merging compaction是合并所有SSTable到一個SSTable,則這個過程稱做major compaction。一次major compaction會將mark成刪除的信息、數據刪除,而其他兩次compaction則會保留這些信息、數據(mark的形式)。Bigtable會時不時的掃描所有的Tablet,并對它們做major compaction。這個major compaction可以將需要刪除的數據真正的刪除從而節省空間,并保持系統一致性。

          SSTable的locality和In Memory

          在Bigtable中,它的本地性是由Locality group來定義的,即多個column family可以組合到一個locality group中,在同一個Tablet中,使用單獨的SSTable存儲這些在同一個locality group的column family。HBase把這個模型簡化了,即每個column family在每個HRegion都使用單獨的HFile存儲,HFile沒有locality group的概念,或者一個column family就是一個locality group。

          在Bigtable中,還可以支持在locality group級別設置是否將所有這個locality group的數據加載到內存中,在HBase中通過column family定義時設置。這個內存加載采用延時加載,主要應用于一些小的column family,并且經常被用到的,從而提升讀的性能,因而這樣就不需要再從磁盤中讀取了。

          SSTable壓縮

          Bigtable的壓縮是基于locality group級別:
          Compression

          Clients can control whether or not the SSTables for a locality group are compressed, and if so, which compression format is used. The user-specified compression format is applied to each SSTable block (whose size is controllable via a locality group specific tuning parameter). Although we lose some space by compressing each block separately, we benefit in that small portions of an SSTable can be read without decompressing the entire file. Many clients use a two-pass custom compression scheme. The first pass uses Bentley and McIlroy’s scheme [6], which compresses long common strings across a large window. The second pass uses a fast compression algorithm that looks for repetitions in a small 16 KB window of the data. Both compression passes are very fast—they encode at 100–200 MB/s, and decode at 400–1000 MB/s on modern machines.

          Bigtable的壓縮以SSTable中的一個Block為單位,雖然每個Block為壓縮單位損失一些空間,但是采用這種方式,我們可以以Block為單位讀取、解壓、分析,而不是每次以一個“大”的SSTable為單位讀取、解壓、分析。

          SSTable的讀緩存

          為了提升讀的性能,Bigtable采用兩層緩存機制:
          Caching for read performance

          To improve read performance, tablet servers use two levels of caching. The Scan Cache is a higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code. The Block Cache is a lower-level cache that caches SSTables blocks that were read from GFS. The Scan Cache is most useful for applications that tend to read the same data repeatedly. The Block Cache is useful for applications that tend to read data that is close to the data they recently read (e.g., sequential reads, or random reads of different columns in the same locality group within a hot row).

          兩層緩存分別是:
          1. High Level,緩存從SSTable讀取的Key/Value對。提升那些傾向重復的讀取相同的數據的操作(引用局部性原理)。
          2. Low Level,BlockCache,緩存SSTable中的Block。提升那些傾向于讀取相近數據的操作。

          Bloom Filter

          前文有提到Bigtable采用合并讀,即需要讀取每個SSTable中的相關數據,并合并成一個結果返回,然而每次讀都需要讀取所有SSTable,自然會耗費性能,因而引入了Bloom Filter,它可以很快速的找到一個RowKey不在某個SSTable中的事實(注:反過來則不成立)。
          Bloom Filter

          As described in Section 5.3, a read operation has to read from all SSTables that make up the state of a tablet. If these SSTables are not in memory, we may end up doing many disk accesses. We reduce the number of accesses by allowing clients to specify that Bloom fil- ters [7] should be created for SSTables in a particu- lar locality group. A Bloom filter allows us to ask whether an SSTable might contain any data for a spec- ified row/column pair. For certain applications, a small amount of tablet server memory used for storing Bloom filters drastically reduces the number of disk seeks re- quired for read operations. Our use of Bloom filters also implies that most lookups for non-existent rows or columns do not need to touch disk.

          SSTable設計成Immutable的好處

          在SSTable定義中就有提到SSTable是一個Immutable的order map,這個Immutable的設計可以讓系統簡單很多:
          Exploiting Immutability

          Besides the SSTable caches, various other parts of the Bigtable system have been simplified by the fact that all of the SSTables that we generate are immutable. For example, we do not need any synchronization of accesses to the file system when reading from SSTables. As a result, concurrency control over rows can be implemented very efficiently. The only mutable data structure that is accessed by both reads and writes is the memtable. To reduce contention during reads of the memtable, we make each memtable row copy-on-write and allow reads and writes to proceed in parallel.

          Since SSTables are immutable, the problem of permanently removing deleted data is transformed to garbage collecting obsolete SSTables. Each tablet’s SSTables are registered in the METADATA table. The master removes obsolete SSTables as a mark-and-sweep garbage collection [25] over the set of SSTables, where the METADATA table contains the set of roots.

          Finally, the immutability of SSTables enables us to split tablets quickly. Instead of generating a new set of SSTables for each child tablet, we let the child tablets share the SSTables of the parent tablet.

          關于Immutable的優點有以下幾點:
          1. 在讀SSTable是不需要同步。讀寫同步只需要在memtable中處理,為了減少memtable的讀寫競爭,Bigtable將memtable的row設計成copy-on-write,從而讀寫可以同時進行。
          2. 永久的移除數據轉變為SSTable的Garbage Collect。每個Tablet中的SSTable在METADATA表中有注冊,master使用mark-and-sweep算法將SSTable在GC過程中移除。
          3. 可以讓Tablet Split過程變的高效,我們不需要為每個子Tablet創建新的SSTable,而是可以共享父Tablet的SSTable。
          posted on 2015-09-25 01:35 DLevin 閱讀(15886) 評論(0)  編輯  收藏 所屬分類: HBase

          只有注冊用戶登錄后才能發表評論。


          網站導航:
           
          主站蜘蛛池模板: 平舆县| 色达县| 旺苍县| 南涧| 府谷县| 孟村| 河间市| 拉萨市| 新余市| 容城县| 丹阳市| 永胜县| 金湖县| 大埔县| 翁牛特旗| 龙门县| 成武县| 东兰县| 北川| 岳普湖县| 时尚| 南开区| 永年县| 沙雅县| 略阳县| 正安县| 新余市| 南丰县| 张北县| 丰原市| 德保县| 灵宝市| 时尚| 马山县| 鄂伦春自治旗| 鄂托克旗| 云安县| 广南县| 伊宁市| 报价| 七台河市|