前記 幾年前在讀Google的BigTable論文的時候,當時并沒有理解論文里面表達的思想,因而囫圇吞棗,并沒有注意到SSTable的概念。再后來開始關注HBase的設計和源碼后,開始對BigTable傳遞的思想慢慢的清晰起來,但是因為事情太多,沒有安排出時間重讀BigTable的論文。在項目里,我因為自己在學HBase,開始主推HBase,而另一個同事則因為對Cassandra比較感冒,因而他主要關注Cassandra的設計,不過我們兩個人偶爾都會討論一下技術、設計的各種觀點和心得,然后他偶然的說了一句:Cassandra和HBase都采用SSTable格式存儲,然后我本能的問了一句:什么是SSTable?他并沒有回答,可能也不是那么幾句能說清楚的,或者他自己也沒有嘗試的去問過自己這個問題。然而這個問題本身卻一直困擾著我,因而趁著現在有一些時間深入學習HBase和Cassandra相關設計的時候先把這個問題弄清楚了。
SSTable的定義 要解釋這個術語的真正含義,最好的方法就是從它的出處找答案,所以重新翻開BigTable的論文。在這篇論文中,最初對SSTable是這么描述的(第三頁末和第四頁初):
SSTable
The Google SSTable file format is used internally to
store Bigtable data. An SSTable provides a persistent,
ordered immutable map from keys to values, where both
keys and values are arbitrary byte strings. Operations are
provided to look up the value associated with a specified key, and to iterate over all key/value pairs in a specified
key range. Internally, each SSTable contains a sequence
of blocks (typically each block is 64KB in size, but this
is configurable). A block index (stored at the end of the
SSTable) is used to locate blocks; the index is loaded
into memory when the SSTable is opened. A lookup
can be performed with a single disk seek: we first find
the appropriate block by performing a binary search in
the in-memory index, and then reading the appropriate
block from disk. Optionally, an SSTable can be completely mapped into memory, which allows us to perform
lookups and scans without touching disk.
簡單的非直譯: SSTable是Bigtable內部用于數據的文件格式,它的格式為文件本身就是一個排序的、不可變的、持久的Key/Value對Map,其中Key和value都可以是任意的byte字符串。使用Key來查找Value,或通過給定Key范圍遍歷所有的Key/Value對。每個SSTable包含一系列的Block(一般Block大小為64KB,但是它是可配置的),在SSTable的末尾是Block索引,用于定位Block,這些索引在SSTable打開時被加載到內存中,在查找時首先從內存中的索引二分查找找到Block,然后一次磁盤尋道即可讀取到相應的Block。還有一種方案是將這個SSTable加載到內存中,從而在查找和掃描中不需要讀取磁盤。 這個貌似就是HFile第一個版本的格式么,貼張圖感受一下:
在HBase使用過程中,對這個版本的HFile遇到以下一些問題(參考
這里 ):
1. 解析時內存使用量比較高。
2. Bloom Filter和Block索引會變的很大,而影響啟動性能。具體的,Bloom Filter可以增長到100MB每個HFile,而Block索引可以增長到300MB,如果一個HRegionServer中有20個HRegion,則他們分別能增長到2GB和6GB的大小。HRegion需要在打開時,需要加載所有的Block索引到內存中,因而影響啟動性能;而在第一次Request時,需要將整個Bloom Filter加載到內存中,再開始查找,因而Bloom Filter太大會影響第一次請求的延遲。
而HFile在版本2中對這些問題做了一些優化,具體會在HFile解析時詳細說明。
SSTable作為存儲使用 繼續BigTable的論文往下走,在5.3 Tablet Serving小節中這樣寫道(第6頁):
Tablet Serving
Updates are committed to a commit
log that stores redo records. Of these updates, the recently committed ones are stored in memory in a sorted
buffer called a memtable ; the older updates are stored in a
sequence of SSTables. To recover a tablet, a tablet server reads its metadata from the METADATA table. This metadata contains the list of SSTables that comprise a tablet
and a set of a redo points, which are pointers into any
commit logs that may contain data for the tablet. The
server reads the indices of the SSTables into memory and
reconstructs the memtable by applying all of the updates
that have committed since the redo points.
When a write operation arrives at a tablet server, the
server checks that it is well-formed, and that the sender
is authorized to perform the mutation. Authorization is
performed by reading the list of permitted writers from a
Chubby file (which is almost always a hit in the Chubby
client cache). A valid mutation is written to the commit
log. Group commit is used to improve the throughput of
lots of small mutations [13, 16]. After the write has been
committed, its contents are inserted into the memtable.
When a read operation arrives at a tablet server, it is
similarly checked for well-formedness and proper authorization. A valid read operation is executed on a merged
view of the sequence of SSTables and the memtable.
Since the SSTables and the memtable are lexicographically sorted data structures, the merged view can be
formed efficiently.
Incoming read and write operations can continue
while tablets are split and merged.
第一段和第三段簡單描述,非翻譯: 在新數據寫入時,這個操作首先提交到日志中作為redo紀錄,最近的數據存儲在內存的排序緩存memtable中;舊的數據存儲在一系列的SSTable
中。在recover中,tablet
server從METADATA表中讀取metadata,metadata包含了組成Tablet的所有SSTable(紀錄了這些SSTable的元
數據信息,如SSTable的位置、StartKey、EndKey等)以及一系列日志中的redo點。Tablet
Server讀取SSTable的索引到內存,并replay這些redo點之后的更新來重構memtable。 在讀時,完成格式、授權等檢查后,讀會同時讀取SSTable、memtable(HBase中還包含了BlockCache中的數據)并合并他們的結果,由于SSTable和memtable都是字典序排列,因而合并操作可以很高效完成。 SSTable在Compaction過程中的使用 在BigTable論文5.4 Compaction小節中是這樣說的:
Compaction
As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the
memtable is frozen, a new memtable is created, and the
frozen memtable is converted to an SSTable and written
to GFS. This minor compaction process has two goals:
it shrinks the memory usage of the tablet server, and it
reduces the amount of data that has to be read from the
commit log during recovery if this server dies. Incoming read and write operations can continue while compactions occur.
Every minor compaction creates a new SSTable. If this
behavior continued unchecked, read operations might
need to merge updates from an arbitrary number of
SSTables. Instead, we bound the number of such files
by periodically executing a merging compaction in the
background. A merging compaction reads the contents
of a few SSTables and the memtable, and writes out a
new SSTable. The input SSTables and memtable can be
discarded as soon as the compaction has finished.
A merging compaction that rewrites all SSTables
into exactly one SSTable is called a major compaction .
SSTables produced by non-major compactions can contain special deletion entries that suppress deleted data in
older SSTables that are still live. A major compaction,
on the other hand, produces an SSTable that contains
no deletion information or deleted data. Bigtable cycles through all of its tablets and regularly applies major
compactions to them. These major compactions allow
Bigtable to reclaim resources used by deleted data, and
also allow it to ensure that deleted data disappears from
the system in a timely fashion, which is important for
services that store sensitive data.
隨著memtable大小增加到一個閥值,這個memtable會被凍住而創建一個新的memtable以供使用,而舊的memtable會轉換成一個SSTable而寫道GFS中,這個過程叫做minor compaction。這個minor compaction可以減少內存使用量,并可以減少日志大小,因為持久化后的數據可以從日志中刪除。 在minor compaction過程中,可以繼續處理讀寫請求。
每次minor compaction會生成新的SSTable文件,如果SSTable文件數量增加,則會影響讀的性能,因而每次讀都需要讀取所有SSTable文件,然后合并結果,因而對SSTable文件個數需要有上限,并且時不時的需要在后臺做merging compaction,這個merging compaction讀取一些SSTable文件和memtable的內容,并將他們合并寫入一個新的SSTable中。當這個過程完成后,這些源SSTable和memtable就可以被刪除了。 如果一個merging compaction是合并所有SSTable到一個SSTable,則這個過程稱做major compaction。一次major compaction會將mark成刪除的信息、數據刪除,而其他兩次compaction則會保留這些信息、數據(mark的形式)。Bigtable會時不時的掃描所有的Tablet,并對它們做major compaction。這個major compaction可以將需要刪除的數據真正的刪除從而節省空間,并保持系統一致性。 SSTable的locality和In Memory 在Bigtable中,它的本地性是由Locality group來定義的,即多個column family可以組合到一個locality group中,在同一個Tablet中,使用單獨的SSTable存儲這些在同一個locality group的column family。HBase把這個模型簡化了,即每個column family在每個HRegion都使用單獨的HFile存儲,HFile沒有locality group的概念,或者一個column family就是一個locality group。 在Bigtable中,還可以支持在locality group級別設置是否將所有這個locality group的數據加載到內存中,在HBase中通過column family定義時設置。這個內存加載采用延時加載,主要應用于一些小的column family,并且經常被用到的,從而提升讀的性能,因而這樣就不需要再從磁盤中讀取了。 SSTable壓縮 Bigtable的壓縮是基于locality group級別:
Compression
Clients can control whether or not the SSTables for a
locality group are compressed, and if so, which compression format is used. The user-specified compression format is applied to each SSTable block (whose size
is controllable via a locality group specific tuning parameter). Although we lose some space by compressing each block separately, we benefit in that small portions of an SSTable can be read without decompressing the entire file. Many clients use a two-pass custom
compression scheme. The first pass uses Bentley and
McIlroy’s scheme [6], which compresses long common
strings across a large window. The second pass uses a
fast compression algorithm that looks for repetitions in
a small 16 KB window of the data. Both compression
passes are very fast—they encode at 100–200 MB/s, and
decode at 400–1000 MB/s on modern machines.
Bigtable的壓縮以SSTable中的一個Block為單位,雖然每個Block為壓縮單位損失一些空間,但是采用這種方式,我們可以以Block為單位讀取、解壓、分析,而不是每次以一個“大”的SSTable為單位讀取、解壓、分析。 SSTable的讀緩存 為了提升讀的性能,Bigtable采用兩層緩存機制:
Caching for read performance
To improve read performance, tablet servers use two levels of caching. The Scan Cache is a higher-level cache
that caches the key-value pairs returned by the SSTable
interface to the tablet server code. The Block Cache is a
lower-level cache that caches SSTables blocks that were
read from GFS. The Scan Cache is most useful for applications that tend to read the same data repeatedly. The
Block Cache is useful for applications that tend to read
data that is close to the data they recently read (e.g., sequential reads, or random reads of different columns in
the same locality group within a hot row).
兩層緩存分別是:
1. High Level,緩存從SSTable讀取的Key/Value對。提升那些傾向重復的讀取相同的數據的操作(引用局部性原理)。
2. Low Level,BlockCache,緩存SSTable中的Block。提升那些傾向于讀取相近數據的操作。
Bloom Filter 前文有提到Bigtable采用合并讀,即需要讀取每個SSTable中的相關數據,并合并成一個結果返回,然而每次讀都需要讀取所有SSTable,自然會耗費性能,因而引入了Bloom Filter,它可以很快速的找到一個RowKey不在某個SSTable中的事實(注:反過來則不成立)。
Bloom Filter
As described in Section 5.3, a read operation has to read
from all SSTables that make up the state of a tablet.
If these SSTables are not in memory, we may end up
doing many disk accesses. We reduce the number of
accesses by allowing clients to specify that Bloom fil-
ters [7] should be created for SSTables in a particu-
lar locality group. A Bloom filter allows us to ask
whether an SSTable might contain any data for a spec-
ified row/column pair. For certain applications, a small
amount of tablet server memory used for storing Bloom
filters drastically reduces the number of disk seeks re-
quired for read operations. Our use of Bloom filters
also implies that most lookups for non-existent rows or
columns do not need to touch disk.
SSTable設計成Immutable的好處 在SSTable定義中就有提到SSTable是一個Immutable的order map,這個Immutable的設計可以讓系統簡單很多:
Exploiting Immutability
Besides the SSTable caches, various other parts of the
Bigtable system have been simplified by the fact that all of the SSTables that we generate are immutable. For example, we do not need any synchronization of accesses
to the file system when reading from SSTables. As a result, concurrency control over rows can be implemented
very efficiently. The only mutable data structure that is
accessed by both reads and writes is the memtable. To reduce contention during reads of the memtable, we make
each memtable row copy-on-write and allow reads and
writes to proceed in parallel.
Since SSTables are immutable, the problem of permanently removing deleted data is transformed to garbage
collecting obsolete SSTables. Each tablet’s SSTables are
registered in the METADATA table. The master removes
obsolete SSTables as a mark-and-sweep garbage collection [25] over the set of SSTables, where the METADATA
table contains the set of roots.
Finally, the immutability of SSTables enables us to
split tablets quickly. Instead of generating a new set of
SSTables for each child tablet, we let the child tablets
share the SSTables of the parent tablet.
關于Immutable的優點有以下幾點: 1. 在讀SSTable是不需要同步。讀寫同步只需要在memtable中處理,為了減少memtable的讀寫競爭,Bigtable將memtable的row設計成copy-on-write,從而讀寫可以同時進行。 2. 永久的移除數據轉變為SSTable的Garbage Collect。每個Tablet中的SSTable在METADATA表中有注冊,master使用mark-and-sweep算法將SSTable在GC過程中移除。 3. 可以讓Tablet Split過程變的高效,我們不需要為每個子Tablet創建新的SSTable,而是可以共享父 Tablet的SSTable。
posted on 2015-09-25 01:35
DLevin 閱讀(15886)
評論(0) 編輯 收藏 所屬分類:
HBase