Two-phase commit(http://en.wikipedia.org/wiki/Two-phase_commit_protocol)是分布式事務最基礎的協議,Three-phase commit(http://en.wikipedia.org/wiki/Three-phase_commit_protocol)主要解決Two-phase commit中協調者宕機問題。
Two-phase commit的算法實現 (from <<Distributed System: Principles and Paradigms>>):
協調者(Coordinator):
write START_2PC to local log;
multicast VOTE_REQUEST to all participants;
while not all votes have been collected {
wait for any incoming vote;
if timeout {
write GLOBAL_ABORT to local log;
multicast GLOBAL_ABORT to all participants;
exit;
}
record vote;
}
if all participants sent VOTE_COMMIT and coordinator votes COMMIT {
write GLOBAL_COMMIT to local log;
multicast GLOBAL_COMMIT to all participants;
} else {
write GLOBAL_ABORT to local log;
multicast GLOBAL_ABORT to all participants;
}
參與者(Participants)
write INIT to local log;
wait for VOTE_REQUEST from coordinator;
if timeout {
write VOTE_ABORT to local log;
exit;
}
if participant votes COMMIT {
write VOTE_COMMIT to local log;
send VOTE_COMMIT to coordinator;
wait for DECISION from coordinator;
if timeout {
multicast DECISION_REQUEST to other participants;
wait until DECISION is received; /* remain blocked*/
write DECISION to local log;
}
if DECISION == GLOBAL_COMMIT
write GLOBAL_COMMIT to local log;
else if DECISION == GLOBAL_ABORT
write GLOBAL_ABORT to local log;
} else {
write VOTE_ABORT to local log;
send VOTE_ABORT to coordinator;
}
另外,每個參與者維護一個線程專門處理其它參與者的DECISION_REQUEST請求,處理線程流程如下:
while true {
wait until any incoming DECISION_REQUEST is received;
read most recently recorded STATE from the local log;
if STATE == GLOBAL_COMMIT
send GLOBAL_COMMIT to requesting participant;
else if STATE == INIT or STATE == GLOBAL_ABORT;
send GLOBAL_ABORT to requesting participant;
else
skip; /* participant remains blocked */
}
從上述的協調者與參與者的流程可以看出,如果所有參與者VOTE_COMMIT后協調者宕機,這個時候每個參與者都無法單獨決定全局事務的最終結果(GLOBAL_COMMIT還是GLOBAL_ABORT),也無法從其它參與者獲取,整個事務一直阻塞到協調者恢復;如果協調者出現類似磁盤壞這種永久性錯誤,該事務將成為被永久遺棄的孤兒。問題的解決有如下思路:
1. 協調者持久化數據定期備份。為了防止協調者出現永久性錯誤,這是一種代價最小的解決方法,不容易引入bug,但是事務被阻塞的時間可能特別長,比較適合銀行這種正確性高于一切的系統。
2. Three-phase Commit。這是理論上的一種方法,實現起來復雜且效率低。思路如下:假設參與者機器不可能出現超過一半同時宕機的情況,如果協調者宕機,我們需要從活著的超過一半的參與者中得出事務的全局結果。由于不可能知道已經宕機的參與者的狀態,所以引入一個新的參與者狀態PRECOMMIT,參與者成功執行一個事務需要經過INIT, READY, PRECOMMIT,最后到COMMIT狀態;如果至少有一個參與者處于PRECOMMIT或者COMMIT,事務成功;如果至少一個參與者處于INIT或者ABORT,事務失敗;如果所有的參與者都處于READY(至少一半參與者活著),事務失敗,即使原先宕機的參與者恢復后處于PRECOMMIT狀態,也會因為有其它參與者處于ABORT狀態而回滾。PRECOMMIT狀態的引入給了宕機的參與者回滾機會,所以Three-phase commit在超過一半的參與者活著的時候是不阻塞的。不過,Three-phase Commit只能算是是理論上的探索,效率低并且沒有解決網絡分區問題。
3. Paxos解決協調者單點問題。Jim Gray和Lamport合作了一篇論文講這個方法,很適合互聯網公司的超大規模集群,Google的Megastore事務就是這樣實現的,不過問題在于Paxos和Two-phase Commit都不簡單,需要有比較靠譜(代碼質量高)的小團隊設計和編碼才行。后續的blog將詳細闡述該方法。
總之,分布式事務只能是系統開發者的烏托邦式理想,Two-phase commit的介入將導致涉及多臺機器的事務之間完全串行,沒有代價的分布式事務是不存在的。
前面我的一篇文章http://hi.baidu.com/knuthocean/blog/item/12bb9f3dea0e400abba1673c.html引用了對Google App Engine工程師關于Bigtable/Megastore replication的文章。當時留下了很多疑問,比如:為什么Google Bigtable 是按照column family級別而不是按行執行replication的?今天重新思考了Bigtable replication問題,有如下體會:
1. Bigtable/GFS的設計屬于分層設計,和文件系統/數據庫分層設計原理一致,通過系統隔離解決工程上的問題。這種分層設計帶來了兩個問題,一個是性能問題,另外一個就是Replication問題。由于存儲節點和服務節點可能不在一臺機器,理論上總是存在性能問題,這就要求我們在加載/遷移Bigtable子表(Bigtable tablet)的時候考慮本地化因素;另外,GFS有自己的replication機制保證存儲的可靠性,Bigtable通過分離服務節點和存儲節點獲得了很大的靈活性,且Bigtable的宕機恢復時間可以做到很短。對于很多對實時性要求不是特別高的應用Bigtable由于服務節點同時只有一個,既節約資源又避免了單點問題。然后,Bigtable tablet服務過于靈活導致replication做起來極其困難。比如,tablet的分裂和合并機制導致多個tablet(一個只寫,其它只讀)服務同一段范圍的數據變得幾乎不可能。
2. Google replication分為兩種機制,基于客戶端和基于Tablet Server。分述如下:
2-1). 基于客戶端的replication。這種機制比較簡單,實現如下:客戶端讀/寫操作均為異步操作,每個寫操作都嘗試寫兩個Bigtable集群,任何一個寫成功就返回用戶,客戶端維護一個retry list,不斷重試失敗的寫操作。讀操作發到兩個集群,任何一個集群讀取成功均可。然后,這樣做有兩個問題:
a. 客戶端不可靠,可能因為各種問題,包括程序問題退出,retry list丟失導致兩個集群的數據不一致;
b. 多個客戶端并發操作時無法保證順序性。集群A收到的寫操作可能是"DEL item; PUT item";集群B的可能是"PUT item; DEL item"。
2-2). 基于Tablet Server的replication。這種機制實現較為復雜,目的是為了保證讀服務,寫操作的延時仍然可能比較長。兩個集群,一個為主集群,提供讀/寫服務;一個為slave集群,提供只讀服務,兩個集群維持最終一致性。對于一般的讀操作,盡量讀取主集群,如果主集群不可以訪問則讀取slave集群;對于寫操作,首先將寫操作提交到主集群的Tablet Server,主集群的Tablet Server維護slave集群的元數據信息,并維護一個后臺線程不斷地將積攢的用戶表格寫操作提交到slave集群進行日志回放(group commit)。對于一般的tablet遷移,操作邏輯和Bigtable論文中的完全一致;主集群如果發生了機器宕機,則除了回放commit log外,還需要完成宕機的Tablet Server遺留的后臺備份任務。之所以要按照column family級別而不是按行復制,是為了提高壓縮率從而提高備份效率。如果主集群寫操作日志的壓縮率大于備份數據的壓縮率,則可能出現備份不及時,待備份數據越來越多的問題。
假設集群A為主集群,集群B是集群A的備份,集群切換時先停止集群A的寫服務,將集群A余下的備份任務備份到集群B后切換到集群B;如果集群A不可訪問的時間不可預知,可以選擇直接切換到集群B,這樣會帶來一致性問題。且由于Bigtable是按列復制的,最后寫入的一些行的事務性無法保證。不過由于寫操作數據還是保存在集群A的,所以用戶可以知道丟了哪些數據,很多應用可以通過重新執行A集群遺留的寫操作進行災難恢復。Google的App Engine也提供了這種查詢及重做丟失的寫操作的工具。
想法不成熟,有問題聯系:knuthocean@163.com
負載平衡策略
Dynamo的負載平衡取決于如何給每臺機器分配虛擬節點號。由于集群環境的異構性,每臺物理機器包含多個虛擬節點。一般有如下兩種分配節點號的方法:
1. 隨機分配。每臺物理節點加入時根據其配置情況隨機分配S個Token(節點號)。這種方法的負載平衡效果還是不錯的,因為自然界的數據大致是比較隨機的,雖然可能出現某段范圍的數據特別多的情況(如baidu, sina等域名下的網頁特別多),但是只要切分足夠細,即S足夠大,負載還是比較均衡的。這個方法的問題是可控性較差,新節點加入/離開系統時,集群中的原有節點都需要掃描所有的數據從而找出屬于新節點的數據,Merkle Tree也需要全部更新;另外,增量歸檔/備份變得幾乎不可能。
2. 數據范圍等分+隨機分配。為了解決方法1的問題,首先將數據的Hash空間等分為Q = N * S份 (N=機器個數,S=每臺機器的虛擬節點數),然后每臺機器隨機選擇S個分割點作為Token。和方法1一樣,這種方法的負載也比較均衡,且每臺機器都可以對屬于每個范圍的數據維護一個邏輯上的Merkle Tree,新節點加入/離開時只需掃描部分數據進行同步,并更新這部分數據對應的邏輯Merkle Tree,增量歸檔也變得簡單。該方法的一個問題是對機器規模需要做出比較合適的預估,隨著業務量的增長,可能需要重新對數據進行劃分。
不管采用哪種方法,Dynamo的負載平衡效果還是值得擔心的。
客戶端緩存及前后臺任務資源分配
客戶端緩存機器信息可以減少一次在DHT中定位目標機器的網絡交互。由于客戶端數量不可控,這里緩存采用客戶端pull的方式更新,Dynamo中每隔10s或者讀/寫操作發現緩存信息不一致時客戶端更新一次緩存信息。
Dynamo中同步操作、寫操作重試等后臺任務較多,為了不影響正常的讀寫服務,需要對后臺任務能夠使用的資源做出限制。Dynamo中維護一個資源授權系統。該系統將整個機器的資源切分成多個片,監控60s內的磁盤讀寫響應時間,事務超時時間及鎖沖突情況,根據監控信息算出機器負載從而動態調整分配給后臺任務的資源片個數。
Dynamo的優點
1. 設計簡單,組合利用P2P的各種成熟技術,模塊劃分好,代碼復用程度高。
2. 分布式邏輯與單機存儲引擎邏輯基本隔離。很多公司有自己的單機存儲引擎,可以借鑒Dynamo的思想加入分布式功能。
3. NWR策略可以根據應用自由調整,這個思想已經被Google借鑒到其下一代存儲基礎設施中。
4. 設計上天然沒有單點,且基本沒有對系統時鐘一致性的依賴。而在Google的單Master設計中,Master是單點,需要引入復雜的分布式鎖機制來解決,且Lease機制需要對機器間時鐘同步做出假設。
Dynamo的缺陷
1. 負載平衡相比單Master設計較不可控;負載平衡策略一般需要預估機器規模,不能無縫地適應業務動態增長。
2. 系統的擴展性較差。由于增加機器需要給機器分配DHT算法所需的編號,操作復雜度較高,且每臺機器存儲了整個集群的機器信息及數據文件的Merkle Tree信息,機器最大規模只能到幾千臺。
3. 數據一致性問題。多個客戶端的寫操作有順序問題,而在GFS中可以通過只允許Append操作得到一個比較好的一致性模型。
4. 數據存儲不是有序,無法執行Mapreduce;Mapreduce是目前允許機器故障,具有強擴展性的最好的并行計算模型,且有開源的Hadoop可以直接使用,Dynamo由于數據存儲依賴Hash無法直接執行Mapreduce任務。
異常處理
Dynamo中把異常分為兩種類型,臨時性的異常和永久性異常。服務器程序運行時一般通過類似supervise的監控daemon啟動,出現core dump等異常情況時自動重啟。這種異常是臨時性的,其它異常如硬盤報修或機器報廢等由于其持續時間太長,稱之為永久性的。回顧Dynamo的設計,一份數據被寫到N, N+1, ... N+K-1這K臺機器上,如果機器N+i (0 <= i <= K-1)宕機,原本寫入該機器的數據轉移到機器N+K,機器N+K定時ping機器N+i,如果在指定的時間T內N+i重新提供服務,機器N+K將啟動傳輸任務將暫存的數據發送給機器N+i;如果超過了時間T機器N+i還是處于宕機狀態,這種異常被認為是永久性的,這時需要借助Merkle Tree機制進行數據同步。這里的問題在于時間T的選擇,所以Dynamo的開發人員后來干脆把所有程序檢測出來的異常認為是臨時性的,并提供給管理員一個utility工具,用來顯示指定一臺機器永久性下線。由于數據被存儲了K份,一臺機器下線將導致后續的K臺機器出現數據不一致的情況。這是因為原本屬于機器N的數據由于機器下線可能被臨時寫入機器N+1, ... N+K。如果機器N出現永久性異常,后續的K臺機器都需要服務它的部分數據,這時它們都需要選擇冗余機器中較為空閑的一臺進行同步。Merkle Tree同步的原理很簡單,每個非葉子節點對應多個文件,為其所有子節點值組合以后的Hash值,葉子節點對應單個數據文件,為文件內容的Hash值。這樣,任何一個數據文件不匹配都將導致從該文件對應的葉子節點到根節點的所有節點值不同。每臺機器維護K棵Merkle Tree,機器同步時首先傳輸Merkle Tree信息,并且只需要同步從根到葉子的所有節點值均不相同的文件。
讀/寫流程
客戶端的讀/寫請求首先傳輸到緩存的一臺機器,根據預先配置的K、W和R值,對于寫請求,根據DHT算法計算出數據所屬的節點后直接寫入后續的K個節點,等到W個節點返回成功時返回客戶端,如果寫請求失敗將加入retry_list不斷重試。如果某臺機器發生了臨時性異常,將數據寫入后續的備用機器并在備用機器中記錄臨時異常的機器信息。對于讀請求,根據DHT算法計算出數據所屬節點后根據負載策略選擇R個節點,從中讀取R份數據,如果數據一致,直接返回客戶端;如果數據不一致,采用vector clock的方法解決沖突。Dynamo系統默認的策略是選擇最新的數據,當然用戶也可以自定義沖突處理方法。每個寫入系統的<key, value>對都記錄一個vector lock信息,vector lock就是一系列<機器節點號, 版本號/時間戳>對,記錄每臺機器對該數據的最新更新版本信息。如下圖:
讀取時進行沖突解決,如果一臺機器讀到的數據的vector lock記錄的所有版本信息都小于另一臺機器,直接返回vector lock較大的數據;如果二者是平行版本,根據時間戳選擇最新的數據或者通過用戶自定義策略解決沖突。讀請求除了返回數據<key, value>值以外還返回vector lock信息,后續的寫操作需要帶上該信息。
問題1:垃圾數據如何回收?
Dynamo的垃圾回收機制主要依賴每個節點上的存儲引擎,如Berkely db存儲引擎,merge-dump存儲引擎等。其它操作,如Merkle Tree同步產生的垃圾文件回收可以和底層存儲引擎配合完成。
問題2:Dynamo有沒有可能丟數據?
關鍵在于K, W, R的設置。假設一個讀敏感應用設置K=3, W=3, R=1,待處理的數據原本屬于節點A, B, C,節點B出現臨時性故障的過程中由節點D代替。在節點B出現故障到節點B同步完成節點D暫存的修改這段時間內,如果讀請求落入節點B或者D都將出現丟數據的問題。這里需要適當處理下,對于B節點下線的情況,由于其它機器要么緩存了B節點已下線信息,要么讀取時將發現B節點處于下線狀態,這是只需要將請求轉發其它節點即可;對于B節點上線情況,可以等到B節點完全同步以后才開始提供讀服務。對于設置W<K的應用,Dynamo讀取時需要解決沖突,可能丟數據。總之,Dynamo中可以保證讀取的機器都是有效的(處于正常服務狀態),但W != K時不保證所有的有效機器均同步了所有更新操作。
問題3:Dynamo的寫入數據有沒有順序問題?
假設要寫入兩條數據"add item"和"delete item",如果寫入的順序不同,將導致完全不同的結果。如果設置W=K,對于同一個客戶端,由于寫入所有的機器以后才返回,可以保證順序;而多個客戶端的寫操作可能被不同的節點處理,不能保證順序性。如果設置W < K,Dynamo不保證順序性。
問題4:沖突解決后是否需要將結果值更新存儲節點?
讀操作解決沖突后不需要將結果值更新存儲節點。產生沖突的情況一般有機器下線或者多個客戶端導致的順序問題。機器下線時retry_list中的操作將丟失,某些節點不能獲取所有的更新操作。對于機器暫時性或者永久性的異常,Dynamo中內部都有同步機制進行處理,但是對于retry_list中的操作丟失或者多個客戶端引發的順序問題,Dynamo內部根本無法分辨數據是否正確。唯一的沖突解決機器在讀操作,Dynamo可以設計成讀操作將沖突解決結果值更新存儲節點,但是這樣會使讀操作變得復雜和不高效。所以,比較好的做法是每個寫操作都帶上讀操作返回的多個版本數據,寫操作將沖突處理的結果更新存儲節點。
Best Paper:
Tolerating File-System Mistakes with EnvyFS
Lakshmi N. Bairavasundaram, NetApp., Inc.; Swaminathan Sundararaman, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison
NSDI ‘09
Best Paper:
TrInc: Small Trusted Hardware for Large Distributed Systems
Dave Levin, University of Maryland; John R. Douceur, Jacob R. Lorch, and Thomas Moscibroda, Microsoft Research
Best Paper:
Sora: High Performance Software Radio Using General Purpose Multi-core Processors
Kun Tan and Jiansong Zhang, Microsoft Research Asia; Ji Fang, Beijing Jiaotong University; He Liu, Yusheng Ye, and Shen Wang, Tsinghua University; Yongguang Zhang, Haitao Wu, and Wei Wang, Microsoft Research Asia; Geoffrey M. Voelker, University of California, San Diego
FAST ‘09
Best Paper:
CA-NFS: A Congestion-Aware Network File System
Alexandros Batsakis, NetApp and
OSDI ‘08
Jay Lepreau Best Paper:
Difference Engine: Harnessing Memory Redundancy in Virtual Machines
Diwaker Gupta, University of California, San Diego; Sangmin Lee, University of Texas at Austin; Michael Vrable, Stefan Savage, Alex C. Snoeren, George Varghese, Geoffrey M. Voelker, and Amin Vahdat, University of California, San Diego
Jay Lepreau Best Paper:
DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language
Yuan Yu, Michael Isard, Dennis Fetterly, and Mihai Budiu, Microsoft Research Silicon Valley; Úlfar Erlingsson, Reykjavík University, Iceland, and Microsoft Research Silicon Valley; Pradeep Kumar Gunda and Jon Currey, Microsoft Research Silicon Valley
Jay Lepreau Best Paper:
KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs
Cristian Cadar, Daniel Dunbar, and
LISA ‘08
Best Paper:
ENAVis: Enterprise Network Activities Visualization
Qi Liao, Andrew Blaich, Aaron Striegel, and Douglas Thain, University of Notre Dame
Best Student Paper:
Automatic Software Fault Diagnosis by Exploiting Application Signatures
Xiaoning Ding, The Ohio State University; Hai Huang, Yaoping Ruan, and Anees Shaikh, IBM T.J. Watson Research Center; Xiaodong Zhang, The Ohio State University
USENIX Security ‘08
Best Paper:
Highly Predictive Blacklisting
Jian Zhang and Phillip Porras, SRI International; Johannes Ullrich, SANS Institute
Best Student Paper:
Lest We Remember: Cold Boot Attacks on Encryption Keys
J. Alex Halderman, Princeton University; Seth D. Schoen, Electronic Frontier Foundation; Nadia Heninger and William Clarkson, Princeton University; William Paul, Wind River Systems; Joseph A. Calandrino and Ariel J. Feldman, Princeton University; Jacob Appelbaum; Edward W. Felten, Princeton University
USENIX ‘08
Best Paper:
Decoupling Dynamic Program Analysis from Execution in Virtual Environments
Jim Chow, Tal Garfinkel, and Peter M. Chen, VMware
Best Student Paper:
Vx32: Lightweight User-level Sandboxing on the x86
Bryan Ford and Russ Cox, Massachusetts Institute of Technology
NSDI ‘08
Best Paper:
Remus: High Availability via Asynchronous Virtual Machine Replication
Brendan Cully, Geoffrey Lefebvre, Dutch Meyer, Mike Feeley, and Norm Hutchinson, University of British Columbia; Andrew Warfield, University of British Columbia and Citrix Systems, Inc.
Best Paper:
Consensus Routing: The Internet as a Distributed System
John P. John, Ethan Katz-Bassett, Arvind Krishnamurthy, and Thomas Anderson, University of Washington; Arun Venkataramani, University of Massachusetts Amherst
LEET ‘08
Best Paper:
Designing and Implementing Malicious Hardware (PDF) or read in HTML
Samuel T. King, Joseph Tucek, Anthony Cozzie, Chris Grier, Weihang Jiang, and Yuanyuan Zhou, University of Illinois at Urbana-Champaign
FAST ‘08
Best Paper:
Portably Solving File TOCTTOU Races with Hardness Amplification
Dan Tsafrir, IBM T.J. Watson Research Center; Tomer Hertz, Microsoft Research; David Wagner, University of California, Berkeley; Dilma Da Silva, IBM T.J. Watson Research Center
LISA ‘07
Best Paper:
Application Buffer-Cache Management for Performance: Running the World’s Largest MRTG
David Plonka, Archit Gupta, and Dale Carder, University of Wisconsin Madison
Best Paper:
PoDIM: A Language for High-Level Configuration Management
Thomas Delaet and Wouter Joosen, Katholieke Universiteit Leuven, Belgium
16th USENIX Security Symposium
Best Paper:
Towards Automatic Discovery of Deviations in Binary Implementations with Applications to Error Detection and Fingerprint Generation
David Brumley, Juan Caballero, Zhenkai Liang, James Newsome, and Dawn Song, Carnegie Mellon University
Best Student Paper:
Keep Your Enemies Close: Distance Bounding Against Smartcard Relay Attacks
Saar Drimer and Steven J. Murdoch, Computer Laboratory, University of Cambridge
USENIX ‘07
Best Paper:
Hyperion: High Volume Stream Archival for Retrospective Querying
Peter Desnoyers and Prashant Shenoy, University of Massachusetts Amherst
Best Paper:
SafeStore: A Durable and Practical Storage System
Ramakrishna Kotla, Lorenzo Alvisi, and Mike Dahlin, The University of Texas at Austin
NSDI ‘07
Best Paper:
Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code
Charles Killian, James W. Anderson, Ranjit Jhala, and Amin Vahdat, University of California, San Diego
Best Student Paper:
Do Incentives Build Robustness in BitTorrent?
Michael Piatek, Tomas Isdal, Thomas Anderson, and Arvind Krishnamurthy, University of Washington; Arun Venkataramani, University of Massachusetts Amherst
FAST ‘07
Best Paper:
Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?
Bianca Schroeder and Garth A. Gibson, Carnegie Mellon University
Best Paper:
TFS: A Transparent File System for Contributory Storage
James Cipar, Mark D. Corner, and Emery D. Berger, University of Massachusetts Amherst
LISA ‘06
Best Paper:
A Platform for RFID Security and Privacy Administration
Melanie R. Rieback, Vrije Universiteit Amsterdam; Georgi N. Gaydadjiev, Delft University of Technology; Bruno Crispo, Rutger F.H. Hofman, and Andrew S. Tanenbaum, Vrije Universiteit Amsterdam
Honorable Mention:
A Forensic Analysis of a Distributed Two-Stage Web-Based Spam Attack
Daniel V. Klein, LoneWolf Systems
OSDI ‘06
Best Paper:
Rethink the Sync
Edmund B. Nightingale, Kaushik Veeraraghavan, Peter M. Chen, and Jason Flinn, University of Michigan
Best Paper:
Bigtable: A Distributed Storage System for Structured Data
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, Google, Inc.
15th USENIX Security Symposium
Best Paper:
Evaluating SFI for a CISC Architecture
Stephen McCamant, Massachusetts Institute of Technology; Greg Morrisett, Harvard University
Best Student Paper:
Keyboards and Covert Channels
Gaurav Shah, Andres Molina, and Matt Blaze, University of Pennsylvania
2006 USENIX Annual Technical Conference
Best Paper:
Optimizing Network Virtualization in Xen
Aravind Menon, EPFL; Alan L. Cox, Rice University; Willy Zwaenepoel, EPFL
Best Paper:
Replay Debugging for Distributed Applications
Dennis Geels, Gautam Altekar, Scott Shenker, and Ion Stoica, University of California, Berkeley
NSDI ‘06
Best Paper:
Experience with an Object Reputation System for Peer-to-Peer Filesharing
Kevin Walsh and Emin Gün Sirer, Cornell University
FAST ‘05
Best Paper:
Ursa Minor: Versatile Cluster-based Storage
Michael Abd-El-Malek, William V. Courtright II, Chuck Cranor, Gregory R. Ganger, James Hendricks, Andrew J. Klosterman, Michael Mesnier, Manish Prasad, Brandon Salmon, Raja R. Sambasivan, Shafeeq Sinnamohideen, John D. Strunk, Eno Thereska, Matthew Wachs, and Jay J. Wylie, Carnegie Mellon University
Best Paper:
On Multidimensional Data and Modern Disks
Steven W. Schlosser, Intel Research Pittsburgh; Jiri Schindler, EMC Corporation; Stratos Papadomanolakis, Minglong Shao, Anastassia Ailamaki, Christos Faloutsos, and Gregory R. Ganger, Carnegie Mellon University
LISA ‘05
Best Paper:
Toward a Cost Model for System Administration
Alva L. Couch, Ning Wu, and Hengky Susanto, Tufts University
Best Student Paper:
Toward an Automated Vulnerability Comparison of Open Source IMAP Servers
Chaos Golubitsky, Carnegie Mellon University
Best Student Paper:
Reducing Downtime Due to System Maintenance and Upgrades
Shaya Potter and Jason Nieh, Columbia University
IMC 2005
Best Student Paper:
Measurement-based Characterization of a Collection of On-line Games
Chris Chambers and Wu-chang Feng, Portland State University; Sambit Sahu and Debanjan Saha, IBM Research
Security ‘05
Best Paper:
Mapping Internet Sensors with Probe Response Attacks
John Bethencourt, Jason Franklin, and Mary Vernon University of Wisconsin, Madison
Best Student Paper:
Security Analysis of a Cryptographically-Enabled RFID Device
Steve Bono, Matthew Green, and Adam Stubblefield, Johns Hopkins University; Ari Juels, RSA Laboratories; Avi Rubin, Johns Hopkins University; Michael Szydlo, RSA Laboratories
MobiSys ‘05
Best Paper:
Reincarnating PCs with Portable SoulPads
Ramón Cáceres, Casey Carter, Chandra Narayanaswami, and Mandayam Raghunath, IBM T.J. Watson Research Center
NSDI ‘05
Best Paper:
Detecting BGP Configuration Faults with Static Analysis
Nick Feamster and Hari Balakrishnan, MIT Computer Science and Artificial Intelligence Laboratory
Best Student Paper:
Botz-4-Sale: Surviving Organized DDoS Attacks That Mimic Flash Crowds
Srikanth Kandula and Dina Katabi, Massachusetts Institute of Technology; Matthias Jacob, Princeton University; Arthur Berger, Massachusetts Institute of Technology/Akamai
2005 USENIX Annual Technical Conference
General Track
Best Paper:
Debugging Operating Systems with Time-Traveling Virtual Machines
Samuel T. King, George W. Dunlap, and Peter M. Chen, University of Michigan
Best Student Paper:
Itanium—A System Implementor’s Tale
Charles Gray, University of New South Wales; Matthew Chapman and Peter Chubb, University of New South Wales and National ICT Australia; David Mosberger-Tang, Hewlett-Packard Labs; Gernot Heiser, University of New South Wales and National ICT Australia
At Google, we've learned through experience to treat everything with healthy skepticism. We expect that servers, racks, shared GFS cells, and even entire datacenters will occasionally go down, sometimes with little or no warning. This has led us to try as hard as possible to design our products to run on multiple servers, multiple cells, and even multiple datacenters simultaneously, so that they keep running even if any one (or more) redundant underlying parts go down. We call this multihoming. It's a term that usually applies narrowly, to networking alone, but we use it much more broadly in our internal language.
Multihoming is straightforward for read-only products like web search, but it's more difficult for products that allow users to read and write data in real time, like GMail, Google Calendar, and App Engine. I've personally spent a while thinking about how multihoming applies to the App Engine datastore. I even gave a talk about it at this year's Google I/O.
While I've got you captive, I'll describe how multihoming currently works in App Engine, and how we're going to improve it with a release next week. I'll wrap things up with more detail about App Engine's maintenance schedule.
When we launched App Engine, the datastore served each application's data out of one datacenter at a time. Data was replicated to other datacenters in the background, using
For example, if the datastore was serving data for some apps from datacenter A, and we needed to switch to serving their data from datacenter B, we simply flipped the datastore to read only mode, waited for Bigtable replication to flush any remaining writes from A to B, then flipped the switch back and started serving in read/write mode from B. This generally works well, but it depends on the Bigtable cells in both A and B to be healthy. Of course, we wouldn't want to move to B if it was unhealthy, but we definitely would if B was healthy but A wasn't. Google continuously monitors the overall health of App Engine's underlying services, like GFS and Bigtable, in all of our datacenters. However, unexpected problems can crop up from time to time. When that happens, having backup options available is crucial. You may remember the unplanned outage we had a few months ago. We published a detailed postmortem; in a nutshell, the shared GFS cell we use went down hard, which took us down as well, and it took a while to get the GFS cell back up. The GFS cell is just one example of the extent to which we use shared infrastructure at Google. It's one of our greatest strengths, in my opinion, but it has its drawbacks. One of the most noticeable drawback is loss of isolation. When a piece of shared infrastructure has problems or goes down, it affects everything that uses it. In the example above, if the Bigtable cell in A is unhealthy, we're in trouble. Bigtable replication is fast, but it runs in the background, so it's usually at least a little behind, which is why we wait for that final flush before switching to B. If A is unhealthy, some of its data may be unavailable for extended periods of time. We can't get to it, so we can't flush it, we can't switch to B, and we're stuck in A until its Bigtable cell recovers enough to let us finish the flush. In extreme cases like this, we might not know how soon the data in A will become available. Rather than waiting indefinitely for A to recover, we'd like to have the option to cut our losses and serve out of B instead of A, even if it means a small, bounded amount of disruption to application data. Following our example, that extreme recovery scenario would go something like this: Naturally, when A comes back online, we can recover that unreplicated data, but if we've already started serving from B, we can't automatically copy it over from A, since there may have been conflicting writes in B to the same entities. If your app had unreplicated writes, we can at least provide you with a full dump of those writes from A, so that your data isn't lost forever. We can also provide you with tools to relatively easily apply those unreplicated writes to your current datastore serving out of B. Unfortunately, Bigtable replication on its own isn't quite enough for us to implement the extreme recovery scenario above. We use Bigtable single-row transactions, which let us do read/modify/write operations on multiple columns in a row, to make our datastore writes transactional and consistent. Unfortunately, Bigtable replication operates at the column value level, not the row level. This means that after a Bigtable transaction in A that updates two columns, one of the new column values could be replicated to B but not the other. If this happened, and we switched to B without flushing the other column value, the datastore would be internally inconsistent and difficult to recover to a consistent state without the data in A. In our July 2nd outage, it was partly this expectation of internal inconsistency that prevented us from switching to datacenter B when A became unhealthy. Thankfully, there's a solution to our consistency problem: Megastore replication. Megastore is an internal library on top of Bigtable that supports declarative schemas, multi-row transactions, secondary indices, and recently, consistent replication across datacenters. The App Engine datastore uses Megastore liberally. We don't need all of its features - declarative schemas, for example - but we've been following the consistent replication feature closely during its development. Megastore replication is similar to Bigtable replication in that it replicates data across multiple datacenters, but it replicates at the level of entire entity group transactions, not individual Bigtable column values. Furthermore, transactions on a given entity group are always replicated in order. This means that if Bigtable in datacenter A becomes unhealthy, and we must take the extreme option to switch to B before all of the data in A has flushed, B will be consistent and usable. Some writes may be stuck in A and unavailable in B, but B will always be a consistent recent snapshot of the data in A. Some scattered entity groups may be stale, ie they may not reflect the most recent updates, but we'd at least be able to start serving from B immediately, as opposed waiting for A to recover. Megastore replication was originally intended to replicate across multiple datacenters synchronously and atomically, using Paxos. Unfortunately, as I described in my Google I/O talk, the latency of Paxos across datacenters is simply too high for a low-level, developer facing storage system like the App Engine datastore. Due to that, we've been working with the Megastore team on an alternative: asynchronous, background replication similar to Bigtable's. This system maintains the write latency our developers expect, since it doesn't replicate synchronously (with Paxos or otherwise), but it's still consistent and fast enough that we can switch datacenters at a moment's notice with a minimum of unreplicated data. We've had a fully functional version of asynchronous Megastore replication for a while. We've been testing it heavily, working out the kinks, and stressing it to make sure it's robust as possible. We've also been using it in our internal version of App Engine for a couple months. I'm excited to announce that we'll be migrating the public App Engine datastore to use it in a couple weeks, on September 22nd. This migration does require some datastore downtime. First, we'll switch the datastore to read only mode for a short period, probably around 20-30 minutes, while we do our normal data replication flush, and roll forward any transactions that have been committed but not fully applied. Then, since Megastore replication uses a new transaction log format, we need to take the entire datastore down while we drop and recreate our transaction log columns in Bigtable. We expect this to only take a few minutes. After that, we'll be back up and running on Megastore replication! As described, Megastore replication will make App Engine much more resilient to hiccoughs and outages in individual datacenters and significantly reduce the likelihood of extended outages. It also opens the door to two new options which will give developers more control over how their data is read and written. First, we're exploring allowing reads from the non-primary datastore if the primary datastore is taking too long to respond, which could decrease the likelihood of timeouts on read operations. Second, we're exploring full Paxos for write operations on an opt-in basis, guaranteeing data is always synchronously replicated across datacenters, which would increase availability at the cost of additional write latency. Both of these features are speculative right now, but we're looking forward to allowing developers to make the decisions that fit their applications best! Finally, a word about our maintenance schedule. App Engine's scheduled maintenance periods usually correspond to shifts in primary application serving between datacenters. Our maintenance periods usually last for about an hour, during which application serving is continuous, but access to the Datastore and memcache may be read-only or completely unavailable. We've recently developed better visibility into when we expect to shift datacenters. This information isn't perfect, but we've heard from many developers that they'd like more advance notice from App Engine about when these maintenance periods will occur. Therefore, we're happy to announce below the preliminary maintenance schedule for the rest of 2009. We don't expect this information to change, but if it does, we'll notify you (via the App Engine Downtime Notify Google Group) as soon as possible. The App Engine team members are personally dedicated to keeping your applications serving without interruption, and we realize that weekday maintenance periods aren't ideal for many. However, we've selected the day of the week and time of day for maintenance to balance disruption to App Engine developers with availability of the full engineering teams of the services App Engine relies upon, like GFS and Bigtable. In the coming months, we expect features like Megastore replication to help reduce the length of our maintenance periods. 閱讀全文
Planning for trouble
We give up on flushing the most recent writes in A that haven't replicated to B, and switch to serving the data that is in B. Thankfully, there isn't much data in A that hasn't replicated to B, because replication is usually quite fast. It depends on the nature of the failure, but the window of unreplicated data usually only includes a small fraction of apps, and is often as small as a few thousand recent puts, deletes, and transaction commits, across all affected apps.
Megastore replication saves the day!
To Paxos or not to Paxos
Onward and upward
Planning for scheduled maintenance
類別:默認分類 查看評論
文章來源:http://hi.baidu.com/knuthocean/blog/item/12bb9f3dea0e400abba1673c.html