jinfeng_wang

G-G-S,D-D-U!

BlogJava

管理

400 Posts :: 0 Stories :: 296 Comments :: 0 Trackbacks

公告

歡迎交流鏈接，給我留言

常用鏈接

留言簿(40)

隨筆分類(592)

隨筆檔案(400)

Domestic

Michael Chen’s Blog
臨海觀潮
兔八哥的狗窩

Foreign

搜索

積分與排名

積分 - 493124
排名 - 103

閱讀排行榜

http://weizijun.cn/2015/04/30/Raft%E5%8D%8F%E8%AE%AE%E5%AE%9E%E6%88%98%E4%B9%8BRedis%20Sentinel%E7%9A%84%E9%80%89%E4%B8%BELeader%E6%BA%90%E7%A0%81%E8%A7%A3%E6%9E%90/

Raft協(xié)議是用來解決分布式系統(tǒng)一致性問題的協(xié)議，在很長(zhǎng)一段時(shí)間，Paxos被認(rèn)為是解決分布式系統(tǒng)一致性的代名詞。但是Paxos難于理解，更難以實(shí)現(xiàn)，諸如Google大牛們開發(fā)的分布式鎖系統(tǒng)Chubby都遭遇了很多坑。Raft協(xié)議設(shè)計(jì)的初衷就是容易實(shí)現(xiàn)，保證對(duì)于普遍的人群都可以十分舒適容易的去理解。另外，它必須能夠讓人形成直觀的認(rèn)識(shí)，這樣系統(tǒng)的構(gòu)建者才能夠在現(xiàn)實(shí)中進(jìn)行必然的擴(kuò)展。

本文從Redis Sentinel集群選擇Leader的具體流程和源碼分析，描述Raft協(xié)議中的選舉Leader算法。關(guān)于Redis Sentinel的介紹可以參看本人的另一篇文章《redis sentinel設(shè)計(jì)與實(shí)現(xiàn)》。

當(dāng)Sentinel集群有Sentinel發(fā)現(xiàn)master客觀下線了，就會(huì)開始故障轉(zhuǎn)移流程，故障轉(zhuǎn)移流程的第一步就是在Sentinel集群選擇一個(gè)Leader，讓Leader完成故障轉(zhuǎn)移流程。

Raft協(xié)議選舉流程

描述Raft選舉流程之前需要了解一些概念。

節(jié)點(diǎn)的狀態(tài)

Raft協(xié)議描述的節(jié)點(diǎn)共有三種狀態(tài)：Leader, Follower, Candidate。在系統(tǒng)運(yùn)行正常的時(shí)候只有Leader和Follower兩種狀態(tài)的節(jié)點(diǎn)。一個(gè)Leader節(jié)點(diǎn)，其他的節(jié)點(diǎn)都是Follower。Candidate是系統(tǒng)運(yùn)行不穩(wěn)定時(shí)期的中間狀態(tài)，當(dāng)一個(gè)Follower對(duì)Leader的的心跳出現(xiàn)異常，就會(huì)轉(zhuǎn)變成Candidate，Candidate會(huì)去競(jìng)選新的Leader，它會(huì)向其他節(jié)點(diǎn)發(fā)送競(jìng)選投票，如果大多數(shù)節(jié)點(diǎn)都投票給它，它就會(huì)替代原來的Leader，變成新的Leader，原來的Leader會(huì)降級(jí)成Follower。

term

在分布式系統(tǒng)中，各個(gè)節(jié)點(diǎn)的時(shí)間同步是一個(gè)很大的難題，但是為了識(shí)別過期時(shí)間，時(shí)間信息又必不可少。Raft協(xié)議為了解決這個(gè)問題，引入了term（任期）的概念。Raft協(xié)議將時(shí)間切分為一個(gè)個(gè)的Term，可以認(rèn)為是一種“邏輯時(shí)間”。

RPC

Raft協(xié)議在選舉階段交互的RPC有兩類：RequestVote和AppendEntries。

RequestVote是用來向其他節(jié)點(diǎn)發(fā)送競(jìng)選投票。
AppendEntries是當(dāng)該節(jié)點(diǎn)得到更多的選票后，成為L(zhǎng)eader，向其他節(jié)點(diǎn)確認(rèn)消息。

選舉流程

Raft采用心跳機(jī)制觸發(fā)Leader選舉。系統(tǒng)啟動(dòng)后，全部節(jié)點(diǎn)初始化為Follower，term為0.節(jié)點(diǎn)如果收到了RequestVote或者AppendEntries，就會(huì)保持自己的Follower身份。如果一段時(shí)間內(nèi)沒收到AppendEntries消息直到選舉超時(shí)，說明在該節(jié)點(diǎn)的超時(shí)時(shí)間內(nèi)還沒發(fā)現(xiàn)Leader，F(xiàn)ollower就會(huì)轉(zhuǎn)換成Candidate，自己開始競(jìng)選Leader。一旦轉(zhuǎn)化為Candidate，該節(jié)點(diǎn)立即開始下面幾件事情：

1、增加自己的term。
2、啟動(dòng)一個(gè)新的定時(shí)器。
3、給自己投一票。
4、向所有其他節(jié)點(diǎn)發(fā)送RequestVote，并等待其他節(jié)點(diǎn)的回復(fù)。

如果在這過程中收到了其他節(jié)點(diǎn)發(fā)送的AppendEntries，就說明已經(jīng)有Leader產(chǎn)生，自己就轉(zhuǎn)換成Follower，選舉結(jié)束。

如果在計(jì)時(shí)器超時(shí)前，節(jié)點(diǎn)收到多數(shù)節(jié)點(diǎn)的同意投票，就轉(zhuǎn)換成Leader。同時(shí)向所有其他節(jié)點(diǎn)發(fā)送AppendEntries，告知自己成為了Leader。

每個(gè)節(jié)點(diǎn)在一個(gè)term內(nèi)只能投一票，采取先到先得的策略，Candidate前面說到已經(jīng)投給了自己，F(xiàn)ollower會(huì)投給第一個(gè)收到RequestVote的節(jié)點(diǎn)。每個(gè)Follower有一個(gè)計(jì)時(shí)器，在計(jì)時(shí)器超時(shí)時(shí)仍然沒有接受到來自Leader的心跳RPC, 則自己轉(zhuǎn)換為Candidate, 開始請(qǐng)求投票，就是上面的的競(jìng)選Leader步驟。

如果多個(gè)Candidate發(fā)起投票，每個(gè)Candidate都沒拿到多數(shù)的投票（Split Vote），那么就會(huì)等到計(jì)時(shí)器超時(shí)后重新成為Candidate，重復(fù)前面競(jìng)選Leader步驟。

Raft協(xié)議的定時(shí)器采取隨機(jī)超時(shí)時(shí)間，這是選舉Leader的關(guān)鍵。每個(gè)節(jié)點(diǎn)定時(shí)器的超時(shí)時(shí)間隨機(jī)設(shè)置，隨機(jī)選取配置時(shí)間的1倍到2倍之間。由于隨機(jī)配置，所以各個(gè)Follower同時(shí)轉(zhuǎn)成Candidate的時(shí)間一般不一樣，在同一個(gè)term內(nèi)，先轉(zhuǎn)為Candidate的節(jié)點(diǎn)會(huì)先發(fā)起投票，從而獲得多數(shù)票。多個(gè)節(jié)點(diǎn)同時(shí)轉(zhuǎn)換為Candidate的可能性很小。即使幾個(gè)Candidate同時(shí)發(fā)起投票，在該term內(nèi)有幾個(gè)節(jié)點(diǎn)獲得一樣高的票數(shù)，只是這個(gè)term無法選出Leader。由于各個(gè)節(jié)點(diǎn)定時(shí)器的超時(shí)時(shí)間隨機(jī)生成，那么最先進(jìn)入下一個(gè)term的節(jié)點(diǎn)，將更有機(jī)會(huì)成為L(zhǎng)eader。連續(xù)多次發(fā)生在一個(gè)term內(nèi)節(jié)點(diǎn)獲得一樣高票數(shù)在理論上幾率很小，實(shí)際上可以認(rèn)為完全不可能發(fā)生。一般1-2個(gè)term類，Leader就會(huì)被選出來。

Sentinel的選舉流程

Sentinel集群正常運(yùn)行的時(shí)候每個(gè)節(jié)點(diǎn)epoch相同，當(dāng)需要故障轉(zhuǎn)移的時(shí)候會(huì)在集群中選出Leader執(zhí)行故障轉(zhuǎn)移操作。Sentinel采用了Raft協(xié)議實(shí)現(xiàn)了Sentinel間選舉Leader的算法，不過也不完全跟論文描述的步驟一致。Sentinel集群運(yùn)行過程中故障轉(zhuǎn)移完成，所有Sentinel又會(huì)恢復(fù)平等。Leader僅僅是故障轉(zhuǎn)移操作出現(xiàn)的角色。

選舉流程

1、某個(gè)Sentinel認(rèn)定master客觀下線的節(jié)點(diǎn)后，該Sentinel會(huì)先看看自己有沒有投過票，如果自己已經(jīng)投過票給其他Sentinel了，在2倍故障轉(zhuǎn)移的超時(shí)時(shí)間自己就不會(huì)成為L(zhǎng)eader。相當(dāng)于它是一個(gè)Follower。
2、如果該Sentinel還沒投過票，那么它就成為Candidate。
3、和Raft協(xié)議描述的一樣，成為Candidate，Sentinel需要完成幾件事情
- 1）更新故障轉(zhuǎn)移狀態(tài)為start
- 2）當(dāng)前epoch加1，相當(dāng)于進(jìn)入一個(gè)新term，在Sentinel中epoch就是Raft協(xié)議中的term。
- 3）更新自己的超時(shí)時(shí)間為當(dāng)前時(shí)間隨機(jī)加上一段時(shí)間，隨機(jī)時(shí)間為1s內(nèi)的隨機(jī)毫秒數(shù)。
- 4）向其他節(jié)點(diǎn)發(fā)送is-master-down-by-addr命令請(qǐng)求投票。命令會(huì)帶上自己的epoch。
- 5）給自己投一票，在Sentinel中，投票的方式是把自己master結(jié)構(gòu)體里的leader和leader_epoch改成投給的Sentinel和它的epoch。
4、其他Sentinel會(huì)收到Candidate的is-master-down-by-addr命令。如果Sentinel當(dāng)前epoch和Candidate傳給他的epoch一樣，說明他已經(jīng)把自己master結(jié)構(gòu)體里的leader和leader_epoch改成其他Candidate，相當(dāng)于把票投給了其他Candidate。投過票給別的Sentinel后，在當(dāng)前epoch內(nèi)自己就只能成為Follower。
5、Candidate會(huì)不斷的統(tǒng)計(jì)自己的票數(shù)，直到他發(fā)現(xiàn)認(rèn)同他成為L(zhǎng)eader的票數(shù)超過一半而且超過它配置的quorum（quorum可以參考《redis sentinel設(shè)計(jì)與實(shí)現(xiàn)》）。Sentinel比Raft協(xié)議增加了quorum，這樣一個(gè)Sentinel能否當(dāng)選Leader還取決于它配置的quorum。
6、如果在一個(gè)選舉時(shí)間內(nèi)，Candidate沒有獲得超過一半且超過它配置的quorum的票數(shù)，自己的這次選舉就失敗了。
7、如果在一個(gè)epoch內(nèi)，沒有一個(gè)Candidate獲得更多的票數(shù)。那么等待超過2倍故障轉(zhuǎn)移的超時(shí)時(shí)間后，Candidate增加epoch重新投票。
8、如果某個(gè)Candidate獲得超過一半且超過它配置的quorum的票數(shù)，那么它就成為了Leader。
9、與Raft協(xié)議不同，Leader并不會(huì)把自己成為L(zhǎng)eader的消息發(fā)給其他Sentinel。其他Sentinel等待Leader從slave選出master后，檢測(cè)到新的master正常工作后，就會(huì)去掉客觀下線的標(biāo)識(shí)，從而不需要進(jìn)入故障轉(zhuǎn)移流程。

關(guān)于Sentinel超時(shí)時(shí)間的說明

Sentinel超時(shí)機(jī)制有幾個(gè)超時(shí)概念。

failover_start_time 下一選舉啟動(dòng)的時(shí)間。默認(rèn)是當(dāng)前時(shí)間加上1s內(nèi)的隨機(jī)毫秒數(shù)
failover_state_change_time 故障轉(zhuǎn)移中狀態(tài)變更的時(shí)間。
failover_timeout 故障轉(zhuǎn)移超時(shí)時(shí)間。默認(rèn)是3分鐘。
election_timeout 選舉超時(shí)時(shí)間，是默認(rèn)選舉超時(shí)時(shí)間和failover_timeout的最小值。默認(rèn)是10s。

Follower成為Candidate后，會(huì)更新failover_start_time為當(dāng)前時(shí)間加上1s內(nèi)的隨機(jī)毫秒數(shù)。更新failover_state_change_time為當(dāng)前時(shí)間。

Candidate的當(dāng)前時(shí)間減去failover_start_time大于election_timeout，說明Candidate還沒獲得足夠的選票，此次epoch的選舉已經(jīng)超時(shí)，那么轉(zhuǎn)變成Follower。需要等到mstime() - failover_start_time < failover_timeout*2的時(shí)候才開始下一次獲得成為Candidate的機(jī)會(huì)。

如果一個(gè)Follower把某個(gè)Candidate設(shè)為自己認(rèn)為的Leader，那么它的failover_start_time會(huì)設(shè)置為當(dāng)前時(shí)間加上1s內(nèi)的隨機(jī)毫秒數(shù)。這樣它就進(jìn)入了上面說的需要等到mstime() - failover_start_time < failover_timeout*2的時(shí)候才開始下一次獲得成為Candidate的機(jī)會(huì)。

因?yàn)槊總€(gè)Sentinel判斷節(jié)點(diǎn)客觀下線的時(shí)間不是同時(shí)開始的，一般都有先后，這樣先開始的Sentinel就更有機(jī)會(huì)贏得更多選票，另外failover_state_change_time為1s內(nèi)的隨機(jī)毫秒數(shù)，這樣也把各個(gè)節(jié)點(diǎn)的超時(shí)時(shí)間分散開來。本人嘗試過很多次，Sentinel間的Leader選舉過程基本上一個(gè)epoch內(nèi)就完成了。

Sentinel 選舉流程源碼解析

Sentinel的選舉流程的代碼基本都在sentinel.c文件中，下面結(jié)合源碼對(duì)Sentinel的選舉流程進(jìn)行說明。

定時(shí)任務(wù)

void sentinelHandleRedisInstance(sentinelRedisInstance *ri) {      ...      // 判斷 master 是否進(jìn)入SDOWN 狀態(tài)     sentinelCheckSubjectivelyDown(ri);      /* Masters and slaves */     if (ri->flags & (SRI_MASTER|SRI_SLAVE)) {         /* Nothing so far. */     }              if (ri->flags & SRI_MASTER) {          // 判斷 master 是否進(jìn)入 ODOWN 狀態(tài)         sentinelCheckObjectivelyDown(ri);          // 查看是否需要開始故障轉(zhuǎn)移         if (sentinelStartFailoverIfNeeded(ri))             // 向其他 Sentinel 發(fā)送 SENTINEL is-master-down-by-addr 命令             // 刷新其他 Sentinel 關(guān)于主服務(wù)器的狀態(tài)             sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_ASK_FORCED);          // 執(zhí)行故障轉(zhuǎn)移         sentinelFailoverStateMachine(ri);          //此處調(diào)用sentinelAskMasterStateToOtherSentinels，只是為了獲取其他Sentinel對(duì)于master是否存活的判斷，         //用來下一次判斷master是否進(jìn)入ODOWN狀態(tài)         sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_NO_FLAGS);     } }

Sentinel會(huì)每隔100ms執(zhí)行一次sentinelHandleRedisInstance函數(shù)。流程會(huì)檢查master是否進(jìn)入SDOWN狀態(tài)，接著會(huì)檢查master是否進(jìn)入ODOWN狀態(tài)，接著會(huì)查看是否需要開始故障轉(zhuǎn)移，如果開始故障轉(zhuǎn)移就會(huì)向其他節(jié)點(diǎn)拉去投票，接下來有個(gè)故障轉(zhuǎn)移的狀態(tài)機(jī)，根據(jù)不同的failover_state，決定完成不同的操作，正常的時(shí)候failover_state為SENTINEL_FAILOVER_STATE_NONE。

向其他Sentinel獲取投票或者獲取對(duì)master存活狀態(tài)的判斷結(jié)果

#define SENTINEL_ASK_FORCED (1<<0) void sentinelAskMasterStateToOtherSentinels(sentinelRedisInstance *master, int flags) {     dictIterator *di;     dictEntry *de;      // 遍歷正在監(jiān)視相同 master 的所有 sentinel     // 向它們發(fā)送 SENTINEL is-master-down-by-addr 命令     di = dictGetIterator(master->sentinels);     while((de = dictNext(di)) != NULL) {         sentinelRedisInstance *ri = dictGetVal(de);          // 距離該 sentinel 最后一次回復(fù) SENTINEL master-down-by-addr 命令已經(jīng)過了多久         mstime_t elapsed = mstime() - ri->last_master_down_reply_time;         char port[32];         int retval;          /* If the master state from other sentinel is too old, we clear it. */         // 如果目標(biāo) Sentinel 關(guān)于主服務(wù)器的信息已經(jīng)太久沒更新，那么我們清除它         if (elapsed > SENTINEL_ASK_PERIOD*5) {             ri->flags &= ~SRI_MASTER_DOWN;             sdsfree(ri->leader);             ri->leader = NULL;         }          /* Only ask if master is down to other sentinels if:          *          * 只在以下情況滿足時(shí)，才向其他 sentinel 詢問主服務(wù)器是否已下線          *          * 1) We believe it is down, or there is a failover in progress.          *    本 sentinel 相信服務(wù)器已經(jīng)下線，或者針對(duì)該主服務(wù)器的故障轉(zhuǎn)移操作正在執(zhí)行          * 2) Sentinel is connected.          *    目標(biāo) Sentinel 與本 Sentinel 已連接          * 3) We did not received the info within SENTINEL_ASK_PERIOD ms.           *    當(dāng)前 Sentinel 在 SENTINEL_ASK_PERIOD 毫秒內(nèi)沒有獲得過目標(biāo) Sentinel 發(fā)來的信息          * 4) 條件 1 和條件 2 滿足而條件 3 不滿足，但是 flags 參數(shù)給定了 SENTINEL_ASK_FORCED 標(biāo)識(shí)          */         if ((master->flags & SRI_S_DOWN) == 0) continue;         if (ri->flags & SRI_DISCONNECTED) continue;         if (!(flags & SENTINEL_ASK_FORCED) &&             mstime() - ri->last_master_down_reply_time < SENTINEL_ASK_PERIOD)             continue;          /* Ask */         // 發(fā)送 SENTINEL is-master-down-by-addr 命令         ll2string(port,sizeof(port),master->addr->port);         retval = redisAsyncCommand(ri->cc,                     sentinelReceiveIsMasterDownReply, NULL,                     "SENTINEL is-master-down-by-addr %s %s %llu %s",                     master->addr->ip, port,                     sentinel.current_epoch,                     // 如果本 Sentinel 已經(jīng)檢測(cè)到 master 進(jìn)入 ODOWN                      // 并且要開始一次故障轉(zhuǎn)移，那么向其他 Sentinel 發(fā)送自己的運(yùn)行 ID                     // 讓對(duì)方將給自己投一票（如果對(duì)方在這個(gè)紀(jì)元內(nèi)還沒有投票的話）                     (master->failover_state > SENTINEL_FAILOVER_STATE_NONE) ?                     server.runid : "*");         if (retval == REDIS_OK) ri->pending_commands++;     }     dictReleaseIterator(di); }

對(duì)于每個(gè)節(jié)點(diǎn)，Sentinel都會(huì)確認(rèn)節(jié)點(diǎn)是否SDOWN，對(duì)于master，還需要確認(rèn)ODOWN。sentinelAskMasterStateToOtherSentinels方法會(huì)在master進(jìn)入SDOWN或者ODOWN調(diào)用sentinel is-master-down-by-addr命令，SDOWN時(shí)，該命令用來獲取其他Sentinel對(duì)于master的存活狀態(tài)，ODOWN是用來像其他節(jié)點(diǎn)投票的。SDOWN時(shí)，flags是SENTINEL_NO_FLAGS，ODOWN時(shí)，flags是SENTINEL_ASK_FORCED。

檢查是否開始故障轉(zhuǎn)移

/* This function checks if there are the conditions to start the failover,  * that is:  *  * 這個(gè)函數(shù)檢查是否需要開始一次故障轉(zhuǎn)移操作：  *  * 1) Master must be in ODOWN condition.  *    主服務(wù)器已經(jīng)計(jì)入 ODOWN 狀態(tài)。  * 2) No failover already in progress.  *    當(dāng)前沒有針對(duì)同一主服務(wù)器的故障轉(zhuǎn)移操作在執(zhí)行。  * 3) No failover already attempted recently.  *    最近時(shí)間內(nèi)，這個(gè)主服務(wù)器沒有嘗試過執(zhí)行故障轉(zhuǎn)移  *    （應(yīng)該是為了防止頻繁執(zhí)行）。  *   * We still don't know if we'll win the election so it is possible that we  * start the failover but that we'll not be able to act.  *  * 雖然 Sentinel 可以發(fā)起一次故障轉(zhuǎn)移，但因?yàn)楣收限D(zhuǎn)移操作是由領(lǐng)頭 Sentinel 執(zhí)行的，  * 所以發(fā)起故障轉(zhuǎn)移的 Sentinel 不一定就是執(zhí)行故障轉(zhuǎn)移的 Sentinel 。  *  * Return non-zero if a failover was started.   *  * 如果故障轉(zhuǎn)移操作成功開始，那么函數(shù)返回非 0 值。  */ int sentinelStartFailoverIfNeeded(sentinelRedisInstance *master) {     /* We can't failover if the master is not in O_DOWN state. */     if (!(master->flags & SRI_O_DOWN)) return 0;      /* Failover already in progress? */     if (master->flags & SRI_FAILOVER_IN_PROGRESS) return 0;      /* Last failover attempt started too little time ago? */     if (mstime() - master->failover_start_time <         master->failover_timeout*2)     {         if (master->failover_delay_logged != master->failover_start_time) {             time_t clock = (master->failover_start_time +                             master->failover_timeout*2) / 1000;             char ctimebuf[26];              ctime_r(&clock,ctimebuf);             ctimebuf[24] = '\0'; /* Remove newline. */             master->failover_delay_logged = master->failover_start_time;             redisLog(REDIS_WARNING,                 "Next failover delay: I will not start a failover before %s",                 ctimebuf);         }         return 0;     }      sentinelStartFailover(master);     return 1; }

sentinelStartFailoverIfNeeded方法會(huì)檢查master是否為ODOWN狀態(tài)。因?yàn)槎〞r(shí)任務(wù)每次就會(huì)執(zhí)行到該函數(shù)，所以還要確認(rèn)故障轉(zhuǎn)移狀態(tài)SRI_FAILOVER_IN_PROGRESS是否已經(jīng)開始。然后會(huì)看定時(shí)任務(wù)是否超時(shí)，只有以上條件都滿足才能開始故障轉(zhuǎn)移。關(guān)于定時(shí)任務(wù)是否超時(shí)，failover_start_time默認(rèn)為0，它有2個(gè)地方會(huì)被修改，一個(gè)是開始故障轉(zhuǎn)移后，一個(gè)是收到其他Sentinel的投票請(qǐng)求。failover_start_time被修改的值為 mstime()+rand()%SENTINEL_MAX_DESYNC，這就是Raft協(xié)議說的隨機(jī)因子。SENTINEL_MAX_DESYNC是1000，相當(dāng)于failover_start_time是當(dāng)前時(shí)間加上1s內(nèi)的隨機(jī)值，這個(gè)保證了，不同Sentinel在超時(shí)后，下次申請(qǐng)Leader時(shí)間的隨機(jī)。所以故障轉(zhuǎn)移開始，像Raft協(xié)議描述的“啟動(dòng)一個(gè)新的定時(shí)器”，設(shè)置了failover_start_time。在投票的時(shí)候設(shè)置failover_start_time，那么先投票，再通過ODOWN和SRI_FAILOVER_IN_PROGRESS的節(jié)點(diǎn)，在檢查定時(shí)任務(wù)是否超時(shí)的時(shí)候就無法通過，相當(dāng)于是Raft協(xié)議中的Follower，它不會(huì)參與競(jìng)爭(zhēng)Leader。

成為Candidate，開始競(jìng)選Leader

/* Setup the master state to start a failover. */ // 設(shè)置主服務(wù)器的狀態(tài)，開始一次故障轉(zhuǎn)移 void sentinelStartFailover(sentinelRedisInstance *master) {     redisAssert(master->flags & SRI_MASTER);      // 更新故障轉(zhuǎn)移狀態(tài)     master->failover_state = SENTINEL_FAILOVER_STATE_WAIT_START;      // 更新主服務(wù)器狀態(tài)     master->flags |= SRI_FAILOVER_IN_PROGRESS;      // 更新紀(jì)元     master->failover_epoch = ++sentinel.current_epoch;      sentinelEvent(REDIS_WARNING,"+new-epoch",master,"%llu",         (unsigned long long) sentinel.current_epoch);      sentinelEvent(REDIS_WARNING,"+try-failover",master,"%@");      // 記錄故障轉(zhuǎn)移狀態(tài)的變更時(shí)間     master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;     master->failover_state_change_time = mstime(); }

如果Sentinel通過三重檢查，進(jìn)入了sentinelStartFailover，相當(dāng)于成為了Candidate，它會(huì)做以下幾件事情：

1、把failover_state改成SENTINEL_FAILOVER_STATE_WAIT_START。
2、把master的狀態(tài)改成故障轉(zhuǎn)移中SRI_FAILOVER_IN_PROGRESS。
3、增加master的current_epoch，并賦值給failover_epoch。
4、把failover_start_time改成mstime()+rand()%SENTINEL_MAX_DESYNC。
5、把failover_state_change_time改成mstime()。

sentinelStartFailover完成了成為Candidate的前面兩步，接著要回到前面的定時(shí)任務(wù)sentinelHandleRedisInstance。因?yàn)閟entinelStartFailoverIfNeeded返回了1，所以進(jìn)入if流程，執(zhí)行sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_ASK_FORCED);，開始向其他Sentinel拉票。然后就進(jìn)入sentinelFailoverStateMachine。

Follower投票

這里先來看下投票的源碼。

/* Vote for the sentinel with 'req_runid' or return the old vote if already  * voted for the specifed 'req_epoch' or one greater.  *  * 為運(yùn)行 ID 為 req_runid 的 Sentinel 投上一票，有兩種額外情況可能出現(xiàn)：  * 1) 如果 Sentinel 在 req_epoch 紀(jì)元已經(jīng)投過票了，那么返回之前投的票。  * 2) 如果 Sentinel 已經(jīng)為大于 req_epoch 的紀(jì)元投過票了，那么返回更大紀(jì)元的投票。  *  * If a vote is not available returns NULL, otherwise return the Sentinel  * runid and populate the leader_epoch with the epoch of the vote.   *  * 如果投票暫時(shí)不可用，那么返回 NULL 。  * 否則返回 Sentinel 的運(yùn)行 ID ，并將被投票的紀(jì)元保存到 leader_epoch 指針的值里面。  */ char *sentinelVoteLeader(sentinelRedisInstance *master, uint64_t req_epoch, char *req_runid, uint64_t *leader_epoch) {     if (req_epoch > sentinel.current_epoch) {         sentinel.current_epoch = req_epoch;         sentinelFlushConfig();         sentinelEvent(REDIS_WARNING,"+new-epoch",master,"%llu",             (unsigned long long) sentinel.current_epoch);     }      if (master->leader_epoch < req_epoch && sentinel.current_epoch <= req_epoch)     {         sdsfree(master->leader);         master->leader = sdsnew(req_runid);         master->leader_epoch = sentinel.current_epoch;         sentinelFlushConfig();         sentinelEvent(REDIS_WARNING,"+vote-for-leader",master,"%s %llu",             master->leader, (unsigned long long) master->leader_epoch);         /* If we did not voted for ourselves, set the master failover start          * time to now, in order to force a delay before we can start a          * failover for the same master. */         if (strcasecmp(master->leader,server.runid))             master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;     }      *leader_epoch = master->leader_epoch;     return master->leader ? sdsnew(master->leader) : NULL; }

前面說到Candidate開始競(jìng)選后，會(huì)把當(dāng)前epoch加1，這樣就比Follower大1，F(xiàn)ollower收到第一個(gè)Candidate的投票后，因?yàn)樽约寒?dāng)前的epoch比Candidate小，所以把當(dāng)前的epoch改成第一個(gè)Candidate的epoch，然后把自己認(rèn)為的Leader設(shè)置成該Candidate。然后其他Candidate再發(fā)起對(duì)該Follower的投票時(shí)，由于這些Candidate的epoch與自己選出Leader的epoch一樣，所以不會(huì)再改變自己認(rèn)為的Leader。這樣，在一個(gè)epoch內(nèi)，F(xiàn)ollower就只能投出一票，給它第一個(gè)收到投票請(qǐng)求的Candidate。最后有個(gè)if (strcasecmp(master->leader,server.runid))，這個(gè)是為了設(shè)置failover_start_time，這樣Follower在當(dāng)前epoch內(nèi)，就無法成為Candidate了。

Sentinel執(zhí)行任務(wù)的狀態(tài)機(jī)

void sentinelFailoverStateMachine(sentinelRedisInstance *ri) {     redisAssert(ri->flags & SRI_MASTER);      if (!(ri->flags & SRI_FAILOVER_IN_PROGRESS)) return;      switch(ri->failover_state) {         case SENTINEL_FAILOVER_STATE_WAIT_START:             // 統(tǒng)計(jì)選票，查看是否成為leader             sentinelFailoverWaitStart(ri);             break;         case SENTINEL_FAILOVER_STATE_SELECT_SLAVE:             // 從slave列表中選出最佳slave             sentinelFailoverSelectSlave(ri);             break;         case SENTINEL_FAILOVER_STATE_SEND_SLAVEOF_NOONE:             // 把選出的slave設(shè)置為master             sentinelFailoverSendSlaveOfNoOne(ri);             break;         case SENTINEL_FAILOVER_STATE_WAIT_PROMOTION:             // 等待升級(jí)生效，如果升級(jí)超時(shí)，那么重新選擇新主服務(wù)器             sentinelFailoverWaitPromotion(ri);             break;         case SENTINEL_FAILOVER_STATE_RECONF_SLAVES:             // 向從服務(wù)器發(fā)送 SLAVEOF 命令，讓它們同步新主服務(wù)器             sentinelFailoverReconfNextSlave(ri);             break;     } }

Sentinel處理故障轉(zhuǎn)移流程是采用狀態(tài)處理的模式，不同狀態(tài)處理不同任務(wù)，任務(wù)完成后更新狀態(tài)到下一個(gè)狀態(tài)。sentinelFailoverStateMachine函數(shù)根據(jù)failover_state決定進(jìn)入什么流程。在sentinelFailoverWaitStart函數(shù)里面，Leader就被選出了，其他幾個(gè)狀態(tài)是Leader進(jìn)行故障轉(zhuǎn)移的流程。

確認(rèn)自己是否成為L(zhǎng)eader

void sentinelFailoverWaitStart(sentinelRedisInstance *ri) {     char *leader;     int isleader;      /* Check if we are the leader for the failover epoch. */     // 獲取給定紀(jì)元的領(lǐng)頭 Sentinel     leader = sentinelGetLeader(ri, ri->failover_epoch);     // 本 Sentinel 是否為領(lǐng)頭 Sentinel ？     isleader = leader && strcasecmp(leader,server.runid) == 0;     sdsfree(leader);      /* If I'm not the leader, and it is not a forced failover via      * SENTINEL FAILOVER, then I can't continue with the failover. */     // 如果本 Sentinel 不是領(lǐng)頭，并且這次故障遷移不是一次強(qiáng)制故障遷移操作     // 那么本 Sentinel 不做動(dòng)作     if (!isleader && !(ri->flags & SRI_FORCE_FAILOVER)) {         int election_timeout = SENTINEL_ELECTION_TIMEOUT;          /* The election timeout is the MIN between SENTINEL_ELECTION_TIMEOUT          * and the configured failover timeout. */         // 當(dāng)選的時(shí)長(zhǎng)（類似于任期）是 SENTINEL_ELECTION_TIMEOUT         // 和 Sentinel 設(shè)置的故障遷移時(shí)長(zhǎng)之間的較小那個(gè)值         if (election_timeout > ri->failover_timeout)             election_timeout = ri->failover_timeout;          /* Abort the failover if I'm not the leader after some time. */         // Sentinel 的當(dāng)選時(shí)間已過，取消故障轉(zhuǎn)移計(jì)劃         if (mstime() - ri->failover_start_time > election_timeout) {             sentinelEvent(REDIS_WARNING,"-failover-abort-not-elected",ri,"%@");             // 取消故障轉(zhuǎn)移             sentinelAbortFailover(ri);         }         return;     }      // 本 Sentinel 作為領(lǐng)頭，開始執(zhí)行故障遷移操作...      sentinelEvent(REDIS_WARNING,"+elected-leader",ri,"%@");      // 進(jìn)入選擇從服務(wù)器狀態(tài)     ri->failover_state = SENTINEL_FAILOVER_STATE_SELECT_SLAVE;     ri->failover_state_change_time = mstime();      sentinelEvent(REDIS_WARNING,"+failover-state-select-slave",ri,"%@"); }

前面說到的sentinelStartFailover把failover_state設(shè)置成SENTINEL_FAILOVER_STATE_WAIT_START，于是進(jìn)入sentinelFailoverWaitStart。

sentinelFailoverWaitStart會(huì)先查看leader是否已經(jīng)選出。如果Leader是自己或者這是一次強(qiáng)制故障轉(zhuǎn)移，failover_state就設(shè)置為SENTINEL_FAILOVER_STATE_SELECT_SLAVE。強(qiáng)制故障轉(zhuǎn)移是通過Sentinel的SENTINEL FAILOVER <master-name>命令設(shè)置的，這里不做討論。

如果自己當(dāng)選Leader，就會(huì)進(jìn)入下一個(gè)任務(wù)處理狀態(tài)，開始故障轉(zhuǎn)移流程。如果在election_timeout內(nèi)還沒當(dāng)選為L(zhǎng)eader，那么本次epoch內(nèi)，Candidate就沒有當(dāng)選，需要等待failover_timeout超時(shí)，進(jìn)入下一次競(jìng)選，或者本次epoch內(nèi)，有Leader被選出，自己變會(huì)Follower。

統(tǒng)計(jì)投票

/* Scan all the Sentinels attached to this master to check if there  * is a leader for the specified epoch.  *  * 掃描所有監(jiān)視 master 的 Sentinels ，查看是否有 Sentinels 是這個(gè)紀(jì)元的領(lǐng)頭。  *  * To be a leader for a given epoch, we should have the majorify of  * the Sentinels we know that reported the same instance as  * leader for the same epoch.   *  * 要讓一個(gè) Sentinel 成為本紀(jì)元的領(lǐng)頭，  * 這個(gè) Sentinel 必須讓大多數(shù)其他 Sentinel 承認(rèn)它是該紀(jì)元的領(lǐng)頭才行。  */ // 選舉出 master 在指定 epoch 上的領(lǐng)頭 char *sentinelGetLeader(sentinelRedisInstance *master, uint64_t epoch) {     dict *counters;     dictIterator *di;     dictEntry *de;     unsigned int voters = 0, voters_quorum;     char *myvote;     char *winner = NULL;     uint64_t leader_epoch;     uint64_t max_votes = 0;      redisAssert(master->flags & (SRI_O_DOWN|SRI_FAILOVER_IN_PROGRESS));      // 統(tǒng)計(jì)器     counters = dictCreate(&leaderVotesDictType,NULL);      /* Count other sentinels votes */     // 統(tǒng)計(jì)其他 sentinel 的主觀 leader 投票     di = dictGetIterator(master->sentinels);     while((de = dictNext(di)) != NULL) {         sentinelRedisInstance *ri = dictGetVal(de);          // 為目標(biāo) Sentinel 選出的領(lǐng)頭 Sentinel 增加一票         if (ri->leader != NULL && ri->leader_epoch == sentinel.current_epoch)             sentinelLeaderIncr(counters,ri->leader);          // 統(tǒng)計(jì)投票數(shù)量         voters++;     }     dictReleaseIterator(di);      /* Check what's the winner. For the winner to win, it needs two conditions:      *      * 選出領(lǐng)頭 leader ，它必須滿足以下兩個(gè)條件：      *      * 1) Absolute majority between voters (50% + 1).      *    有多于一般的 Sentinel 支持      * 2) And anyway at least master->quorum votes.       *    投票數(shù)至少要有 master->quorum 那么多      */     di = dictGetIterator(counters);     while((de = dictNext(di)) != NULL) {          // 取出票數(shù)         uint64_t votes = dictGetUnsignedIntegerVal(de);          // 選出票數(shù)最大的人         if (votes > max_votes) {             max_votes = votes;             winner = dictGetKey(de);         }     }     dictReleaseIterator(di);      /* Count this Sentinel vote:      * if this Sentinel did not voted yet, either vote for the most      * common voted sentinel, or for itself if no vote exists at all. */     // 本 Sentinel 進(jìn)行投票     // 如果 Sentinel 之前還沒有進(jìn)行投票，那么有兩種選擇：     // 1）如果選出了 winner （最多票數(shù)支持的 Sentinel ），那么這個(gè) Sentinel 也投 winner 一票     // 2）如果沒有選出 winner ，那么 Sentinel 投自己一票     if (winner)         myvote = sentinelVoteLeader(master,epoch,winner,&leader_epoch);     else         myvote = sentinelVoteLeader(master,epoch,server.runid,&leader_epoch);      // 領(lǐng)頭 Sentinel 已選出，并且領(lǐng)頭的紀(jì)元和給定的紀(jì)元一樣     if (myvote && leader_epoch == epoch) {          // 為領(lǐng)頭 Sentinel 增加一票（這一票來自本 Sentinel ）         uint64_t votes = sentinelLeaderIncr(counters,myvote);          // 如果投票之后的票數(shù)比最大票數(shù)要大，那么更換領(lǐng)頭 Sentinel         if (votes > max_votes) {             max_votes = votes;             winner = myvote;         }     }     voters++; /* Anyway, count me as one of the voters. */      // 如果支持領(lǐng)頭的投票數(shù)量不超過半數(shù)     // 并且支持票數(shù)不超過 master 配置指定的投票數(shù)量     // 那么這次領(lǐng)頭選舉無效     voters_quorum = voters/2+1;     if (winner && (max_votes < voters_quorum || max_votes < master->quorum))         winner = NULL;      // 返回領(lǐng)頭 Sentinel ，或者 NULL     winner = winner ? sdsnew(winner) : NULL;     sdsfree(myvote);     dictRelease(counters);     return winner; }

sentinelGetLeader會(huì)統(tǒng)計(jì)所有其他Sentinel的投票結(jié)果，如果投票結(jié)果中有個(gè)Sentinel獲得了超過半數(shù)且超過master的quorum，那么Leader就被選出了。

Candidate第一次進(jìn)入sentinelGetLeader函數(shù)的時(shí)候是還沒向其他Sentinel發(fā)起投票，winner為NULL，于是就會(huì)給自己投上一票，這就是前面Raft協(xié)議說到的，在開始競(jìng)選前“3、給自己投一票“，這樣競(jìng)選前的4個(gè)步驟就全部完成了。以后再進(jìn)入sentinelGetLeader就可以統(tǒng)計(jì)其他Sentinel的投票數(shù)目。當(dāng)發(fā)現(xiàn)有個(gè)Sentinel的投票數(shù)據(jù)超過半數(shù)且超過quorum，就會(huì)返回該Sentinel，sentinelFailoverWaitStart會(huì)判斷該Sentinel是否是自己，如果是自己，那么自己就成為了Leader，開始進(jìn)行故障轉(zhuǎn)移，不是自己，那么等待競(jìng)選超時(shí)，成為Follower。

關(guān)于Leader通知其他Sentinel自己成為L(zhǎng)eader的說明

在Sentinel的實(shí)現(xiàn)里面。關(guān)于Leader發(fā)送競(jìng)選成功的消息給其他Sentinel，并沒有專門的邏輯。某個(gè)Sentinel成為L(zhǎng)eader后，他就默默的干起活。故障轉(zhuǎn)移中Leader通過獲取選出的slave的INFO信息，發(fā)現(xiàn)其確認(rèn)了master身份，Leader就會(huì)修改config_epoch為最新的epoch。

/* Process the INFO output from masters. */ void sentinelRefreshInstanceInfo(sentinelRedisInstance *ri, const char *info) {     ...     /* Handle slave -> master role switch. */     // 處理從服務(wù)器轉(zhuǎn)變?yōu)橹鞣?wù)器的情況     if ((ri->flags & SRI_SLAVE) && role == SRI_MASTER) {         /* If this is a promoted slave we can change state to the          * failover state machine. */         if ((ri->master->flags & SRI_FAILOVER_IN_PROGRESS) &&             (ri->master->failover_state ==                 SENTINEL_FAILOVER_STATE_WAIT_PROMOTION))         {             /* Now that we are sure the slave was reconfigured as a master              * set the master configuration epoch to the epoch we won the              * election to perform this failover. This will force the other              * Sentinels to update their config (assuming there is not              * a newer one already available). */             ri->master->config_epoch = ri->master->failover_epoch;             ri->master->failover_state = SENTINEL_FAILOVER_STATE_RECONF_SLAVES;             ri->master->failover_state_change_time = mstime();             sentinelFlushConfig();             sentinelEvent(REDIS_WARNING,"+promoted-slave",ri,"%@");             sentinelEvent(REDIS_WARNING,"+failover-state-reconf-slaves",                 ri->master,"%@");             sentinelCallClientReconfScript(ri->master,SENTINEL_LEADER,                 "start",ri->master->addr,ri->addr);             sentinelForceHelloUpdateForMaster(ri->master);         }         ...     }     ... }

config_epoch會(huì)通過hello頻道發(fā)送給其他Sentinel。其他Sentinel發(fā)現(xiàn)config_epoch更新了，就會(huì)更新最新的master地址和config_epoch。這相當(dāng)于Leader把當(dāng)選消息告知了其他Sentinel。

/* Process an hello message received via Pub/Sub in master or slave instance,  * or sent directly to this sentinel via the (fake) PUBLISH command of Sentinel.  *  * If the master name specified in the message is not known, the message is  * discarded. */ void sentinelProcessHelloMessage(char *hello, int hello_len) {     ...         /* Update master info if received configuration is newer. */         if (master->config_epoch < master_config_epoch) {             master->config_epoch = master_config_epoch;             if (master_port != master->addr->port ||                 strcmp(master->addr->ip, token[5]))             {                 sentinelAddr *old_addr;                  sentinelEvent(REDIS_WARNING,"+config-update-from",si,"%@");                 sentinelEvent(REDIS_WARNING,"+switch-master",                     master,"%s %s %d %s %d",                     master->name,                     master->addr->ip, master->addr->port,                     token[5], master_port);                  old_addr = dupSentinelAddr(master->addr);                 sentinelResetMasterAndChangeAddress(master, token[5], master_port);                 sentinelCallClientReconfScript(master,                     SENTINEL_OBSERVER,"start",                     old_addr,master->addr);                 releaseSentinelAddr(old_addr);             }         }      ... }

參考資料：

Redis 2.8.19 source code

http://redis.io/topics/sentinel

《In Search of an Understandable Consensus Algorithm》 Diego Ongaro and John Ousterhout Stanford University

《Redis設(shè)計(jì)與實(shí)現(xiàn)》黃健宏機(jī)械工業(yè)出版社

posted on 2016-12-14 18:33 jinfeng_wang 閱讀(2089) 評(píng)論(0) 編輯收藏所屬分類: 2016-REDIS

新用戶注冊(cè) 刷新評(píng)論列表


只有注冊(cè)用戶登錄后才能發(fā)表評(píng)論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關(guān)文章: 緩存系列文章--7.無底洞問題(multiget hole) 緩存系列文章--6.緩存雪崩問題緩存系列文章--5.緩存穿透問題緩存系列文章--4.緩存的粒度控制緩存系列文章--2.是否真的需要緩存？緩存系列文章--3.緩存常用更新策略對(duì)比(一致性)。緩存系列文章--1.緩存的一些基本常識(shí) JedisCluster 源碼分析 redis cluster使用經(jīng)驗(yàn) 深入淺出Redis（三）高級(jí)特性：管道

jinfeng_wang

公告

常用鏈接

留言簿(40)

隨筆分類(592)

隨筆檔案(400)

Domestic

Foreign

搜索

積分與排名

最新評(píng)論

閱讀排行榜

評(píng)論排行榜

Raft協(xié)議選舉流程

節(jié)點(diǎn)的狀態(tài)

term

RPC

選舉流程

Sentinel的選舉流程

選舉流程

關(guān)于Sentinel超時(shí)時(shí)間的說明

Sentinel 選舉流程源碼解析

定時(shí)任務(wù)

向其他Sentinel獲取投票或者獲取對(duì)master存活狀態(tài)的判斷結(jié)果

檢查是否開始故障轉(zhuǎn)移

成為Candidate，開始競(jìng)選Leader

Follower投票

Sentinel執(zhí)行任務(wù)的狀態(tài)機(jī)

確認(rèn)自己是否成為L(zhǎng)eader

統(tǒng)計(jì)投票

關(guān)于Leader通知其他Sentinel自己成為L(zhǎng)eader的說明