面向系統測試的一種ganglia指標擴展的方法
ganlia 和 nagios 等工具,是業界優秀的監控告警工具;這種工具主要是面向運維的,也可以用來進行性能穩定性的測試。
面對分布式系統測試,耗時都比較長,往往一臺機器安裝多套系統,影響監控指標的準確性。
下面是一種進行進程級別監控的方n法,可以通過擴展,集群的監控力度;同時將監控腳本加入告警,防止腳本異常退出(Nagios擴展另文描述)
GEngin.py:總體的引擎,根據conf下配置文件的配置項,輪詢監控指標,調用gmetric廣播出去
conf:目錄中保存metrix配置文件,配置參數指標
flag:目錄中僅保存一個flag文件,文件名就是任務名,監控指標將根據任務名分離,便于匯總統計對比
log: 目錄中記錄GEngin的log及每個指標收取腳本的log
pid: GEngin的pid 為告警腳本使用
script: 指標收集的具體的腳本
cat conf/metrix.cfg:
YARN|ResourceManager|cpu|ResourceManager_cpu.py|ResourceManager_cpu.txt|int16|Percent| YARN|ResourceManager|mem|ResourceManager_mem.py|ResourceManager_mem.txt|int16|Percent| YARN|ResourceManager|lsof|ResourceManager_lsof.py|ResourceManager_lsof.txt|int16|Number| |
ls flag/:
yarntestD001.flag
ll log/:
-rw-r--r-- 1 yarn users 168 Mar 19 20:02 yarntestD001_YARNResourceManagercputdw-10-16-19-91.txt -rw-r--r-- 1 yarn users 168 Mar 19 20:02 yarntestD001_YARNResourceManagerlsoftdw-10-16-19-91.txt -rw-r--r-- 1 yarn users 168 Mar 19 20:02 yarntestD001_YARNResourceManagermemtdw-10-16-19-91.txt |
ll script/:
-rw-r--r-- 1 yarn users 882 Feb 28 17:20 ResourceManager_cpu.py -rw-r--r-- 1 yarn users 1093 Feb 28 17:45 ResourceManager_lsof.py -rw-r--r-- 1 yarn users 882 Feb 28 17:18 ResourceManager_mem.py |
cat script/SAMPLE.py:
#!/usr/bin/env python # coding=gbk import sys import os import datetime import time def CheckInput(): "Check Input parameters , they should be a pysql file." if len(sys.argv) < 2 : print "Usage: " + sys.argv[0] + " FileNamePrefix " sys.exit() if __name__== '__main__': CheckInput() # check parameter and asign PyFileName ## result file log to directory of LOG LogFile = open("log/"+sys.argv[1],'a') res = "29" ## Interface to Gmetrix ,must be value:Value print "value:"+res ntime = str(time.strftime("%Y-%m-%d %X",time.localtime())) LogFile.write(ntime+" "+res+"\n") LogFile.close() |
cat GEngin.py :
#!/usr/bin/env python # coding=gbk import sys import os import random import datetime import time from time import sleep def CheckInput(): "Check Input parameters , they should be a pysql file." print "Usage : python ./" + sys.argv[0] if not os.path.exists("conf/metrix.cfg"): print "Error : config file conf/metrix.cfg does not exsits ! " sys.exit() ## kill previous proc For restart if os.path.exists("pid/pid.txt"): pfile = open("pid/pid.txt",'r') for p in pfile: pid = p.strip() os.system("kill -9 "+pid) pfile.close() os.system("rm pid/pid.txt") pfile = open("pid/pid.txt",'a') pid = os.getpid() pfile.write(str(pid)) pfile.close() if __name__== '__main__': CheckInput() # check parameter and asign PyFileName LogFile = open("log/"+sys.argv[0]+".log",'a') # File Prefix of logs filePre="noTask" for fi in os.listdir("flag"): if fi.endswith(".flag"): filePre=fi.split('.')[0].strip() # host name for gmetrix host="" f = os.popen("hostname") for res in f: if res.startswith("tdw"): host=res.strip() LogFile.write("******** Start task "+filePre+" monitoring *******\n") # Main Loop untile flag is null while True: if len(os.listdir("flag")) < 1 or len(os.listdir("flag")) > 1: sleep(10) LogFile.write("Finish previous take "+filePre+" .... No task ,Main loop .....\n") LogFile.flush() continue if len(os.listdir("flag")) == 1 and not os.path.exists("flag/"+filePre+".flag"): LogFile.write("Finish previous take "+filePre+".....\n") for fi in os.listdir("flag"): if fi.endswith(".flag"): filePre=fi.split('.')[0].strip() LogFile.write("***** Start New Task "+filePre+" monitoring *******\n") # Deal with config metrix one by one insFile = open("conf/metrix.cfg",'r') for line in insFile: mGroup,mName,mItem,mShell,mFile,mUnit,mWeiht,nouse = line.split('|'); outPutFile = filePre+"_"+mGroup+mName+mItem+host+".txt" value = "" if mShell.endswith(".py"): f = os.popen("python script/"+mShell+" "+outPutFile) for res in f: if res.startswith("value:"): value=res.split(':')[1].strip() else: value="0" f.close() if mShell.endswith(".sh"): f = os.popen("script/"+mShell+" "+outPutFile) for res in f: if res.startswith("value:"): value=res.split(':')[1].strip() else: value="0" f.close() cmd = "gmetric -n "+mGroup+"_"+mName+"_"+mItem+" -v "+value+" -t "+mUnit+" -u "+mWeiht+" -S "+host+":"+host print cmd f = os.popen(cmd) ntime = str(time.strftime("%Y-%m-%d %X",time.localtime())) LogFile.write(ntime+" "+cmd+"\n") insFile.close() LogFile.flush() if len(os.listdir("flag")) == 1 and os.path.exists("flag/"+filePre+".flag"): sleep(8) LogFile.close() |
Ganglia 中顯示的監控指標:
將運行的GEngin.py腳本加入監控,防止進程異常退出
posted on 2014-03-27 16:53 順其自然EVO 閱讀(296) 評論(0) 編輯 收藏 所屬分類: 測試學習專欄