一、前言
早上醒來打開微信,同事反饋kafka集群從昨天凌晨開始寫入頻繁失敗,趕緊打開電腦查看了kafka集群的機器監(jiān)控,日志信息,發(fā)現(xiàn)其中一個節(jié)點的集群負載從昨天凌晨突然掉下來了,和同事反饋的時間點大概一致,于是乎就登錄服務器開始干活。
二、排錯
1、查看機器監(jiān)控,看是否能大概定位是哪個節(jié)點有異常
技術分享
2、根據(jù)機器監(jiān)控大概定位到其中一個異常節(jié)點,登錄服務器查看kafka日志,發(fā)現(xiàn)有報錯日志,并且日志就停留在這個這個時間點:
[2017-06-01 16:59:59,851] ERROR Processor got uncaught exception. (kafka.network.Processor)
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:174)
at sun.nio.ch.IOUtil.read(IOUtil.java:195)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at org.apache.kafka.common.network.PlaintextTransportLayer.read(PlaintextTransportLayer.java:108)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:97)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:160)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:141)
at org.apache.kafka.common.network.Selector.poll(Selector.java:286)
at kafka.network.Processor.run(SocketServer.scala:413)3、查看kafka進程和監(jiān)聽端口情況,發(fā)現(xiàn)都正常,尼瑪假死了
ps -ef |grep kafka ## 查看kafka的進程
netstat -ntlp |grep 9092 ##9092kafka的監(jiān)聽端口4、既然已經假死了,只能重啟了
ps -ef |grep kafka |grep -v grep |awk ‘{print $2}‘ | xargs kill -9
/usr/local/kafka/bin;nohup ./kafka-server-start.sh ../config/server.properties &5、重啟后在觀察該節(jié)點的kafka日志,在一頓index重建之后,上面的報錯信息在瘋狂的刷,最后谷歌一番,解決了該問題
三、解決方案:
在
/usr/local/kafka/binkafka-run-class.sh去掉
-XX:+DisableExplicitGC添加
-XX:MaxDirectMemorySize=512m在一次重啟kafka,問題解決。