paulwong

My Links

Blog Stats

Posts - 1198
Stories - 10
Comments - 108
Trackbacks - 0

常用鏈接

留言簿(67)

隨筆分類(1393)

隨筆檔案(1151)

文章分類(7)

文章檔案(10)

相冊

Test

收藏夾(2)

AI

AI智能PDF問答工具
CSV數據分析智能工具
docker image
ZLibrary
克隆ChatGPT
爆款小紅書AI寫作助手
視頻腳本生成器

Develop

!!!Event Sourcing
!!!Microservice Patterns
!!!NIO清晰解釋
!!PDF SEARCH
4+1 Architectural View Model
Apache安裝及jboss部署說明文檔
APK自動化測試網站
Command-Query Responsibility Segregation
data source
ELK日志分析平臺搭建全過程
Enterprise Architect中文網
EXT 中文站 ver2.0 since 2006-11-20
GOOGLE
GOOGLE
GOOGLE
Google代理
GOREAD RSS閱讀器
INOREADER RSS閱讀器
JavaScript 全棧工程師培訓教程
JBoss3.0 下配置和部署EJB簡介
Jquery Option Plug-in
LCA
MAVEN最佳實踐-版本管理
microservice-security
Mulity Tenant
MYSQL MHA
OAUTH2.0
RARBG TORRENT
Robin's Java World
Spring Boot Admin的使用
spring cloud
SPRING CLOUD教程
Spring 平臺整合 Activiti 工作流引擎實例
SPRING-BEAN自動組裝解釋
Spring-cloud-OAuth2-0配置
SQL2005客戶端下載
SRPING BOOT教程
TCC
TCC
TCC
一個extjs的好網站
一個優秀的CQRS框架Reveno
一個非常不錯的J2EE框架。
一個非常不錯的J2EE框架，從前端的JSP，到菜單，用戶和權限，都有了，還集成了STRUTS。
東莞源豐印刷
本人設計
中國象棋
中國軟件架構師網
不錯的培訓網，有相關文檔下載。
五行湯好轉反應
五行湯好轉反應
人體自愈的秘密
分布式事務1
分布式架構教學
各種大數據
在SPRING CLOUD中使用JAX-RS發布REST服務
在線思維導圖工具
大數據相關應用
學習課程
學習課程
安徽未名細胞治療有限公司
建模工具EA的使用
開源會議系統
指定MAVEN中的JDK版本
數據層的多租戶淺談
無法連接ITUNES STORE的原因
深圳房網
深圳通余額查詢
甘油三脂高應該用什么樣的食療方法
神級翻譯
簡歷模版
管理學
自動組裝SPRING-BEAN例子
通俗易懂的文章收藏
開放式課程
駕車學習
駕駛教學

E-BOOK

Ebook
ex libgen.io, libgen.org, alternative domains: *.li, *.gs, *.lc
EPDF
http://www.allitebooks.org

搜索

閱讀排行榜

評論排行榜

60天內閱讀排行

Analyzing Apache logs with Pig

Analyzing log files, churning them and extracting meaningful information is a potential use case in Hadoop. We don’t have to go in for MapReduce programming for these analyses; instead we can go for tools like Pig and Hive for this log analysis. I’d just give you a start off on the analysis part. Let us consider Pig for apache log analysis. Pig has some built in libraries that would help us load the apache log files into pig and also some cleanup operation on string values from crude log files. All the functionalities are available in the piggybank.jar mostly available under pig/contrib/piggybank/java/ directory. As the first step we need to register this jar file with our pig session then only we can use the functionalities in our Pig Latin

1. Register PiggyBank jar

REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar;

Once we have registered the jar file we need to define a few functionalities to be used in our Pig Latin. For any basic apache log analysis we need a loader to load the log files in a column oriented format in pig, we can create a apache log loader as

2. Define a log loader

DEFINE ApacheCommonLogLoader org.apache.pig.piggybank.storage.apachelog.CommonLogLoader();

(Piggy Bank has other log loaders as well)

In apache log files the default format of date is ‘dd/MMM/yyyy:HH:mm:ss Z’ . But such a date won’t help us much in case of log analysis we may have to extract date without time stamp. For that we use DateExtractor()

3. Define Date Extractor

DEFINE DayExtractor org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('yyyy-MM-dd');

Once we have the required functionalities with us we need to first load the log file into pig

4. Load apachelog file into pig

--load the log files from hdfs into pig using CommonLogLoader

logs = LOAD '/userdata/bejoys/pig/p01/access.log.2011-01-01' USING ApacheCommonLogLoader AS (ip_address, rfc, userId, dt, request, serverstatus, returnobject, referersite, clientbrowser);

Now we are ready to dive in for the actual log analysis. There would be multiple information you need to extract out of a log; we’d see a few of those common requirements out here

Note: you need to first register the jar, define the classes to be used and load the log files into pig before trying out any of the pig latin below

Requirement 1: Find unique hits per day

PIG Latin

--Extracting the day alone and grouping records based on days

grpd = GROUP logs BY DayExtractor(dt) as day;

--looping through each group to get the unique no of userIds

cntd = FOREACH grpd

{

tempId = logs.userId;

uniqueUserId = DISTINCT tempId;

GENERATE group AS day,COUNT(uniqueUserId) AS cnt;

}

--sorting the processed records based on no of unique user ids in descending order

srtd = ORDER cntd BY cnt desc;

--storing the final result into a hdfs directory

STORE srtd INTO '/userdata/bejoys/pig/ApacheLogResult1';

Requirement 1: Find unique hits to websites (IPs) per day

PIG Latin

--Extracting the day alone and grouping records based on days and ip address

grpd = GROUP logs BY (DayExtractor(dt) as day,ip_address);

--looping through each group to get the unique no of userIds

cntd = FOREACH grpd

{

tempId = logs.userId;

uniqueUserId = DISTINCT tempId;

GENERATE group AS day,COUNT(uniqueUserId) AS cnt;

}

--sorting the processed records based on no of unique user ids in descending order

srtd = ORDER cntd BY cnt desc;

--storing the final result into a hdfs directory

STORE srtd INTO '/userdata/bejoys/pig/ ApacheLogResult2 ';

Note: When you use pig latin in grunt shell we need to know a few factors

1. When we issue a pig statement in grunt and press enter only the semantic check is being done, no execution is triggered.

2. All the pig statements are executed only after the STORE command is submitted, ie map reduce programs would be triggered only after STORE is submitted

3. Also in this case you don’t have to load the log files again and again to pig once it is loaded we can use the same for all related operations in that session. Once you are out of the grunt shell the loaded files are lost, you’d have to perform the register and log file loading steps all over again.

posted on 2013-04-08 02:06 paulwong 閱讀(357) 評論(0) 編輯收藏所屬分類: PIG

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關文章: 一個PIG腳本例子分析把命令行中的值傳進PIG中 PIG中的分組統計百分比 CombinedLogLoader Analyzing Apache logs with Pig PIG小議 PIG資源

paulwong

My Links

Blog Stats

常用鏈接

留言簿(67)

隨筆分類(1393)

隨筆檔案(1151)

文章分類(7)

文章檔案(10)

相冊

收藏夾(2)

AI

Develop

E-BOOK

Other

養生

微服務

搜索

最新評論

閱讀排行榜

評論排行榜

60天內閱讀排行

Analyzing Apache logs with Pig