泰仔在線

java學習,心情日記,繽紛時刻

posts - 100, comments - 34, trackbacks - 0, articles - 0

Nutch URL過濾配置規則

Posted on 2010-04-30 10:12 泰仔在線閱讀(3396) 評論(0) 編輯收藏所屬分類: 云計算相關

nutch網上有不少有它的源碼解析,但是采集這塊還是不太讓人容易理解.今天終于知道怎么,弄的.現在把crawl-urlfilter.txt文件貼出來,讓大家一塊交流,也給自己備忘錄一個。

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.

//采集動態網站很重要。必須這樣設置。不然像a.jsp?a=001 帶有問號的網頁就沒辦法采集。
+[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
###########################7shop24########################################
#+^http://([a-z0-9]*\.)*7shop24.com/
#+^http://www.7shop24.com/indexdtl06.asp\?classid=([0-9]*)&productid=([0-9]*)+$

###############################http://www.redbaby.com.cn/##############################

//采集是有順序的，不是隨便寫的。比如：你要采集產品頁，你首先得把首頁放進來，然后產品是放在分類頁面的，你得把//分類也得包括進來，然后再把具體產品規則的正則寫進來，這樣才能完成你所需要的任務。如：
+^http://www.redbaby.com.cn/$
+^http://www.redbaby.com.cn/([a-zA-Z]*\.)*index.html$
+^http://www.redbaby.com.cn/([a-zA-Z]*)/$
+^http://www.redbaby.com.cn/([a-zA-Z]*)/index\.html+$
+^http://www.redbaby.com.cn/Product/Product_List.aspx\?Site=\d&BranchID=\d&DepartmentID=\d+$
+^http://www.redbaby.com.cn/Product/Product_List.aspx\?Site=\d&BrandID=\d&BranchID=\d+$
+^http://www.redbaby.com.cn/Product/ProductInfo\w\d\w([0-9]*\.)*html$
+^http://www.redbaby.com.cn/Product/Product_List.aspx\?Site=\d&BranchID=\d&DepartmentID=\d&SortID=\d+$
+^http://www.redbaby.com.cn/Product/ProductInfo\w\d\w\d\.htm$
# skip everything else
-.

url匹配可能用到的java正則:

? 對應 \?

_ (下劃張) 對應 \w

.(點號) 對應 \.

轉自:nutch 最新使用日志

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關文章: Nutch URL過濾配置規則 nutch抓取動態網頁 Nutch中的html頁面的解析問題 Nutch中的一些小的問題解決 Nutch插件加載分析 nutch源代碼閱讀心得 MapReduce算法模式 MapReduce 簡介

泰仔在線

導航

留言簿(3)

隨筆分類

收藏夾

Database相關

Enet 沖浪

Java 技術

Linux相關

搜索

最新評論

閱讀排行榜

Nutch URL過濾配置規則