泰仔在線

          java學習,心情日記,繽紛時刻
          posts - 100, comments - 34, trackbacks - 0, articles - 0

          Nutch URL過濾配置規則

          Posted on 2010-04-30 10:12 泰仔在線 閱讀(3389) 評論(0)  編輯  收藏 所屬分類: 云計算相關

          nutch網上有不少有它的源碼解析,但是采集這塊還是不太讓人容易理解.今天終于知道怎么,弄的.現在把crawl-urlfilter.txt文件貼出來,讓大家一塊交流,也給自己備忘錄一個。

           

          # Licensed to the Apache Software Foundation (ASF) under one or more
          # contributor license agreements.  See the NOTICE file distributed with
          # this work for additional information regarding copyright ownership.
          # The ASF licenses this file to You under the Apache License, Version 2.0
          # (the "License"); you may not use this file except in compliance with
          # the License.  You may obtain a copy of the License at
          #
          #     http://www.apache.org/licenses/LICENSE-2.0
          #
          # Unless required by applicable law or agreed to in writing, software
          # distributed under the License is distributed on an "AS IS" BASIS,
          # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          # See the License for the specific language governing permissions and
          # limitations under the License.


          # The url filter file used by the crawl command.

          # Better for intranet crawling.
          # Be sure to change MY.DOMAIN.NAME to your domain name.

          # Each non-comment, non-blank line contains a regular expression
          # prefixed by '+' or '-'.  The first matching pattern in the file
          # determines whether a URL is included or ignored.  If no pattern
          # matches, the URL is ignored.

          # skip file:, ftp:, & mailto: urls
          -^(file|ftp|mailto):

          # skip image and other suffixes we can't yet parse
          -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

          # skip URLs containing certain characters as probable queries, etc.

          //采集動態網站很重要。必須這樣設置。不然像a.jsp?a=001 帶有問號的網頁就沒辦法采集。
          +[?*!@=]

          # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
          -.*(/[^/]+)/[^/]+\1/[^/]+\1/

          # accept hosts in MY.DOMAIN.NAME
          ###########################7shop24########################################
          #+^http://([a-z0-9]*\.)*7shop24.com/
          #+^http://www.7shop24.com/indexdtl06.asp\?classid=([0-9]*)&productid=([0-9]*)+$



          ###############################http://www.redbaby.com.cn/##############################

           

          //采集是有順序的,不是隨便寫的。比如:你要采集產品頁,你首先得把首頁放進來,然后產品是放在分類頁面的,你得把//分類也得包括進來,然后再把具體產品規則的正則寫進來,這樣才能完成你所需要的任務。如:
          +^http://www.redbaby.com.cn/$
          +^http://www.redbaby.com.cn/([a-zA-Z]*\.)*index.html$
          +^http://www.redbaby.com.cn/([a-zA-Z]*)/$
          +^http://www.redbaby.com.cn/([a-zA-Z]*)/index\.html+$
          +^http://www.redbaby.com.cn/Product/Product_List.aspx\?Site=\d&BranchID=\d&DepartmentID=\d+$
          +^http://www.redbaby.com.cn/Product/Product_List.aspx\?Site=\d&BrandID=\d&BranchID=\d+$
          +^http://www.redbaby.com.cn/Product/ProductInfo\w\d\w([0-9]*\.)*html$
          +^http://www.redbaby.com.cn/Product/Product_List.aspx\?Site=\d&BranchID=\d&DepartmentID=\d&SortID=\d+$
          +^http://www.redbaby.com.cn/Product/ProductInfo\w\d\w\d\.htm$
          # skip everything else
          -.

           

           

          url匹配可能用到的java正則:

          ?    對應     \? 

          _ (下劃張)  對應   \w 

          .(點號)    對應  \.


          轉自:nutch 最新使用日志
          主站蜘蛛池模板: 潍坊市| 富民县| 永德县| 柞水县| 防城港市| 罗山县| 东源县| 鹿邑县| 柘城县| 通渭县| 苏尼特右旗| 会同县| 禄丰县| 马龙县| 桐梓县| 若尔盖县| 兴山县| 东山县| 兴化市| 营口市| 沙湾县| 南澳县| 朝阳县| 麻阳| 雷山县| 永新县| 潜江市| 通辽市| 淅川县| 凤山市| 来凤县| 仁寿县| 大名县| 沽源县| 井冈山市| 宁陵县| 罗山县| 黄冈市| 大埔县| 万州区| 开远市|