泰仔在線

          java學習,心情日記,繽紛時刻
          posts - 100, comments - 34, trackbacks - 0, articles - 0

          Nutch URL過濾配置規則

          Posted on 2010-04-30 10:12 泰仔在線 閱讀(3390) 評論(0)  編輯  收藏 所屬分類: 云計算相關

          nutch網上有不少有它的源碼解析,但是采集這塊還是不太讓人容易理解.今天終于知道怎么,弄的.現在把crawl-urlfilter.txt文件貼出來,讓大家一塊交流,也給自己備忘錄一個。

           

          # Licensed to the Apache Software Foundation (ASF) under one or more
          # contributor license agreements.  See the NOTICE file distributed with
          # this work for additional information regarding copyright ownership.
          # The ASF licenses this file to You under the Apache License, Version 2.0
          # (the "License"); you may not use this file except in compliance with
          # the License.  You may obtain a copy of the License at
          #
          #     http://www.apache.org/licenses/LICENSE-2.0
          #
          # Unless required by applicable law or agreed to in writing, software
          # distributed under the License is distributed on an "AS IS" BASIS,
          # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          # See the License for the specific language governing permissions and
          # limitations under the License.


          # The url filter file used by the crawl command.

          # Better for intranet crawling.
          # Be sure to change MY.DOMAIN.NAME to your domain name.

          # Each non-comment, non-blank line contains a regular expression
          # prefixed by '+' or '-'.  The first matching pattern in the file
          # determines whether a URL is included or ignored.  If no pattern
          # matches, the URL is ignored.

          # skip file:, ftp:, & mailto: urls
          -^(file|ftp|mailto):

          # skip image and other suffixes we can't yet parse
          -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

          # skip URLs containing certain characters as probable queries, etc.

          //采集動態網站很重要。必須這樣設置。不然像a.jsp?a=001 帶有問號的網頁就沒辦法采集。
          +[?*!@=]

          # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
          -.*(/[^/]+)/[^/]+\1/[^/]+\1/

          # accept hosts in MY.DOMAIN.NAME
          ###########################7shop24########################################
          #+^http://([a-z0-9]*\.)*7shop24.com/
          #+^http://www.7shop24.com/indexdtl06.asp\?classid=([0-9]*)&productid=([0-9]*)+$



          ###############################http://www.redbaby.com.cn/##############################

           

          //采集是有順序的,不是隨便寫的。比如:你要采集產品頁,你首先得把首頁放進來,然后產品是放在分類頁面的,你得把//分類也得包括進來,然后再把具體產品規則的正則寫進來,這樣才能完成你所需要的任務。如:
          +^http://www.redbaby.com.cn/$
          +^http://www.redbaby.com.cn/([a-zA-Z]*\.)*index.html$
          +^http://www.redbaby.com.cn/([a-zA-Z]*)/$
          +^http://www.redbaby.com.cn/([a-zA-Z]*)/index\.html+$
          +^http://www.redbaby.com.cn/Product/Product_List.aspx\?Site=\d&BranchID=\d&DepartmentID=\d+$
          +^http://www.redbaby.com.cn/Product/Product_List.aspx\?Site=\d&BrandID=\d&BranchID=\d+$
          +^http://www.redbaby.com.cn/Product/ProductInfo\w\d\w([0-9]*\.)*html$
          +^http://www.redbaby.com.cn/Product/Product_List.aspx\?Site=\d&BranchID=\d&DepartmentID=\d&SortID=\d+$
          +^http://www.redbaby.com.cn/Product/ProductInfo\w\d\w\d\.htm$
          # skip everything else
          -.

           

           

          url匹配可能用到的java正則:

          ?    對應     \? 

          _ (下劃張)  對應   \w 

          .(點號)    對應  \.


          轉自:nutch 最新使用日志
          主站蜘蛛池模板: 博兴县| 大安市| 民勤县| 咸宁市| 保亭| 万安县| 哈巴河县| 石屏县| 建始县| 改则县| 乌拉特后旗| 扬中市| 宁阳县| 临桂县| 自治县| 大竹县| 牟定县| 库尔勒市| 镇雄县| 巨野县| 隆林| 大方县| 炎陵县| 全椒县| 太谷县| 本溪| 九寨沟县| 东至县| 舟曲县| 电白县| 鄂伦春自治旗| 左权县| 澄迈县| 杭锦旗| 安乡县| 同德县| 宜兰市| 木兰县| 调兵山市| 荆门市| 平安县|