泰仔在線

          java學(xué)習(xí),心情日記,繽紛時刻
          posts - 100, comments - 34, trackbacks - 0, articles - 0

          Nutch URL過濾配置規(guī)則

          Posted on 2010-04-30 10:12 泰仔在線 閱讀(3396) 評論(0)  編輯  收藏 所屬分類: 云計算相關(guān)

          nutch網(wǎng)上有不少有它的源碼解析,但是采集這塊還是不太讓人容易理解.今天終于知道怎么,弄的.現(xiàn)在把crawl-urlfilter.txt文件貼出來,讓大家一塊交流,也給自己備忘錄一個。

           

          # Licensed to the Apache Software Foundation (ASF) under one or more
          # contributor license agreements.  See the NOTICE file distributed with
          # this work for additional information regarding copyright ownership.
          # The ASF licenses this file to You under the Apache License, Version 2.0
          # (the "License"); you may not use this file except in compliance with
          # the License.  You may obtain a copy of the License at
          #
          #     http://www.apache.org/licenses/LICENSE-2.0
          #
          # Unless required by applicable law or agreed to in writing, software
          # distributed under the License is distributed on an "AS IS" BASIS,
          # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          # See the License for the specific language governing permissions and
          # limitations under the License.


          # The url filter file used by the crawl command.

          # Better for intranet crawling.
          # Be sure to change MY.DOMAIN.NAME to your domain name.

          # Each non-comment, non-blank line contains a regular expression
          # prefixed by '+' or '-'.  The first matching pattern in the file
          # determines whether a URL is included or ignored.  If no pattern
          # matches, the URL is ignored.

          # skip file:, ftp:, & mailto: urls
          -^(file|ftp|mailto):

          # skip image and other suffixes we can't yet parse
          -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

          # skip URLs containing certain characters as probable queries, etc.

          //采集動態(tài)網(wǎng)站很重要。必須這樣設(shè)置。不然像a.jsp?a=001 帶有問號的網(wǎng)頁就沒辦法采集。
          +[?*!@=]

          # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
          -.*(/[^/]+)/[^/]+\1/[^/]+\1/

          # accept hosts in MY.DOMAIN.NAME
          ###########################7shop24########################################
          #+^http://([a-z0-9]*\.)*7shop24.com/
          #+^http://www.7shop24.com/indexdtl06.asp\?classid=([0-9]*)&productid=([0-9]*)+$



          ###############################http://www.redbaby.com.cn/##############################

           

          //采集是有順序的,不是隨便寫的。比如:你要采集產(chǎn)品頁,你首先得把首頁放進來,然后產(chǎn)品是放在分類頁面的,你得把//分類也得包括進來,然后再把具體產(chǎn)品規(guī)則的正則寫進來,這樣才能完成你所需要的任務(wù)。如:
          +^http://www.redbaby.com.cn/$
          +^http://www.redbaby.com.cn/([a-zA-Z]*\.)*index.html$
          +^http://www.redbaby.com.cn/([a-zA-Z]*)/$
          +^http://www.redbaby.com.cn/([a-zA-Z]*)/index\.html+$
          +^http://www.redbaby.com.cn/Product/Product_List.aspx\?Site=\d&BranchID=\d&DepartmentID=\d+$
          +^http://www.redbaby.com.cn/Product/Product_List.aspx\?Site=\d&BrandID=\d&BranchID=\d+$
          +^http://www.redbaby.com.cn/Product/ProductInfo\w\d\w([0-9]*\.)*html$
          +^http://www.redbaby.com.cn/Product/Product_List.aspx\?Site=\d&BranchID=\d&DepartmentID=\d&SortID=\d+$
          +^http://www.redbaby.com.cn/Product/ProductInfo\w\d\w\d\.htm$
          # skip everything else
          -.

           

           

          url匹配可能用到的java正則:

          ?    對應(yīng)     \? 

          _ (下劃張)  對應(yīng)   \w 

          .(點號)    對應(yīng)  \.


          轉(zhuǎn)自:nutch 最新使用日志
          主站蜘蛛池模板: 句容市| 司法| 无极县| 壤塘县| 长宁县| 西平县| 镇雄县| 望奎县| 马关县| 龙游县| 蒙城县| 陆良县| 白山市| 洞头县| 沛县| 曲阳县| 贵德县| 子洲县| 三台县| 大新县| 田林县| 浦东新区| 忻州市| 景宁| 武胜县| 辽中县| 峨眉山市| 武威市| 高要市| 漾濞| 福泉市| 泗阳县| 普格县| 方正县| 大姚县| 阿克苏市| 镇平县| 万盛区| 田阳县| 比如县| 泸西县|