Dedian  
          -- 關(guān)注搜索引擎的開發(fā)
          日歷
          <2006年6月>
          28293031123
          45678910
          11121314151617
          18192021222324
          2526272829301
          2345678
          統(tǒng)計(jì)
          • 隨筆 - 82
          • 文章 - 2
          • 評(píng)論 - 228
          • 引用 - 0

          導(dǎo)航

          常用鏈接

          留言簿(8)

          隨筆分類(45)

          隨筆檔案(82)

          文章檔案(2)

          Java Spaces

          搜索

          •  

          積分與排名

          • 積分 - 66092
          • 排名 - 813

          最新評(píng)論

          閱讀排行榜

          評(píng)論排行榜

           
          Still working on Webcrawler part, the URL collection strategies are under thinking. A URL frontier which stores the list of? activate URLs to be parsed or downloaded will be applied to handle for synchonized I/O operations with URL collection/Inventory, stuck by some issues:

          1. Duplicate URL Elimination:
          ??? a. Host name aliases --> DNS Resolver
          ??? b. Omitted port numbers
          ??? c. Alternative paths on the same host
          ??? d. replication across difference host
          ??? e. non-sense links or session IDs embedded in URLs ?
          2. Reachable of URL
          3. Distributed Storage of URL Inventory and relative synchronization problem
          4. Fetch strategies for URL Frontier or Fetchor to get activate links for parsing
          5. Scheduler for fetching and updating URL collection: multi-thread or single thread on each pc, when to decide re-parsing a page
          7. URL-Seen test: if that page has been parsed and should it re-parse? which should be done before entering URL frontier...
          8. Extensibility issues for those modules: Fetcher, Extractor/Filters, Collector...
          9. Checkpointing for crawlering interupted: how to resume the crawler job, how to split crawler jobs and distribute to different machines

          seems that I need couple days to refine my systen architecture design...
          posted on 2006-06-09 08:57 Dedian 閱讀(853) 評(píng)論(0)  編輯  收藏

          只有注冊(cè)用戶登錄后才能發(fā)表評(píng)論。


          網(wǎng)站導(dǎo)航:
           
           
          Copyright © Dedian Powered by: 博客園 模板提供:滬江博客
          主站蜘蛛池模板: 彭山县| 惠水县| 达州市| 台北县| 永昌县| 尼玛县| 兴国县| 沁源县| 日喀则市| 金坛市| 禹城市| 连平县| 亚东县| 西乌珠穆沁旗| 宁乡县| 屯昌县| 叶城县| 武威市| 呼图壁县| 宁远县| 天长市| 乐平市| 丰县| 视频| 綦江县| 温泉县| 郴州市| 灵川县| 镶黄旗| 罗江县| 江山市| 吉隆县| 民县| 波密县| 龙江县| 布拖县| 伊宁县| 淮南市| 武汉市| 阿勒泰市| 大安市|