Dedian |
|
|||
-- 關(guān)注搜索引擎的開發(fā) |
日歷
統(tǒng)計(jì)
導(dǎo)航常用鏈接留言簿(8)隨筆分類(45)
隨筆檔案(82)
文章檔案(2)Java Spaces搜索積分與排名
最新評論
閱讀排行榜評論排行榜 |
Still working on Webcrawler part, the URL collection strategies are
under thinking. A URL frontier which stores the list of? activate URLs to
be parsed or downloaded will be applied to handle for synchonized I/O operations with
URL collection/Inventory, stuck by some issues:
1. Duplicate URL Elimination: ??? a. Host name aliases --> DNS Resolver ??? b. Omitted port numbers ??? c. Alternative paths on the same host ??? d. replication across difference host ??? e. non-sense links or session IDs embedded in URLs ? 2. Reachable of URL 3. Distributed Storage of URL Inventory and relative synchronization problem 4. Fetch strategies for URL Frontier or Fetchor to get activate links for parsing 5. Scheduler for fetching and updating URL collection: multi-thread or single thread on each pc, when to decide re-parsing a page 7. URL-Seen test: if that page has been parsed and should it re-parse? which should be done before entering URL frontier... 8. Extensibility issues for those modules: Fetcher, Extractor/Filters, Collector... 9. Checkpointing for crawlering interupted: how to resume the crawler job, how to split crawler jobs and distribute to different machines seems that I need couple days to refine my systen architecture design... |
![]() |
|
Copyright © Dedian | Powered by: 博客園 模板提供:滬江博客 |