導(dǎo)航

常用鏈接

留言簿(8)

隨筆分類(45)

隨筆檔案(82)

文章檔案(2)

2006年4月 (2)

Java Spaces

Alanb(Sun) (rss)
FreeRoller (rss)
JavaBlogs
JavaWorld (rss)

搜索

積分與排名

積分 - 66091
排名 - 813

閱讀排行榜

評(píng)論排行榜

Something is in progress

Still working on Webcrawler part, the URL collection strategies are under thinking. A URL frontier which stores the list of? activate URLs to be parsed or downloaded will be applied to handle for synchonized I/O operations with URL collection/Inventory, stuck by some issues:

1. Duplicate URL Elimination:
??? a. Host name aliases --> DNS Resolver
??? b. Omitted port numbers
??? c. Alternative paths on the same host
??? d. replication across difference host
??? e. non-sense links or session IDs embedded in URLs ?
2. Reachable of URL
3. Distributed Storage of URL Inventory and relative synchronization problem
4. Fetch strategies for URL Frontier or Fetchor to get activate links for parsing
5. Scheduler for fetching and updating URL collection: multi-thread or single thread on each pc, when to decide re-parsing a page
7. URL-Seen test: if that page has been parsed and should it re-parse? which should be done before entering URL frontier...
8. Extensibility issues for those modules: Fetcher, Extractor/Filters, Collector...
9. Checkpointing for crawlering interupted: how to resume the crawler job, how to split crawler jobs and distribute to different machines

seems that I need couple days to refine my systen architecture design...

posted on 2006-06-09 08:57 Dedian 閱讀(853) 評(píng)論(0) 編輯收藏

新用戶注冊(cè) 刷新評(píng)論列表


只有注冊(cè)用戶登錄后才能發(fā)表評(píng)論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 Chat2DB C++博客博問(wèn) 管理


Copyright © Dedian	Powered by: 博客園模板提供：滬江博客