Dedian |
|
|||
-- 關注搜索引擎的開發 |
日歷
統計
導航常用鏈接留言簿(8)隨筆分類(45)
隨筆檔案(82)
文章檔案(2)Java Spaces搜索積分與排名
最新評論
閱讀排行榜評論排行榜 |
+ Webcrawler
??? ??? -- study open source code ??? ?? ?? purpose: analyze code structure and basic componences ??? ?? ?? focus on: Nutch (http://lucene.apache.org/nutch/) ??? ??? ??? ??? ??? & HTMLParser (http://htmlparser.sourceforge.net/) ??? ?? ?? ?? ?? ?? ? & GData(http://code.google.com/apis/gdata/overview.html) ??? -- understand PageRank idea ??? ?? relative articles: ??? ?? http://en.wikipedia.org/wiki/PageRank ??? ?? http://www.thesitewizard.com/archive/google.shtml ?????? paper : "PageRank Uncoverd" by Chris Ridings and Mike Shishigin ?????? http://www.rankforsales.com/n-aa/095-seo-may-31-03.html (about Chris Ridings & SEO) ??? ?? http://en.wikipedia.org/wiki/Web_crawler (basic idea about crawler) ??? ?? ??? -- familar with RSS & Atom protocol ??? -- sample coding: ??? ?? Interface: Scheduler for fetching web links ??? ?? Interface: Web page paser/Analyzer --> to deal with XML-based websites(Weblogs or news sites, RSS & Atom) --> Paser classes based on SAX parser ??? ?? Interface: Retractor/Fetcher --> to get links from page ??? ?? Interface: Collector --> check URL whether duplicated and save in URL database with certian data structure ??? ?? Interface: InformationProcesser --> PageRank should be one important factor --> (under thinking) ??? ?? Interface: Policies(Filter) --> will be served for Collector and InformationProcessor --> (under thinking) + Indexer/Searcher (almost done base on Lucene)
評論:
|
![]() |
|
Copyright © Dedian | Powered by: 博客園 模板提供:滬江博客 |