ï»??xml version="1.0" encoding="utf-8" standalone="yes"?>日韩天堂av,在线观影网站,日韩精品系列http://www.aygfsteel.com/persister/category/38140.htmlzh-cnThu, 16 Sep 2010 17:00:17 GMTThu, 16 Sep 2010 17:00:17 GMT60Hadoop学习½W”è®°åQˆä¸€åQ?/title><link>http://www.aygfsteel.com/persister/archive/2010/03/12/315306.html</link><dc:creator>persister</dc:creator><author>persister</author><pubDate>Fri, 12 Mar 2010 12:59:00 GMT</pubDate><guid>http://www.aygfsteel.com/persister/archive/2010/03/12/315306.html</guid><wfw:comment>http://www.aygfsteel.com/persister/comments/315306.html</wfw:comment><comments>http://www.aygfsteel.com/persister/archive/2010/03/12/315306.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.aygfsteel.com/persister/comments/commentRss/315306.html</wfw:commentRss><trackback:ping>http://www.aygfsteel.com/persister/services/trackbacks/315306.html</trackback:ping><description><![CDATA[今天ž®†Hadoop下蝲下来学习了一下文档中的tutorialåQŒç„¶åŽä»¿ç…§å¦‚下链接实çŽîCº†ä¸€ä¸ªword count的例子:<br /> <h1><a >ç”?Hadoop ˜q›è¡Œåˆ†å¸ƒå¼æ•°æ®å¤„理,½W?1 部分: 入门</a></h1> <br /> 以下是一部分理论学习åQ?br /> The storage is provided by HDFS, and analysis by MapReduce.<br /> <br /> MapReduce is a good fit for problems<br /> that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis.<br /> An RDBMS is good for point queries or updates, where the dataset has been indexed<br /> to deliver low-latency retrieval and update times of a relatively small amount of<br /> data.<br /> MapReduce suits applications where the data is written once, and read many<br /> times, whereas a relational database is good for datasets that are continually updated.<br /> <br /> MapReduce tries to colocate the data with the compute node, so data access is fast<br /> since it is local.* This feature, known as data locality, is at the heart of MapReduce and<br /> is the reason for its good performance.<br /> <br /> Hadoop divides the input to a MapReduce job into fixed-size pieces called input<br /> splits, or just splits. Hadoop creates one map task for each split, which runs the userdefined<br /> map function for each record in the split.<br /> <br /> On the other hand, if splits are too small, then the overhead of managing the splits and<br /> of map task creation begins to dominate the total job execution time.For most jobs, a<br /> good split size tends to be the size of a HDFS block, 64 MB by default.<br /> <br /> Reduce tasks don’t have the advantage of data locality—the input to a single reduce<br /> task is normally the output from all mappers.<br /> <br /> Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays<br /> to minimize the data transferred between map and reduce tasks. Hadoop allows the<br /> user to specify a combiner function to be run on the map output—the combiner function’s<br /> output forms the input to the reduce function.<br /> <br /> Why Is a Block in HDFS So Large?<br /> HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost<br /> of seeks. By making a block large enough, the time to transfer the data from the disk<br /> can be made to be significantly larger than the time to seek to the start of the block.<br /> Thus the time to transfer a large file made of multiple blocks operates at the disk transfer<br /> rate.<br /> A quick calculation shows that if the seek time is around 10ms, and the transfer rate is<br /> 100 MB/s, then to make the seek time 1% of the transfer time, we need to make the<br /> block size around 100 MB. The default is actually 64 MB, although many HDFS installations<br /> use 128 MB blocks. This figure will continue to be revised upward as transfer<br /> speeds grow with new generations of disk drives.<br /> This argument shouldn’t be taken too far, however. Map tasks in MapReduce normally<br /> operate on one block at a time, so if you have too few tasks (fewer than nodes in the<br /> cluster), your jobs will run slower than they could otherwise.<br /> 意思是˜q™æ ·çš„,Block大的话,å¯ÀL‰¾Block的时间大概少åQŒä¸»è¦è€—在传输的时间上åQŒä½†æ˜¯å¦‚æžœBlockž®çš„话,传输的时间和å¯Õd€çš„æ—¶é—´å°±ç›¸å½“了,½{‰äºŽè¯´å°±æ˜¯æ¶ˆè€—的旉™—´æ˜?倍传输的旉™—´åQŒåˆ’不来。具体的说是åQŒå¦‚果数据量ä¸?00MåQŒé‚£ä¹ˆBlock的大ž®æ˜¯100MåQŒé‚£ä¹ˆä¼ è¾“的旉™—´ž®±æ˜¯1s(100M/s)åQŒä½†æ˜¯å¦‚æžœBlock的大ž®æ˜¯1MåQŒé‚£ä¹ˆä¼ è¾“的旉™—´˜q˜æ˜¯1såQŒä½†æ˜¯seek的时é—?0ms*100=1s了。这æ äh€Õd…±èŠ±åŽ»çš„æ—¶é—´å°±æ˜?s。是不是­‘Šå¤§­‘Šå¥½å‘¢ï¼Ÿä¹Ÿä¸æ˜¯ï¼Œå¤ªå¤§çš„话åQŒå¯èƒ½å¯¼è‡´æ–‡æ¡£æ²¡æœ‰åˆ†å¸ƒå¼çš„存储,也就没有很好的利用MapReduce模型˜q›è¡Œè®¡ç®—了,反而可能更慢ã€?br /> <br /> <br /> <img src ="http://www.aygfsteel.com/persister/aggbug/315306.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.aygfsteel.com/persister/" target="_blank">persister</a> 2010-03-12 20:59 <a href="http://www.aygfsteel.com/persister/archive/2010/03/12/315306.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Lucene数据存储¾l“构中的VIntåQˆå¯å˜é•¿åº¦æ•´åž‹ï¼‰http://www.aygfsteel.com/persister/archive/2010/02/02/311642.htmlpersisterpersisterTue, 02 Feb 2010 03:08:00 GMThttp://www.aygfsteel.com/persister/archive/2010/02/02/311642.htmlhttp://www.aygfsteel.com/persister/comments/311642.htmlhttp://www.aygfsteel.com/persister/archive/2010/02/02/311642.html#Feedback0http://www.aygfsteel.com/persister/comments/commentRss/311642.htmlhttp://www.aygfsteel.com/persister/services/trackbacks/311642.html

A variable-length format for positive integers is defined where the high-order bit of each byte indicates whether more bytes remain to be read. The low-order seven bits are appended as increasingly more significant bits in the resulting integer value. Thus values from zero to 127 may be stored in a single byte, values from 128 to 16,383 may be stored in two bytes, and so on.

可变格式的整型定义:最高位表示是否˜q˜æœ‰å­—节要读取,低七位就是就是具体的有效位,æ·ÕdŠ åˆ?/p>

¾l“果数据中。比å¦?0000001 最高位表示0åQŒé‚£ä¹ˆè¯´æ˜Žè¿™ä¸ªæ•°ž®±æ˜¯ä¸€ä¸ªå­—节表½Cºï¼Œæœ‰æ•ˆä½æ˜¯åŽé¢çš„七ä½?000001åQŒå€égØ“1ã€?0000010 00000001 ½W¬ä¸€ä¸ªå­—节最高位ä¸?åQŒè¡¨½CºåŽé¢è¿˜æœ‰å­—节,½W¬äºŒä½æœ€é«˜ä½0表示到此为止了,卛_°±æ˜¯ä¸¤ä¸ªå­—节,那么具体的值注意,是从最后一个字节的七位有效数放在最前面åQŒä¾‹Æ¡æ”¾¾|®ï¼Œæœ€åŽæ˜¯½W¬ä¸€ä¸ªè‡ªå·Þqš„七位有效位,所以这个数表示 0000001 0000010åQŒæ¢½Ž—成整数ž®±æ˜¯130

VInt Encoding Example

Value

First byte

Second byte

Third byte

0

00000000



1

00000001



2

00000010



...




127

01111111



128

10000000

00000001


129

10000001

00000001


130

10000010

00000001


...




16,383

11111111

01111111


16,384

10000000

10000000

00000001

16,385

10000001

10000000

00000001

...





Lucene源代码中˜q›è¡Œå­˜å‚¨å’Œè¯»å–是˜q™æ ·çš„。OutputStream是负责写åQ?/p>

 1   /** Writes an int in a variable-length format.  Writes between one and
 2    * five bytes.  Smaller values take fewer bytes.  Negative numbers are not
 3    * supported.
 4    * @see InputStream#readVInt()
 5    */
 6   public final void writeVInt(int i) throws IOException {
 7     while ((i & ~0x7F!= 0) {
 8       writeByte((byte)((i & 0x7f| 0x80));
 9       i >>>= 7;
10     }
11     writeByte((byte)i);
12   }

InputStream负责读:
 1   /** Reads an int stored in variable-length format.  Reads between one and
 2    * five bytes.  Smaller values take fewer bytes.  Negative numbers are not
 3    * supported.
 4    * @see OutputStream#writeVInt(int)
 5    */
 6   public final int readVInt() throws IOException {
 7     byte b = readByte();
 8     int i = b & 0x7F;
 9     for (int shift = 7; (b & 0x80!= 0; shift += 7) {
10       b = readByte();
11       i |= (b & 0x7F<< shift;
12     }
13     return i;
14   }

>>>表示无符号右¿U?br />



]]>
½W¬ä¸€‹Æ¡å°è¯•Nutchhttp://www.aygfsteel.com/persister/archive/2009/07/23/288039.htmlpersisterpersisterThu, 23 Jul 2009 07:43:00 GMThttp://www.aygfsteel.com/persister/archive/2009/07/23/288039.htmlhttp://www.aygfsteel.com/persister/comments/288039.htmlhttp://www.aygfsteel.com/persister/archive/2009/07/23/288039.html#Feedback0http://www.aygfsteel.com/persister/comments/commentRss/288039.htmlhttp://www.aygfsteel.com/persister/services/trackbacks/288039.html环境åQšNutch0.9+Fedora5+tomcat6+JDK6

tomcatå’Œjdk都安装好åQ?/p>

二:nutch-0.9.tar.gz
        ž®†ä¸‹è½½åˆ°çš„tar.gz包,解压åˆ?optç›®å½•ä¸‹åÆˆæ”¹ååQ?br />         #gunzip -xf nutch-0.9.tar.gz |tar xf
        #mv nutch-0.9.tar.gz /usr/local/nutch
      
       ‹¹‹è¯•环境是否讄¡½®æˆåŠŸåQšè¿è¡Œï¼š/urs/local/nutch/bin/nutch看一下有没有命ä×o参数输出åQŒå¦‚果有说明没问题ã€?/p>

       抓取˜q‡ç¨‹åQ?cd /opt/nutch
                         #mkdir urls
                         #vi nutch.txt 输入www.aicent.net
                         #vi conf/crawl-urlfilter.txt 加入以下信息åQšåˆ©ç”¨æ­£åˆ™è¡¨è¾‘Ö¼å¯¹ç½‘ç«™url抓取½{›é€‰ã€?br />                         /**** accept hosts in MY.DOMAIN.NAME******/
                                +^http://([a-z0-9]*\.)*aicent.net/
                       #vi nutch/nutch-site.xmlåQˆç»™è‡ªå·±çš„蜘蛛取一个名字)讄¡½®å¦‚下åQ?br />    <configuration>
<property>
    <name>http.agent.name</name>
    <value>test/unique</value>
</property>
</configuration>

                开始抓取:#bin/nutch crawl urls -dir crawl -detpth 5 -thread 10 >& crawl.log

½{‰å¾…一会,旉™—´ä¾æ®¾|‘站的大ž®ï¼Œå’Œè®¾¾|®çš„æŠ“取深度ã€?/p>


三:apache-tomcat

                在这里,当你看到每次‹‚€ç´¢çš„™åµé¢ä¸?里,需要修改一下参敎ͼŒå› äØ“tomcat中的nutch的检索èµ\径不寚w€ æˆçš„ã€?br />                 #vi /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml
<property>
<name>searcher.dir</name>
<value>/opt/nutch/crawl</value>抓取¾|‘页所在的路径
<description>My path to nutch's searcher dir.</description>
</property>

                #/opt/tomcat/bin/startup.sh


OK,搞定。。�/p>


问题汇总:


˜qè¡ŒåQšsh ./bin/nutch crawl urls -dir crawl -depth 3 -threads 60 -topN 100 >&./logs/nutch_log.log

1.Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
¾|‘上查有说是JDK版本的问题,不能用JDK1.6åQŒäºŽæ˜¯å®‰è£?.5。但是还是同æ ïLš„问题åQŒå¥‡æ€ªå•Šã€?br /> 于是¾l§ç®‹googleåQŒå‘现有如下的可能:

Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)

说明åQšä¸€èˆ¬äØ“crawl-urlfilters.txt中配¾|®é—®é¢˜ï¼Œæ¯”如˜q‡æ×o条äšgåº”äØ“
+^http://www.ihooyo.com ,而配¾|®æˆäº?http://www.ihooyo.com ˜q™æ ·çš„æƒ…况就引è“v如上错误ã€?/p>

但是自己的配¾|®æ ¹æœ¬å°±æ²¡æœ‰é—®é¢˜å•Šã€?br /> 在Logs目录下面除了生成nutch_log.log˜q˜è‡ªåŠ¨ç”Ÿæˆä¸€ä¸ªlogæ–‡äšgåQšhadoop.log
发现有错误出玎ͼš


2009-07-22 22:20:55,501 INFO  crawl.Crawl - crawl started in: crawl
2009-07-22 22:20:55,501 INFO  crawl.Crawl - rootUrlDir = urls
2009-07-22 22:20:55,502 INFO  crawl.Crawl - threads = 60
2009-07-22 22:20:55,502 INFO  crawl.Crawl - depth = 3
2009-07-22 22:20:55,502 INFO  crawl.Crawl - topN = 100
2009-07-22 22:20:55,603 INFO  crawl.Injector - Injector: starting
2009-07-22 22:20:55,604 INFO  crawl.Injector - Injector: crawlDb: crawl/crawldb
2009-07-22 22:20:55,604 INFO  crawl.Injector - Injector: urlDir: urls
2009-07-22 22:20:55,605 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
2009-07-22 22:20:56,574 INFO  plugin.PluginRepository - Plugins: looking in: /opt/nutch/plugins
2009-07-22 22:20:56,720 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
2009-07-22 22:20:56,720 INFO  plugin.PluginRepository - Registered Plugins:
2009-07-22 22:20:56,720 INFO  plugin.PluginRepository -         the nutch core extension points (nutch-extensionpoints)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Basic Query Filter (query-basic)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Basic URL Normalizer (urlnormalizer-basic)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Basic Indexing Filter (index-basic)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Html Parse Plug-in (parse-html)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Basic Summarizer Plug-in (summary-basic)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Site Query Filter (query-site)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         HTTP Framework (lib-http)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Text Parse Plug-in (parse-text)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Regex URL Filter (urlfilter-regex)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Pass-through URL Normalizer (urlnormalizer-pass)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Http Protocol Plug-in (protocol-http)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Regex URL Normalizer (urlnormalizer-regex)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         OPIC Scoring Plug-in (scoring-opic)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         CyberNeko HTML Parser (lib-nekohtml)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         JavaScript Parser (parse-js)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         URL Query Filter (query-url)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Regex URL Filter Framework (lib-regex-filter)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository - Registered Extension-Points:
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Nutch Protocol (org.apache.nutch.protocol.Protocol)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2009-07-22 22:20:56,721 INFO  plugin.PluginRepository -         Nutch URL Filter (org.apache.nutch.net.URLFilter)
2009-07-22 22:20:56,722 INFO  plugin.PluginRepository -         Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2009-07-22 22:20:56,722 INFO  plugin.PluginRepository -         Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2009-07-22 22:20:56,722 INFO  plugin.PluginRepository -         HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2009-07-22 22:20:56,722 INFO  plugin.PluginRepository -         Nutch Content Parser (org.apache.nutch.parse.Parser)
2009-07-22 22:20:56,722 INFO  plugin.PluginRepository -         Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2009-07-22 22:20:56,722 INFO  plugin.PluginRepository -         Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2009-07-22 22:20:56,722 INFO  plugin.PluginRepository -         Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2009-07-22 22:20:56,786 WARN  regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2009-07-22 22:20:56,829 WARN  mapred.LocalJobRunner - job_2319eh
java.lang.RuntimeException: java.net.UnknownHostException: jackliu: jackliu
        at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:617)
        at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:591)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:364)
        at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:390)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.startPartition(MapTask.java:294)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:355)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$100(MapTask.java:231)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:180)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
Caused by: java.net.UnknownHostException: jackliu: jackliu
        at java.net.InetAddress.getLocalHost(InetAddress.java:1353)
        at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:614)
        ... 8 more

也就是Host配置错误åQŒäºŽæ˜¯ï¼š
Add the following to your /etc/hosts file
127.0.0.1    jackliu

˜q™æ¬¡å†æ¬¡˜qè¡ŒåQŒç»“果成功!

 

2:http://127.0.0.1:8080/nutch-0.9
 è¾“å…¥nutch˜q›è¡ŒæŸ¥è¯¢åQŒç»“果报错:
 HTTP Status 500 -

type Exception report

message

description The server encountered an internal error () that prevented it from fulfilling this request.

exception

org.apache.jasper.JasperException: /search.jsp(151,22) Attribute value  language + "/include/header.html" is quoted with " which must be escaped when used within the value
 org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:40)
 org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:407)
 org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:198)
 org.apache.jasper.compiler.Parser.parseQuoted(Parser.java:299)
 org.apache.jasper.compiler.Parser.parseAttributeValue(Parser.java:249)
 org.apache.jasper.compiler.Parser.parseAttribute(Parser.java:211)
 org.apache.jasper.compiler.Parser.parseAttributes(Parser.java:154)
 org.apache.jasper.compiler.Parser.parseInclude(Parser.java:867)
 org.apache.jasper.compiler.Parser.parseStandardAction(Parser.java:1134)
 org.apache.jasper.compiler.Parser.parseElements(Parser.java:1461)
 org.apache.jasper.compiler.Parser.parse(Parser.java:137)
 org.apache.jasper.compiler.ParserController.doParse(ParserController.java:255)
 org.apache.jasper.compiler.ParserController.parse(ParserController.java:103)
 org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:170)
 org.apache.jasper.compiler.Compiler.compile(Compiler.java:332)
 org.apache.jasper.compiler.Compiler.compile(Compiler.java:312)
 org.apache.jasper.compiler.Compiler.compile(Compiler.java:299)
 org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:586)
 org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:317)
 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:342)
 org.apache.jasper.servlet.JspServlet.service(JspServlet.java:267)
 javax.servlet.http.HttpServlet.service(HttpServlet.java:717)

note The full stack trace of the root cause is available in the Apache Tomcat/6.0.20 logs.

分析åQšæŸ¥çœ‹nutch Web应用根目录下的search.jsp可知åQŒæ˜¯å¼•号匚w…çš„问题ã€?/p>

<jsp:include page="<%= language + "/include/header.html"%>"/>  //line 152 search.jsp

½W¬ä¸€ä¸ªå¼•号和后面½W¬ä¸€ä¸ªå‡ºçŽ°çš„å¼•å·˜q›è¡ŒåŒšw…åQŒè€Œä¸æ˜¯å’Œ˜q™ä¸€è¡Œæœ€åŽä¸€ä¸ªå¼•可‚¿›è¡ŒåŒ¹é…ï¼Œæ‰€ä»¥é—®é¢˜å°±å‡ºçŽ°äº†ã€?/p>

解决æ–ÒŽ³•åQ?/p>

ž®†è¯¥è¡Œä»£ç ä¿®æ”¹äØ“åQ?lt;jsp:include page="<%= language+urlsuffix %>"/>

˜q™é‡Œæˆ‘们定一个字½W¦ä¸²urlsuffixåQŒæˆ‘们把它定义在language字符串定义之后,

  String language =   // line 116 search.jsp
    ResourceBundle.getBundle("org.nutch.jsp.search", request.getLocale())
    .getLocale().getLanguage();
 String urlsuffix="/include/header.html";

修改完成后,为确保修æ”ÒŽˆåŠŸï¼Œé‡å¯ä¸€ä¸‹Tomcat服务器,˜q›è¡Œæœçƒ¦åQŒä¸å†æŠ¥é”™ã€?/p>


3.无法查询¾l“æžœåQ?br />   å¯Òޝ”nutch_log.log的结果发现和¾|‘上描述的不同,而且crawl里面只有两个文äšg夹segmentså’ŒcrawldbåQŒåŽæ¥é‡æ–°çˆ¬äº†ä¸€‹Æ?br />   发现˜q™æ¬¡æ˜¯å¥½çš„,奇怪不知道ä¸ÞZ»€ä¹ˆä¸Š‹Æ¡çˆ¬çš„失败了ã€?br />  
4.cached.jsp explain.jsp½{‰éƒ½æœ‰ä¸Šé?的错误,更正˜q‡åŽ»ž®±OK了ã€?/p>

5.今天èŠ×ƒº†ä¸€ä¸Šåˆå’ŒåŠä¸ªä¸‹åˆçš„æ—‰™—´¾lˆäºŽæžå®šäº†nutch的安装和配置了。明天ç‘ô¾l­å­¦ä¹ ã€?/p>

]]>
PhraseQuery、SpanQueryå’ŒPhrasePrefixQueryhttp://www.aygfsteel.com/persister/archive/2009/07/14/286634.htmlpersisterpersisterTue, 14 Jul 2009 01:49:00 GMThttp://www.aygfsteel.com/persister/archive/2009/07/14/286634.htmlhttp://www.aygfsteel.com/persister/comments/286634.htmlhttp://www.aygfsteel.com/persister/archive/2009/07/14/286634.html#Feedback0http://www.aygfsteel.com/persister/comments/commentRss/286634.htmlhttp://www.aygfsteel.com/persister/services/trackbacks/286634.htmlPhraseQuery使用位置信息来进行相å…ÏxŸ¥è¯¢ï¼Œæ¯”如TermQuery使用“我们”å’?#8220;¼œ–国”˜q›è¡ŒæŸ¥è¯¢åQŒé‚£ä¹ˆæ–‡æ¡£ä¸­å«æœ‰˜q™ä¸¤ä¸ªè¯çš„æ‰€æœ‰è®°å½•都会被查询出来。但是有一¿Uæƒ…况,我们可能需要查è¯?#8220;我们”å’?#8220;中国”之间只隔一个字和两个字或者两个字½{‰ï¼Œè€Œä¸æ˜¯å®ƒä»¬ä¹‹é—´å­—距相差十万八千里åQŒå°±å¯ä»¥ä½¿ç”¨PhraseQuery。比如下面的情况åQ?br />     doc.add(Field.Text("field", "the quick brown fox jumped over the lazy dog"));
那么åQ?br />     String[] phrase = new String[] {"quick", "fox"};
    assertFalse("exact phrase not found", matched(phrase, 0));
    assertTrue("close enough", matched(phrase, 1));
multi-terms:
    assertFalse("not close enough", matched(new String[] {"quick", "jumped", "lazy"}, 3));
    assertTrue("just enough", matched(new String[] {"quick", "jumped", "lazy"}, 4));
    assertFalse("almost but not quite", matched(new String[] {"lazy", "jumped", "quick"}, 7));
    assertTrue("bingo", matched(new String[] {"lazy", "jumped", "quick"}, 8));

数字表示slopåQŒé€šè¿‡å¦‚下方式讄¡½®åQŒè¡¨½CºæŒ‰ç…§é¡ºåºä»Ž½W¬ä¸€ä¸ªå­—ŒDµåˆ°½W¬äºŒä¸ªå­—ŒDµä¹‹é—´é—´éš”çš„term个数ã€?br />     query.setSlop(slop);

™åºåºå¾ˆé‡è¦ï¼š
    String[] phrase = new String[] {"fox", "quick"};
assertFalse("hop flop", matched(phrase, 2));
assertTrue("hop hop slop", matched(phrase, 3));

原理如下图所½Cºï¼š


对于查询关键字quickå’ŒfoxåQŒåªéœ€è¦fox¿UÕdŠ¨ä¸€ä¸ªä½¾|®å³å¯åŒ¹é…quick brown fox。而对于foxå’Œquick˜q™ä¸¤ä¸ªå…³é”®å­—
需要将fox¿UÕdŠ¨ä¸‰ä¸ªä½ç½®ã€‚ç§»åŠ¨çš„è·ç¦»­‘Šå¤§åQŒé‚£ä¹ˆè¿™™å¹è®°å½•çš„scorež®Þp¶Šž®ï¼Œè¢«æŸ¥è¯¢å‡ºæ¥çš„可能行就­‘Šå°äº†ã€?br />
SpanQuery利用位置信息查询更有意思的查询åQ?br />
SpanQuery type         Description
SpanTermQuery         Used in conjunction with the other span query types. On its own, it’s
                                        functionally equivalent to TermQuery.
SpanFirstQuery         Matches spans that occur within the first part of a field.
SpanNearQuery         Matches spans that occur near one another.
SpanNotQuery         Matches spans that don’t overlap one another.
SpanOrQuery             Aggregates matches of span queries.

SpanFirstQueryåQšTo query for spans that occur within the first n positions of a field, use Span-FirstQuery.



quick = new SpanTermQuery(new Term("f", "quick"));
brown = new SpanTermQuery(new Term("f", "brown"));
red = new SpanTermQuery(new Term("f", "red"));
fox = new SpanTermQuery(new Term("f", "fox"));
lazy = new SpanTermQuery(new Term("f", "lazy"));
sleepy = new SpanTermQuery(new Term("f", "sleepy"));
dog = new SpanTermQuery(new Term("f", "dog"));
cat = new SpanTermQuery(new Term("f", "cat"));

SpanFirstQuery sfq = new SpanFirstQuery(brown, 2);
assertNoMatches(sfq);
sfq = new SpanFirstQuery(brown, 3);
assertOnlyBrownFox(sfq);

SpanNearQueryåQ?br />
彼此盔R‚»çš„è·¨åº?

      首先åQŒå¼ºè°ƒä¸€ä¸‹PhraseQuery对象åQŒè¿™ä¸ªå¯¹è±¡ä¸å±žäºŽè·¨åº¦æŸ¥è¯¢¾c»ï¼Œä½†èƒ½å®Œæˆè·¨åº¦æŸ¥è¯¢åŠŸèƒ½ã€?/p>

      匚w…åˆ°çš„æ–‡æ¡£æ‰€åŒ…含的项通常是彼此相é‚ȝš„åQŒè€ƒè™‘到原文档中在查询™å¹ä¹‹é—´å¯èƒ½æœ‰ä¸€äº›ä¸­é—´é¡¹åQŒæˆ–ä¸ÞZº†èƒ½æŸ¥è¯¢å€’排的项åQŒPhraseQuery讄¡½®äº†slop因子åQ?font color="#ff0000">但是˜q™ä¸ªslop因子æŒ?个项允许最大间隔距¼›»ï¼Œä¸æ˜¯ä¼ ç»Ÿæ„ä¹‰ä¸Šçš„距离åQŒæ˜¯æŒ‰é¡ºåºç»„成给定的短语åQŒæ‰€éœ€è¦ç§»åŠ¨ä½¾|®çš„‹Æ¡æ•°åQ?font color="#0000ff">˜q™è¡¨½CºPhraseQuery是必™åÀLŒ‰ç…§é¡¹åœ¨æ–‡æ¡£ä¸­å‡ºçŽ°çš„é¡ºåºè®¡½Ž—跨度的åQŒå¦‚quick brown fox为文档,则quick fox2个项的slopä¸?åQŒquick向后¿UÕdЍ䏀‹Æ?而fox quick需要quick向后¿UÕdЍ3‹Æ¡ï¼Œæ‰€ä»¥slopä¸?

      其次åQŒæ¥çœ‹ä¸€ä¸‹SpanQuery的子¾c»SpanTermQueryã€?/p>

      它能跨度查询åQ?font color="#0000ff">òq¶ä¸”不一定非要按™å¹åœ¨æ–‡æ¡£ä¸­å‡ºçŽ°çš„™åºåºåQŒå¯ä»¥ç”¨ä¸€ä¸ªç‹¬ç«‹çš„æ ‡è®°è¡¨ç¤ºæŸ¥è¯¢å¯¹è±¡å¿…须按顺序,或允许按倒过来的™åºåºå®ŒæˆåŒšw…ã€?font color="#ff0000">匚w…çš„跨度也不是指移动位¾|®çš„‹Æ¡æ•°åQŒæ˜¯æŒ‡ä»Ž½W¬ä¸€ä¸ªè·¨åº¦çš„起始位置到最后一个跨度的¾l“束位置ã€?/font>

      在SpanNearQuery中将SpanTermQueryå¯¹è±¡ä½œäØ“SpanQuery对象使用的效果,与ä‹É用PharseQuery的效果非常相伹{€‚在SpanNearQuery的构造函æ•îC¸­çš„第三个参数为inOrder标志åQŒè®¾¾|®è¿™ä¸ªæ ‡å¿—,表示按项在文档中出现的顺序倒过来的™åºåºã€?/p>

      å¦?the quick brown fox jumps over the lazy dog˜q™ä¸ªæ–‡æ¡£

      public void testSpanNearQuery() throws Exception{

           SpanQuery[] quick_brown_dog=new SpanQuery[]{quick,brown,dog};

           SpanNearQuery snq=new SpanNearQuery(quick_brown_dog,0,true);//按正帔R¡ºåº?跨度ä¸?,对三个项˜q›è¡ŒæŸ¥è¯¢

           assertNoMatches(snq);//无法匚w…

           SpanNearQuery snq=new SpanNearQuery(quick_brown_dog,4,true);//按正帔R¡ºåº?跨度ä¸?,对三个项˜q›è¡ŒæŸ¥è¯¢

           assertNoMatches(snq);//无法匚w…

           SpanNearQuery snq=new SpanNearQuery(quick_brown_dog,4,true);//按正帔R¡ºåº?跨度ä¸?,对三个项˜q›è¡ŒæŸ¥è¯¢

           assertOnlyBrownFox(snq);//匚w…æˆåŠŸ    

           SpanNearQuery snq=new SpanNearQuery(new SpanQuery[]{lazy,fox},3,false);//按相反顺åº?跨度ä¸?,对三个项˜q›è¡ŒæŸ¥è¯¢

           assertOnlyBrownFox(snq);//匚w…æˆåŠŸ   

           //下面使用PhraseQuery˜q›è¡ŒæŸ¥è¯¢åQŒå› ä¸ºæ˜¯æŒ‰é¡ºåºï¼Œæ‰€ä»¥lazyå’Œfoxå¿…é¡»è¦è·¨åº¦äØ“5

           PhraseQuery pq=new PhraseQuery();

           pq.add(new Term("f","lazy"));

           pq.add(new Term("f","lazy"));

           pq.setslop(4);

           assertNoMatches(pq);//跨度4无法匚w…

           //PharseQuery,slop因子ä¸?

           pq.setSlop(5);

           assertOnlyBrownFox(pq);          

      }


3.PhrasePrefixQuery 主要用来˜q›è¡ŒåŒä¹‰è¯æŸ¥è¯¢çš„åQ?br />     IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), true);
    Document doc1 = new Document();
    doc1.add(Field.Text("field", "the quick brown fox jumped over the lazy dog"));
    writer.addDocument(doc1);
    Document doc2 = new Document();
    doc2.add(Field.Text("field","the fast fox hopped over the hound"));
    writer.addDocument(doc2);

    PhrasePrefixQuery query = new PhrasePrefixQuery();
    query.add(new Term[] {new Term("field", "quick"), new Term("field", "fast")});
    query.add(new Term("field", "fox"));

    Hits hits = searcher.search(query);
    assertEquals("fast fox match", 1, hits.length());
    query.setSlop(1);
    hits = searcher.search(query);
    assertEquals("both match", 2, hits.length());



]]>
搜烦引擎中对于输入查询关键词的一些考虑http://www.aygfsteel.com/persister/archive/2009/07/11/286377.htmlpersisterpersisterSat, 11 Jul 2009 09:33:00 GMThttp://www.aygfsteel.com/persister/archive/2009/07/11/286377.htmlhttp://www.aygfsteel.com/persister/comments/286377.htmlhttp://www.aygfsteel.com/persister/archive/2009/07/11/286377.html#Feedback0http://www.aygfsteel.com/persister/comments/commentRss/286377.htmlhttp://www.aygfsteel.com/persister/services/trackbacks/286377.html
2、近义词查询ã€?SynonymAnalyzerå’ŒPhrasePrefixQuery都能解决˜q™ä¸ªé—®é¢˜ã€?

]]>
Analyzerhttp://www.aygfsteel.com/persister/archive/2009/07/07/285833.htmlpersisterpersisterTue, 07 Jul 2009 07:59:00 GMThttp://www.aygfsteel.com/persister/archive/2009/07/07/285833.htmlhttp://www.aygfsteel.com/persister/comments/285833.htmlhttp://www.aygfsteel.com/persister/archive/2009/07/07/285833.html#Feedback0http://www.aygfsteel.com/persister/comments/commentRss/285833.htmlhttp://www.aygfsteel.com/persister/services/trackbacks/285833.html  Analyzer                          Steps taken  
WhitespaceAnalyzer         Splits tokens at whitespace  
SimpleAnalyzer                Divides text at nonletter characters and lowercases  
StopAnalyzer        Divides text at nonletter characters, lowercases, and removes stop words  
StandardAnalyzer      Tokenizes based on a sophisticated grammar that recognizes
               e-mail addresses, acronyms, Chinese- Japanese-Korean characters,
              alphanumericsåQ?and more; lowercases;and removes stop words  


]]>
Porter stemming algorithmhttp://www.aygfsteel.com/persister/archive/2009/07/06/285728.htmlpersisterpersisterMon, 06 Jul 2009 14:47:00 GMThttp://www.aygfsteel.com/persister/archive/2009/07/06/285728.htmlhttp://www.aygfsteel.com/persister/comments/285728.htmlhttp://www.aygfsteel.com/persister/archive/2009/07/06/285728.html#Feedback0http://www.aygfsteel.com/persister/comments/commentRss/285728.htmlhttp://www.aygfsteel.com/persister/services/trackbacks/285728.html 所è°?a target="_blank">StemmingåQŒå¯ä»¥ç§°ä¸?strong>词根åŒ?/strong>åQŒè¿™é‡Œæœ‰ä¸?strong>overview。在è‹Þp¯­˜q™æ ·çš„æ‹‰ä¸è¯­¾p»é‡Œé¢ï¼Œå•词有多¿Uå˜å½¢ã€‚比如加ä¸?edã€?ingã€?ly½{‰ç­‰ã€‚在分词的时候,如果能够把这些变形单词的词根扑ևºäº†ï¼Œå¯Òސœç´¢ç»“果是很有帮助的。Stemming½Ž—法有很多了åQŒä¸‰å¤§ä¸»‹¹ç®—法是Porter stemming algorithmã€?a target="_blank">Lovins stemming algorithmã€?a target="_blank">Lancaster (Paice/Husk) stemming algorithmåQŒè¿˜æœ‰ä¸€äº›æ”¹˜q›çš„æˆ–其它的½Ž—法。这个PorterStemFilter里面调用的一个PorterStemmerž®±æ˜¯Porter Stemming algorithm的一个实现ã€?

]]>
Lucene倒排索引原理http://www.aygfsteel.com/persister/archive/2009/06/10/281201.htmlpersisterpersisterWed, 10 Jun 2009 10:08:00 GMThttp://www.aygfsteel.com/persister/archive/2009/06/10/281201.htmlhttp://www.aygfsteel.com/persister/comments/281201.htmlhttp://www.aygfsteel.com/persister/archive/2009/06/10/281201.html#Feedback0http://www.aygfsteel.com/persister/comments/commentRss/281201.htmlhttp://www.aygfsteel.com/persister/services/trackbacks/281201.html
倒排索引åQšInverted index

Lucene是一个高性能的java全文‹‚€ç´¢å·¥å…·åŒ…åQŒå®ƒä½¿ç”¨çš„æ˜¯å€’排文äšg索引¾l“构。该¾l“构及相应的生成½Ž—法如下åQ?br />
0åQ‰è®¾æœ‰ä¸¤½‹‡æ–‡ç«?å’?
文章1çš„å†…å®¹äØ“åQšTom lives in Guangzhou,I live in Guangzhou too.
文章2çš„å†…å®¹äØ“åQšHe once lived in Shanghai.

1)ç”׃ºŽlucene是基于关键词索引和查询的åQŒé¦–先我们要取得˜q™ä¸¤½‹‡æ–‡ç« çš„关键词,通常我们需要如下处理措æ–?br /> a.我们现在有的是文章内容,即一个字½W¦ä¸²åQŒæˆ‘们先要找出字½W¦ä¸²ä¸­çš„æ‰€æœ‰å•词,卛_ˆ†è¯ã€‚英文单词由于用½Iºæ ¼åˆ†éš”åQŒæ¯”较好处理。中文单词间是连在一èµïLš„需要特ŒDŠçš„分词处理ã€?br /> b.文章中的”in”, “once” “too”½{‰è¯æ²¡æœ‰ä»€ä¹ˆå®žé™…意义,中文中的“çš?#8221;“æ˜?#8221;½{‰å­—通常也无具体含义åQŒè¿™äº›ä¸ä»£è¡¨æ¦‚念的词可以˜q‡æ×oæŽ?br /> c.用户通常希望æŸ?#8220;He”时能把含“he”åQ?#8220;HE”的文章也扑ևºæ¥ï¼Œæ‰€ä»¥æ‰€æœ‰å•词需要统一大小写ã€?br /> d.用户通常希望æŸ?#8220;live”时能把含“lives”åQ?#8220;lived”的文章也扑ևºæ¥ï¼Œæ‰€ä»¥éœ€è¦æŠŠ“lives”åQ?#8220;lived”˜q˜åŽŸæˆ?#8220;live”
e.文章中的标点½W¦å·é€šå¸¸ä¸è¡¨½CºæŸ¿Uæ¦‚念,也可以过滤掉
在lucene中以上措施由Analyzer¾cÕd®Œæˆ?br />
¾lè¿‡ä¸Šé¢å¤„理å?br />     æ–‡ç« 1的所有关键词为:[tom] [live] [guangzhou] [i] [live] [guangzhou]
    æ–‡ç« 2的所有关键词为:[he] [live] [shanghai]

2) æœ‰äº†å…³é”®è¯åŽåQŒæˆ‘们就可以建立倒排索引了。上面的对应关系是:“文章å?#8221;å¯?#8220;文章中所有关键词”。倒排索引把这个关¾pÕd€’过来,变成åQ?#8220;关键è¯?#8221;å¯?#8220;拥有该关键词的所有文章号”。文ç«?åQ?¾lè¿‡å€’排后变æˆ?br /> 关键è¯?nbsp;  æ–‡ç« å?br /> guangzhou  1
he         2
i           1
live       1,2
shanghai   2
tom         1

通常仅知道关键词在哪些文章中出现˜q˜ä¸å¤Ÿï¼Œæˆ‘们˜q˜éœ€è¦çŸ¥é“关键词在文章中出现‹Æ¡æ•°å’Œå‡ºçŽ°çš„ä½ç½®åQŒé€šå¸¸æœ‰ä¸¤¿Uä½¾|®ï¼ša)字符位置åQŒå³è®°å½•该词是文章中½W¬å‡ ä¸ªå­—½W¦ï¼ˆä¼˜ç‚¹æ˜¯å…³é”®è¯äº®æ˜¾æ—¶å®šä½å¿«åQ‰ï¼›b)关键词位¾|®ï¼ŒåŒ™®°å½•该词是文章中第几个关键词(优点是节¾U¦çƒ¦å¼•空间、词¾l„(phaseåQ‰æŸ¥è¯¢å¿«åQ‰ï¼Œlucene中记录的ž®±æ˜¯˜q™ç§ä½ç½®ã€?br />
加上“出现频率”å’?#8220;出现位置”信息后,我们的烦引结构变为:
关键è¯?nbsp;  æ–‡ç« å·[出现频率]   å‡ºçŽ°ä½ç½®
guangzhou 1[2]               3åQ?
he       2[1]               1
i         1[1]               4
live      1[2],2[1]           2åQ?åQ?
shanghai  2[1]               3
tom      1[1]               1

以live˜q™è¡Œä¸ÞZ¾‹æˆ‘们说明一下该¾l“æž„åQšlive在文ç«?中出çŽîCº†2‹Æ¡ï¼Œæ–‡ç« 2中出çŽîCº†ä¸€‹Æ¡ï¼Œå®ƒçš„出现位置ä¸?#8220;2,5,2”˜q™è¡¨½CÞZ»€ä¹ˆå‘¢åQŸæˆ‘们需要结合文章号和出现频率来分析åQŒæ–‡ç«?中出çŽîCº†2‹Æ¡ï¼Œé‚£ä¹ˆ“2,5”ž®Þp¡¨½Cºlive在文ç«?中出现的两个位置åQŒæ–‡ç«?中出çŽîCº†ä¸€‹Æ¡ï¼Œå‰©ä¸‹çš?#8220;2”ž®Þp¡¨½Cºlive是文ç«?中第2个关键字ã€?br />     
以上ž®±æ˜¯lucene索引¾l“构中最核心的部分。我们注意到关键字是按字½W¦é¡ºåºæŽ’列的åQˆlucene没有使用B树结构)åQŒå› æ­¤lucene可以用二元搜索算法快速定位关键词ã€?br />     
实现æ—?nbsp;lucenež®†ä¸Šé¢ä¸‰åˆ—åˆ†åˆ«ä½œä¸ø™¯å…¸æ–‡ä»Óž¼ˆTerm DictionaryåQ‰ã€é¢‘率文ä»?frequencies)、位¾|®æ–‡ä»?positions)保存。其中词典文件不仅保存有每个关键词,˜q˜ä¿ç•™äº†æŒ‡å‘频率文äšg和位¾|®æ–‡ä»¶çš„æŒ‡é’ˆåQŒé€šè¿‡æŒ‡é’ˆå¯ä»¥æ‰‘Öˆ°è¯¥å…³é”®å­—的频率信息和位置信息ã€?br />
    Lucene中ä‹É用了field的概念,用于表达信息所在位¾|®ï¼ˆå¦‚标题中åQŒæ–‡ç« ä¸­åQŒurl中)åQŒåœ¨å»ºçƒ¦å¼•中åQŒè¯¥field信息也记录在词典文äšg中,每个关键词都有一个field信息(å› äØ“æ¯ä¸ªå…³é”®å­—ä¸€å®šå±žäºŽä¸€ä¸ªæˆ–å¤šä¸ªfield)ã€?br />
    ä¸ÞZº†å‡å°ç´¢å¼•æ–‡äšg的大ž®ï¼ŒLucene对烦引还使用了压¾~©æŠ€æœ¯ã€‚首先,对词典文件中的关键词˜q›è¡Œäº†åŽ‹¾~©ï¼Œå…³é”®è¯åŽ‹¾~©äØ“<前缀长度åQŒåŽ¾~€>åQŒä¾‹å¦‚ï¼šå½“å‰è¯äØ““阿拉伯语”åQŒä¸Šä¸€ä¸ªè¯ä¸?#8220;阿拉ä¼?#8221;åQŒé‚£ä¹?#8220;阿拉伯语”压羃ä¸?lt;3åQŒè¯­>。其‹Æ¡å¤§é‡ç”¨åˆ°çš„æ˜¯å¯¹æ•°å­—的压¾~©ï¼Œæ•°å­—只保存与上一个值的差å€û|¼ˆ˜q™æ ·å¯ä»¥å‡å°æ•°å­—的长度,˜q›è€Œå‡ž®‘保存该数字需要的字节敎ͼ‰ã€‚例如当前文章号æ˜?6389åQˆä¸åŽ‹ç¾ƒè¦ç”¨3个字节保存)åQŒä¸Šä¸€æ–‡ç« åäh˜¯16382åQŒåŽ‹¾~©åŽä¿å­˜7åQˆåªç”¨ä¸€ä¸ªå­—节)ã€?br />     
    ä¸‹é¢æˆ‘ä»¬å¯ä»¥é€šè¿‡å¯¹è¯¥ç´¢å¼•çš„æŸ¥è¯¢æ¥è§£é‡Šä¸€ä¸‹äØ“ä»€ä¹ˆè¦å»ºç«‹ç´¢å¼•ã€?br /> 假设要查询单è¯?nbsp;“live”åQŒlucene先对词典二元查找、找到该词,通过指向频率文äšg的指针读出所有文章号åQŒç„¶åŽè¿”回结果。词兔R€šå¸¸éžå¸¸ž®ï¼Œå› è€Œï¼Œæ•´ä¸ª˜q‡ç¨‹çš„æ—¶é—´æ˜¯æ¯«ç§’¾U§çš„ã€?br /> 而用普通的™åºåºåŒšw…½Ž—法åQŒä¸å»ºçƒ¦å¼•,而是å¯Òމ€æœ‰æ–‡ç« çš„内容˜q›è¡Œå­—符串匹配,˜q™ä¸ª˜q‡ç¨‹ž®†ä¼šç›¸å½“¾~“æ…¢åQŒå½“文章数目很大æ—Óž¼Œæ—‰™—´å¾€å¾€æ˜¯æ— æ³•忍受的ã€?br />
自我评论åQ?br /> ˜q˜å¯ä»¥å‚考http://zh.wikipedia.org/wiki/%E5%80%92%E6%8E%92%E7%B4%A2%E5%BC%95


二元搜烦½Ž—法
在排好序的数¾l„中扑ֈ°ç‰¹å®šçš„元素ã€?br /> 首先, 比较数组中间的元素,如果相同åQŒåˆ™˜q”回此元素的指针åQŒè¡¨½Cºæ‰¾åˆîCº†ã€?如果不相同, 此函数就会ç‘ô¾l­æœç´¢å…¶ä¸­å¤§ž®ç›¸½W¦çš„一半,然后¾l§ç®‹ä¸‹åŽ»ã€‚å¦‚æžœå‰©ä¸‹çš„æ•°ç»„é•¿åº¦ä¸?åQŒåˆ™è¡¨ç¤ºæ‰¾ä¸åˆŽÍ¼Œé‚£ä¹ˆå‡½æ•°ž®×ƒ¼š¾l“束ã€?br /> 此算法函数如下:
int *binarySearch(int val, int array[], int n)
{
int m = n/2;
if(n <= 0) return NULL;
if(val == array[m]) return array + m;
if(val < array[m]) return binarySearch(val, array, m);
else return binarySearch(val, array+m+1, n-m-1);
}


对于有n个元素的数组来说åQŒäºŒå…ƒæœç´¢ç®—法进行最å¤?+log2(n)‹Æ¡æ¯”较ã€?如果有一百万元素åQŒå¤§æ¦‚比è¾?0‹Æ¡ï¼Œä¹Ÿå°±æ˜¯æœ€å¤?0‹Æ¡é€’归执行binarySearch()函数ã€?/p>


]]>
Lucene学习indexhttp://www.aygfsteel.com/persister/archive/2009/06/09/281032.htmlpersisterpersisterTue, 09 Jun 2009 15:33:00 GMThttp://www.aygfsteel.com/persister/archive/2009/06/09/281032.htmlhttp://www.aygfsteel.com/persister/comments/281032.htmlhttp://www.aygfsteel.com/persister/archive/2009/06/09/281032.html#Feedback0http://www.aygfsteel.com/persister/comments/commentRss/281032.htmlhttp://www.aygfsteel.com/persister/services/trackbacks/281032.html1.Adding documents to an indexåQ?br />  protected String[] keywords = {"1", "2"};
 protected String[] unindexed = {"Netherlands", "Italy"};
 protected String[] unstored = {"Amsterdam has lots of bridges", "Venice has lots of canals"};
 protected String[] text = {"Amsterdam", "Venice"};
 Directory dir = FSDirectory.getDirectory(indexDir, true);
 IndexWriter writer = new IndexWriter(dir, new SimpleAnalyzer(), true);
 writer.setUseCompoundFile(true);
 for (int i = 0; i < keywords.length; i++) {
  Document doc = new Document();
  doc.add(Field.Keyword("id", keywords[i]));
  doc.add(Field.UnIndexed("country", unindexed[i]));
  doc.add(Field.UnStored("contents", unstored[i]));
  doc.add(Field.Text("city", text[i]));
  writer.addDocument(doc);
 }
 writer.optimize();
 writer.close();
2.Removing Documents from an indexåQ?br />  IndexReader reader = IndexReader.open(dir);
 reader.delete(1);
上面的方式一‹Æ¡åªèƒ½åˆ é™¤ä¸€ä¸ªdocumentåQŒä¸‹é¢çš„æ–ÒŽ³•可以删除多个满èƒö条äšgçš„document
 IndexReader reader = IndexReader.open(dir);
 reader.delete(new Term("city", "Amsterdam"));
 reader.close();

3.Index dates
 Document doc = new Document();
 doc.add(Field.Keyword("indexDate", new Date()));

4.Tuning indexing performance
 IndexWriter          System property                            Default value          Description
 --------------------------------------------------------------------------------------------------
 mergeFactor          org.apache.lucene.mergeFactor        10       Controls segment merge  frequency and size
 maxMergeDocs     org.apache.lucene.maxMergeDocs   Integar.MAX_VALUE    Limits the number of  documents per segement
 minMergeDocs         org.apache.lucene.minMergeDocs     10     Controls the amount of   RAM used when indexing

mergeFactor控制写入¼‹¬ç›˜å‰å†…存中¾~“存的document数量åQŒåŒæ—¶æŽ§åˆ¶merge index segments的频率。其默认值是10åQŒå³å­˜æ»¡10ä¸?br /> documents后就必须写入¼‹¬ç›˜åQŒè€Œä¸”如果segment的数量达åˆ?0çš„çñ”数的时候会merge成一个segmentåQŒå½“ç„¶maxMergeDocs限制了每ä¸?br /> segment最大能够保存的document数量。mergeFactor­‘Šå¤§çš„话ž®Þp¶Šèƒ½åˆ©ç”¨RAMåQŒæé«˜index的效率,但是mergeFactor­‘Šé«˜ä¹Ÿå°±æ„å‘³ç€
merge的频率就­‘Šä½ŽåQŒä¼šå¯èƒ½å¯ÆD‡´segmentsçš„æ•°é‡å¾ˆå¤§ï¼ˆå› äØ“æ²¡æœ‰mergeåQ‰ï¼Œ˜q™æ ·search的时候就需要打开更多的segmentæ–‡äšgåQŒä¹Ÿž®?br /> 降低了search的效率。minMergeDocs is another IndexWriter instance variable that affects indexing performance. Its
value controls how many Documents have to be buffered before they’re merged to a segment.也即是说minMergeDocs也具æœ?br /> mergeFactor控制¾~“å­˜document数量的功能ã€?/p>

5.RAMDirectory帮助利用RAMåQŒä¹Ÿå¯ä»¥é‡‡ç”¨é›†ç¾¤æˆ–者多¾U¿ç¨‹çš„æ–¹å¼å……分利用硬件和软äšg资源åQŒæé«˜index的效率ã€?/p>

6.有时候对于每个field可能希望控制其大ž®ï¼Œæ¯”如只对å‰?000个term做indexåQŒè¿™ä¸ªæ—¶å€™å°±éœ€è¦ä‹É用maxFieldLength来控制ã€?/p>

7.IndexWriter’s optimize()æ–ÒŽ³•ž®±æ˜¯ž®†segments˜q›è¡ŒmergeåQŒé™ä½Žsegments的数量从而减ž®‘search的时候读取index的时间ã€?/p>

8.注意多线½E‹çŽ¯å¢ƒä¸‹çš„å·¥ä½œï¼šan index-modifying IndexReader operation can’t be executed
while an index-modifying IndexWriter operation is in progress.ä¸ÞZº†é˜²æ­¢è¯¯ç”¨åQŒLucene在ä‹É用某些API时会¾l?br /> index上锁ã€?/p>

]]>
Lucene的Queryhttp://www.aygfsteel.com/persister/archive/2009/06/08/280567.htmlpersisterpersisterMon, 08 Jun 2009 02:05:00 GMThttp://www.aygfsteel.com/persister/archive/2009/06/08/280567.htmlhttp://www.aygfsteel.com/persister/comments/280567.htmlhttp://www.aygfsteel.com/persister/archive/2009/06/08/280567.html#Feedback0http://www.aygfsteel.com/persister/comments/commentRss/280567.htmlhttp://www.aygfsteel.com/persister/services/trackbacks/280567.htmlLucene基本的查询语句:
 Searcher searcher = new IndexSearcher(dbpath);
 Query query = QueryParser.parse(searchkey, searchfield,
     new StandardAnalyzer());
 Hits hits = searcher.search(query);
下面是Query的各¿Uå­æŸ¥è¯¢åQŒä»–们斗鱼QueryParser都有对应关系ã€?/p>

1.TermQuery常用åQŒå¯¹ä¸€ä¸ªTermåQˆæœ€ž®çš„索引块,包含一个field名字和å€û|¼‰˜q›è¡Œç´¢å¼•查询ã€?br /> Term直接与QueryParser.parse里面的keyå’Œfield直接对应ã€?/p>

 IndexSearcher searcher = new IndexSearcher(directory);
 Term t = new Term("isbn", "1930110995");
 Query query = new TermQuery(t);
 Hits hits = searcher.search(query);

2.RangeQuery用于区间查询,RangeQuery的第三个参数表示是开区间˜q˜æ˜¯é—­åŒºé—´ã€?br /> QueryParser会构å»ÞZ»Žbegin到end之间的N个查询进行查询ã€?/p>

 Term begin, end;
 Searcher searcher = new IndexSearcher(dbpath);
 begin = new Term("pubmonth","199801");
 end = new Term("pubmonth","199810");
 RangeQuery query = new RangeQuery(begin, end, true);

RangeQuery本质是比较大ž®ã€‚所以如下查询也是可以的åQŒä½†æ˜¯æ„ä¹‰å°±äºŽä¸Šé¢ä¸å¤§ä¸€æ ·äº†åQŒæ€ÖM¹‹æ˜¯å¤§ž®çš„æ¯”较
讑֮šäº†ä¸€ä¸ªåŒºé—ß_¼Œåœ¨åŒºé—´å†…的都能够搜烦出来åQŒè¿™é‡Œå°±å­˜åœ¨ä¸€ä¸ªæ¯”较大ž®çš„原则åQŒæ¯”如字½W¦ä¸²ä¼šé¦–先比较第一个字½W¦ï¼Œ˜q™æ ·ä¸Žå­—½W¦é•¿åº¦æ²¡æœ‰å…³¾p…R€?br /> begin = new Term("pubmonth","19");
 end = new Term("pubmonth","20");
 RangeQuery query = new RangeQuery(begin, end, true);


3.PrefixQuery.对于TermQueryåQŒå¿…™åÕd®Œå…¨åŒ¹é…ï¼ˆç”¨Field.Keyword生成的字ŒDµï¼‰æ‰èƒ½å¤ŸæŸ¥è¯¢å‡ºæ¥ã€?br /> ˜q™å°±åˆ¶çº¦äº†æŸ¥è¯¢çš„灉|´»æ€§ï¼ŒPrefixQuery只需要匹配value的前面ä“Q何字ŒDµå³å¯ã€‚如Field为nameåQŒè®°å½?br /> 中那么有jackliu,jackwu,jackli,那么使用jackž®±å¯ä»¥æŸ¥è¯¢å‡ºæ‰€æœ‰çš„记录。QueryParser creates a PrefixQuery
for a term when it ends with an asterisk (*) in query expressions.

 IndexSearcher searcher = new IndexSearcher(directory);
 Term term = new Term("category", "/technology/computers/programming");
 PrefixQuery query = new PrefixQuery(term);
 Hits hits = searcher.search(query);

4.BooleanQuery.上面所有的查询都是åŸÞZºŽå•个field的查询,多个field怎么查询呢,BooleanQuery
ž®±æ˜¯è§£å†³å¤šä¸ªæŸ¥è¯¢çš„问题。通过add(Query query, boolean required, boolean prohibited)加入
多个查询.通过BooleanQuery的嵌套可以组合非常复杂的查询ã€?br />  
 IndexSearcher searcher = new IndexSearcher(directory);
 TermQuery searchingBooks =
 new TermQuery(new Term("subject","search"));

 RangeQuery currentBooks =
 new RangeQuery(new Term("pubmonth","200401"),
  new Term("pubmonth","200412"),true);
  
 BooleanQuery currentSearchingBooks = new BooleanQuery();
 currentSearchingBooks.add(searchingBook s, true, false);
 currentSearchingBooks.add(currentBooks, true, false);
 Hits hits = searcher.search(currentSearchingBooks);

BooleanQueryçš„addæ–ÒŽ³•有两个boolean参数åQ?br /> trueåQ†falseåQšè¡¨æ˜Žå½“前加入的子句是必™å»è¦æ»¡èƒö的;
falseåQ†trueåQšè¡¨æ˜Žå½“前加入的子句是不可以被满­‘³çš„åQ?br /> falseåQ†falseåQšè¡¨æ˜Žå½“前加入的子句是可选的åQ?br /> trueåQ†trueåQšé”™è¯¯çš„æƒ…况ã€?/p>

QueryParser handily constructs BooleanQuerys when multiple terms are specified.
Grouping is done with parentheses, and the prohibited and required flags are
set when the –, +, AND, OR, and NOT operators are specified.

5.PhraseQuery˜q›è¡Œæ›´äØ“¾_„¡¡®çš„æŸ¥æ‰¾ã€‚它能够对烦引文本中的两个或更多的关键词的位¾|®è¿›è¡?br /> 限定。如搜查包含Aå’ŒBòq¶ä¸”A、B之间˜q˜æœ‰ä¸€ä¸ªæ–‡å­—。Terms surrounded by double quotes in
QueryParser parsed expressions are translated into a PhraseQuery.
The slop factor defaults to zero, but you can adjust the slop factor
by adding a tilde (~) followed by an integer.
For example, the expression "quick fox"~3

6.WildcardQuery.WildcardQuery比PrefixQuery提供了更¾l†çš„æŽ§åˆ¶å’Œæ›´å¤§çš„灉|´»æ€§ï¼Œ˜q™ä¸ªæœ€å®ÒŽ˜“
理解和ä‹É用ã€?/p>

7.FuzzyQuery.˜q™ä¸ªQuery比较特别åQŒå®ƒä¼šæŸ¥è¯¢ä¸Žå…³é”®å­—长得很像的其他记录。QueryParser
supports FuzzyQuery by suffixing a term with a tilde (~),for exmaple wuzza~.

 public void testFuzzy() throws Exception {
  indexSingleFieldDocs(new Field[] {
  Field.Text("contents", "fuzzy"),
  Field.Text("contents", "wuzzy")
  });
  IndexSearcher searcher = new IndexSearcher(directory);
  Query query = new FuzzyQuery(new Term("contents", "wuzza"));
  Hits hits = searcher.search(query);
  assertEquals("both close enough", 2, hits.length());
  assertTrue("wuzzy closer than fuzzy",
  hits.score(0) != hits.score(1));
  assertEquals("wuzza bear","wuzzy", hits.doc(0).get("contents"));
 }



]]>
Lucene学习http://www.aygfsteel.com/persister/archive/2009/03/06/258147.htmlpersisterpersisterFri, 06 Mar 2009 03:03:00 GMThttp://www.aygfsteel.com/persister/archive/2009/03/06/258147.htmlhttp://www.aygfsteel.com/persister/comments/258147.htmlhttp://www.aygfsteel.com/persister/archive/2009/03/06/258147.html#Feedback0http://www.aygfsteel.com/persister/comments/commentRss/258147.htmlhttp://www.aygfsteel.com/persister/services/trackbacks/258147.html 加深了我å¯ÒŽ£€ç´¢çš„理解
在全文检索中åQŒå¯ä»¥å’Œæ•°æ®åº“进行一个简单的å¯Òޝ”
全文‹‚€ç´¢æ²¡æœ‰è¡¨çš„æ¦‚念,也就没有固定的fieldsåQŒä½†æ˜¯æœ‰è®°å½•åQŒæ¯ä¸€ä¸ªè®°å½•就是一个Document对象
每一个document都可以有自己不同的fieldsåQŒå¦‚下:

    Document doc = new Document(); 

   doc.add(Field.Keyword("filename",file.getAbsolutePath())); 
     
   //以下两句只能取一å?前者是索引不存å‚?后者是索引且存å‚?
   //doc.add(Field.Text("content",new FileReader(file))); 
   doc.add(Field.Text("content",this.chgFileToString(file)));
   
   indexWriter.addDocument(doc);

在查询的时候,需要三个重要的参数
首先是库路径åQŒå³åœ¨å“ªä¸ªåº“里面˜q›è¡Œ‹‚€ç´¢ï¼ˆç›¸å½“于databaseçš„èµ\径)åQ?br />
Searcher searcher = new IndexSearcher(dbpath);

然后ž®±æ˜¯ä½ ä»¥å“ªä¸ªå­—段åQŒæŸ¥è¯¢ä»€ä¹ˆå…³é”®è¯åQŒå› ä¸ºæ ¹æ®å­—ŒDµå°±å¯ä»¥å¾—到字段对应的内å®?br /> 在得到的内容中检索你的关键词åQŒè¿™ä¸ªç¯æ­»sql语句åQŒåªä¸è¿‡æ²¡æœ‰è¡¨çš„æ¦‚念
Query query
    = QueryParser.parse(searchkey,searchfield,new StandardAnalyzer()); 

然后开始查询,查询的结果就是document的集合:
   Hits hits = searcher.search(query); 

对得到的集合˜q›è¡Œå¤„理åQ?br />
   if(hits != null)
  {
       list = new ArrayList();
       int temp_hitslength = hits.length();
       Document doc = null;
       for(int i = 0;i < temp_hitslength; i++){
           doc = hits.doc(i);
           //list.add(doc.get("filename"));
           list.add(doc.get("content"));
       }
   } 

  附常用FieldåQ?span style="font-size: 10pt; color: black; font-family: 宋体;">

常用çš?/span>Fieldæ–ÒŽ³•如下åQ?/span>


æ–ÒŽ³•

切词

索引

存储

用�/span>

Field.Text(String name, String value)

Yes

Yes

Yes

åˆ‡åˆ†è¯çƒ¦å¼•åÆˆå­˜å‚¨åQŒæ¯”如:标题åQŒå†…容字ŒD?/span>

Field.Text(String name, Reader value)

Yes

Yes

No

切分词烦引不存储åQŒæ¯”如:META信息åQ?/span>

不用于返回显½Cºï¼Œä½†éœ€è¦è¿›è¡Œæ£€ç´¢å†…å®?/span>

Field.Keyword(String name, String value)

No

Yes

Yes

ä¸åˆ‡åˆ†çƒ¦å¼•åÆˆå­˜å‚¨åQŒæ¯”如:日期字段

Field.UnIndexed(String name, String value)

No

No

Yes

不烦引,只存储,比如åQ𿖇件èµ\å¾?/span>

Field.UnStored(String name, String value)

Yes

Yes

No

只全文烦引,不存�/span>


切分è¯? ž®±æ˜¯æŒ‡å¯¹æ–‡æœ¬˜q›è¡Œåˆ‡è¯åQŒç”¨äºŽè¿›è¡Œçƒ¦å¼•,上面可以看到切分的都会进行烦引;索引即用于通过搜烦词进行查询;存储表示是否存储内容本èín。上面的 Field.Keywordæ–ÒŽ³•ž®×ƒ¸åˆ‡åˆ†ä½†æ˜¯å¯ä»¥ç´¢å¼•åQŒæ‰€ä»¥å¯ä»¥ç”¨˜q™ä¸ªå­—段˜q›è¡ŒæŸ¥è¯¢åQŒè€ŒField.UnIndexedž®×ƒ¸èƒ½è¿›è¡ŒæŸ¥è¯¢äº†ã€‚但是由äº? Field.Keyword不切分,所以当使用new Term(searchkey,searchfield)˜q›è¡ŒæŸ¥è¯¢æ—Óž¼Œ¾l™å‡ºçš„searchkey必须与vaue参数值完全一致才会查询出来,è€? Field.Textå’ŒField.UnStored则就不一æ ?/span>ã€?br />
Lucene中国是一个非常好的网站,对Lucene内部¾l“æž„˜q›è¡Œäº†è¯¦¾l†çš„分析åQŒå¯ä»¥å‚考ã€?br />



]]>
Ö÷Õ¾Ö©Öë³ØÄ£°å£º ÏóÖÝÏØ| ¶¨°²ÏØ| ÁúÓÎÏØ| ÑØºÓ| ×ÓÖÞÏØ| ×Þ³ÇÊÐ| ÓÀ´ºÏØ| ÊÖÓÎ| Î÷»ªÏØ| ׿ÄáÏØ| ·ïÇìÏØ| ÕòÐÛÏØ| ¶þÊÖ·¿| ¸·ÐÂ| ½­´ïÏØ| ÀóÆÖÏØ| ÉÛÑôÏØ| ¶«º£ÏØ| Îå´óÁ¬³ØÊÐ| º£ÃÅÊÐ| ÓÀÊÙÏØ| ºáÉ½ÏØ| ÄÏ·áÏØ| ÔúêãÌØÆì| ÐìË®ÏØ| ¹ÅÀËÏØ| Ú¯°²ÏØ| Ð˺£ÏØ| ÎÐÑôÏØ| ºÍÌïÏØ| °ÍÑåÄ×¶ûÊÐ| ºþÖÝÊÐ| »·áÏØ| ³çÃ÷ÏØ| ÑôÐÅÏØ| º£¿ÚÊÐ| ¶¼ÔÈÊÐ| ÔÀ³ØÏØ| ³Ê¹±ÏØ| Î÷ºÍÏØ| éŽ­ÏØ|