I use nutch to crawl the intranet.but you know ,the cache.jsp have mang problem (X).Because I filter the gif|jgf and so on then I use ORO replace the html content use my customer pif Code:?1 ????String?sRegexpSrc = " src\\s*=\\s*\ " ([\\.] * ) / ([a - z] * ) / ([ ^ \ " ]+) " ; ?2 ????String?sRegxpBackground?= " background\\s*=\\s*\ " ([.] * ) / ([a - z] * ) / ([ ^ \ " ]+) " ; ?3 ????String?sAdd?= ? "" ; ?4 ????String?sNewContent= "" ; ?5 ????PatternCompiler?compiler?= ? new ?Perl5Compiler(); ?6 ????Pattern?pattern?= ? null ,pattern1? = ? null ?; ?7 ????try { ?8 ????????pattern?= ?compiler.compile(sRegexpSrc,Perl5Compiler.CASE_INSENSITIVE_MASK); ?9 ????????pattern1?= ?compiler.compile(sRegxpBackground,Perl5Compiler.CASE_INSENSITIVE_MASK); 10 ????}catch ?(MalformedPatternException?e) { 11 ????????12 ????????e.printStackTrace();13 ????}14 ????PatternMatcher?matcher?= ? new ?Perl5Matcher(); 15 16 if ?(matcher.contains(content,?pattern)) { 17 ????????????????????MatchResult?result?= ?matcher.getMatch(); 18 ????????????????????// System.out.println(result.toString()); 19 ????????????????????sAdd? = ?result.group( 1 ) + " / " + result.group( 2 ) + " / " + result.group( 3 ); 20 ????????????????????// System.out.println("sAdd=?"+sAdd); 21 ????????????????????sNewContent = content.replaceAll(sAdd, " \\img\\liuxuan " ); 22 ????????????????????// System.out.println("FinalString="+sTest.replaceAll(sAdd,"/img/liuxuan.png")); 23 ????????????// System.out.print("sTest=?"+result.group(1)+"/"+result.group(2)); 24 ????????} else { 25 ????????????// System.out.print("Can't?find?the?String?"); 26 27 ????????}
主站蜘蛛池模板:
平果县 |
伊通 |
麻城市 |
白水县 |
泰宁县 |
弥渡县 |
苏尼特左旗 |
平塘县 |
岳普湖县 |
大化 |
镇巴县 |
惠州市 |
武邑县 |
阜城县 |
庆阳市 |
平邑县 |
博客 |
黎平县 |
湘潭市 |
沧州市 |
百色市 |
云霄县 |
平陆县 |
腾冲县 |
松溪县 |
灵武市 |
四平市 |
姚安县 |
阿鲁科尔沁旗 |
奎屯市 |
灌南县 |
揭阳市 |
靖远县 |
桐庐县 |
宜都市 |
揭西县 |
信宜市 |
绥宁县 |
钟祥市 |
长寿区 |
江安县 |