I use nutch to crawl the intranet.but you know ,the cache.jsp have mang problem (X).Because I filter the gif|jgf and so on then I use ORO replace the html content use my customer pif Code:?1 ????String?sRegexpSrc = " src\\s*=\\s*\ " ([\\.] * ) / ([a - z] * ) / ([ ^ \ " ]+) " ; ?2 ????String?sRegxpBackground?= " background\\s*=\\s*\ " ([.] * ) / ([a - z] * ) / ([ ^ \ " ]+) " ; ?3 ????String?sAdd?= ? "" ; ?4 ????String?sNewContent= "" ; ?5 ????PatternCompiler?compiler?= ? new ?Perl5Compiler(); ?6 ????Pattern?pattern?= ? null ,pattern1? = ? null ?; ?7 ????try { ?8 ????????pattern?= ?compiler.compile(sRegexpSrc,Perl5Compiler.CASE_INSENSITIVE_MASK); ?9 ????????pattern1?= ?compiler.compile(sRegxpBackground,Perl5Compiler.CASE_INSENSITIVE_MASK); 10 ????}catch ?(MalformedPatternException?e) { 11 ????????12 ????????e.printStackTrace();13 ????}14 ????PatternMatcher?matcher?= ? new ?Perl5Matcher(); 15 16 if ?(matcher.contains(content,?pattern)) { 17 ????????????????????MatchResult?result?= ?matcher.getMatch(); 18 ????????????????????// System.out.println(result.toString()); 19 ????????????????????sAdd? = ?result.group( 1 ) + " / " + result.group( 2 ) + " / " + result.group( 3 ); 20 ????????????????????// System.out.println("sAdd=?"+sAdd); 21 ????????????????????sNewContent = content.replaceAll(sAdd, " \\img\\liuxuan " ); 22 ????????????????????// System.out.println("FinalString="+sTest.replaceAll(sAdd,"/img/liuxuan.png")); 23 ????????????// System.out.print("sTest=?"+result.group(1)+"/"+result.group(2)); 24 ????????} else { 25 ????????????// System.out.print("Can't?find?the?String?"); 26 27 ????????}
主站蜘蛛池模板:
开封县 |
赞皇县 |
红桥区 |
阿坝县 |
成武县 |
临漳县 |
武冈市 |
古交市 |
钟山县 |
常熟市 |
潍坊市 |
商丘市 |
内黄县 |
桓台县 |
中卫市 |
怀远县 |
卫辉市 |
蛟河市 |
壶关县 |
门源 |
沿河 |
合川市 |
东平县 |
徐闻县 |
抚顺县 |
阿拉善右旗 |
延川县 |
鄂尔多斯市 |
博客 |
达日县 |
澄迈县 |
化德县 |
黄平县 |
太仓市 |
孝感市 |
盈江县 |
新乐市 |
呼伦贝尔市 |
东辽县 |
灵璧县 |
富锦市 |