posts - 8, comments - 6, trackbacks - 0

2.4 使用Regex捕獲組

在上一節中，介紹了如何使用正則表達式在一個文件中進行搜索以便檢索它內部所有的URL。可以使用Matcher類的find、start和end方法來檢索匹配的URL字符串。有時有必要進一步處理子串匹配的結果，或是查找附加的子模式。例如，對某個特定區域的URL不進行處理。為了實現此目的，一種強制性的方法是使用另一個Pattern和Matcher對象，代碼如下：

// assume urlMatcher instance as in the previous example
while (urlMatcher.find()) {
int startIndex = urlMatcher.start();
int endIndex = urlMatcher.end();
String currentMatch = data.substring(startIndex, endIndex);
// the brute force approach, using a new pattern!
Pattern restricted = Pattern.compile(".*(abc|cbs|nbc)\\.com.*");
Matcher restrictMatcher = restricted.matcher(currentMatch);
if (!restrictMatcher.matches()) {
System.out.println(currentMatch);
}
}

在捕獲的URL中匹配域名并不是一個非常高效的方法。由于已經使用find方法完成了提取URL的困難工作，不應該僅僅為了獲得結果的一部分而編寫另一個regex，并且也不必這樣做。正則表達式允許將模式分解成子序列。使用圓括號，將以后要用到的模式部分括起來，這樣，我們就可以忽略其余部分單獨讀取這些部分的值。重寫URL模式以使域名可以與URL的其他部分相分離：

String urlPattern =
"(http|https|ftp)://([a-zA-Z0-9-\\.]+)[/\\w\\.\\-\\+\\?%=&;:,#]*";

當在模式中存在用括號括起來的組時，可以分別檢索每個組的匹配值。從最左邊的組開始編為1，然后依次對每對括號相對應的組進行編號。在上面的模式中，第一組是協議(如http)，第二組是域名。為了在匹配的字符串中訪問組，可以使用Matcher的group方法。下面的代碼示例從每個URL中檢索域名并顯示它們的值：

String data = getStringData(); // load the document
String urlString =
"(http|https|ftp)://([a-zA-Z0-9-\\.]+)[/\\w\\.\\-\\+\\?%=&;:,#]*";
Pattern urlPattern = Pattern.compile(urlString);
Matcher urlMatcher = urlPattern.matcher(data);
// print out the domain from each URL
while (urlMatcher.find()) {
String domain = urlMatcher.group(2); // 2nd group is the domain
System.out.println(domain);
}

保存每個匹配的組以便可以隨后引用它們。在一個模式內引用一個以前的匹配組稱為逆向引用(backreference)。為了對第三個組進行逆向引用，在模式中包括\3即可。這將會只匹配一個與以前的組相匹配的嚴格重復的數據。為了說明此問題，考慮一個在文本文件中常見的錯誤—— 一個句子中意外地重復出現某個常用的單詞，如“the”或“of”。

" The the water molecules are made of of hydrogen and oxygen."

下面編寫一個模式來找出文件中存在的這些問題。該模式將捕獲第一個單詞，后跟一些空白符，而其后又跟著匹配第一個單詞的重復模式：

String wordPattern = "\\s(of|or|the|to)\\s+\\1[\\s\\.,;]";

該模式匹配情況如下：一個空白字符、特殊的單詞列表中的一個單詞、更多的空白、再次重復的相同的單詞(使用\1逆向引用)以及空白符或標點符號。這種匹配應不區分大小寫，以便能夠捕獲到“The the”以及類似的變型。如以下的代碼段所示，該模式不區分大小寫，能在一個字符串中查找重復出現的模式：

String data = getStringData();
String patternStr = "\\s(of|or|the|to)\\s+\\1[\\s\\.,;]";
Pattern wordPattern =
Pattern.compile(patternStr, Pattern.CASE_INSENSITIVE);
Matcher wordMatcher = wordPattern.matcher(data);
while (wordMatcher.find()) {
int start = wordMatcher.start();
String word = wordMatcher.group(1);
// print the index location of the repeated word
System.out.println("Repeated " + word + " starting at " + start);
}

有一種簡便和強大的匹配文件中文本的方法，該方法允許使用多個正則表達式來處理文件，本章后面的“使用Scanner類進行語法分析”一節將會講解此方法。若想了解使用內置索引進行更為復雜的文本搜索的解決方法，請參考第3章中“使用Lucene進行搜索”一節的內容。

posted on 2008-12-12 11:38 ♂游泳的魚閱讀(317) 評論(0) 編輯收藏所屬分類: Wicked Cool Java中文版：代碼、開源類庫與項目創意

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關文章: 2.5 使用正則表達式進行替換 2.4 使用Regex捕獲組 2.1 使用正則表達式來搜索文本 1.10 “==”不等于“.equals” 1.9 創建一個匿名的類 1.8 亞毫秒級的線程休眠 1.7 以納秒級的時間計算：使用System.nanoTime 1.6 要決斷：使用Java斷言 1.5 使用多個參數：編寫Vararg方法 1.4 常用的泛型：使用泛型參數來編寫方法

<

2025年8月

>

日

一

二

三

四

五

六

27

28

29

30

31

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

1

2

3

4

5

6

留言簿(1)

隨筆分類

SQL SERVER(2)

隨筆檔案

文章分類

Wicked Cool Java中文版：代碼、開源類庫與項目創意(13)

留言簿(1)

隨筆分類

隨筆檔案

文章分類

文章檔案

搜索

最新評論

閱讀排行榜