Change Dir

先知cd——熱愛生活是一切藝術(shù)的開始

導(dǎo)航

<

2010年9月

>

日

一

二

三

四

五

六

29

30

31

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

1

2

3

4

5

6

7

8

9

公告

寫下來的都是資源，分享給互聯(lián)網(wǎng)~~均屬原創(chuàng)隨筆。
轉(zhuǎn)載引用請注明作者changedi。
喜歡應(yīng)用研究，熱愛編程，歡迎交流。

隨筆分類(125)

隨筆檔案(123)

統(tǒng)計

隨筆 - 222
文章 - 0
評論 - 182
引用 - 0

留言簿(18)

積分與排名

積分 - 421936
排名 - 132

閱讀排行榜

評論排行榜

關(guān)于String里indexOf()的一些思考

這些天偶然翻起了算法書，看到KMP算法的時候，突然發(fā)現(xiàn)多年來似乎一直沒有完全搞明白，于是惡補了一下。寫了個程序，“基本”明白了fail函數(shù)和KMP的那些事~~~~
具體程序見下

1

package test;
2

/**
3

* @author Jia Yu
4

* @date 2010-9-28
5

*/
6

public class StringMatch {
7

8

private int []f;
9

/**
10

* KMP fail function to calculate f[]
11

* f[j] = k < j where k is the maximum of pat[0..k] == pat[j-k..j]
12

* = -1 otherwise
13

* @param pat
14

*/
15

public void fail(String pat){
16

int lenP = pat.length();
17

f = new int[lenP];
18

f[0] = -1;
19

for(int j=1;j<lenP;j++){
20

int i = f[j-1];
21

while((pat.charAt(j)==pat.charAt(i+1))&&(i>=0)) i = f[i];
22

if(pat.charAt(j)!=pat.charAt(i+1)) f[j] = i + 1;
23

else f[j] = -1;
24

}
25

}
26

27

/**
28

* Implementation of KMP algorithm.
29

* @param s string which is source string
30

* @param pat string pattern
31

* @return
32

*/
33

public int kmp_find(String s,String pat){
34

int lenS,lenP;
35

lenS = s.length();
36

lenP = pat.length();
37

int i,j;
38

i=j=0;
39

while(i<lenS&&j<lenP){
40

if(s.charAt(i)==pat.charAt(j)){
41

i++;
42

j++;
43

}
44

else{
45

if(j==0) i++;
46

else j=f[j-1] +1;
47

}
48

}
49

if(j<lenP||lenP==0) return -1;
50

else return i-lenP;
51

}
52

53

/**
54

* @param args
55

*/
56

public static void main(String[] args) {
57

// TODO Auto-generated method stub
58

StringMatch ss = new StringMatch();
59

String s = "abcdedabc";
60

String pat = "dab";
61

ss.fail(pat);
62

System.out.println(ss.kmp_find(s, pat));
63

}
64

65

}
66

完事后忍不住想和String的indexOf比比性能。一直以為jdk 的src里String的indexOf是用Naïve string search algorithm實現(xiàn)的。沒有任何技巧，當(dāng)然也有好多人去拿這個說事，但是逛過一些論壇，記得人們只是自己去實現(xiàn)KMP，然后說KMP有多好，但是很少有拿出數(shù)據(jù)來比對的。所以今天，在投簡歷閑暇，還是抽出些時間，看了看String match相關(guān)的一些東西。寫了一些代碼，發(fā)現(xiàn)了一些東東~~~~
不多扯閑話了。首先進行了一個KMP和String indexOf的比較，看看結(jié)果吧。
preprocess using 7549ns
=====================================================
Cycles :      30000
String find pat in pos -1
Used Time is 87134ns
KMP find pat in pos -1
Used Time is 1071829ns
=====================================================
Cycles :      90000
String find pat in pos -1
Used Time is 150480ns
KMP find pat in pos -1
Used Time is 2277475ns
=====================================================
Cycles :     270000
String find pat in pos -1
Used Time is 380815ns
KMP find pat in pos -1
Used Time is 848257ns
=====================================================
Cycles :     810000
String find pat in pos 119457
Used Time is 509997ns
KMP find pat in pos 119457
Used Time is 1141992ns
=====================================================
Cycles :    2430000
String find pat in pos 459895
Used Time is 1845130ns
KMP find pat in pos 459895
Used Time is 4180643ns
測試代碼如下：

1

/**
2

*
3

*/
4

package test;
5

6

import java.util.Random;
7

8

/**
9

* @author Jia Yu
10

* @date 2010-9-28
11

*/
12

public class StringMatchTest2 {
13

14

private static String src;
15

private static String pat;
16

private static long cycles = 10000L;
17

private static Random rand = new Random(47);
18

19

public static void generateSource() {
20

StringBuilder sb = new StringBuilder();
21

int iter = (int) cycles;
22

for (int i = 0; i < iter; i++) {
23

sb.append((char) ('a' + (rand.nextInt(6))));
24

}
25

src = sb.toString();
26

}
27

28

public static void generatePattern() {
29

StringBuilder sb = new StringBuilder();
30

for (int i = 0; i < 7; i++) {
31

sb.append((char) ('a' + (rand.nextInt(6))));
32

}
33

pat = sb.toString();
34

}
35

36

/**
37

* @param args
38

*/
39

public static void main(String[] args) {
40

// TODO Auto-generated method stub
41

generatePattern();
42

StringMatch sm = new StringMatch();
43

long start, pre, dur;
44

start = System.nanoTime();
45

sm.fail(pat);
46

pre = System.nanoTime() - start;
47

System.out.println("preprocess using " + pre + "ns");
48

for (int i = 0; i < 5; i++) {
49

generateSource();
50

cycles *= 3;
51

System.out
52

.println("=====================================================");
53

System.out.printf("%s : %10d\n", "Cycles\t", cycles);
54

start = System.nanoTime();
55

System.out.println("String find pat in pos " + src.indexOf(pat));
56

dur = System.nanoTime() - start;
57

System.out.println("Used Time is " + dur + "ns");
58

start = System.nanoTime();
59

System.out.println("KMP find pat in pos " + sm.kmp_find(src, pat));
60

dur = System.nanoTime() - start;
61

System.out.println("Used Time is " + dur + "ns");
62

}
63

}
64

65

}
66

總是說KMP算法的時間復(fù)雜度是O(n)的，但是看看效果，又是什么樣子呢？看到這里我覺得自己哪里好像做的不對，所以厚顏的把代碼拿出來，請大家?guī)兔纯吹降啄睦锍隽藛栴}。我的思路就是用同一個pat串去查詢規(guī)模逐漸變大的src串，就是為了能節(jié)省kmp的預(yù)處理時間，畢竟KMP要計算一個fail函數(shù)，拿出額外的O(m)空間做這個事情的，其中m是pat的長度。但是從結(jié)果上看，不管是查不到（返回-1）還是在中間位置附近查到。我實現(xiàn)的kmp完全沒有任何優(yōu)勢可言。
對了，也得把java的String的indexOf方法貼上吧~~~

1

static int indexOf(char[] source, int sourceOffset, int sourceCount,
2

char[] target, int targetOffset, int targetCount,
3

int fromIndex) {
4

if (fromIndex >= sourceCount) {
5

return (targetCount == 0 ? sourceCount : -1);
6

}
7

if (fromIndex < 0) {
8

fromIndex = 0;
9

}
10

if (targetCount == 0) {
11

return fromIndex;
12

}
13

14

char first = target[targetOffset];
15

int max = sourceOffset + (sourceCount - targetCount);
16

17

for (int i = sourceOffset + fromIndex; i <= max; i++) {
18

/* Look for first character. */
19

if (source[i] != first) {
20

while (++i <= max && source[i] != first);
21

}
22

23

/* Found first character, now look at the rest of v2 */
24

if (i <= max) {
25

int j = i + 1;
26

int end = j + targetCount - 1;
27

for (int k = targetOffset + 1; j < end && source[j] ==
28

target[k]; j++, k++);
29

30

if (j == end) {
31

/* Found whole string. */
32

return i - sourceOffset;
33

}
34

}
35

}
36

return -1;
37

}

感覺笨笨的java的方法，卻達到了高的性能。那跟別的比比呢？
之前有用過一個Rope結(jié)構(gòu)的，模仿了String，但是內(nèi)部不是字符數(shù)組實現(xiàn)，而是用了R*樹（如果我沒記錯的話）。Rope也有類似的indexOf實現(xiàn)，那么來比比吧，我為了增加算法的多樣性，還拿stringsearch這個包來調(diào)用BoyerMooreHorspool算法實驗一下。
具體的結(jié)果如下：
=====================================================
Cycles :      10000
String : 15991ns in pos 5000
Rope : 125542ns in pos 5000
KMP : 448211ns in pos 5000
BMH : 743977ns in pos 5000
=====================================================
Cycles :      30000
String : 27135ns in pos 15000
Rope : 122409ns in pos 15000
KMP : 1554487ns in pos 15000
BMH : 175424ns in pos 15000
=====================================================
Cycles :      90000
String : 77267ns in pos 45000
Rope : 284037ns in pos 45000
KMP : 266580ns in pos 45000
BMH : 906169ns in pos 45000
=====================================================
Cycles :     270000
String : 235725ns in pos 135000
Rope : 112890ns in pos 135000
KMP : 796252ns in pos 135000
BMH : 971998ns in pos 135000
=====================================================
Cycles :     810000
String : 698246ns in pos 405000
Rope : 328063ns in pos 405000
KMP : 2652067ns in pos 405000
BMH : 12013728ns in pos 405000
=====================================================
Cycles :    2430000
String : 2082535ns in pos 1215000
Rope : 884518ns in pos 1215000
KMP : 7261113ns in pos 1215000
BMH : 4895766ns in pos 1215000
這個實驗中，我每次去產(chǎn)生一個隨機的字符串src，然后模式串pat只選用了src串的中間位置開始的200個字符。這樣可以保證每次都是可以查找到的，因此從某種角度上講沒有測試最壞情況，因為返回-1這種情況實在是太easy 了。
實驗代碼如下：

1

package test;
2

3

import org.ahmadsoft.ropes.Rope;
4

import com.eaio.stringsearch.*;
5

import java.util.*;
6

7

/**
8

* @author Jia Yu
9

* @date 2010-9-28
10

*/
11

public class StringMatchTest {
12

13

static String src;
14

static String pat;
15

16

public static void generateString() {
17

StringBuilder sb = new StringBuilder();
18

int iter = (int) Tester.cycles;
19

Random rand = new Random(47);
20

for (int i = 0; i < iter; i++) {
21

sb.append((char)('a' + (rand.nextInt(26))));
22

}
23

src = sb.toString();
24

pat = sb.substring(iter / 2, iter / 2 + 200);
25

}
26

27

static void test() {
28

System.out
29

.println("=====================================================");
30

System.out.printf("%s : %10d\n", "Cycles\t", Tester.cycles);
31

generateString();
32

BaseLine baseLine = new BaseLine(src, pat);
33

RopeTest ropeTest = new RopeTest(src, pat);
34

KMPTest kmpTest = new KMPTest(src, pat);
35

BMHTest bmhTest = new BMHTest(src, pat);
36

37

baseLine.timedTest();
38

ropeTest.timedTest();
39

kmpTest.timedTest();
40

bmhTest.timedTest();
41

}
42

43

/**
44

* @param args
45

*/
46

public static void main(String[] args) {
47

// TODO Auto-generated method stub
48

int iter = 6;
49

for (int i = 0; i < iter; i++) {
50

test();
51

Tester.cycles *= 3;
52

}
53

}
54

55

}
56

57

abstract class Tester {
58

public static long cycles = 10000L;
59

protected String id = "error";
60

protected String src, pat;
61

62

public Tester(String s, String p) {
63

this.src = s;
64

this.pat = p;
65

}
66

67

public abstract int match();
68

69

public void timedTest() {
70

long start = System.nanoTime();
71

int pos = this.match();
72

long duriation = System.nanoTime() - start;
73

System.out.println(id + "\t : \t" + duriation + "ns in pos " + pos);
74

}
75

}
76

77

class BaseLine extends Tester {
78

public BaseLine(String s, String p) {
79

super(s, p);
80

id = "String";
81

}
82

83

@Override
84

public int match() {
85

// TODO Auto-generated method stub
86

return src.indexOf(pat);
87

}
88

}
89

90

class RopeTest extends Tester {
91

92

private Rope rope;
93

94

public RopeTest(String s, String p) {
95

super(s, p);
96

// TODO Auto-generated constructor stub
97

id = "Rope";
98

rope = Rope.BUILDER.build(s.toCharArray());
99

}
100

101

@Override
102

public int match() {
103

// TODO Auto-generated method stub
104

return rope.indexOf(pat);
105

}
106

107

}
108

109

class KMPTest extends Tester {
110

111

StringMatch sm = new StringMatch();
112

113

public KMPTest(String s, String p) {
114

super(s, p);
115

// TODO Auto-generated constructor stub
116

id = "KMP";
117

sm.fail(pat);
118

}
119

120

@Override
121

public int match() {
122

// TODO Auto-generated method stub
123

return sm.kmp_find(src, pat);
124

}
125

126

}
127

128

class BMHTest extends Tester{
129

130

private StringSearch ss;
131

public BMHTest(String s, String p) {
132

super(s, p);
133

// TODO Auto-generated constructor stub
134

id = "BMH";
135

ss = new BoyerMooreHorspool();
136

}
137

138

@Override
139

public int match() {
140

// TODO Auto-generated method stub
141

return ss.searchChars(src.toCharArray(), pat.toCharArray());
142

}
143

144

}

各種悲劇，在搞了一段時間后，我已經(jīng)被這些東西的時間差異搞暈了。看了看這篇文章http://en.wikipedia.org/wiki/String_searching_algorithm，里面講的比較清楚。羅里吧嗦的寫了這么多，沒有研究清楚String search，反倒是更迷糊了。希望有人能給我講講，針對我的實驗情況。總歸有些收獲的，KMP現(xiàn)在印在腦子里了，雖然對其性能沒啥信心。最后還要把用到的Stringsearch的下載地址附上，畢竟是別人的結(jié)果。
Stringsearch：http://johannburkard.de/software/stringsearch/。
另外關(guān)于BMH算法，可以看http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm。
P.S. 有個補充的東西，在rope實現(xiàn)的過程中，如果令cycles變?yōu)?0000，有興趣的可以試試結(jié)果。難不成我發(fā)現(xiàn)了些什么？如果令cycles不是3，而是以2的倍數(shù)指數(shù)的改變呢？

posted on 2010-09-28 16:30 changedi 閱讀(5783) 評論(9) 編輯收藏所屬分類: 算法

理論上說簡單的匹配算法是O(m*n)的時間復(fù)雜度，而KMP可以達到O(m+n)，在我的測試?yán)铮琄MP的實現(xiàn)也很難超過簡單的匹配算法，這個跟測試數(shù)據(jù)有一定關(guān)系，只有在m和n足夠大，也就是匹配串和被匹配字符串足夠長的時候，KMP算法才能體現(xiàn)出來一些優(yōu)勢來。

想超過簡單匹配，可以嘗試下BM、Shift-And、Shift-Or之類的匹配算法。回復(fù) 更多評論

# re: 關(guān)于String里indexOf()的一些思考 2010-09-30 16:41 changedi

@denniis

哦，謝謝Denniis大牛的指教。我一直覺得在尋找200長度的模式串已經(jīng)算是長的了。
有時間的話我會看看關(guān)于String search相關(guān)的paper的。回復(fù) 更多評論

# re: 關(guān)于String里indexOf()的一些思考[未登錄] 2010-12-31 00:11 tony

程序好像有點問題：
21 while((pat.charAt(j)==pat.charAt(i+1))&&(i>=0)) i = f[i];
應(yīng)該是不等于吧

測試了一下
s = "abxabababc"
pat = "ababc"
輸出是： -1

結(jié)果應(yīng)該是：5 回復(fù) 更多評論

# re: 關(guān)于String里indexOf()的一些思考 2011-01-01 20:01 changedi

@tony
沒錯，改過了，謝謝提示回復(fù) 更多評論

# re: 關(guān)于String里indexOf()的一些思考 2012-12-29 14:56 dizh

while((pat.charAt(j)==pat.charAt(i+1))&&(i>=0)) i = f[i];
22 if(pat.charAt(j)!=pat.charAt(i+1)) f[j] = i + 1;
你寫錯了吧，應(yīng)該是while里面是！=，if是== 回復(fù) 更多評論

# re: 關(guān)于String里indexOf()的一些思考 2012-12-29 14:59 dizh

還是說K的取值覺定了？
f[j] = k < j where k is the maximum of pat[0..k] == pat[j-k..j]
你這里說要去最大K，你那么做是對的，可是書上說要取最小K，當(dāng)然得按照我和tony說的。回復(fù) 更多評論

# re: 關(guān)于String里indexOf()的一些思考 2014-04-15 16:18 seki

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab
aaaaaaaaaab

第一行的a你弄1w個，第2行的a你弄1k個，再試試... 回復(fù) 更多評論

# re: 關(guān)于String里indexOf()的一些思考 2015-04-24 18:15 ss

@seki
KMP主要對字符重復(fù)性較高的字符串有著高性能，其他的情況，性能應(yīng)該不會有很大差距。回復(fù) 更多評論

# re: 關(guān)于String里indexOf()的一些思考[未登錄] 2016-02-26 16:43 z

是因為charAt函數(shù)太耗時了可以先toCharArray再訪問數(shù)組回復(fù) 更多評論

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發(fā)表評論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關(guān)文章: 深入理解動態(tài)規(guī)劃的一系列問題（15）深入理解動態(tài)規(guī)劃的一系列問題（14）深入理解動態(tài)規(guī)劃的一系列問題（13）深入理解動態(tài)規(guī)劃的一系列問題（12）深入理解動態(tài)規(guī)劃的一系列問題（11）深入理解動態(tài)規(guī)劃的一系列問題（10）深入理解動態(tài)規(guī)劃的一系列問題（9）深入理解動態(tài)規(guī)劃的一系列問題（8）深入理解動態(tài)規(guī)劃的一系列問題（7）深入理解動態(tài)規(guī)劃的一系列問題（6）

Change Dir

導(dǎo)航

公告

隨筆分類(125)

隨筆檔案(123)

統(tǒng)計

留言簿(18)

積分與排名

“牛”們的博客

各個公司技術(shù)

我的鏈接

淘寶技術(shù)

閱讀排行榜

評論排行榜

關(guān)于String里indexOf()的一些思考

評論

# re: 關(guān)于String里indexOf()的一些思考 2010-09-30 11:36 denniis

# re: 關(guān)于String里indexOf()的一些思考 2010-09-30 16:41 changedi

# re: 關(guān)于String里indexOf()的一些思考[未登錄] 2010-12-31 00:11 tony

# re: 關(guān)于String里indexOf()的一些思考 2011-01-01 20:01 changedi

# re: 關(guān)于String里indexOf()的一些思考 2012-12-29 14:56 dizh

# re: 關(guān)于String里indexOf()的一些思考 2012-12-29 14:59 dizh

# re: 關(guān)于String里indexOf()的一些思考 2014-04-15 16:18 seki

# re: 關(guān)于String里indexOf()的一些思考 2015-04-24 18:15 ss

# re: 關(guān)于String里indexOf()的一些思考[未登錄] 2016-02-26 16:43 z