《程序員》第9期智慧擂臺題目——高頻詞匯提取
面對浩瀚的信息海洋,找到想要的資源有時真的是不容易。在大量文字中搜索高頻詞匯是信息搜索和數(shù)據(jù)壓縮的共通課題。
這次智慧擂臺請大家在一個比較龐大的英文文本中找出M個數(shù)量最多的短語(由N個單詞組成)。統(tǒng)一處理相同的文本文件,該文本只包含英文單詞、空格和回行符,比較誰的程序效率最高。
程序輸入:M,N,文本文件路徑(M不超過20,N不超過8)
程序輸出:高頻短語及其數(shù)量清單
擂臺規(guī)則:提交符合以上要求的可執(zhí)行程序,語言招式不限,點到為止;
我們將在統(tǒng)一的環(huán)境下對每位選手的作品進行公平的測試,
比較出綜合用時最少的程序。
源程序
import java.util.*;
class tt
{
public String phrase;
public int count;
}
public class searchphrase
{
private static LinkedHashMap phrase = new LinkedHashMap();
static tt[] max_phrase;
private static Vector SeparateString(String s)
{
Vector v = new Vector();
String temp = "";
for (int i = 0; i < s.length(); i++)
{
if (s.charAt(i) != ' ')
{
temp += s.charAt(i);
}
else
{
if (temp != "")
v.add(temp);
temp = "";
}
}
if (temp != "")
v.add(temp);
return v;
}
private static void swap(int pos, int count, String phrase)
{
int i;
if (max_phrase[pos - 1].count < count)
{
for (i = pos - 1; i > 0; i--)
{
if (max_phrase[i - 1].count > max_phrase[i].count)
break;
}
max_phrase[pos].count = max_phrase[i].count;
max_phrase[pos].phrase = max_phrase[i].phrase;
max_phrase[i].count = count;
max_phrase[i].phrase = phrase;
}
}
private static void adjust_max(int count, String phrase)
{
int i, j;
if (count <= max_phrase[max_phrase.length - 1].count)
return;
for (i = max_phrase.length - 1; i >= 0; i--)
{
if (max_phrase[i].phrase.equals(phrase))
{
max_phrase[i].count = count;
if (i > 0)
{
swap(i, count, phrase);
}
return;
}
}
max_phrase[max_phrase.length - 1].count = count;
max_phrase[max_phrase.length - 1].phrase = phrase;
if (i > 0)
{
swap(max_phrase.length - 1, count, phrase);
}
}
private static void js(Vector v, int n)
{
String s;
for (int i = 0; i < v.size() - n + 1; i++)
{
s = "";
for (int j = i; j < i + n; j++)
{
s += v.get(j) + " ";
}
int count = 1;
if (phrase.containsKey(s.hashCode()))
{
count = Integer.parseInt(phrase.get(s.hashCode()).toString());
count++;
}
phrase.put(s.hashCode(), count);
adjust_max(count, s);
}
}
public static void main(String[] args)
{
try
{
long t;
int m, n;
String path;
m = Integer.parseInt(args[0]);
n = Integer.parseInt(args[1]);
path = args[2];
max_phrase = new tt[m];
for (int i = 0; i < m; i++)
{
max_phrase[i] = new tt();
max_phrase[i].count = 0;
max_phrase[i].phrase = "";
}
t = (new java.util.Date()).getTime();
java.io.FileReader fr = new java.io.FileReader(path);
java.io.BufferedReader br = new BufferedReader(fr);
String s;
Vector v = null;
while ((s = br.readLine()) != null)
{
v = SeparateString(s);
js(v, n);
}
for (int i = 0; i < m; i++)
{
System.out.println(max_phrase[i].phrase);
System.out.println(max_phrase[i].count);
System.out.println();
}
t = (new java.util.Date()).getTime() - t;
System.out.print(t);
System.out.println(" ms");
}
catch (Exception e)
{
System.out.println(e.getMessage());
}
}
}
測試結(jié)果1:m = 20 n = 8
under games played won drawn lost goals for
71
tabulated under games played won drawn lost goals
70
games played won drawn lost goals for against
70
May Xinhua Following are the results from the
69
played won drawn lost goals for against and
59
won drawn lost goals for against and points
59
Jan Xinhua Following are the results from the
48
Chinas economic efficiency indicators of the sector of
39
The industrial statistics include all stateowned enterprises and
39
industrial statistics include all stateowned enterprises and the
39
statistics include all stateowned enterprises and the nonstateowned
39
include all stateowned enterprises and the nonstateowned ones
39
all stateowned enterprises and the nonstateowned ones with
39
stateowned enterprises and the nonstateowned ones with annual
39
enterprises and the nonstateowned ones with annual sales
39
and the nonstateowned ones with annual sales income
39
Xinhua Chinas economic efficiency indicators of the sector
39
the nonstateowned ones with annual sales income over
39
nonstateowned ones with annual sales income over million
39
up percent over the same period last year
35
13594 ms
測試結(jié)果2 m = 10 n = 5
Xinhua Following are the results
295
May Xinhua Following are the
209
Following are the results from
183
are the results from the
176
April Xinhua Following are the
141
Jan Xinhua Following are the
122
billion yuan billion US dollars
120
won drawn lost goals for
88
played won drawn lost goals
88
Dec Xinhua Following are the
87
12437 ms
以上源程序是采用的是最簡單的方法,誰有更好,效率更高的方法,請跟貼!!
《Android開發(fā)完全講義(第2版)》(本書版權(quán)已輸出到臺灣)
http://product.dangdang.com/product.aspx?product_id=22741502
《Android高薪之路:Android程序員面試寶典 》http://book.360buy.com/10970314.html
新浪微博:http://t.sina.com.cn/androidguy 昵稱:李寧_Lining
posted on 2008-05-10 09:37 銀河使者 閱讀(1673) 評論(1) 編輯 收藏 所屬分類: java 、algorithm 、 原創(chuàng)