隨筆 - 312, 文章 - 14, 評論 - 1393, 引用 - 0
          數據加載中……

          《程序員》第9期智慧擂臺題目——高頻詞匯提取

          本文為原創,如需轉載,請注明作者和出處,謝謝!

              面對浩瀚的信息海洋,找到想要的資源有時真的是不容易。在大量文字中搜索高頻詞匯是信息搜索和數據壓縮的共通課題。
          這次智慧擂臺請大家在一個比較龐大的英文文本中找出M個數量最多的短語(由N個單詞組成)。統一處理相同的文本文件,該文本只包含英文單詞、空格和回行符,比較誰的程序效率最高。

            程序輸入:M,N,文本文件路徑(M不超過20,N不超過8)
          程序輸出:高頻短語及其數量清單

          擂臺規則:提交符合以上要求的可執行程序,語言招式不限,點到為止;
          我們將在統一的環境下對每位選手的作品進行公平的測試,
          比較出綜合用時最少的程序。

          源程序


          import java.io.*;
          import java.util.*;

          class tt
          {
              
          public String phrase;
              
          public int count;
          }

          public class searchphrase
          {

              
          private static LinkedHashMap phrase = new LinkedHashMap();
              
          static tt[] max_phrase;

              
          private static Vector SeparateString(String s)
              {
                  Vector v 
          = new Vector();
                  String temp 
          = "";
                  
          for (int i = 0; i < s.length(); i++)
                  {
                      
          if (s.charAt(i) != ' ')
                      {
                          temp 
          += s.charAt(i);
                      }
                      
          else
                      {
                          
          if (temp != "")
                              v.add(temp);
                          temp 
          = "";
                      }
                  }
                  
          if (temp != "")
                      v.add(temp);
                  
          return v;
              }

              
          private static void swap(int pos, int count, String phrase)
              {
                  
          int i;
                  
          if (max_phrase[pos - 1].count < count)
                  {
                      
          for (i = pos - 1; i > 0; i--)
                      {
                          
          if (max_phrase[i - 1].count > max_phrase[i].count)
                              
          break;
                      }
                      max_phrase[pos].count 
          = max_phrase[i].count;
                      max_phrase[pos].phrase 
          = max_phrase[i].phrase;
                      max_phrase[i].count 
          = count;
                      max_phrase[i].phrase 
          = phrase;
                  }

              }

              
          private static void adjust_max(int count, String phrase)
              {
                  
          int i, j;
                  
          if (count <= max_phrase[max_phrase.length - 1].count)
                      
          return;
                  
          for (i = max_phrase.length - 1; i >= 0; i--)
                  {
                      
          if (max_phrase[i].phrase.equals(phrase))
                      {
                          max_phrase[i].count 
          = count;
                          
          if (i > 0)
                          {
                              swap(i, count, phrase);
                          }
                          
          return;
                      }
                  }
                  max_phrase[max_phrase.length 
          - 1].count = count;
                  max_phrase[max_phrase.length 
          - 1].phrase = phrase;
                  
          if (i > 0)
                  {
                      swap(max_phrase.length 
          - 1, count, phrase);
                  }
              }

              
          private static void js(Vector v, int n)
              {
                  String s;
                  
          for (int i = 0; i < v.size() - n + 1; i++)
                  {
                      s 
          = "";
                      
          for (int j = i; j < i + n; j++)
                      {
                          s 
          += v.get(j) + " ";
                      }
                      
          int count = 1;
                      
          if (phrase.containsKey(s.hashCode()))
                      {
                          count 
          = Integer.parseInt(phrase.get(s.hashCode()).toString());
                          count
          ++;
                      }
                      phrase.put(s.hashCode(), count);
                      adjust_max(count, s);
                  }
              }

              
          public static void main(String[] args)
              {
                  
          try
                  {
                      
          long t;
                      
          int m, n;
                      String path;
                      m 
          = Integer.parseInt(args[0]);
                      n 
          = Integer.parseInt(args[1]);
                      path 
          = args[2];
                      max_phrase 
          = new tt[m];
                      
          for (int i = 0; i < m; i++)
                      {
                          max_phrase[i] 
          = new tt();
                          max_phrase[i].count 
          = 0;
                          max_phrase[i].phrase 
          = "";
                      }
                      t 
          = (new java.util.Date()).getTime();
                      java.io.FileReader fr 
          = new java.io.FileReader(path);
                      java.io.BufferedReader br 
          = new BufferedReader(fr);
                      String s;

                      Vector v 
          = null;

                      
          while ((s = br.readLine()) != null)
                      {
                          v 
          = SeparateString(s);
                          js(v, n);
                      }
                      
          for (int i = 0; i < m; i++)
                      {
                          System.out.println(max_phrase[i].phrase);
                          System.out.println(max_phrase[i].count);
                          System.out.println();
                      }
                      t 
          = (new java.util.Date()).getTime() - t;
                      System.out.print(t);
                      System.out.println(
          " ms");
                  }
                  
          catch (Exception e)
                  {
                      System.out.println(e.getMessage());
                  }

              }

          }

          測試結果1:m = 20 n = 8 

          under games played won drawn lost goals for
          71

          tabulated under games played won drawn lost goals
          70

          games played won drawn lost goals for against
          70

          May Xinhua Following are the results from the
          69

          played won drawn lost goals for against and
          59

          won drawn lost goals for against and points
          59

          Jan Xinhua Following are the results from the
          48

          Chinas economic efficiency indicators of the sector of
          39

          The industrial statistics include all stateowned enterprises and
          39

          industrial statistics include all stateowned enterprises and the
          39

          statistics include all stateowned enterprises and the nonstateowned
          39

          include all stateowned enterprises and the nonstateowned ones
          39

          all stateowned enterprises and the nonstateowned ones with
          39

          stateowned enterprises and the nonstateowned ones with annual
          39

          enterprises and the nonstateowned ones with annual sales
          39

          and the nonstateowned ones with annual sales income
          39

          Xinhua Chinas economic efficiency indicators of the sector
          39

          the nonstateowned ones with annual sales income over
          39

          nonstateowned ones with annual sales income over million
          39

          up percent over the same period last year
          35

          13594 ms


          測試結果2  m = 10 n = 5

          Xinhua Following are the results
          295

          May Xinhua Following are the
          209

          Following are the results from
          183

          are the results from the
          176

          April Xinhua Following are the
          141

          Jan Xinhua Following are the
          122

          billion yuan billion US dollars
          120

          won drawn lost goals for
          88

          played won drawn lost goals
          88

          Dec Xinhua Following are the
          87

          12437 ms

           
          以上源程序是采用的是最簡單的方法,誰有更好,效率更高的方法,請跟貼!!





          Android開發完全講義(第2版)(本書版權已輸出到臺灣)

          http://product.dangdang.com/product.aspx?product_id=22741502



          Android高薪之路:Android程序員面試寶典 http://book.360buy.com/10970314.html


          新浪微博:http://t.sina.com.cn/androidguy   昵稱:李寧_Lining

          posted on 2008-05-10 09:37 銀河使者 閱讀(1673) 評論(1)  編輯  收藏 所屬分類: javaalgorithm 原創

          評論

          # re: 《程序員》第9期智慧擂臺題目——高頻詞匯提取  回復  更多評論   

          恩,程序不錯,應該是最簡單的了
          2008-09-01 11:51 | andyluohao
          主站蜘蛛池模板: 共和县| 沅江市| 广南县| 普陀区| 邳州市| 邢台县| 独山县| 阳江市| 芷江| 岢岚县| 平潭县| 和政县| 莎车县| 益阳市| 崇信县| 永登县| 兰州市| 阿坝| 大同县| 华安县| 巴林右旗| 隆尧县| 惠水县| 吐鲁番市| 宜良县| 开封市| 靖江市| 铜川市| 汉川市| 莲花县| 乳源| 弋阳县| 略阳县| 婺源县| 武定县| 保德县| 余姚市| 奉新县| 宜阳县| 延津县| 崇左市|