日本午夜在线视频,欧美日韩一区在线播放,www.黄在线观看

ICTCLAS分詞的總體流程包括：1）初步分詞；2）詞性標注；3）人名、地名識別；4）重新分詞；5）重新詞性標注這五步。就第一步分詞而言，又細分成：1）原子切分；2）找出原子之間所有可能的組詞方案；3）N-最短路徑中文詞語粗分三步。

在所有內容中，詞典庫的讀取是最基本的功能。ICTCLAS中詞典存放在Data目錄中，常用的詞典包括coreDict.dct（詞典庫）、BigramDict.dct（詞與詞間的關聯庫）、nr.dct（人名庫）、ns.dct（地名庫）、tr.dct（翻譯人名庫），它們的文件格式是完全相同的，都使用CDictionary類進行解析。如果想深入了解ICTCLAS詞典結構，可以參考sinboy的《ICTCLAS分詞系統研究（二）--詞典結構》一文，詳細介紹了詞典結構。我這里只給出SharpICTCLAS中的實現。

首先是對基本元素的定義。在SharpICTCLAS中，對原有命名進行了部分調整，使得更具有實際意義并適合C#的習慣。代碼如下：

Copy Code

WordDictionaryElement.cs 程序

using System;
using System.Collections.Generic;
using System.Text;

namespace SharpICTCLAS
{
   //==================================================
   // Original predefined in DynamicArray.h file
   //==================================================
   public class ArrayChainItem
   {
      public int col, row;//row and column
      public double value;//The value of the array
      public int nPOS;
      public int nWordLen;
      public string sWord;
      //The possible POS of the word related to the segmentation graph
      public ArrayChainItem next;
   }

   public class WordResult
   {
      //The word
      public string sWord;

      //the POS of the word
      public int nPOS;

      //The -log(frequency/MAX)
      public double dValue;
   }

   //--------------------------------------------------
   // data structure for word item
   //--------------------------------------------------
   public class WordItem
   {
      public int nWordLen;

      //The word
      public string sWord;

      //the process or information handle of the word
      public int nPOS;

      //The count which it appear
      public int nFrequency;
   }

   //--------------------------------------------------
   //data structure for dictionary index table item
   //--------------------------------------------------
   public class IndexTableItem
   {
      //The count number of words which initial letter is sInit
      public int nCount;

      //The head of word items
      public WordItem[] WordItems;
   }

   //--------------------------------------------------
   //data structure for word item chain
   //--------------------------------------------------
   public class WordChain
   {
      public WordItem data;
      public WordChain next;
   }

   //--------------------------------------------------
   //data structure for dictionary index table item
   //--------------------------------------------------
   public class ModifyTableItem
   {
      //The count number of words which initial letter is sInit
      public int nCount;

      //The number of deleted items in the index table
      public int nDelete;

      //The head of word items
      public WordChain pWordItemHead;
   }
}

其中ModifyTableItem用于組成ModifyTable，但在實際分詞時，詞庫往往處于“只讀”狀態，因此用于修改詞庫的ModifyTable實際上起的作用并不大。因此在后面我將ModifyTable的代碼暫時省略。

有了基本元素的定義后，就該定義“詞典”類了。原有C++代碼中所有類名均以大寫的“C”打頭，詞典類名為CDictionary，在SharpICTCLAS中，我去掉了開頭的“C”，并且為了防止和系統的Dictionary類重名，特起名為“WordDictionary”類。該類主要負責完成詞典庫的讀、寫以及檢索操作。讓我們看看如何讀取詞典庫：

Copy Code

詞典庫的讀取：

public class WordDictionary
{
   public bool bReleased = true;

   public IndexTableItem[] indexTable;
   public ModifyTableItem[] modifyTable;

   public bool Load(string sFilename)
   {
      return Load(sFilename, false);
   }

   public bool Load(string sFilename, bool bReset)
   {
      int frequency, wordLength, pos;   //頻率、詞長、讀取詞性
      bool isSuccess = true;
      FileStream fileStream = null;
      BinaryReader binReader = null;

      try
      {
         fileStream = new FileStream(sFilename, FileMode.Open, FileAccess.Read);
         if (fileStream == null)
            return false;

         binReader = new BinaryReader(fileStream, Encoding.GetEncoding("gb2312"));

         indexTable = new IndexTableItem[Predefine.CC_NUM];

         bReleased = false;
         for (int i = 0; i < Predefine.CC_NUM; i++)
         {
            //讀取以該漢字打頭的詞有多少個
            indexTable[i] = new IndexTableItem();
            indexTable[i].nCount = binReader.ReadInt32();

            if (indexTable[i].nCount <= 0)
               continue;

            indexTable[i].WordItems = new WordItem[indexTable[i].nCount];

            for (int j = 0; j < indexTable[i].nCount; j++)
            {
               indexTable[i].WordItems[j] = new WordItem();

               frequency = binReader.ReadInt32();   //讀取頻率
               wordLength = binReader.ReadInt32(); //讀取詞長
               pos = binReader.ReadInt32();      //讀取詞性

               if (wordLength > 0)
                  indexTable[i].WordItems[j].sWord = Utility.ByteArray2String(binReader.ReadBytes(wordLength));
               else
                  indexTable[i].WordItems[j].sWord = "";

               //Reset the frequency
               if (bReset)
                  indexTable[i].WordItems[j].nFrequency = 0;
               else
                  indexTable[i].WordItems[j].nFrequency = frequency;

               indexTable[i].WordItems[j].nWordLen = wordLength;
               indexTable[i].WordItems[j].nPOS = pos;
            }
         }
      }
      catch (Exception e)
      {
         Console.WriteLine(e.Message);
         isSuccess = false;
      }
      finally
      {
         if (binReader != null)
            binReader.Close();

         if (fileStream != null)
            fileStream.Close();
      }
      return isSuccess;
   }
   //......
}

下面內容節選自詞庫中CCID為2、3、4、5的單元， CCID的取值范圍自1～6768，對應6768個漢字，所有與該漢字可以組成的詞均記錄在相應的單元內。詞庫中記錄的詞是沒有首漢字的（我用帶括號的字補上了），其首漢字就是該單元對應的漢字。詞庫中記錄了詞的詞長、頻率、詞性以及詞。

另外特別需要注意的是在一個單元內，詞是按照CCID大小排序的！這對我們后面的分析至關重要。

Copy Code

ICTCLAS詞庫部分內容

漢字:埃, ID ：2

詞長頻率詞性   詞
    0   128    h   (埃)
    0     0    j   (埃)
    2     4    n   (埃)鎊
    2    28    ns (埃)鎊
    4     4    n   (埃)菲爾
    2   511    ns (埃)及
    4     4    ns (埃)克森
    6     2    ns (埃)拉特灣
    4     4    nr (埃)里溫
    6     2    nz (埃)默魯市
    2    27    n   (埃)塞
    8    64    ns (埃)塞俄比亞
   22     2    ns (埃)塞俄比亞聯邦民主共和國
    4     3    ns (埃)塞薩
    4     4    ns (埃)舍德
    6     2    nr (埃)斯特角
    4     2    ns (埃)松省
    4     3    nr (埃)特納
    6     2    nz (埃)因霍溫
====================================
漢字:挨, ID ：3

詞長頻率詞性   詞
    0    56    h   (挨)
    2     1    j   (挨)次
    2    19    n   (挨)打
    2     3    ns (挨)凍
    2     1    n   (挨)斗
    2     9    ns (挨)餓
    2     4    ns (挨)個
    4     2    ns (挨)個兒
    6    17    nr (挨)家挨戶
    2     1    nz (挨)近
    2     0    n   (挨)罵
    6     1    ns (挨)門挨戶
    2     1    ns (挨)批
    2     0    ns (挨)整
    2    12    ns (挨)著
    2     0    nr (挨)揍
====================================
漢字:哎, ID ：4

詞長頻率詞性   詞
    0    10    h   (哎)
    2     3    j   (哎)呀
    2     2    n   (哎)喲
====================================
漢字:唉, ID ：5

詞長頻率詞性   詞
    0     9    h   (唉)
    6     4    j   (唉)聲嘆氣

在這里還應當注意的是，一個詞可能有多個詞性，因此一個詞可能在詞典中出現多次，但詞性不同。若想從詞典中唯一定位一個詞的話，必須同時指明詞與詞性。

另外在WordDictionary類中用到得比較多的就是詞的檢索，這由FindInOriginalTable方法實現。原ICTCLAS代碼中該方法的實現結構比較復雜，同時考慮了多種檢索需求，因此代碼也相對復雜一些。在SharpICTCLAS中，我對該方法進行了重載，針對不同檢索目的設計了不同的FindInOriginalTable方法，簡化了程序接口和代碼復雜度。其中一個FindInOriginalTable方法代碼如下，實現了判斷某一詞性的一詞是否存在功能。

Copy Code

FindInOriginalTable方法的一個重載版本

private bool FindInOriginalTable(int nInnerCode, string sWord, int nPOS)
{
   WordItem[] pItems = indexTable[nInnerCode].WordItems;

   int nStart = 0, nEnd = indexTable[nInnerCode].nCount - 1;
   int nMid = (nStart + nEnd) / 2, nCmpValue;

   //Binary search
   while (nStart <= nEnd)
   {
      nCmpValue = Utility.CCStringCompare(pItems[nMid].sWord, sWord);

      if (nCmpValue == 0 && (pItems[nMid].nPOS == nPOS || nPOS == -1))
         return true;//find it
      else if (nCmpValue < 0 || (nCmpValue == 0 && pItems[nMid].nPOS < nPOS && nPOS != -1))
         nStart = nMid + 1;
      else if (nCmpValue > 0 || (nCmpValue == 0 && pItems[nMid].nPOS > nPOS && nPOS != -1))
         nEnd = nMid - 1;

      nMid = (nStart + nEnd) / 2;
   }
   return false;
}

其它功能在這里就不再介紹了。

小結

1、WordDictionary類實現了對字典的讀取、寫入、更改、檢索等功能。

2、詞典中記錄了以6768個漢字打頭的詞、詞性、出現頻率的信息，具體結構需要了解。

來源：http://www.cnblogs.com/zhenyulu/category/85598.html

posted on 2007-12-28 19:21 刀劍笑閱讀(592) 評論(0) 編輯收藏所屬分類: SharpICTCLAS

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關文章: SharpICTCLAS 1.0 發布! （轉） SharpICTCLAS分詞系統簡介(9)詞庫擴充（轉） SharpICTCLAS分詞系統簡介(8)其它（轉） SharpICTCLAS分詞系統簡介(7)OptimumSegment（轉） SharpICTCLAS分詞系統簡介(6)Segment（轉） SharpICTCLAS分詞系統簡介(5)NShortPath-2(轉) SharpICTCLAS分詞系統簡介(4)NShortPath-1（轉） SharpICTCLAS分詞系統簡介(3)DynamicArray（轉） SharpICTCLAS分詞系統簡介(2)初步分詞（轉） SharpICTCLAS分詞系統簡介(1)讀取詞典庫（轉）

刀劍笑

常用鏈接

留言簿(1)

隨筆分類

隨筆檔案

文章檔案

搜索

最新評論

閱讀排行榜

評論排行榜