emu in blogjava

作者:emu(黃希彤)從mysql4.1的connector/J（3.1.?版）就有了漢字編碼問題。http://www.csip.cn/new/st/db/2004/0804/428.htm 里面介紹了一種解決方法。但是我現(xiàn)在使用的是mysql5.0beta和Connector/J（mysql-connector-java-3.2.0-alpha版），原來的方法不適用了，趁這個機會對Connector/J的源碼做一點分析吧。
mysql-connector-java-3.2.0-alpha的下載地址：http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-3.2.0-alpha.zip/from/pick

3.2版的connectotJ已經(jīng)不象 http://www.csip.cn/new/st/db/2004/0804/428.htm 上面描述的樣子了。原來的“com.mysql.jdbc.Connecter.java” 已經(jīng)不復存在了，“this.doUnicode = true; ”在com.mysql.jdbc.Connection.java 中變成了setDoUnicode(true)，而這個調(diào)用在Connection類中的兩次調(diào)用都是在checkServerEncoding 方法中（2687，2716），而checkServerEncoding 方法只由 initializePropsFromServer 方法調(diào)用            //
            // We only do this for servers older than 4.1.0, because
            // 4.1.0 and newer actually send the server charset
            // during the handshake, and that's handled at the
            // top of this method...
            //
            if (!clientCharsetIsConfigured) {
                checkServerEncoding();
            }
它說只在4.1.0版本以前才需要調(diào)用這個方法，對于mysql5.0，根本就不會進入這個方法

從initialize里面找不到問題，直接到ResultSet.getString里面跟一下看看。一番努力之后終于定位到了出錯的地方：com.mysql.jdbc.SingleByteCharsetConverter

193 /**
194 * Convert the byte buffer from startPos to a length of length
195 * to a string using this instance's character encoding.
196 *
197 * @param buffer the bytes to convert
198 * @param startPos the index to start at
199 * @param length the number of bytes to convert
200 * @return the String representation of the given bytes
201 */
202 public final String toString(byte[] buffer, int startPos, int length) {
203     char[] charArray = new char[length];
204     int readpoint = startPos;
205
206     for (int i = 0; i < length; i++) {
207         charArray[i] = this.byteToChars[buffer[readpoint] - Byte.MIN_VALUE];
208         readpoint++;
209     }
210
211     return new String(charArray);
212 }

在進入這個方法的時候一切都還很美好，buffer里面放著從數(shù)據(jù)庫拿來的正確的Unicode數(shù)據(jù)（一個漢字對應著兩個byte）
剛進入方法，就定義了一個char數(shù)組，其實相當于就是String的原始形式?？纯炊x了多少個字符：
char[] charArray = new char[length];
嘿嘿，字符數(shù)和byte數(shù)組長度一樣，也就是說每個漢字將轉(zhuǎn)換成兩個字符。
后面的循環(huán)是把byte數(shù)組里面的字符一個一個轉(zhuǎn)換成char。一樣的沒有對unicode數(shù)據(jù)進行任何處理，簡單的就把一個漢字轉(zhuǎn)成兩個字符了。最后用這個字符數(shù)組來構(gòu)造字符串，能不錯嗎？把toString方法改造一下：

    public final String toString(byte[] buffer, int startPos, int length) {
        return new String(buffer,startPos,length);
    }

這是解決問題最簡單的辦法了吧。但是我們還可以追究一下原因，看看有沒有更好的解決方法。

這個toString方法其實是寫來轉(zhuǎn)換所謂的SingleByteCharset，也就是單字節(jié)字符用的。用這個方法而不直接new String，目的是提高轉(zhuǎn)換效率，可是現(xiàn)在為什么在轉(zhuǎn)換unicode字符的時候被調(diào)用了呢？一路跟蹤出來，問題出在com.mysql.jdbc.ResultSet.java的extractStringFromNativeColumn里面：

    /**
* @param columnIndex
* @param stringVal
* @param mysqlType
* @return
* @throws SQLException
*/
private String extractStringFromNativeColumn(int columnIndex, int mysqlType) throws SQLException {
  if (this.thisRow[columnIndex - 1] instanceof String) {
      return (String) this.thisRow[columnIndex - 1];
  }

  String stringVal = null;

  if ((this.connection != null) && this.connection.getUseUnicode()) {
      try {
          String encoding = this.fields[columnIndex - 1].getCharacterSet();

          if (encoding == null) {
              stringVal = new String((byte[]) this.thisRow[columnIndex -
                      1]);
          } else {
              SingleByteCharsetConverter converter = this.connection.getCharsetConverter(encoding);

              if (converter != null) {
                  stringVal = converter.toString((byte[]) this.thisRow[columnIndex -
                          1]);
              } else {
                  stringVal = new String((byte[]) this.thisRow[columnIndex -
                          1], encoding);
              }
          }
      } catch (java.io.UnsupportedEncodingException E) {
          throw new SQLException(Messages.getString(
                  "ResultSet.Unsupported_character_encoding____138") //$NON-NLS-1$
               + this.connection.getEncoding() + "'.", "0S100");
      }
  } else {
      stringVal = StringUtils.toAsciiString((byte[]) this.thisRow[columnIndex -
              1]);
  }

  // Cache this conversion if the type is a MySQL string type
  if ((mysqlType == MysqlDefs.FIELD_TYPE_STRING) ||
          (mysqlType == MysqlDefs.FIELD_TYPE_VAR_STRING)) {
      this.thisRow[columnIndex - 1] = stringVal;
  }

return stringVal;
}

這個方法從fields里面取得編碼方式。而fields是在MysqlIO類里面根據(jù)數(shù)據(jù)庫返回的數(shù)據(jù)解析處理字符集代號,這里取回的是數(shù)據(jù)庫的默認字符集。所以如果你在創(chuàng)建數(shù)據(jù)庫或者表的時候指定了字符集為gbk（CREATE DATABASE dbname DEFAULT CHARSET=GBK;）那么恭喜恭喜，你取回的數(shù)據(jù)不需要再行編碼了。

但是當時我在建數(shù)據(jù)庫表的時候沒有這么做（也不能怪我，是bugzilla的checksetup.pl自己創(chuàng)建的庫?。袁F(xiàn)在fields返回的不是我們期望的GBK而是mysql默認的設置ISO8859-1。于是ResultSet就拿ISO8859-1來編碼我們GBK編碼的數(shù)據(jù)，這就是為什么我們從getString取得數(shù)據(jù)以后先getBytes("ISO8859-1")再new String就可以把漢字變回來了。

其實我們指定了jdbc的編碼方式的情況下，jdbc應該明白我們已經(jīng)不打算使用數(shù)據(jù)庫默認的編碼方式了，因此ResultSet應該忽略原來數(shù)據(jù)庫的編碼方式的，否則我們設置的編碼方式還有什么用呢？可是mysql偏偏就選擇了忽略我們的選擇而用了數(shù)據(jù)庫的編碼方式。解決方法很簡單，把mysql那段自作聰明的判斷編碼方式的代碼全部干掉：

    /**
* @param columnIndex
* @param stringVal
* @param mysqlType
* @return
* @throws SQLException
*/
private String extractStringFromNativeColumn(int columnIndex, int mysqlType) throws SQLException {
  if (this.thisRow[columnIndex - 1] instanceof String) {
      return (String) this.thisRow[columnIndex - 1];
  }

  String stringVal = null;

  if ((this.connection != null) && this.connection.getUseUnicode()) {
      try {
//          String encoding = this.fields[columnIndex - 1].getCharacterSet();
          String encoding = null;
          if (encoding == null) {
              stringVal = new String((byte[]) this.thisRow[columnIndex -
                      1]);
          } else {
              SingleByteCharsetConverter converter = this.connection.getCharsetConverter(encoding);

              if (converter != null) {
                  stringVal = converter.toString((byte[]) this.thisRow[columnIndex -
                          1]);
              } else {
                  stringVal = new String((byte[]) this.thisRow[columnIndex -
                          1], encoding);
              }
          }
      } catch (java.io.UnsupportedEncodingException E) {
          throw new SQLException(Messages.getString(
                  "ResultSet.Unsupported_character_encoding____138") //$NON-NLS-1$
               + this.connection.getEncoding() + "'.", "0S100");
      }
  } else {
      stringVal = StringUtils.toAsciiString((byte[]) this.thisRow[columnIndex -
              1]);
  }

  // Cache this conversion if the type is a MySQL string type
  if ((mysqlType == MysqlDefs.FIELD_TYPE_STRING) ||
          (mysqlType == MysqlDefs.FIELD_TYPE_VAR_STRING)) {
      this.thisRow[columnIndex - 1] = stringVal;
  }

return stringVal;
}

好了，整個世界都清靜了，現(xiàn)在不管原來的表是什么編碼都按默認方式處理，繞過了愛出問題的針對ISO8859-1的加速代碼。上面的toString也可以改回去了，不過改不改都無所謂，它沒有機會被執(zhí)行了。

可是我的疑惑沒有完全消除。數(shù)據(jù)庫表定義的是ISO8859-1編碼，為何返回回來的數(shù)據(jù)卻又是GBK編碼呢？而且這個編碼并不隨我在jdbc的url中的設定而改變，那么mysql是根據(jù)什么來決定返回回來的數(shù)據(jù)的編碼方式呢？作者:emu(黃希彤)

作者:emu(黃希彤)
上面研究的只是Result.getString的編碼問題。提交數(shù)據(jù)的時候有類似的編碼問題，但是其原因就更復雜一些了。我發(fā)現(xiàn)這樣做的結(jié)果是對的：

pstmt.setBytes(1,"我們都是祖國的花朵".getBytes());

而這樣居然是錯的：

pstmt.setString(1,"我們都是祖國的花朵");

一番努力之后把斷點打到了MysqlIO的send(Buffer packet, int packetLen)方法里面：

                if (!this.useNewIo) {
                    this.mysqlOutput.write(packetToSend.getByteBuffer(), 0,
                        packetLen);
                    this.mysqlOutput.flush();
                } else {...

字符串的編碼在packetToSend.getByteBuffer()里面還是對的，但是送到數(shù)據(jù)庫里面的時候就全部變成“???????”了。也就是說，數(shù)據(jù)庫接收這組byte的時候重新進行了編碼，而且是錯誤的編碼。比較兩種方式發(fā)送的byte數(shù)組，數(shù)據(jù)差異很小,基本上就是第0、4和16這三個byte的值會有些變化，看起來似乎第15、16個byte里面保存的是一個代表數(shù)據(jù)類型的int，估計就是這個標記，讓mysql服務器對接收到的數(shù)據(jù)進行了再加工。但是源碼里面對這些邏輯也沒有寫充分的注釋（還是看jdk自己的源碼比較舒服），看起來一頭霧水，算了。作者:emu(黃希彤)

posted on 2005-06-03 09:08 emu 閱讀(4192) 評論(1) 編輯收藏


只有注冊用戶登錄后才能發(fā)表評論。




網(wǎng)站導航: 博客園 IT新聞 Chat2DB C++博客博問管理

emu in blogjava

公告

常用鏈接

留言簿(92)

隨筆分類(20)

隨筆檔案(171)

文章分類(89)

文章檔案(103)

相冊

收藏夾(46)

友情連接

收藏

搜索

積分與排名

最新評論

閱讀排行榜

評論排行榜

評論