首先得下載xpdf-3.00pl3-win32.zip和xpdf-chinese-simplified.tar.gz二個包
配置:
1。xpdf-3.00pl3-win32.zip寫壓后改名為xpdf
2。修改xpdfrc文件
(1)在文件最下面加入
1
#----- begin Chinese Simplified support package (2004-jul-27)
2
3
cidToUnicode Adobe-GB1 C:/xpdf/chinese-simplified/Adobe-GB1.cidToUnicode
4
5
unicodeMap ISO-2022-CN C:/PublicInstall/xpdf/chinese-simplified/ISO-2022-CN.unicodeMap
6
7
unicodeMap EUC-CN C:/xpdf/chinese-simplified/EUC-CN.unicodeMap
8
9
unicodeMap GBK C:/xpdf/chinese-simplified/GBK.unicodeMap
10
11
cMapDir Adobe-GB1 C:/xpdf/chinese-simplified/CMap
12
13
toUnicodeDir C:/xpdf/chinese-simplified/CMap
14
15
#displayCIDFontTT Adobe-GB1 /usr/
./gkai00mp.ttf
16
17
#----- end Chinese Simplified support package

2

3

4

5

6

7

8

9

10

11

12

13

14

15


16

17

(2)另外,配置文件中原先沒有加上一個“textPageBreaks”控制。為了避免這個分頁符號,我們需要在xpdfrc文件“text output control”下面加上這么一段話:
# If set to "yes", text extraction will insert page
# breaks (form feed characters) between pages. This
# defaults to "yes".
textPageBreaks no
設置textPageBreaks為no的意思是:在PDF文檔的兩頁之間不加入分頁符號。
之所以這樣,是因為這個符號有時候會引起SAX解析XML上的困難。
讀PDF文件
1
String PATH_TO_XPDF="C:\\xpdf\\pdftotext.exe";
2
String[] cmd = new String[] { PATH_TO_XPDF, "-enc", "UTF-8", "-q", 文件路徑, "-"};
3
Process p = Runtime.getRuntime().exec(cmd);
4
BufferedInputStream iss = new BufferedInputStream(p.getInputStream());
5
str = new ReadFileUtil(comm).readPDF(iss);

2

3

4

5
