首先得下載xpdf-3.00pl3-win32.zip和xpdf-chinese-simplified.tar.gz二個(gè)包
配置:
1。xpdf-3.00pl3-win32.zip寫(xiě)壓后改名為xpdf
2。修改xpdfrc文件
(1)在文件最下面加入
1
#----- begin Chinese Simplified support package (2004-jul-27)
2
3
cidToUnicode Adobe-GB1 C:/xpdf/chinese-simplified/Adobe-GB1.cidToUnicode
4
5
unicodeMap ISO-2022-CN C:/PublicInstall/xpdf/chinese-simplified/ISO-2022-CN.unicodeMap
6
7
unicodeMap EUC-CN C:/xpdf/chinese-simplified/EUC-CN.unicodeMap
8
9
unicodeMap GBK C:/xpdf/chinese-simplified/GBK.unicodeMap
10
11
cMapDir Adobe-GB1 C:/xpdf/chinese-simplified/CMap
12
13
toUnicodeDir C:/xpdf/chinese-simplified/CMap
14
15
#displayCIDFontTT Adobe-GB1 /usr/
./gkai00mp.ttf
16
17
#----- end Chinese Simplified support package

2

3

4

5

6

7

8

9

10

11

12

13

14

15


16

17

(2)另外,配置文件中原先沒(méi)有加上一個(gè)“textPageBreaks”控制。為了避免這個(gè)分頁(yè)符號(hào),我們需要在xpdfrc文件“text output control”下面加上這么一段話:
# If set to "yes", text extraction will insert page
# breaks (form feed characters) between pages. This
# defaults to "yes".
textPageBreaks no
設(shè)置textPageBreaks為no的意思是:在PDF文檔的兩頁(yè)之間不加入分頁(yè)符號(hào)。
之所以這樣,是因?yàn)檫@個(gè)符號(hào)有時(shí)候會(huì)引起SAX解析XML上的困難。
讀PDF文件
1
String PATH_TO_XPDF="C:\\xpdf\\pdftotext.exe";
2
String[] cmd = new String[] { PATH_TO_XPDF, "-enc", "UTF-8", "-q", 文件路徑, "-"};
3
Process p = Runtime.getRuntime().exec(cmd);
4
BufferedInputStream iss = new BufferedInputStream(p.getInputStream());
5
str = new ReadFileUtil(comm).readPDF(iss);

2

3

4

5
