The Data Import Handler Framework

Solr includes a very popular contrib module for importing data known as the DataImportHandler (DIH in short). It's a data processing pipeline built specificallyfor Solr. Here's a summary of notable capabilities:

•    Imports data from databases through JDBC (Java Database Connectivity)
    ° Supports importing only changed records, assuming a last-updated date
•    Imports data from a URL (HTTP GET)
•    Imports data from files (that is it crawls files)
•    Imports e-mail from an IMAP server, including attachments
•    Supports combining data from different sources
•    Extracts text and metadata from rich document formats
•    Applies XSLT transformations and XPath extraction on XML data
•    Includes a diagnostic/development tool

The DIH is not considered a core part of Solr, even though it comes with the Solr download, and so you must add its Java JAR files to your Solr setup to use it. If this isn't done, you'll eventually see a ClassNotFoundException error. The DIH's JAR files are located in Solr's dist directory: apache-solr-dataimporthandler-3.4.0.jar and apache-solr-dataimporthandler-extras-3.4.0.jar. The easiest way to add JAR files to a Solr configuration is to copy them to the <solr_home>/lib directory; you may need to create it. Another method is to reference them from solrconfig.xml via <lib/> tags—see Solr's example configuration for examples of that. You will most likely need some additional JAR files as well. If you'll be communicating with a database, then you'll need to get a JDBC driver for it. If you will be extracting text from various document formats then you'll need to add the JARs in /contrib/extraction/lib. Finally, if you'll be indexing e-mail then you'll need to add the JARs in /contrib /dataimporthandler/lib.

The DIH needs to be registered with Solr in solrconfig.xml like so:

<requestHandler name="/dih_artists_jdbc"
      class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
        <str name="config">mb-dih-artists-jdbc.xml</str>
    </lst>
</requestHandler>

This reference mb-dih-artists-jdbc.xml is located in <solr-home>/conf, which specifies the details of a data importing process. We'll get to that file in a bit.

DIHQuickStart

http://wiki.apache.org/solr/DIHQuickStart

Index a DB table directly into Solr

Step 1 : Edit your solrconfig.xml to add the request handle

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>

Step 2 : Create a data-config.xml file as follows and save it to the conf dir

Step 3 : Ensure that your solr schema (schema.xml) has the fields 'id', 'name', 'desc'. Change the appropriate details in the data-config.xml

Step 4: Drop your JDBC driver jar file into the <solr-home>/lib directory .

Step 5 : Run the command

http://solr-host:port/solr/dataimport?command=full-import .

Keep in mind that every time a full-import is executed the index is cleaned up. If you do not wish that to happen add clean=false. For example:

http://solr-host:port/solr/dataimport?command=full-import&clean=false

Index the fields in different names

Step: 1 Change the data-config as follows:

Step 2 : This time the fields will be written to the solr fields 'solr_id', 'solr_name', solr_desc'. You must have these fields in the schema.xml.

Step 3 : Run the command http://solr-host:port/dataimpor?command=full-import

Index data from multiple tables into Solr

Step: 1 Change the data-config as follows :

Step 2: The schema.xml should have the solr_details field

Step 3: Run the full-import command

配置數(shù)據(jù)源

將 dataSource標(biāo)簽直接添加到dataConfig下面，即成為dataConfig的子元素.

driver(必需的)：jdbc驅(qū)動(dòng)名稱
url（必需的）：jdbc鏈接
user：用戶名
password：密碼
批量大小：jdbc鏈接中的批量大小

數(shù)據(jù)源也可以配置在solrconfig.xml中
屬性type 指定了實(shí)現(xiàn)的類型。它是可選的。默認(rèn)的實(shí)現(xiàn)是JdbcDataSource。
屬性 name 是datasources的名字，當(dāng)有多個(gè)datasources時(shí)，可以使用name屬性加以區(qū)分
其他的屬性都是隨意的，根據(jù)你使用的DataSource實(shí)現(xiàn)而定。
當(dāng)然你也可以實(shí)現(xiàn)自己的DataSource。

多 數(shù)據(jù)源

使用：

<entity name="one" dataSource="ds-1"

</entity>

<entity name="two" dataSource="ds-2"

</entity>

配置data-config.xml

solr document是schema，它的域上的值可能來(lái)自于多個(gè)表.

data-config.xml的根元素是document。一個(gè)document元素代表了一種文檔。一個(gè)document元素中包含了一個(gè)或者多個(gè)root實(shí)體。一個(gè)root實(shí)體包含著一些子實(shí)體，這些子實(shí)體能夠包含其他的實(shí)體。實(shí)體就是，關(guān)系數(shù)據(jù)庫(kù)上的表或者視圖。每個(gè)實(shí)體都能夠包含多個(gè)域，每個(gè)域?qū)?yīng)著數(shù)據(jù)庫(kù)返回結(jié)果中的一列。域的名字跟列的名字默認(rèn)是一樣的。如果一個(gè)列的名字跟solr field的名字不一樣，那么屬性name就應(yīng)該要給出。其他的需要的屬性在solrschema.xml文件中配置。

為了能夠從數(shù)據(jù)庫(kù)中取得想要的數(shù)據(jù)，我們的設(shè)計(jì)支持標(biāo)準(zhǔn)sql規(guī)范。這使得用戶能夠使用他任何想要的sql語(yǔ)句。root實(shí)體是一個(gè)中心表，使用它的列可以把表連接在一起。

dataconfig的結(jié)構(gòu)

dataconfig 的結(jié)構(gòu)不是一成不變的,entity和field元素中的屬性是隨意的，這主要取決于processor和transformer。

以下是entity的默認(rèn)屬性

name(必需的):name是唯一的，用以標(biāo)識(shí)entity
processor:只有當(dāng)datasource不是RDBMS時(shí)才是必需的。默認(rèn)值是 SqlEntityProcessor
transformer:轉(zhuǎn)換器將會(huì)被應(yīng)用到這個(gè)entity上，詳情請(qǐng)瀏覽transformer部分。
pk：entity的主鍵，它是可選的，但使用“增量導(dǎo)入”的時(shí)候是必需。它跟schema.xml中定義的 uniqueKey沒有必然的聯(lián)系，但它們可以相同。
rootEntity：默認(rèn)情況下，document元素下就是根實(shí)體了，如果沒有根實(shí)體的話，直接在實(shí)體下面的實(shí)體將會(huì)被看做跟實(shí)體。對(duì)于根實(shí)體對(duì)應(yīng)的數(shù)據(jù)庫(kù)中返回的數(shù)據(jù)的每一行，solr都將生成一個(gè)document。

一下是SqlEntityProcessor的屬性

query (required) :sql語(yǔ)句
deltaQuery : 只在“增量導(dǎo)入”中使用
parentDeltaQuery : 只在“增量導(dǎo)入”中使用
deletedPkQuery : 只在“增量導(dǎo)入”中使用
deltaImportQuery : (只在“增量導(dǎo)入”中使用) . 如果這個(gè)存在，那么它將會(huì)在“增量導(dǎo)入”中導(dǎo)入phase時(shí)代替query產(chǎn)生作用。這里有一個(gè)命名空間的用法${dataimporter.delta.}

`Commands`

打開導(dǎo)入數(shù)據(jù)界面http://192.168.0.248:9080/solr/admin/dataimport.jsp，看到幾種按鈕分別調(diào)用不同的導(dǎo)數(shù)據(jù)命令。

full-import : "完全導(dǎo)入"這個(gè)操作可以通過(guò)訪問(wèn)URL http://192.168.0.248:9080/solr/dataimport?command=full-import 完成。
- 這個(gè)操作，將會(huì)新起一個(gè)線程。response中的attribute屬性將會(huì)顯示busy。
- 這個(gè)操作執(zhí)行的時(shí)間取決于數(shù)據(jù)集的大小。
- 當(dāng)這個(gè)操作運(yùn)行完了以后，它將在conf/dataimport.properties這個(gè)文件中記錄下這個(gè)操作的開始時(shí)間
- 當(dāng)“增量導(dǎo)入”被執(zhí)行時(shí)，stored timestamp這個(gè)時(shí)間戳將會(huì)被用到
- solr的查詢?cè)?#8220;完全導(dǎo)入”時(shí)，不是阻塞的
- 它還有下面一些參數(shù)：
  - clean : (default 'true'). 決定在建立索引之前，刪除以前的索引。
  - commit : (default 'true'). 決定這個(gè)操作之后是否要commit
  - optimize : (default 'true'). 決定這個(gè)操作之后是否要優(yōu)化。
  - debug : (default false). 工作在debug模式下。詳情請(qǐng)看 the interactive development mode (see here )
delta-import : 當(dāng)遇到一些增量的輸入，或者發(fā)生一些變化時(shí)使用http://192.168.0.248:9080/solr/dataimport?command= delta-import .它同樣支持 clean, commit, optimize and debug 這幾個(gè)參數(shù).
status : 想要知道命令執(zhí)行的狀態(tài) , 訪問(wèn) URL http://192.168.0.248:9080/solr/dataimport .它給出了關(guān)于文檔創(chuàng)建、刪除，查詢、結(jié)果獲取等等的詳細(xì)狀況。
reload-config : 如果data-config.xml已經(jīng)改變，你不希望重啟solr，而要重新加載配置時(shí)，運(yùn)行一下的命令http://192.168.0.248:9080/solr/dataimport?command=reload-config
abort : 你可以通過(guò)訪問(wèn) http://192.168.0.248:9080/solr/dataimport?command=abort 來(lái)終止一個(gè)在運(yùn)行的操作

Full Import 例子

data-config.xml 如下：

</entity>

</entity>

</document>

</dataConfig>

這里, 根實(shí)體是一個(gè)名叫“item”的表，它的主鍵是id。我們使用語(yǔ)句 "select * from item"讀取數(shù)據(jù). 每一項(xiàng)都擁有多個(gè)特性?？聪旅鎓eature實(shí)體的查詢語(yǔ)句：

</entity>

feature表中的外鍵item_id跟item中的主鍵連在一起從數(shù)據(jù)庫(kù)中取得該row的數(shù)據(jù)。相同地，我們將item和category連表（它們是多對(duì)多的關(guān)系）。注意，我們是怎樣使用中間表和標(biāo)準(zhǔn)sql連表的：

</entity>

短一點(diǎn)的 data-config

在上面的例子中，這里有好幾個(gè)從域到solr域之間的映射。如果域的名字和solr中域的名字是一樣的話，完全避免使用在實(shí)體中配置域也是可以的。當(dāng)然，如果你需要使用轉(zhuǎn)換器的話，你還是需要加上域?qū)嶓w的。

</entity>

</document>

</dataConfig>

訪問(wèn) http://localhost:8983/solr/dataimport?command=full-import 執(zhí)行一個(gè)“完全導(dǎo)入”

使用“增量導(dǎo)入”命令

你可以通過(guò)訪問(wèn)URL http://localhost:8983/solr/dataimport?command=delta-import 來(lái)使用增量導(dǎo)入。操作將會(huì)新起一個(gè)線程，response中的屬性statue也將顯示busy now。操作執(zhí)行的時(shí)間取決于你的數(shù)據(jù)集的大小。在任何時(shí)候，你都可以通過(guò)訪問(wèn) http://localhost:8983/solr/dataimport 來(lái)查看狀態(tài)。

當(dāng) 增量導(dǎo)入被執(zhí)行的時(shí)候，它讀取存儲(chǔ)在conf/dataimport.properties中的“start time”。它使用這個(gè)時(shí)間戳來(lái)執(zhí)行增量查詢，完成之后，會(huì)更新這個(gè)放在conf/dataimport.properties中的時(shí)間戳。

Delta-Import 例子

我們將使用跟“完全導(dǎo)入”中相同的數(shù)據(jù)庫(kù)。注意，數(shù)據(jù)庫(kù)已經(jīng)被更新了，每個(gè)表都包含有一個(gè)額外timestamp類型的列叫做last_modified?；蛟S你需要重新下載數(shù)據(jù)庫(kù)，因?yàn)樗罱桓铝?。我們使用這個(gè)時(shí)間戳的域來(lái)區(qū)別出那一行是上次索引以來(lái)有更新的。

看看下面的這個(gè) data-config.xml：

<entity name="item" pk="ID" query="select * from item"

deltaQuery="select id from item where last_modified > '${dataimporter.last_index_time}'">

<entity name="feature" pk="ITEM_ID"

query="select description as features from feature where item_id='${item.ID}'">

</entity>

<entity name="item_category" pk="ITEM_ID, CATEGORY_ID"

query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'">

<entity name="category" pk="ID"

query="select description as cat from category where id = '${item_category.CATEGORY_ID}'">

</entity>

</document>

</dataConfig>

注意到item實(shí)體的屬性deltaquery了嗎，它包含了一個(gè)能夠查出最近更新的sql語(yǔ)句。注意，變量{dataimporter.last_index_time } 是DataImporthandler傳過(guò)來(lái)的變量，我們叫它時(shí)間戳，它指出“完全導(dǎo)入”或者“部分導(dǎo)入”的最后運(yùn)行時(shí)間。你可以在data- config.xml文件中的sql的任何地方使用這個(gè)變量，它將在processing這個(gè)過(guò)程中被賦值。

上面例子中deltaQuery 只能夠發(fā)現(xiàn)item中的更新，而不能發(fā)現(xiàn)其他表的。你可以像下面那樣在一個(gè)sql語(yǔ)句中指定所有的表的更新:

deltaQuery="select id from item where id in

(select item_id as id from feature where last_modified > '${dataimporter.last_index_time}')

or id in

(select item_id as id from item_category where item_id in

(select id as item_id from category where last_modified > '${dataimporter.last_index_time}')

or last_modified > '${dataimporter.last_index_time}')

or last_modified > '${dataimporter.last_index_time}'"

寫一個(gè)類似上面的龐大的deltaQuery 并不是一件很享受的工作，我們還是選擇其他的方法來(lái)達(dá)到這個(gè)目的

<entity name="item" pk="ID" query="select * from item"

deltaQuery="select id from item where last_modified > '${dataimporter.last_index_time}'">

<entity name="feature" pk="ITEM_ID"

query="select DESCRIPTION as features from FEATURE where ITEM_ID='${item.ID}'"

deltaQuery="select ITEM_ID from FEATURE where last_modified > '${dataimporter.last_index_time}'"

parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}"/>

<entity name="item_category" pk="ITEM_ID, CATEGORY_ID"

query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'"

deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where last_modified > '${dataimporter.last_index_time}'"

parentDeltaQuery="select ID from item where ID=${item_category.ITEM_ID}">

<entity name="category" pk="ID"

query="select DESCRIPTION as cat from category where ID = '${item_category.CATEGORY_ID}'"

deltaQuery="select ID from category where last_modified > '${dataimporter.last_index_time}'"

parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category where CATEGORY_ID=${category.ID}"/>

</entity>

</document>

</dataConfig>

deltaQuery 取得從上次索引更新時(shí)間以來(lái)有更新的實(shí)體的主鍵。
parentDeltaQuery 從deltaQuery中取得當(dāng)前表中更新的行，并把這些行提交給父表。因?yàn)?，?dāng)子表中的一行發(fā)生改變時(shí)，我們需要更新它的父表的solr文檔。

下面是一些值得注意的地方:

對(duì)于query語(yǔ)句返回的每一行，子實(shí)體的query都將被執(zhí)行一次
對(duì)于deltaQuery返回的每一行，parentDeltaQuery都將被執(zhí)行。
一旦根實(shí)體或者子實(shí)體中的行發(fā)生改變，我們將重新生成包含該行的solr文檔。

posted on 2012-05-30 14:33 CONAN 閱讀(4726) 評(píng)論(0) 編輯收藏所屬分類: Solr

CONAN ZONE

留言簿(6)

文章分類(325)

文章檔案(282)

guy's blog

搜索

積分與排名

最新評(píng)論

DIHQuickStart

Index a DB table directly into Solr

Index the fields in different names

Index data from multiple tables into Solr

`Commands`

Full Import 例子

使用“增量導(dǎo)入”命令

Delta-Import 例子