基于Solr的HBase多條件查詢測(cè)試
背景:
某電信項(xiàng)目中采用HBase來存儲(chǔ)用戶終端明細(xì)數(shù)據(jù),供前臺(tái)頁面即時(shí)查詢。HBase無可置疑擁有其優(yōu)勢(shì),但其本身只對(duì)rowkey支持毫秒級(jí)的快速檢索,對(duì)于多字段的組合查詢卻無能為力。針對(duì)HBase的多條件查詢也有多種方案,但是這些方案要么太復(fù)雜,要么效率太低,本文只對(duì)基于Solr的HBase多條件查詢方案進(jìn)行測(cè)試和驗(yàn)證。
原理:
基于Solr的HBase多條件查詢?cè)砗芎?jiǎn)單,將HBase表中涉及條件過濾的字段和rowkey在Solr中建立索引,通過Solr的多條件查詢快速獲得符合過濾條件的rowkey值,拿到這些rowkey之后在HBASE中通過指定rowkey進(jìn)行查詢。
測(cè)試環(huán)境:
solr 4.0.0版本,使用其自帶的jetty服務(wù)端容器,單節(jié)點(diǎn);
hbase-0.94.2-cdh4.2.1,10臺(tái)Lunux服務(wù)器組成的HBase集群。
HBase中2512萬條數(shù)據(jù)172個(gè)字段;
Solr索引HBase中的100萬條數(shù)據(jù);
測(cè)試結(jié)果:
1、100萬條數(shù)據(jù)在Solr中對(duì)8個(gè)字段建立索引。在Solr中最多8個(gè)過濾條件獲取51316條數(shù)據(jù)的rowkey值,基本在57-80毫秒。根據(jù)Solr返回的rowkey值在HBase表中獲取所有51316條數(shù)據(jù)12個(gè)字段值,耗時(shí)基本在15秒;
2、數(shù)據(jù)量同上,過濾條件同上,采用Solr分頁查詢,每次獲取20條數(shù)據(jù),Solr獲得20個(gè)rowkey值耗時(shí)4-10毫秒,拿到Solr傳入的rowkey值在HBase中獲取對(duì)應(yīng)20條12個(gè)字段的數(shù)據(jù),耗時(shí)6毫秒。
以下列出測(cè)試環(huán)境的搭建、以及相關(guān)代碼實(shí)現(xiàn)過程。
一、Solr環(huán)境的搭建
因?yàn)槌踔灾皇菧y(cè)試Solr的使用,Solr的運(yùn)行環(huán)境也只是用了其自帶的jetty,而非大多人用的Tomcat;沒有搭建Solr集群,只是一個(gè)單一的Solr服務(wù)端,也沒有任何參數(shù)調(diào)優(yōu)。
1)在Apache網(wǎng)站上下載Solr 4:http://lucene.apache.org/solr/downloads.html,我們這里下載的是“apache-solr-4.0.0.tgz”;
2)在當(dāng)前目錄解壓Solr壓縮包:
-xvzf apache-solr-..tgz
3)修改Solr的配置文件schema.xml,添加我們需要索引的多個(gè)字段(配置文件位于“/opt/apache-solr-4.0.0/example/solr/collection1/conf/”)
<field name="rowkey" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="time" type="string" indexed="true" stored="true" required="false" multiValued="false" /> <field name="tebid" type="string" indexed="true" stored="true" required="false" multiValued="false" /> <field name="tetid" type="string" indexed="true" stored="true" required="false" multiValued="false" /> <field name="puid" type="string" indexed="true" stored="true" required="false" multiValued="false" /> <field name="mgcvid" type="string" indexed="true" stored="true" required="false" multiValued="false" /> <field name="mtcvid" type="string" indexed="true" stored="true" required="false" multiValued="false" /> <field name="smaid" type="string" indexed="true" stored="true" required="false" multiValued="false" /> <field name="mtlkid" type="string" indexed="true" stored="true" required="false" multiValued="false" />
另外關(guān)鍵的一點(diǎn)是修改原有的uniqueKey,本文設(shè)置HBase表的rowkey字段為Solr索引的uniqueKey:
<uniqueKey>rowkey</uniqueKey>
type 參數(shù)代表索引數(shù)據(jù)類型,我這里將type全部設(shè)置為string是為了避免異常類型的數(shù)據(jù)導(dǎo)致索引建立失敗,正常情況下應(yīng)該根據(jù)實(shí)際字段類型設(shè)置,比如整型字段設(shè)置為int,更加有利于索引的建立和檢索;
indexed 參數(shù)代表此字段是否建立索引,根據(jù)實(shí)際情況設(shè)置,建議不參與條件過濾的字段一律設(shè)置為false;
stored 參數(shù)代表是否存儲(chǔ)此字段的值,建議根據(jù)實(shí)際需求只將需要獲取值的字段設(shè)置為true,以免浪費(fèi)存儲(chǔ),比如我們的場(chǎng)景只需要獲取rowkey,那么只需把rowkey字段設(shè)置為true即可,其他字段全部設(shè)置flase;
required 參數(shù)代表此字段是否必需,如果數(shù)據(jù)源某個(gè)字段可能存在空值,那么此屬性必需設(shè)置為false,不然Solr會(huì)拋出異常;
multiValued 參數(shù)代表此字段是否允許有多個(gè)值,通常都設(shè)置為false,根據(jù)實(shí)際需求可設(shè)置為true。
4)我們使用Solr自帶的example來作為運(yùn)行環(huán)境,定位到example目錄,啟動(dòng)服務(wù)監(jiān)聽:
cd /opt/apache-solr-4.0.0/example java -jar ./start.jar
如果啟動(dòng)成功,可以通過瀏覽器打開此頁面:http://192.168.1.10:8983/solr/
二、讀取HBase源表的數(shù)據(jù),在Solr中建立索引
一種方案是通過HBase的普通API獲取數(shù)據(jù)建立索引,此方案的缺點(diǎn)是效率較低每秒只能處理100多條數(shù)據(jù)(或許可以通過多線程提高效率):
package com.ultrapower.hbase.solrhbase;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.hbase.HBaseConfiguration;import org.apache.hadoop.hbase.KeyValue;import org.apache.hadoop.hbase.client.HTable;import org.apache.hadoop.hbase.client.Result;import org.apache.hadoop.hbase.client.ResultScanner;import org.apache.hadoop.hbase.client.Scan;import org.apache.hadoop.hbase.util.Bytes;import org.apache.solr.client.solrj.SolrServerException;import org.apache.solr.client.solrj.impl.HttpSolrServer;import org.apache.solr.common.SolrInputDocument;public class SolrIndexer { /** * @param args * @throws IOException * @throws SolrServerException */ public static void main(String[] args) throws IOException, SolrServerException { final Configuration conf; HttpSolrServer solrServer = new HttpSolrServer( "http://192.168.1.10:8983/solr"); // 因?yàn)榉?wù)端是用的Solr自帶的jetty容器,默認(rèn)端口號(hào)是8983 conf = HBaseConfiguration.create(); HTable table = new HTable(conf, "hb_app_xxxxxx"); // 這里指定HBase表名稱 Scan scan = new Scan(); scan.addFamily(Bytes.toBytes("d")); // 這里指定HBase表的列族 scan.setCaching(500); scan.setCacheBlocks(false); ResultScanner ss = table.getScanner(scan); System.out.println("start ..."); int i = 0; try { for (Result r : ss) { SolrInputDocument solrDoc = new SolrInputDocument(); solrDoc.addField("rowkey", new String(r.getRow())); for (KeyValue kv : r.raw()) { String fieldName = new String(kv.getQualifier()); String fieldValue = new String(kv.getValue()); if (fieldName.equalsIgnoreCase("time") || fieldName.equalsIgnoreCase("tebid") || fieldName.equalsIgnoreCase("tetid") || fieldName.equalsIgnoreCase("puid") || fieldName.equalsIgnoreCase("mgcvid") || fieldName.equalsIgnoreCase("mtcvid") || fieldName.equalsIgnoreCase("smaid") || fieldName.equalsIgnoreCase("mtlkid")) { solrDoc.addField(fieldName, fieldValue); } } solrServer.add(solrDoc); solrServer.commit(true, true, true); i = i + 1; System.out.println("已經(jīng)成功處理 " + i + " 條數(shù)據(jù)"); } ss.close(); table.close(); System.out.println("done !"); } catch (IOException e) { } finally { ss.close(); table.close(); System.out.println("erro !"); } } }
另外一種方案是用到HBase的Mapreduce框架,分布式并行執(zhí)行效率特別高,處理1000萬條數(shù)據(jù)僅需5分鐘,但是這種高并發(fā)需要對(duì)Solr服務(wù)器進(jìn)行配置調(diào)優(yōu),不然會(huì)拋出服務(wù)器無法響應(yīng)的異常:
Error: org.apache.solr.common.SolrException: Server at http://192.168.1.10:8983/solr returned non ok status:503, message:Service Unavailable
MapReduce入口程序:
package com.ultrapower.hbase.solrhbase;import java.io.IOException;import java.net.URISyntaxException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.hbase.HBaseConfiguration;import org.apache.hadoop.hbase.client.Scan;import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;import org.apache.hadoop.hbase.util.Bytes;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.output.NullOutputFormat;public class SolrHBaseIndexer { private static void usage() { System.err.println("輸入?yún)?shù): <配置文件路徑> <起始行> <結(jié)束行>"); System.exit(1); } private static Configuration conf; public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException, URISyntaxException { if (args.length == 0 || args.length > 3) { usage(); } createHBaseConfiguration(args[0]); ConfigProperties tutorialProperties = new ConfigProperties(args[0]); String tbName = tutorialProperties.getHBTbName(); String tbFamily = tutorialProperties.getHBFamily(); Job job = new Job(conf, "SolrHBaseIndexer"); job.setJarByClass(SolrHBaseIndexer.class); Scan scan = new Scan(); if (args.length == 3) { scan.setStartRow(Bytes.toBytes(args[1])); scan.setStopRow(Bytes.toBytes(args[2])); } scan.addFamily(Bytes.toBytes(tbFamily)); scan.setCaching(500); // 設(shè)置緩存數(shù)據(jù)量來提高效率 scan.setCacheBlocks(false); // 創(chuàng)建Map任務(wù) TableMapReduceUtil.initTableMapperJob(tbName, scan, SolrHBaseIndexerMapper.class, null, null, job); // 不需要輸出 job.setOutputFormatClass(NullOutputFormat.class); // job.setNumReduceTasks(0); System.exit(job.waitForCompletion(true) ? 0 : 1); } /** * 從配置文件讀取并設(shè)置HBase配置信息 * * @param propsLocation * @return */ private static void createHBaseConfiguration(String propsLocation) { ConfigProperties tutorialProperties = new ConfigProperties( propsLocation); conf = HBaseConfiguration.create(); conf.set("hbase.zookeeper.quorum", tutorialProperties.getZKQuorum()); conf.set("hbase.zookeeper.property.clientPort", tutorialProperties.getZKPort()); conf.set("hbase.master", tutorialProperties.getHBMaster()); conf.set("hbase.rootdir", tutorialProperties.getHBrootDir()); conf.set("solr.server", tutorialProperties.getSolrServer()); } }
對(duì)應(yīng)的Mapper:
package com.ultrapower.hbase.solrhbase;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.hbase.KeyValue;import org.apache.hadoop.hbase.client.Result;import org.apache.hadoop.hbase.io.ImmutableBytesWritable;import org.apache.hadoop.hbase.mapreduce.TableMapper;import org.apache.hadoop.io.Text;import org.apache.solr.client.solrj.SolrServerException;import org.apache.solr.client.solrj.impl.HttpSolrServer;import org.apache.solr.common.SolrInputDocument;public class SolrHBaseIndexerMapper extends TableMapper<Text, Text> { public void map(ImmutableBytesWritable key, Result hbaseResult, Context context) throws InterruptedException, IOException { Configuration conf = context.getConfiguration(); HttpSolrServer solrServer = new HttpSolrServer(conf.get("solr.server")); solrServer.setDefaultMaxConnectionsPerHost(100); solrServer.setMaxTotalConnections(1000); solrServer.setSoTimeout(20000); solrServer.setConnectionTimeout(20000); SolrInputDocument solrDoc = new SolrInputDocument(); try { solrDoc.addField("rowkey", new String(hbaseResult.getRow())); for (KeyValue rowQualifierAndValue : hbaseResult.list()) { String fieldName = new String( rowQualifierAndValue.getQualifier()); String fieldValue = new String(rowQualifierAndValue.getValue()); if (fieldName.equalsIgnoreCase("time") || fieldName.equalsIgnoreCase("tebid") || fieldName.equalsIgnoreCase("tetid") || fieldName.equalsIgnoreCase("puid") || fieldName.equalsIgnoreCase("mgcvid") || fieldName.equalsIgnoreCase("mtcvid") || fieldName.equalsIgnoreCase("smaid") || fieldName.equalsIgnoreCase("mtlkid")) { solrDoc.addField(fieldName, fieldValue); } } solrServer.add(solrDoc); solrServer.commit(true, true, true); } catch (SolrServerException e) { System.err.println("更新Solr索引異常:" + new String(hbaseResult.getRow())); } } }
讀取參數(shù)配置文件的輔助類:
package com.ultrapower.hbase.solrhbase;import java.io.File;import java.io.FileReader;import java.io.IOException;import java.util.Properties;public class ConfigProperties { private static Properties props; private String HBASE_ZOOKEEPER_QUORUM; private String HBASE_ZOOKEEPER_PROPERTY_CLIENT_PORT; private String HBASE_MASTER; private String HBASE_ROOTDIR; private String DFS_NAME_DIR; private String DFS_DATA_DIR; private String FS_DEFAULT_NAME; private String SOLR_SERVER; // Solr服務(wù)器地址 private String HBASE_TABLE_NAME; // 需要建立Solr索引的HBase表名稱 private String HBASE_TABLE_FAMILY; // HBase表的列族 public ConfigProperties(String propLocation) { props = new Properties(); try { File file = new File(propLocation); System.out.println("從以下位置加載配置文件: " + file.getAbsolutePath()); FileReader is = new FileReader(file); props.load(is); HBASE_ZOOKEEPER_QUORUM = props.getProperty("HBASE_ZOOKEEPER_QUORUM"); HBASE_ZOOKEEPER_PROPERTY_CLIENT_PORT = props.getProperty("HBASE_ZOOKEEPER_PROPERTY_CLIENT_PORT"); HBASE_MASTER = props.getProperty("HBASE_MASTER"); HBASE_ROOTDIR = props.getProperty("HBASE_ROOTDIR"); DFS_NAME_DIR = props.getProperty("DFS_NAME_DIR"); DFS_DATA_DIR = props.getProperty("DFS_DATA_DIR"); FS_DEFAULT_NAME = props.getProperty("FS_DEFAULT_NAME"); SOLR_SERVER = props.getProperty("SOLR_SERVER"); HBASE_TABLE_NAME = props.getProperty("HBASE_TABLE_NAME"); HBASE_TABLE_FAMILY = props.getProperty("HBASE_TABLE_FAMILY"); } catch (IOException e) { throw new RuntimeException("加載配置文件出錯(cuò)"); } catch (NullPointerException e) { throw new RuntimeException("文件不存在"); } } public String getZKQuorum() { return HBASE_ZOOKEEPER_QUORUM; } public String getZKPort() { return HBASE_ZOOKEEPER_PROPERTY_CLIENT_PORT; } public String getHBMaster() { return HBASE_MASTER; } public String getHBrootDir() { return HBASE_ROOTDIR; } public String getDFSnameDir() { return DFS_NAME_DIR; } public String getDFSdataDir() { return DFS_DATA_DIR; } public String getFSdefaultName() { return FS_DEFAULT_NAME; } public String getSolrServer() { return SOLR_SERVER; } public String getHBTbName() { return HBASE_TABLE_NAME; } public String getHBFamily() { return HBASE_TABLE_FAMILY; } }
參數(shù)配置文件“config.properties”:
HBASE_ZOOKEEPER_QUORUM=slave-1,slave-2,slave-3,slave-4,slave-5HBASE_ZOOKEEPER_PROPERTY_CLIENT_PORT=2181HBASE_MASTER=master-1:60000HBASE_ROOTDIR=hdfs:///hbaseDFS_NAME_DIR=/opt/data/dfs/name DFS_DATA_DIR=/opt/data/d0/dfs2/data FS_DEFAULT_NAME=hdfs://192.168.1.10:9000SOLR_SERVER=http://192.168.1.10:8983/solrHBASE_TABLE_NAME=hb_app_m_user_te HBASE_TABLE_FAMILY=d
三、結(jié)合Solr進(jìn)行HBase數(shù)據(jù)的多條件查詢:
可以通過web頁面操作Solr索引,
查詢:
http://192.168.1.10:8983/solr/select?(time:201307 AND tetid:1 AND mgcvid:101 AND smaid:101 AND puid:102)
刪除所有索引:
http://192.168.1.10:8983/solr/update/?stream.body=<delete><query>*:*</query></delete>&stream.contentType=text/xml;charset=utf-8&commit=true
通過java客戶端結(jié)合Solr查詢HBase數(shù)據(jù):
package com.ultrapower.hbase.solrhbase;import java.io.IOException;import java.nio.ByteBuffer;import java.util.ArrayList;import java.util.List;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.hbase.HBaseConfiguration;import org.apache.hadoop.hbase.client.Get;import org.apache.hadoop.hbase.client.HTable;import org.apache.hadoop.hbase.client.Result;import org.apache.hadoop.hbase.util.Bytes;import org.apache.solr.client.solrj.SolrQuery;import org.apache.solr.client.solrj.SolrServer;import org.apache.solr.client.solrj.SolrServerException;import org.apache.solr.client.solrj.impl.HttpSolrServer;import org.apache.solr.client.solrj.response.QueryResponse;import org.apache.solr.common.SolrDocument;import org.apache.solr.common.SolrDocumentList;public class QueryData { /** * @param args * @throws SolrServerException * @throws IOException */ public static void main(String[] args) throws SolrServerException, IOException { final Configuration conf; conf = HBaseConfiguration.create(); HTable table = new HTable(conf, "hb_app_m_user_te"); Get get = null; List<Get> list = new ArrayList<Get>(); String url = "http://192.168.1.10:8983/solr"; SolrServer server = new HttpSolrServer(url); SolrQuery query = new SolrQuery("time:201307 AND tetid:1 AND mgcvid:101 AND smaid:101 AND puid:102"); query.setStart(0); //數(shù)據(jù)起始行,分頁用 query.setRows(10); //返回記錄數(shù),分頁用 QueryResponse response = server.query(query); SolrDocumentList docs = response.getResults(); System.out.println("文檔個(gè)數(shù):" + docs.getNumFound()); //數(shù)據(jù)總條數(shù)也可輕易獲取 System.out.println("查詢時(shí)間:" + response.getQTime()); for (SolrDocument doc : docs) { get = new Get(Bytes.toBytes((String) doc.getFieldValue("rowkey"))); list.add(get); } Result[] res = table.get(list); byte[] bt1 = null; byte[] bt2 = null; byte[] bt3 = null; byte[] bt4 = null; String str1 = null; String str2 = null; String str3 = null; String str4 = null; for (Result rs : res) { bt1 = rs.getValue("d".getBytes(), "3mpon".getBytes()); bt2 = rs.getValue("d".getBytes(), "3mponid".getBytes()); bt3 = rs.getValue("d".getBytes(), "amarpu".getBytes()); bt4 = rs.getValue("d".getBytes(), "amarpuid".getBytes()); if (bt1 != null && bt1.length>0) {str1 = new String(bt1);} else {str1 = "無數(shù)據(jù)";} //對(duì)空值進(jìn)行new String的話會(huì)拋出異常 if (bt2 != null && bt2.length>0) {str2 = new String(bt2);} else {str2 = "無數(shù)據(jù)";} if (bt3 != null && bt3.length>0) {str3 = new String(bt3);} else {str3 = "無數(shù)據(jù)";} if (bt4 != null && bt4.length>0) {str4 = new String(bt4);} else {str4 = "無數(shù)據(jù)";} System.out.print(new String(rs.getRow()) + " "); System.out.print(str1 + "|"); System.out.print(str2 + "|"); System.out.print(str3 + "|"); System.out.println(str4 + "|"); } table.close(); } }
小結(jié):
通過測(cè)試發(fā)現(xiàn),結(jié)合Solr索引可以很好的實(shí)現(xiàn)HBase的多條件查詢,同時(shí)還能解決其兩個(gè)難點(diǎn):分頁查詢、數(shù)據(jù)總量統(tǒng)計(jì)。
實(shí)際場(chǎng)景中大多都是分頁查詢,分頁查詢返回的數(shù)據(jù)量很少,采用此種方案完全可以達(dá)到前端頁面毫秒級(jí)的實(shí)時(shí)響應(yīng);若有大批量的數(shù)據(jù)交互,比如涉及到數(shù)據(jù)導(dǎo)出,實(shí)際上效率也是很高,十萬數(shù)據(jù)僅耗時(shí)10秒。
另外,如果真的將Solr納入使用,Solr以及HBase端都可以不斷進(jìn)行優(yōu)化,比如可以搭建Solr集群,甚至可以采用SolrCloud基于hadoop的分布式索引服務(wù)。
總之,HBase不能多條件過濾查詢的先天性缺陷,在Solr的配合之下可以得到較好的彌補(bǔ),難怪諸如新蛋科技、國美電商、蘇寧電商等互聯(lián)網(wǎng)公司以及眾多游戲公司,都使用Solr來支持快速查詢。
----end
posted on 2014-12-04 19:02 paulwong 閱讀(1002) 評(píng)論(0) 編輯 收藏 所屬分類: HBASE 、SOLR/LUCENCE