基数估计的åˆè¡·å°±æ˜¯äؓ了解军_œ¨å¤§æ•°æ®çš„å‰æä¸‹ï¼Œå¦‚ä½•ä»¥ä½Žæˆæœ¬çš„ç©ºé—´å¤æ‚度去计½Ž—超大集åˆçš„势的问题åQŒæ¢å¥è¯è¯ß_¼Œé€šè¿‡åŸºæ•°ä¼°è®¡åQŒå•机åšåˆ°è®¡½Ž—亿¾U§åˆ«uvåQŒè¯¯å·®åœ¨4%以内。解å†Ïx€èµ\ä¸»è¦æ˜¯æ¦‚率估计,具体原ç†å’Œå𿳕å‚çœ?blog和论文原文ã€?
å‡ÞZºŽå®žéªŒçš„目的,我简å•实çŽîCº†æš´åŠ›åšæ³•bruteforce-bfåQŒå¸ƒéš†è¿‡æ»¤å™¨-bbfåQŒloglog-llcå’Œhyperloglog-hllc四个½Ž—法åQŒæ¯”较一下基æ•îC¼°è®¡è¿™ä¸ªè®¡½Ž—åŽ»é‡æŒ‡æ ‡çš„逻辑是å¦å¯è¡ŒåQˆllcéžå¸¸¼›»è°±åQŒå¯èƒ½æ˜¯æˆ‘分桶数没有调整好,ž®×ƒ¸è´´å‡º¾l“果了)ã€?
é¢„å¤„ç†æ–¹æ³•:1-N生æˆéšæœºuidåQŒæ¨¡æ‹ŸN‹Æ¡ï¼ˆå‡åŒ€åˆ†å¸ƒåQ‰ï¼Œjvmå¯åЍ-Xmx1024mã€?
实验¾l“æžœåQ?
é™„åŠ è¯´æ˜Žä¸€ä¸‹ï¼ŒæœŸæœ›å€¼å¦‚ä½•è®¡½Ž—:其实˜q™ä¸ªå®žéªŒçš„æ•°å¦åŽŸåž‹å°±æ˜¯ä¸€ä¸ªé•¿åº¦äØ“kçš„å‡åŒ€åˆ†å¸ƒçš„(1-N)çš„éšæœºæ•°åˆ—,求ä¸é‡å¤çš„å…ƒç´ ä¸ªæ•°çš„æœŸæœ›ã€‚æˆ‘å®žéªŒé‡Œk=nåQŒè¿™æ˜¯ä¸€¿Uæžç«¯æƒ…况(实验设计¾U¯äؓ方便计算åQŒå¦‚æžœk较大会导致计½Ž—超慢,uv5000wæ—¶æ ¹æœ¬æ— æ³•è®¡½Ž—出æ¥ï¼Œå¢žå¤§kç†è®ºä¸Šä¼šæé«˜¾_‘Öº¦åQŒæˆ‘实验˜q‡çš„一¾l„æ•°æ®æ˜¯100w uv 500wpvæ—?hllc的值是991234åQŒè¯¯å·?lt;1%åQ‰ï¼Œç†è®ºä¸Šk相当于pvåQŒåœ¨é€’推公å¼ä¸k‘‹äºŽæ— 穷时期望ç‰äºŽnã€?
˜q™ä¸ªé€’推的计½Ž—å¯ä»¥é€šè¿‡¾l„åˆåˆ†æžæŽ¨å¯¼åQŒæŽ¨å¯¼æ–¹æ³•ä¸è¯¦è¯´äº†ï¼ˆå½“然我有å¯èƒ½æŽ¨å¯¼é”™äº†~~æ•°å¦åŠŸåº• 实在 ä¸è¡Œäº†ï¼‰åQŒé€šé¡¹å…¬å¼è§matlab代ç ã€?
syms e n;
e = n-(1/n)*((1-2*n+n*n)*((n-1)/n)^(n-2)+(1-n)*n+n*(n-1));
vpa(subs(e,'n',1000000),10)
å¦å¤–åQŒæˆ‘个ähè®¤äØ“åˆ†å¸ƒå¼å¸ƒéš†è¿‡æ»¤å™¨çš„æ–¹æ¡ˆæ˜¯éžå¸¸å¥½çš„åQŒå› 为空间和旉™—´éƒ½æ¯”较å‡è¡¡ï¼Œä¸”ç²¾¼‹®åº¦é«˜ï¼ŒåŸºæ•°ä¼°è®¡çš„æ–¹æ³•本质上½Iºé—´å¤æ‚度O(1)åQŒæ—¶é—´å¤æ‚度代ç 高效一点也å¯ä»¥éžå¸¸å¿«ï¼Œä½†æ˜¯¾~ºç‚¹æ˜¯ç²¾¼‹®åº¦½Eå¾®‹Æ 缺åQŒä¸”䏿˜“分布å¼è®¡½Ž—ï¼ˆå› äØ“å®ƒå¤©ç”Ÿé€‚åˆå•è¿›½E‹ï¼Œllc分桶å‡è¡¡ä¹Ÿæ˜¯å•è¿›½E‹åšæ¯”è¾ƒå¥½ï¼Œåˆ†å¸ƒå¼å®Œå…¨æ˜¯ç‰›åˆ€æ€é¸¡ï¼‰ã€?
ref blog: http://blog.codinglabs.org/articles/cardinality-estimate-exper.html#ref4
½Ž—法实现的java代ç å¯è§githubåQ?https://github.com/changedi/card-estimate
首先从我们计½Ž—机人最熟æ?zh¨¨n)‰çš„线性代数开始ã€?/span>
今天先写½W¬ä¸€½‹‡ï¼šå‘é‡â€”â€?/span>vector。å¦å¤–补充一å¥ï¼Œæ¯ä¸€ä¸ªå…·ä½“ç±»æˆ–è€…åŒ…çš„ç ”½I‰™ƒ½æ˜¯ç¤ºä¾‹æ€§è´¨çš„,具体è¦ç”¨åˆîC»€ä¹ˆç±»åž‹çš„¾cÀLˆ–者接å£ï¼Œå¤§å®¶åº”该自己动手åŽÈ¿»é˜?/span>api docã€?/span>
Vector是一个普通的å‘é‡ã€‚在mathåŒ?/span>org.apache.commons.math.linear䏿œ‰RealVectorå’?/span>FieldVector˜q™ä¸¤¿Uå‘é‡ï¼Œå‡æ˜¯æŽ¥å£¾cÕdž‹ã€‚å‰è€…是实数¾cÕdž‹çš„å‘é‡ï¼ŒåŽè€…是场å‘é‡ã€‚以实数å‘é‡ä¸ÞZ¾‹åQ?/span>RealVectoræœ?/span>AbstractRealVector, ArrayRealVector。具体的¾l“æž„¾cÕd›¾è§ä¸‹åQ?/span>
˜q™ä¸ª¾l“æž„å¯èƒ½ä¼šæœ‰äº›å˜åŒ–ï¼Œå› äØ“åœ?/span>docä¸ï¼Œå¯ä»¥çœ‹åˆ°ArrayRealVector是ç‘ôæ‰?/span>AbstractRealVector的,è€?/span>2.0的代ç ä¸ArrayRealVector˜q˜æ˜¯ç›´æŽ¥å®žçްäº?/span>RealVectoråQŒä¸”包ä¸ä¹Ÿæ²¡æœ?/span>AbstractRealVector。å¯è§ä»£ç çš„å˜åŒ–å’?/span>docä¸çš„æè¿°ä¹Ÿæœ‰ä¸åŒåQŒè™½ç„?/span>docä¸è¯´æ˜Žäº†AbstractRealVectoræ˜?/span>since 2.0的,而且ArrayRealVector也是since 2.0的。呵呵,一个ä¸ä¸€è‡´ã€‚其实是update dateçš„ä¸åŒäº†ã€?/span>
å‘釘q™ä¸ªæ¦‚念是线性代数的基础ã€?/span>RealVectorä½œäØ“åŸºæœ¬çš„æŽ¥å£ï¼Œå·²ç»å®šä¹‰äº†åŸºæœ¬æ‰€æœ‰çš„å‘釿“作。比如å‘é‡çš„åŠ å‡ä¹˜é™¤˜qç®—ã€å‘é‡å¤–¿U¯ã€å‘é‡å†…¿U¯ã€å‘é‡èŒƒæ•°ç‰½{‰ï¼Œå½“ç„¶vector的实现时åŸÞZºŽæ•°ç»„¾cÕdž‹çš„ã€?/span>RealVector的内部实现是double []data;ã€?/span>
需è¦é‡ç‚¹è§£é‡Šçš„æ˜¯ä¸€ä¸ªæ“ä½?/span>map***åQšå°±åƒåŽŸæ¥çš„api解释çš?#8220;The various mapXxx and mapXxxToSelf methods operate on vectors element-wise, i.e. they perform the same operation (adding a scalar, applying a function ...) on each element in turn. The mapXxx versions create a new vector to hold the result and do not change the instance. The mapXxxToSelf versions use the instance itself to store the results, so the instance is changed by these methods. In both cases, the result vector is returned by the methods”ã€?/span>
具体是什么呢åQŸå¾ˆå¤šçš„map***æ“作å’?/span>map***toselfæ“作ž®±æ˜¯å¯¹å‘é‡çš„æ¯ä¸€ä¸ªå…ƒç´ åšå›ºå®šæ“ä½œçš„æ„æ€ã€‚è€?/span>map***是返回新的实例的åQŒè€?/span>map***toself则返回自己。这个从æºç å¯ä»¥çœ‹å‡ºåQŒæ¯”å¦?/span>mapAdd()的实玎ͼš
è€?/span>mapAddToSelf()的实玎ͼš
区别昄¡„¶äº†ï¼Œä¸€ä¸ªè¿”å›?/span>new ArrayRealVectoråQŒä¸€ä¸ªè¿”å›?/span>thisã€?/span>
具体è§ä»£ç :
对应的输出:
v1 is {1; 2; 3}
size is 3
v1 + v2 = {5; 7; 9}
v1 + v2 = {5; 7; 9}
v1 - v2 = {-3; -3; -3}
v1 * v2 = {4; 10; 18}
v1 / v2 = {0.25; 0.4; 0.5}
v1[1] = 2.0
v1 append v2 is {1; 2; 3; 4; 5; 6}
distance between v1 and v2 is 5.196152422706632
L1 distance between v1 and v2 is 9.0
norm of v1 is 3.7416573867739413
dot product of v1 and v2 is 32.0
outer product of v1 and v2 is Array2DRowRealMatrix{{4.0,5.0,6.0},{8.0,10.0,12.0},{12.0,15.0,18.0}}
hogonal projection of v1 and v2 is {1.66; 2.08; 2.49}
Map the Math.abs(double) function to v1 is {1; 2; 3}
Map the 1/x function to v1 itself is {1; 0.5; 0.33}
sub vector of v1 is {1; 0.5}
å‘釿˜¯ä¸€ä¸ªåŸº¼‹€æ•°å¦¾l“æž„åQŒä»¥åŽè¿˜ä¼šå¤§é‡çš„æåˆ°ã€?/span>Commons Mathåº“äØ“æˆ‘ä»¬æä¾›äº†è¿™æ äh–¹ä¾¿çš„å‘é‡è¡¨ç¤ºåQŒåœ¨ç”?/span>Java写è“v½E‹åºæ¥ä¹Ÿæ˜¯å¾—心应手ã€?/span>
å½“ç„¶æ‰€æœ‰çš„ç ”ç©¶è¦ä»¥æ–‡æ¡£ä¸ÞZ¸»åQŒå‚çœ‹æ–‡æ¡£å†™ä»£ç ˜q™æ˜¯å¿…é¡»åšåˆ°çš„事情。所以,ä¸è¦å«Œéº»çƒ¦ï¼Œèµ¶ç´§æŠ?/span>api doc攑ֈ°æ¡Œé¢ä¸Šï¼Œå¼€å§?/span>codingå§ã€?/span>
相关资料åQ?/span>
å‘é‡å®šä¹‰åQ?/span>http://zh.wikipedia.org/zh/%E7%9F%A2%E9%87%8F
Commons math包:http://commons.apache.org/math/index.html