Skynet

---------- ---------- 我的新 blog : liukaiyi.cublog.cn ---------- ----------

:: 管理

112 Posts :: 1 Stories :: 49 Comments :: 0 Trackbacks

雖然 mysql,oracle 和 Berkeley DB，sqlite3 等數(shù)據(jù)庫已經(jīng)很好
但是當(dāng)我初略學(xué)習(xí)下數(shù)據(jù)挖掘方面的一些知識發(fā)現(xiàn)，關(guān)系數(shù)據(jù)庫遠(yuǎn)遠(yuǎn)不夠來存儲，查詢 etl 后的數(shù)據(jù)

比如：我希望原始日志數(shù)據(jù)進(jìn)行某一字段的排序,是不是很簡單。
有人說 - 數(shù)據(jù)導(dǎo)入數(shù)據(jù)庫 load into table ... ， select order by 。之
還有人說 - linux sort -n...

恩！很好，下面我們對大小為 1TB 的數(shù)據(jù)開始進(jìn)行這個簡單的操作   -- 傻眼了 ?。?br />    關(guān)于挖掘 - TB 級別的數(shù)量在我目前學(xué)習(xí)挖掘不到半年，就遇到過3-4次之多

解決辦法:
對于這個問題 - 我現(xiàn)在希望能有個大的鏈表 - （大到內(nèi)存裝不下），
鏈表中的struct 結(jié)構(gòu)為 :
   >> 排序?qū)傩晕募w屬
   >> 排序?qū)傩哉麠l數(shù)據(jù)在文件中的起始位置 - 結(jié)束位置
   >> 在排序中的排位（鏈表結(jié)構(gòu),只記入比自己小的屬性在此鏈表的位置）

比如 :
1. 文件1內(nèi)容 =>

說明:
完整數(shù)據(jù)描述 : 此數(shù)據(jù)在文件中的起始位置（當(dāng)然是通過程序取得的，這為了方便我標(biāo)出）
..c

. 0 - 22
..a

. 23 - 55
..b

. 56- 76
..d

. 77 - 130
..f

. 131 - 220
..e

. 221 - 243

2. 數(shù)據(jù)結(jié)構(gòu)預(yù)開空間 100 byte
3. 文件存儲在描述 : # 鏈表排序我就不介紹了，數(shù)據(jù)結(jié)構(gòu)的最基本技能，修改數(shù)據(jù)結(jié)構(gòu)中的比自己小的指向
      我這就給出結(jié)果
{ /tmp/文件1, 0-22 , 300 }   #說明 c ：在鏈表位置 0
{ /tmp/文件1, 23-55 , 200 }       # a ： 100
{ /tmp/文件1, 56-76 , 0 }     # b : 200
{ /tmp/文件1, 77-130 , 500 } # d : 300
{ /tmp/文件1, 131-220 , } # f : 400
{ /tmp/文件1, 221-243 , 400 } # e : 500

4. 倒敘輸出由小到到
     假設(shè)預(yù)存最小為 200 鏈表位置
     找出使用 open /tmp/文件1
       并使用 seek 文件游標(biāo) 定位 23-55 取出 ..a...
   根據(jù) 鏈表中 200 到 seek 56 76 取出 ..b...
   等等

當(dāng)然上面
數(shù)據(jù)結(jié)構(gòu)你可以使用雙向鏈表， btree , 紅黑 , 斐波那契。。。（數(shù)據(jù)結(jié)構(gòu)終于感覺有用了，不枉費我考的軟證?。。?br />

通過說明，我這給大家提供個可能需要的技術(shù)細(xì)節(jié) (py),不足之處歡迎拍磚??！

1. 二進(jìn)制文件結(jié)構(gòu)化寫，修改

#指定修改 190 byte 處的內(nèi)容
import os
from struct import *
fd = os.open( "pack1.txt", os.O_RDWR|os.O_CREAT )

ss = pack('ii11s', 3, 4, 'google')
os.lseek(fs, len(ss)*10, 0)
os.write(fs,ss)
os.fsync(fs)

#os.close( fs )

2. seek 指定位置結(jié)構(gòu)化讀取

from struct import *
file_object = open('pack1.txt', 'rb')

def ts(si,ss=len(ss)):
    file_object.seek(si*ss)
    chunk = file_object.read(ss)
    a,b,c=unpack('ii11s', chunk )
    print a,b,c

ts(10)
#輸出 3 4 google

1. 其他語言的使用
struct 結(jié)構(gòu)定義 ,在 python 中使用 struct 包，這樣序列出來的數(shù)據(jù)到文件中其他語言也可以使用
參考: http://www.pythonid.com/bbs/archiver/?tid-285.html

pack1.py
from struct import *

# i 為 int（4） 11s 為預(yù)留 11 位置的 string
# 此數(shù)據(jù)類型為 19 byte ss = pack('ii11s', 1, 2, 'hello world')

f = open("pack1.txt", "wb")
f.write(ss)
f.close()

上面的代碼往C的結(jié)構(gòu)中寫入數(shù)據(jù)，結(jié)構(gòu)包括兩個整型和一個字符串。
pack1.c
#include <stdio.h>
#include <string.h>

struct AA
{
    int a;
    int b;
    char    c[64];
};

int main()
{
    struct AA   aa;
    FILE    *fp;
    int     size, readsize;

    memset(&aa, 0, sizeof(struct AA));

    fp = fopen("pack1.txt", "rb");
    if (NULL == fp) {
        printf("open file error!"n");
        return 0;
    }

    readsize = sizeof(struct AA);
    printf("readsize: %d"n", readsize);

    size = fread(&aa, 1, readsize, fp);
    printf("read: %d"n", size);
    printf("a=%d, b=%d, c=%s"n", aa.a, aa.b, aa.c);

    fclose(fp);

    return 0;
}

結(jié)果輸出:
C:"Documents and Settings"lky"桌面"dataStructure>a
readsize: 72
read: 57
a=1, b=2, c=hello word

最后羅嗦下：
能用數(shù)據(jù)結(jié)構(gòu)了，很多東西都可以根據(jù)自己邏輯定制存儲很方便。不再受關(guān)系數(shù)據(jù)庫 , key 數(shù)據(jù)庫或 mapreduce 的限制

參考:
http://docs.python.org/library/struct.html#module-struct #官方struct 包說明
http://blog.csdn.net/JGood/archive/2009/06/22/4290158.aspx # 使用 struct 的前輩留下的
http://www.tutorialspoint.com/python/os_lseek.htm #一個小demo
Python天天美味(17) - open讀寫文件

整理 www.aygfsteel.com/Good-Game

posted on 2009-11-04 15:16 劉凱毅閱讀(2118) 評論(0) 編輯收藏所屬分類: python 、數(shù)據(jù)挖掘

新用戶注冊刷新評論列表


只有注冊用戶登錄后才能發(fā)表評論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關(guān)文章: MoinMoin wiki 服務(wù)器搭建與嘗試給自己的圖片處理工具 (py2exe) 跟我一起學(xué) - 算法導(dǎo)論 - 快速排序 python pil 使用(轉(zhuǎn)) shell txt 分析小結(jié) 跟我一起學(xué) - 算法導(dǎo)論 - 遞歸式理解高斯函數(shù)，以及在推薦算法中的應(yīng)用跟我一起學(xué) - 算法導(dǎo)論 - 插入排序文件存儲 - 數(shù)據(jù)結(jié)構(gòu)( py ) beanstalkd 消息隊列的第一手資料

Skynet

常用鏈接

留言簿(13)

我參與的團(tuán)隊

隨筆分類

隨筆檔案

相冊

搜索

最新評論

閱讀排行榜

評論排行榜