Chan Chen Coding...

導(dǎo)航

統(tǒng)計(jì)

隨筆 - 1
文章 - 169
評(píng)論 - 11
引用 - 0

文章分類

文章檔案

SPARK環(huán)境搭建-WINDOWS版本

轉(zhuǎn)載: Spark環(huán)境搭建-WIndows版本

這段時(shí)間在看Scala語言方面的資料，接觸到了Spark，于是昨天下午在公司，把Spark的環(huán)境搭建起來了。安裝的時(shí)候陪到了一個(gè)問題，在網(wǎng)上沒有找到解決方案，于是自己查了一下原因。現(xiàn)在做一下筆記。

1. spark的下載文件可以在官方找到，地址：http://spark.incubator.apache.org/downloads.html ，這次裝的是截至目前為止，最新的版本：0.9

2. 下載完以后，直接解壓到指定的路徑，例如，d：/programs

3. 安裝scala，并制定Scala_Home路徑，scala安裝請(qǐng)查看官網(wǎng)

4. 按照Spark官方的安裝指南，在解壓的目錄下，運(yùn)行

sbt/sbt package

命令就可以。

但是這是針對(duì)linux和OS X系統(tǒng)的，在windows下運(yùn)行這條命令，會(huì)報(bào)錯(cuò)：

not a valid command

這個(gè)問題是因?yàn)椋瑂park知道的sbt腳本無法在windows下運(yùn)行，只要在網(wǎng)上下載一個(gè)windows版本的sbt，然后將里面的文件拷貝到Spark目錄下的sbt （http://www.scala-sbt.org/），然后在運(yùn)行命令，安裝就會(huì)成功。

試試spark-shell

1 scala> val textFile = sc.textFile("README.md")
2 14/02/14 16:38:12 INFO MemoryStore: ensureFreeSpace(35480) called with curMem=177376, maxMem=308713881
3 14/02/14 16:38:12 INFO MemoryStore: Block broadcast_5 stored as values to memory (estimated size 34.6 KB, free 294.2 MB)
4
5 textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[16] at textFile at <console>:12
6
7 scala> textFile.count
8 14/02/14 16:38:14 INFO FileInputFormat: Total input paths to process : 1
9 14/02/14 16:38:14 INFO SparkContext: Starting job: count at <console>:15
10 14/02/14 16:38:14 INFO DAGScheduler: Got job 7 (count at <console>:15) with 1 output partitions (allowLocal=false)
11 14/02/14 16:38:14 INFO DAGScheduler: Final stage: Stage 7 (count at <console>:15)
12 14/02/14 16:38:14 INFO DAGScheduler: Parents of final stage: List()
13 14/02/14 16:38:14 INFO DAGScheduler: Missing parents: List()
14 14/02/14 16:38:14 INFO DAGScheduler: Submitting Stage 7 (MappedRDD[16] at textFile at <console>:12), which has no missin
15 g parents
16 14/02/14 16:38:14 INFO DAGScheduler: Submitting 1 missing tasks from Stage 7 (MappedRDD[16] at textFile at <console>:12)
17
18 14/02/14 16:38:14 INFO TaskSchedulerImpl: Adding task set 7.0 with 1 tasks
19 14/02/14 16:38:14 INFO TaskSetManager: Starting task 7.0:0 as TID 5 on executor localhost: localhost (PROCESS_LOCAL)
20 14/02/14 16:38:14 INFO TaskSetManager: Serialized task 7.0:0 as 1560 bytes in 1 ms
21 14/02/14 16:38:14 INFO Executor: Running task ID 5
22 14/02/14 16:38:14 INFO BlockManager: Found block broadcast_5 locally
23 14/02/14 16:38:14 INFO HadoopRDD: Input split: file:/D:/program/spark-0.9.0-incubating/README.md:0+4491
24 14/02/14 16:38:14 INFO Executor: Serialized size of result for 5 is 563
25 14/02/14 16:38:14 INFO Executor: Sending result for 5 directly to driver
26 14/02/14 16:38:14 INFO Executor: Finished task ID 5
27 14/02/14 16:38:14 INFO TaskSetManager: Finished TID 5 in 6 ms on localhost (progress: 0/1)
28 14/02/14 16:38:14 INFO DAGScheduler: Completed ResultTask(7, 0)
29 14/02/14 16:38:14 INFO TaskSchedulerImpl: Remove TaskSet 7.0 from pool
30 14/02/14 16:38:14 INFO DAGScheduler: Stage 7 (count at <console>:15) finished in 0.009 s
31 14/02/14 16:38:14 INFO SparkContext: Job finished: count at <console>:15, took 0.012329265 s
32 res10: Long = 119
33
34 scala> textFile.first
35 14/02/14 16:38:24 INFO SparkContext: Starting job: first at <console>:15
36 14/02/14 16:38:24 INFO DAGScheduler: Got job 8 (first at <console>:15) with 1 output partitions (allowLocal=true)
37 14/02/14 16:38:24 INFO DAGScheduler: Final stage: Stage 8 (first at <console>:15)
38 14/02/14 16:38:24 INFO DAGScheduler: Parents of final stage: List()
39 14/02/14 16:38:24 INFO DAGScheduler: Missing parents: List()
40 14/02/14 16:38:24 INFO DAGScheduler: Computing the requested partition locally
41 14/02/14 16:38:24 INFO HadoopRDD: Input split: file:/D:/program/spark-0.9.0-incubating/README.md:0+4491
42 14/02/14 16:38:24 INFO SparkContext: Job finished: first at <console>:15, took 0.002671379 s
43 res11: String = # Apache Spark
44
45 scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
46 linesWithSpark: org.apache.spark.rdd.RDD[String] = FilteredRDD[17] at filter at <console>:14
47
48 scala> textFile.filter(line=> line.contains("spark")).count
49 14/02/14 16:38:37 INFO SparkContext: Starting job: count at <console>:15
50 14/02/14 16:38:37 INFO DAGScheduler: Got job 9 (count at <console>:15) with 1 output partitions (allowLocal=false)
51 14/02/14 16:38:37 INFO DAGScheduler: Final stage: Stage 9 (count at <console>:15)
52 14/02/14 16:38:37 INFO DAGScheduler: Parents of final stage: List()
53 14/02/14 16:38:37 INFO DAGScheduler: Missing parents: List()
54 14/02/14 16:38:37 INFO DAGScheduler: Submitting Stage 9 (FilteredRDD[18] at filter at <console>:15), which has no missin
55 g parents
56 14/02/14 16:38:37 INFO DAGScheduler: Submitting 1 missing tasks from Stage 9 (FilteredRDD[18] at filter at <console>:15)
57
58 14/02/14 16:38:37 INFO TaskSchedulerImpl: Adding task set 9.0 with 1 tasks
59 14/02/14 16:38:37 INFO TaskSetManager: Starting task 9.0:0 as TID 6 on executor localhost: localhost (PROCESS_LOCAL)
60 14/02/14 16:38:37 INFO TaskSetManager: Serialized task 9.0:0 as 1642 bytes in 0 ms
61 14/02/14 16:38:37 INFO Executor: Running task ID 6
62 14/02/14 16:38:37 INFO BlockManager: Found block broadcast_5 locally
63 14/02/14 16:38:37 INFO HadoopRDD: Input split: file:/D:/program/spark-0.9.0-incubating/README.md:0+4491
64 14/02/14 16:38:37 INFO Executor: Serialized size of result for 6 is 563
65 14/02/14 16:38:37 INFO Executor: Sending result for 6 directly to driver
66 14/02/14 16:38:37 INFO Executor: Finished task ID 6
67 14/02/14 16:38:37 INFO TaskSetManager: Finished TID 6 in 10 ms on localhost (progress: 0/1)
68 14/02/14 16:38:37 INFO DAGScheduler: Completed ResultTask(9, 0)
69 14/02/14 16:38:37 INFO TaskSchedulerImpl: Remove TaskSet 9.0 from pool
70 14/02/14 16:38:37 INFO DAGScheduler: Stage 9 (count at <console>:15) finished in 0.010 s
71 14/02/14 16:38:37 INFO SparkContext: Job finished: count at <console>:15, took 0.020335125 s
72 res12: Long = 7

另外Spark官網(wǎng)提供了入門的四段視頻，但是國內(nèi)被墻了，無法觀看youtube，我把這四段視頻放到了土豆網(wǎng)，大家可以看看。

Spark Screencast 1 – 搭建Spark環(huán)境

Spark Screencast 2 – Spark文檔總覽

Spark Screencast 3 – 轉(zhuǎn)換和緩存

Spark Screencast 4 – Scala獨(dú)立任務(wù)

-----------------------------------------------------
Silence, the way to avoid many problems;
Smile, the way to solve many problems;

posted on 2014-02-14 16:21 Chan Chen 閱讀(3041) 評(píng)論(0) 編輯收藏所屬分類: Scala / Java

新用戶注冊(cè) 刷新評(píng)論列表


只有注冊(cè)用戶登錄后才能發(fā)表評(píng)論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 Chat2DB C++博客博問管理
相關(guān)文章: SPARK環(huán)境搭建-WINDOWS版本 Java CyclicBarrier介紹 Java 枚舉7常見種用法 JVM參數(shù)設(shè)定 spring mvc singleton的驗(yàn)證 Java關(guān)鍵字final、static使用總結(jié) Spring Quartz Corn Expression Jps介紹以及解決jps無法查看某個(gè)已經(jīng)啟動(dòng)的java進(jìn)程問題關(guān)于memcache取多值的性能比較 Pool resources using Apache's Commons Pool Framework