DISC(Data Intensive Super Computing 數據密集型超級計算)
Data Intensive System(DIS)
System Challenges:
Data distributed over many disks
Compute using many processors
Connected by gigabit Ethernet (or equivalent)
System Requirements:
Lots of disks
Lots of processors
Located in close proximity
System Comparison:
(i) Data
Conventional Supercomputers |
DISC |
Data stored in separate repository No support for collection or management Brought into system for computation Time consuming Limits interactivity |
System collects and maintains data Shared, active data set Computation colocated with storage Faster access |
(ii) Programing Models
Conventional Supercomputers |
DISC |
Programs described at very low level Specify detailed control of processing & communications Rely on small number of software packages Written by specialists Limits classes of problems & solution methods |
Application programs written in terms of high-level operations on data Runtime system controls scheduling, load balancing, … |
(iii) Interaction
Conventional Supercomputers |
DISC |
Main Machine: Batch Access Priority is to conserve machine resources User submits job with specific resource requirements Run in batch mode when resources available Offline Visualization Move results to separate facility for interactive use |
Interactive Access Priority is to conserve human resources User action can range from simple query to complex computation System supports many simultaneous users Requires flexible programming and runtime environment |
(iv) Reliability
Conventional Supercomputers |
DISC |
“Brittle” Systems Main recovery mechanism is to recompute from most recent checkpoint Must bring down system for diagnosis, repair, or upgrades |
Flexible Error Detection and Recovery Runtime system detects and diagnoses errors Selective use of redundancy and dynamic recomputation Replace or upgrade components while system running Requires flexible programming model & runtime environment |
Comparing with Grid Computing:
Grid: Distribute Computing and Data
(i) Computation: Distribute problem across many machines
Generally only those with easy partitioning into independent subproblems
(ii) Data: Support shared access to large-scale data set
DISC: Centralize Computing and Data
(i) Enables more demanding computational tasks
(ii) Reduces time required to get data to machines
(iii) Enables more flexible resource management
A Commercial DISC
Netezza Performance Server (NPS)
Designed for “data warehouse” applications
Heavy duty analysis of database
Data distributed over up to 500 Snippet Processing Units
Disk storage, dedicated processor, FPGA controller
User “programs” expressed in SQL
Constructing DISC
Hardware: Rent from Amazon
Elastic Compute Cloud (EC2)
Generic Linux cycles for $0.10 / hour ($877 / yr)
Simple Storage Service (S3)
Network-accessible storage for $0.15 / GB / month ($1800/TB/yr)
Software: utilize open source
Hadoop Project
Open source project providing file system and MapReduce
Supported and used by Yahoo
Implementing System Software
Programming Support
Abstractions for computation & data representation
E.g., Google: MapReduce & BigTable
Usage models
Runtime Support
Allocating processing and storage
Scheduling multiple users
Implementing programming model
Error Handling
Detecting errors
Dynamic recovery
Identifying failed components
posted on 2008-04-04 23:43 sun 閱讀(1220) 評論(0) 編輯 收藏 所屬分類: DISC