Data Centric Computing Jim Gray Microsoft Research


Download 460 b.
Sana14.08.2018
Hajmi460 b.


Data Centric Computing

  • Jim Gray

  • Microsoft Research

  • Research.Microsoft.com/~Gray/talks

  • FAST 2002

  • Monterey, CA, 14 Oct 1999


Put Everything in Future (Disk) Controllers (it’s not “if”, it’s “when?”) Jim Gray Microsoft Research http://Research.Micrsoft.com/~Gray/talks FAST 2002 Monterey, CA, 14 Oct 1999 Acknowledgements: Dave Patterson explained this to me long ago Leonard Chung Kim Keeton Erik Riedel Catharine Van Ingen



First Disk 1956

  • IBM 305 RAMAC

  • 4 MB

  • 50x24” disks

  • 1200 rpm

  • 100 ms access

  • 35k$/y rent

  • Included computer & accounting software (tubes not transistors)



10 years later



Disk Evolution

  • Capacity:100x in 10 years 1 TB 3.5” drive in 2005 20 GB as 1” micro-drive

  • System on a chip

  • High-speed SAN

  • Disk replacing tape

  • Disk is super computer!



Disks are becoming computers

  • Smart drives

  • Camera with micro-drive

  • Replay / Tivo / Ultimate TV

  • Phone with micro-drive

  • MP3 players

  • Tablet

  • Xbox

  • Many more…



Data Gravity Processing Moves to Transducers smart displays, microphones, printers, NICs, disks

  • Storage

  • Network

  • Display



It’s Already True of Printers Peripheral = CyberBrick

  • You buy a printer

  • You get a

    • several network interfaces
    • A Postscript engine
      • cpu,
      • memory,
      • software,
      • a spooler (soon)
    • and… a print engine.


The (absurd?) consequences of Moore’s Law

  • 256 way nUMA?

  • Huge main memories: now: 500MB - 64GB memories then: 10GB - 1TB memories

  • Huge disks now: 20-200 GB 3.5” disks then: .1 - 1 TB disks

  • Petabyte storage farms

    • (that you can’t back up or restore).
  • Disks >> tapes

    • “Small” disks: One platter one inch 10GB
  • SAN convergence 1 GBps point to point is easy



The Absurd Design?

  • Further segregate processing from storage

  • Poor locality

  • Much useless data movement

  • Amdahl’s laws: bus: 10 B/ips io: 1 b/ips



What’s a Balanced System? (40+ disk arms / cpu)



Amdahl’s Balance Laws Revised

  • Laws right, just need “interpretation” (imagination?)

  • Balanced System Law: A system needs 8 MIPS/MBpsIO, but instruction rate must be measured on the workload.

    • Sequential workloads have low CPI (clocks per instruction),
    • random workloads tend to have higher CPI.
  • Alpha (the MB/MIPS ratio) is rising from 1 to 6. This trend will likely continue.

  • One Random IO’s per 50k instructions.

  • Sequential IOs are larger One sequential IO per 200k instructions



Observations re TPC C, H systems

  • More than ½ the hardware cost is in disks

  • Most of the mips are in the disk controllers

  • 20 mips/arm is enough for tpcC

  • 50 mips/arm is enough for tpcH

  • Need 128MB to 256MB/arm

  • Ref:

    • Gray& Shenoy: “Rules of Thumb…”
    • Keeton, Riedel, Uysal, PhD thesis.
  • ? The end of computers ?



TPC systems

  • Normalize for CPI (clocks per instruction)

    • TPC-C has about 7 ins/byte of IO
    • TPC-H has 3 ins/byte of IO
  • TPC-H needs ½ as many disks, sequential vs random

  • Both use 9GB 10 krpm disks (need arms, not bytes)



TPC systems: What’s alpha (=MB/MIPS)?

  • Hard to say:

    • Intel 32 bit addressing (= 4GB limit). Known CPI.
    • IBM, HP, Sun have 64 GB limit. Unknown CPI.
    • Look at both, guess CPI for IBM, HP, Sun
  • Alpha is between 1 and 6



When each disk has 1bips, no need for ‘cpu’



Implications

  • Offload device handling to NIC/HBA

  • higher level protocols: I2O, NASD, VIA, IP, TCP…

  • SMP and Cluster parallelism is important.



Interim Step: Shared Logic

  • Brick with 8-12 disk drives

  • 200 mips/arm (or more)

  • 2xGbpsEthernet

  • General purpose OS (except NetApp )

  • 10k$/TB to 50k$/TB

  • Shared

    • Sheet metal
    • Power
    • Support/Config
    • Security
    • Network ports


Next step in the Evolution

  • Disks become supercomputers

    • Controller will have 1bips, 1 GB ram, 1 GBps net
    • And a disk arm.
  • Disks will run full-blown app/web/db/os stack

  • Distributed computing

  • Processors migrate to transducers.



Gordon Bell’s Seven Price Tiers

  • 10$: wrist watch computers

  • 100$: pocket/ palm computers

  • 1,000$: portable computers

  • 10,000$: personal computers (desktop)

  • 100,000$: departmental computers (closet)

  • 1,000,000$: site computers (glass house)

  • 10,000,000$: regional computers (glass castle)



Bell’s Evolution of Computer Classes



NAS vs SAN

  • Network Attached Storage

    • File servers
    • Database servers
    • Application servers
    • (it’s a slippery slope: as Novell showed)
  • Storage Area Network

    • A lower life form
    • Block server: get block / put block
    • Wrong abstraction level (too low level)
    • Security is VERY hard to understand.
      • (who can read that disk block?)


How Do They Talk to Each Other?

  • Each node has an OS

  • Each node has local resources: A federation.

  • Each node does not completely trust the others.

  • Nodes use RPC to talk to each other

    • WebServices/SOAP? CORBA? COM+? RMI?
    • One or all of the above.
  • Huge leverage in high-level interfaces.

  • Same old distributed system story.



Basic Argument for x-Disks

  • Future disk controller is a super-computer.

    • 1 bips processor
    • 256 MB dram
    • 1 TB disk plus one arm
  • Connects to SAN via high-level protocols

    • RPC, HTTP, SOAP, COM+, Kerberos, Directory Services,….
    • Commands are RPCs
    • management, security,….
    • Services file/web/db/… requests
    • Managed by general-purpose OS with good dev environment
  • Move apps to disk to save data movement

    • need programming environment in “controller”


The Slippery Slope

  • If you add function to server

  • Then you add more function to server

  • Function gravitates to data.



Why Not a Sector Server? (let’s get physical!)

  • Good idea, that’s what we have today.

  • But

    • cache added for performance
    • Sector remap added for fault tolerance
    • error reporting and diagnostics added
    • SCSI commends (reserve,.. are growing)
    • Sharing problematic (space mgmt, security,…)
  • Slipping down the slope to a 2-D block server



Why Not a 1-D Block Server? Put A LITTLE on the Disk Server

  • Tried and true design

    • HSC - VAX cluster
    • EMC
    • IBM Sysplex (3980?)
  • But look inside

    • Has a cache
    • Has space management
    • Has error reporting & management
    • Has RAID 0, 1, 2, 3, 4, 5, 10, 50,…
    • Has locking
    • Has remote replication
    • Has an OS
    • Security is problematic
    • Low-level interface moves too many bytes


Why Not a 2-D Block Server? Put A LITTLE on the Disk Server

  • Tried and true design

    • Cedar -> NFS
    • file server, cache, space,..
    • Open file is many fewer msgs
  • Grows to have

    • Directories + Naming
    • Authentication + access control
    • RAID 0, 1, 2, 3, 4, 5, 10, 50,…
    • Locking
    • Backup/restore/admin
    • Cooperative caching with client


Why Not a File Server? Put a Little on the 2-D Block Server

  • Tried and true design

    • NetWare, Windows, Linux, NetApp, Cobalt, SNAP,... WebDav
  • Yes, but look at NetWare

    • File interface grew
    • Became an app server
      • Mail, DB, Web,….
    • Netware had a primitive OS
      • Hard to program, so optimized wrong thing


Why Not Everything? Allow Everything on Disk Server (thin client’s)

  • Tried and true design

    • Mainframes, Minis, ...
    • Web servers,…
    • Encapsulates data
    • Minimizes data moves
    • Scaleable
  • It is where everyone ends up.

  • All the arguments against are short-term.



The Slippery Slope

  • If you add function to server

  • Then you add more function to server

  • Function gravitates to data.



Disk = Node

  • has magnetic storage (1TB?)

  • has processor & DRAM

  • has SAN attachment

  • has execution environment



Hardware

  • Homogenous machines leads to quick response through reallocation

  • HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives

  • $4k/TB (street), 2.5processors/TB, 1GB RAM/TB

  • 3 weeks from ordering to operational



Disk as Tape

  • Tape is unreliable, specialized, slow, low density, not improving fast, and expensive

  • Using removable hard drives to replace tape’s function has been successful

  • When a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used.

  • Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good.



Disk As Tape: What format?

  • Today I send NTFS/SQL disks.

  • But that is not a good format for Linux.

  • Solution: Ship NFS/CIFS/ODBC servers (not disks)

  • Plug “disk” into LAN.

    • DHCP then file or DB server via standard interface.
    • Web Service in long term


Some Questions

  • Will the disk folks deliver?

  • What is the product?

  • How do I manage 1,000 nodes (disks)?

  • How do I program 1,000 nodes (disks)?

  • How does RAID work?

  • How do I backup a PB?

  • How do I restore a PB?



Will the disk folks deliver? Maybe! Hard Drive Unit Shipments



Most Disks are Personal

  • 85% of disks are desktop/mobile (not SCSI)

  • Personal media is AT LEAST 50% of the problem.

  • How to manage your shoebox of:

    • Documents
    • Voicemail
    • Photos
    • Music
    • Videos


What is the Product? (see next section on media management)

  • Concept: Plug it in and it works!

  • Music/Video/Photo appliance (home)

  • Game appliance

  • “PC”

  • File server appliance

  • Data archive/interchange appliance

  • Web appliance

  • Email appliance

  • Application appliance

  • Router appliance



Auto Manage Storage

  • 1980 rule of thumb:

    • A DataAdmin per 10GB, SysAdmin per mips
  • 2000 rule of thumb

    • A DataAdmin per 5TB
    • SysAdmin per 100 clones (varies with app).
  • Problem:

    • 5TB is 50k$ today, 5k$ in a few years.
    • Admin cost >> storage cost !!!!
  • Challenge:

    • Automate ALL storage admin tasks


How do I manage 1,000 nodes?

  • You can’t manage 1,000 x (for any x).

  • They manage themselves.

    • You manage exceptional exceptions.
  • Auto Manage

    • Plug & Play hardware
    • Auto-load balance & placement storage & processing
    • Simple parallel programming model
    • Fault masking
  • Some positive signs:

    • Few admins at Google 10k nodes 2 PB , Yahoo! ? nodes, 0.3 PB, Hotmail 10k nodes, 0.3 PB


How do I program 1,000 nodes?

  • You can’t program 1,000 x (for any x).

  • They program themselves.

    • You write embarrassingly parallel programs
    • Examples: SQL, Web, Google, Inktomi, HotMail,….
    • PVM and MPI prove it must be automatic (unless you have a PhD)!
  • Auto Parallelism is ESSENTIAL



Plug & Play Software

  • RPC is standardizing: (SOAP/HTTP, COM+, RMI/IIOP)

    • Gives huge TOOL LEVERAGE
    • Solves the hard problems :
      • naming,
      • security,
      • directory service,
      • operations,...
  • Commoditized programming environments

    • FreeBSD, Linix, Solaris,…+ tools
    • NetWare + tools
    • WinCE, WinNT,…+ tools
    • JavaOS + tools
  • Apps gravitate to data.

  • General purpose OS on dedicated ctlr can run apps.



It’s Hard to Archive a Petabyte It takes a LONG time to restore it.

  • At 1GBps it takes 12 days!

  • Store it in two (or more) places online (on disk?). A geo-plex

  • Scrub it continuously (look for errors)

  • On failure,

    • use other copy until failure repaired,
    • refresh lost copy from safe copy.
  • Can organize the two copies differently (e.g.: one by time, one by space)



Disk vs Tape

  • Disk

    • 160 GB
    • 25 MBps
    • 5 ms seek time
    • 3 ms rotate latency
    • 2$/GB for drive 1$/GB for ctlrs/cabinet
    • 4 TB/rack


I’m a disk bigot

  • I hate tape, tape hates me.

      • Unreliable hardware
      • Unreliable software
      • Poor human factors
      • Terrible latency, bandwidth
  • Disk

    • Much easier to use
    • Much faster
    • Cheaper!
    • But needs new concepts


Disk as Tape Challenges

  • Offline disk (safe from virus)

  • Trivialize Backup/Restore software

    • Things never change
    • Just object versions
  • Snapshot for continuous change (databases)

  • RAID in a SAN

    • (cross-disk journaling)
    • Massive replication (a la Farsite)


Summary

  • Disks will become supercomputers

  • Compete in Linux appliance space

  • Build best NAS software (compete with NetApp, ..)

  • Auto-manage huge storage farms FarSite, SQL autoAdmin++,…

  • Build world’s best disk-based backup system Including Geoplex (compete with Veritas,..)

  • Push faster on 64-bit



Storage capacity beating Moore’s law

  • 2 k$/TB today (raw disk)

  • 1k$/TB by end of 2002



Trends: Magnetic Storage Densities

  • Amazing progress

  • Ratios have changed

  • Capacity grows 60%/y

  • Access speed grows 10x more slowly



Trends: Density Limits

  • The end is near!

  • Products:23 Gbpsi Lab: 50 Gbpsi “limit”: 60 Gbpsi

  • But limit keeps rising & there are alternatives



CyberBricks

  • Disks are becoming supercomputers.

  • Each disk will be a file server then SOAP server

  • Multi-disk bricks are transitional

  • Long-term brick will have OS per disk.

  • Systems will be built from bricks.

  • There will also be

    • Network Bricks
    • Display Bricks
    • Camera Bricks
    • ….


Data Centric Computing

  • Jim Gray

  • Microsoft Research

  • Research.Microsoft.com/~Gray/talks

  • FAST 2002

  • Monterey, CA, 14 Oct 1999



Communications Excitement!!



Information Excitement!

  • But comm just carries information

  • Real value added is

    • information capture & render speech, vision, graphics, animation, …
    • Information storage retrieval,
    • Information analysis


Information At Your Fingertips

  • All information will be in an online database (somewhere)

  • You might record everything you

    • read: 10MB/day, 400 GB/lifetime (5 disks today)
    • hear: 400MB/day, 16 TB/lifetime (2 disks/year today)
    • see: 1MB/s, 40GB/day, 1.6 PB/lifetime (150 disks/year maybe someday)
  • Data storage, organization, and analysis is challenge.

  • text, speech, sound, vision, graphics, spatial, time…

  • Information at Your Fingertips

    • Make it easy to capture
    • Make it easy to store & organize & analyze
    • Make it easy to present & access


How much information is there?

  • Soon everything can be recorded and indexed

  • Most bytes will never be seen by humans.

  • Data summarization, trend detection anomaly detection are key technologies

  • See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html

  • See Lyman & Varian:

  • How much information

  • http://www.sims.berkeley.edu/research/projects/how-much-info/



Why Put Everything in Cyberspace?



Disk Storage Cheaper than Paper

  • File Cabinet: cabinet (4 drawer) 250$ paper (24,000 sheets) 250$ space (2x3 @ 10$/ft2) 180$ total 700$ 3 ¢/sheet

  • Disk: disk (160 GB =) 300$ ASCII: 100 m pages 0.0001 ¢/sheet (10,000x cheaper)

  • Image: 1 m photos 0.03 ¢/sheet (100x cheaper)

  • Store everything on disk



Gordon Bell’s MainBrain™ Digitize Everything A BIG shoebox?

  • Scans 20 k “pages” tiff@ 300 dpi 1 GB

  • Music: 2 k “tacks” 7 GB

  • Photos: 13 k images 2 GB

  • Video: 10 hrs 3 GB

  • Docs: 3 k (ppt, word,..) 2 GB

  • Mail: 50 k messages 1 GB

    • 16 GB


Gary Starkweather

  • Scan EVERYTHING

  • 400 dpi TIFF

  • 70k “pages” ~ 14GB

  • OCR all scans (98% recognition ocr accuracy)

  • All indexed (5 second access to anything)

  • All on his laptop.



Q: What happens when the personal terabyte arrives?

  • Q: What happens when the personal terabyte arrives?

  • A: Things will run SLOWLY…. unless we add good software



Summary

  • Disks will morph to appliances

  • Main barriers to this happening

    • Lack of Cool Apps
    • Cost of Information management


The “Absurd” Disk

  • 2.5 hr scan time (poor sequential access)

  • 1 aps / 5 GB (VERY cold data)

  • It’s a tape!



Crazy Disk Ideas

  • Disk Farm on a card: surface mount disks

  • Disk (magnetic store) on a chip: (micro machines in Silicon)

  • Full Apps (e.g. SAP, Exchange/Notes,..) in the disk controller (a processor with 128 MB dram)



The Disk Farm On a Card

  • The 500GB disc card

  • An array of discs

  • Can be used as

  • 100 discs

  • 1 striped disc

  • 50 Fault Tolerant discs

  • ....etc

  • LOTS of accesses/second

  • bandwidth



Trends: promises NEMS (Nano Electro Mechanical Systems) (http://www.nanochip.com/) also Cornell, IBM, CMU,…

  • 250 Gbpsi by using tunneling electronic microscope

  • Disk replacement

    • Capacity: 180 GB now, 1.4 TB in 2 years
    • Transfer rate: 100 MB/sec R&W
    • Latency: 0.5msec
    • Power: 23W active, .05W Standby
    • 10k$/TB now, 2k$/TB in 2004


Trends: Gilder’s Law: 3x bandwidth/year for 25 more years

  • Today:

    • 40 Gbps per channel (λ)
    • 12 channels per fiber (wdm): 500 Gbps
    • 32 fibers/bundle = 16 Tbps/bundle
  • In lab 3 Tbps/fiber (400 x WDM)

  • In theory 25 Tbps per fiber

  • 1 Tbps = USA 1996 WAN bisection bandwidth

  • Aggregate bandwidth doubles every 8 months!



Technology Drivers: What if Networking Was as Cheap As Disk IO?

  • TCP/IP

    • Unix/NT 100% cpu @ 40MBps


SAN: Standard Interconnect

  • LAN faster than memory bus?

  • 1 GBps links in lab.

  • 100$ port cost soon

  • Port is computer



Building a Petabyte Store

  • EMC ~ 500k$/TB = 500M$/PB plus FC switches plus… 800M$/PB

  • TPC-C SANs (Dell 18GB/…) 62 M$/PB

  • Dell local SCSI, 3ware 20M$/PB

  • Do it yourself: 5M$/PB



The Cost of Storage (heading for 1K$/TB soon)



Cheap Storage or Balanced System

  • Low cost storage (2 x 1.5k$ servers) 6K$ TB 2x (1K$ system + 8x80GB disks + 100MbEthernet)

  • Balanced server (7k$/.5 TB)

    • 2x800Mhz (2k$)
    • 256 MB (400$)
    • 8 x 80 GB drives (2K$)
    • Gbps Ethernet + switch (1k$)
    • 11k$ TB, 22K$/RAIDED TB


320 GB, 2k$ (now)

  • 4x80 GB IDE (2 hot plugable)

    • (1,000$)
  • SCSI-IDE bridge

    • 200k$
  • Box

    • 500 Mhz cpu
    • 256 MB SRAM
    • Fan, power, Enet
    • 700$
  • Or 8 disks/box 640 GB for ~3K$ ( or 300 GB RAID)





Hot Swap Drives for Archive or Data Interchange

  • 25 MBps write (so can write N x 160 GB in 3 hours)

  • 160 GB/overnite

  • = ~N x 4 MB/second

  • @ 19.95$/nite



Data delivery costs 1$/GB today

  • Rent for “big” customers: 300$/megabit per second per month

  • Improved 3x in last 6 years (!).

  • That translates to 1$/GB at each end.

  • You can mail a 160 GB disk for 20$.

    • That’s 16x cheaper
    • If overnight it’s 3 MBps.


Data on Disk Can Move to RAM in 8 years



Storage Latency: How Far Away is the Data?



More Kaps and Kaps/$ but….

  • Disk accesses got much less expensive Better disks Cheaper disks!

  • But: disk arms are expensive the scarce resource

  • 1 hour Scan vs 5 minutes in 1990



Backup: 3 scenarios

  • Disaster Recovery: Preservation through Replication

  • Hardware Faults: different solutions for different situations

    • Clusters,
    • load balancing,
    • replication,
    • tolerate machine/disk outages
    • (Avoided RAID and expensive, low volume solutions)
  • Programmer Error: versioned duplicates (no deletes)



Online Data

  • Can build 1PB of NAS disk for 5M$ today

  • Can SCAN (read or write) entire PB in 3 hours.

      • Operate it as a data pump: continuous sequential scan
  • Can deliver 1PB for 1M$ over Internet

    • Access charge is 300$/Mbps bulk rate
  • Need to Geoplex data (store it in two places).

  • Need to filter/process data near the source,

    • To minimize network costs.




Do'stlaringiz bilan baham:


Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2019
ma'muriyatiga murojaat qiling