MapReduce/Hadoop

MapReduce/Hadoop

Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage.

MapReduce computing paradigm (e.g., Hadoop) vs. Traditional database systems

Need to parallelize computation across thousands of nodes

Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data

Hadoop is designed as a master-slave shared-nothing architecture

NameNode

Shell command: most common: fs hadoop fs [genericOptions] [commandOptions]

Block Placement: How to place data blocks?

HDFS treats fault as norm not exception

Job Tracker is the master node (runs with the namenode)

Task Tracker is the slave node (runs on each datanode)

Create a launching program for your application

Pig

Hive: data warehousing application in Hadoop

Hadoop is great for large-data processing!

Developed at Facebook

Tables

Warehouse directory in HDFS

Partitioning breaks table into separate files for each (dt, country) pair

Started at Yahoo! Research

Schema and type checking

an open-source, distributed, column-oriented database built on top of HDFS based on BigTable

Tables have one primary index, the row key.

Tables are sorted by Row

Retrieve a cell

Coordination: An act that multiple nodes must perform together.

An open source, high-performance coordination service for distributed applications.

Configuration Management

Maintain a stat structure with version numbers for data changes, ACL changes and timestamps.

Sequential Consistency: Updates are applied in order

Companies:

Used within Twitter for service discovery

The Google File System

Do'stlaringiz bilan baham:

MapReduce/Hadoop

MapReduce/Hadoop

MapReduce/Hadoop

Hadoop: history, features and design

Hadoop ecosystem

Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage.

Apache top level project, open-source implementation of frameworks for reliable, scalable, distributed computing and data storage.

A flexible and highly-available architecture for large scale computation and data processing on a network of commodity hardware.

open-source implementation for Google MapReduce

based on MapReduce

based on a simple data model for any data

MapReduce computing paradigm (e.g., Hadoop) vs. Traditional database systems

MapReduce computing paradigm (e.g., Hadoop) vs. Traditional database systems

Need to parallelize computation across thousands of nodes

Need to parallelize computation across thousands of nodes

Commodity hardware

Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data

Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data

Replication: Each data block is replicated many times (default is 3)

Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS

Hadoop is designed as a master-slave shared-nothing architecture

Hadoop is designed as a master-slave shared-nothing architecture

NameNode

NameNode

DataNode

Shell command: most common: fs hadoop fs [genericOptions] [commandOptions]

Shell command: most common: fs hadoop fs [genericOptions] [commandOptions]

hadoop fs -ls : display detailed file info specified by path

hadoop fs -mkdir : create folder

hadoop fs -cat : stdout file content

hadoop fs –copyFromLocal : copy file

Block Placement: How to place data blocks?

Block Placement: How to place data blocks?

Replication Engine

Rebalancer: % of disk full on DataNodes should be similar

HDFS treats fault as norm not exception

HDFS treats fault as norm not exception

Heartbeats

Namenode failure:

Data error

Job Tracker is the master node (runs with the namenode)

Job Tracker is the master node (runs with the namenode)

Task Tracker is the slave node (runs on each datanode)

Task Tracker is the slave node (runs on each datanode)

Create a launching program for your application

Create a launching program for your application

The launching program configures:

The launching program then submits the job and typically waits for it to complete

Pig

Pig

HBase

Hive

Mahout

Zookeeper

Hive: data warehousing application in Hadoop

Hive: data warehousing application in Hadoop

Pig: large-scale data processing system

Common idea:

Hadoop is great for large-data processing!

Hadoop is great for large-data processing!

Solution: develop higher-level data processing languages

Developed at Facebook

Developed at Facebook

Used for majority of Facebook jobs

“Relational database” built on Hadoop

Utilized by individuals with strong SQL Skills and limited programming ability.

Tables

Tables

Partitions

Buckets

Warehouse directory in HDFS

Warehouse directory in HDFS

Tables stored in subdirectories of warehouse

Actual data stored in flat files

Partitioning breaks table into separate files for each (dt, country) pair

Partitioning breaks table into separate files for each (dt, country) pair

Ex: /hive/page_view/dt=2008-06-08,country=USA

/hive/page_view/dt=2008-06-08,country=CA

Started at Yahoo! Research

Started at Yahoo! Research

hadoop fs -ls
: display detailed file info specified by path

hadoop fs -mkdir
: create folder

hadoop fs -cat
: stdout file content