Power point
CaliforniaAn introduction to
Class Presentation by
Damon A. Runion
MIS 2321 - Spring 2017
Hello and welcome to An Introduction to Hadoop
Data Everywhere
“Every two days now we create as much information as we did from the dawn of civilization up until 2003”
Eric Schmidt
then CEO of Google
Aug 4, 2010
Read this quote. That data is something like 4 exabytes.
The Hadoop Project
Originally based on papers published by Google in 2003 and 2004
Hadoop started in 2006 at Yahoo!
- Top level Apache Foundation project
- Large, active user base, user groups
- Very active development, strong development team
One way to do that analysis is through Hadoop
Who Uses Hadoop?
Rackspace for log processing. Netflix for recommendations. LinkedIn for social graph. SU for page recommendations.
Hadoop Components
Storage
Self-healing
high-bandwidth
clustered storage
Processing
Fault-tolerant
distributed
processing
HDFS
MapReduce
HDFS cluster/healing. MapReduce
HDFS Basics
HDFS is a filesystem written in Java
- Sits on top of a native filesystem
- Provides redundant storage for massive amounts of data
- Use cheap(ish), unreliable computers
Let’s talk about HDFS
HDFS Data
- Data is split into blocks and stored on multiple nodes in the cluster
- Each block is usually 64 MB or 128 MB (conf)
- Each block is replicated multiple times (conf)
- Replicas stored on different data nodes
- Large files, 100 MB+
What is MapReduce?
MapReduce is a method for distributing a task across multiple nodes
Automatic parallelization and distribution
- Each node processes data stored on that node (processing goes to the data, unlike Databases where data is brought to the query engine)