Hadoop
- It's an open source software platform for
distributed storage
anddistributed processing
oflarge datasets
oncomputer clusters
- A tool to
analyze
andtransform
massive amount of data -
Hadoop filesystem:
HDFS
-
Started with Google in 2003
Google Filesystem (GFS)
: for distributed storageMapReduce
: for distributed processing
Hadoop Ecosystem
-
Core Hadoop Ecosystem
-
HDFS
- Hadoop Distributed File System
- Consolidate the content of various hard drivers into a giant single drive
- Backup, replication and restore mechanisms
Yarn
- Yet Another Resource Negotiator
- Manages the computing resources on the system
- Decides what gets to run, when and where
Mesos
- Alternative to Yarn (solves the same problem)
MapReduce
- Programming model to process data across the cluster
- Mappers: transforms the data
- Reducers: aggregate the data together
Pig
- Alternative to write mapreduce code (instead of using java, python, etc)
- Good for simple scripts and tasks
Hive
- Allows fetching data from HDFS using SQL queries
- It's a translation layer
- Hive can be used with MapReduce or Tez
Spark
- Run queries to process or fetch data on your cluster
- Can be used with yarn or mesos
- Spark scripts can be written in Python, Java or Scala
- Handles SQL queries, do machine learning, streams data in real time
Tez
- Uses same technique as Spark
- Crete plans to execute the queries
HBase
- A way to expose the data on the cluster to transaction platforms (other systems)
- NoSQL database (Wide-Column Datastore)
Storm
- A way to process streaming data. E.g., data from sensors
Oozie
- A way to schedule jobs on the cluster
Zookeeper
- Coordination of everything on the cluster
- Shares the state across the cluster
- E.g., which nodes are down/up
Ambari
- It's a dashboard to manage all the hadoop components
- Have a view about the current state of the cluster
- Used by the Hortonworks hadoop distribution
Scoop
- Tie the hadoop fs into a relational database
- It's a connector between hadoop and legacy databases
Flume
- Listen to web logs coming from web servers in real time and save in the HDFS
-
Kafka
- Data in/out of the cluster
-
-
External Storage
-
MySQL
- A way to expose the data for real time usable
Cassandra
- A way to expose the data for real time usable
MongoDB
- A way to expose the data for real time usable
-
Hbase
- It's part of the hadoop core
-
-
Query Engines
-
Drill
- SQL queries
- Fetch data across multiple NoSQL databases
- Consolidate the data together
Hue
- Queries
Phoenix
- SQL queries
- Has some additional guarantees
Presto
- Queries
Zeppelin
- Notebook UI approach
-
Hive
- It's part of the hadoop core
-
Hadoop Distributions
- Hadoop distributions are used to provide scalable, distributed computing against on-premises and cloud-based file store data
- Distributions are composed of commercially packaged and supported editions of open-source
Apache Hadoop-related
projects -
Distributions provide access to applications, query/reporting tools, machine learning and data management infrastructure components.
-
Distributions
Cloudera Hadoop
(CDH)- A cloud solution
- https://www.cloudera.com/downloads/cdh.html
Hortonworks Data Platform
(HDP)- Bought by cloudera
- https://www.cloudera.com/downloads/hortonworks-sandbox/hdp.html
MapR