Skip to content

Hadoop

  • It's an open source software platform for distributed storage and distributed processing of large datasets on computer clusters
  • A tool to analyze and transform massive amount of data
  • Hadoop filesystem: HDFS

  • Started with Google in 2003

  • Google Filesystem (GFS): for distributed storage
  • MapReduce: for distributed processing

Hadoop Ecosystem

  • Core Hadoop Ecosystem

  • HDFS

    • Hadoop Distributed File System
    • Consolidate the content of various hard drivers into a giant single drive
    • Backup, replication and restore mechanisms
  • Yarn
    • Yet Another Resource Negotiator
    • Manages the computing resources on the system
    • Decides what gets to run, when and where
  • Mesos
    • Alternative to Yarn (solves the same problem)
  • MapReduce
    • Programming model to process data across the cluster
    • Mappers: transforms the data
    • Reducers: aggregate the data together
  • Pig
    • Alternative to write mapreduce code (instead of using java, python, etc)
    • Good for simple scripts and tasks
  • Hive
    • Allows fetching data from HDFS using SQL queries
    • It's a translation layer
    • Hive can be used with MapReduce or Tez
  • Spark
    • Run queries to process or fetch data on your cluster
    • Can be used with yarn or mesos
    • Spark scripts can be written in Python, Java or Scala
    • Handles SQL queries, do machine learning, streams data in real time
  • Tez
    • Uses same technique as Spark
    • Crete plans to execute the queries
  • HBase
    • A way to expose the data on the cluster to transaction platforms (other systems)
    • NoSQL database (Wide-Column Datastore)
  • Storm
    • A way to process streaming data. E.g., data from sensors
  • Oozie
    • A way to schedule jobs on the cluster
  • Zookeeper
    • Coordination of everything on the cluster
    • Shares the state across the cluster
    • E.g., which nodes are down/up
  • Ambari
    • It's a dashboard to manage all the hadoop components
    • Have a view about the current state of the cluster
    • Used by the Hortonworks hadoop distribution
  • Scoop
    • Tie the hadoop fs into a relational database
    • It's a connector between hadoop and legacy databases
  • Flume
    • Listen to web logs coming from web servers in real time and save in the HDFS
  • Kafka

    • Data in/out of the cluster
  • Core

  • External Storage

  • MySQL

    • A way to expose the data for real time usable
  • Cassandra
    • A way to expose the data for real time usable
  • MongoDB
    • A way to expose the data for real time usable
  • Hbase

    • It's part of the hadoop core
  • External Storage

  • Query Engines

  • Drill

    • SQL queries
    • Fetch data across multiple NoSQL databases
    • Consolidate the data together
  • Hue
    • Queries
  • Phoenix
    • SQL queries
    • Has some additional guarantees
  • Presto
    • Queries
  • Zeppelin
    • Notebook UI approach
  • Hive

    • It's part of the hadoop core
  • Search Engines

Hadoop Distributions

  • Hadoop distributions are used to provide scalable, distributed computing against on-premises and cloud-based file store data
  • Distributions are composed of commercially packaged and supported editions of open-source Apache Hadoop-related projects
  • Distributions provide access to applications, query/reporting tools, machine learning and data management infrastructure components.

  • Distributions

  • Cloudera Hadoop (CDH)
  • Hortonworks Data Platform (HDP)
  • MapR