Skip to content

Apache Spark

  • Engine for processing large-scale data (faster alternative to mapreduce)
  • It's a way to distribute data processing loads across machines
  • Divide and conquer approach
  • Uses DAG Engine (directed acyclic graph) to optimize the workflow, like pig on tez. It finds the most efficient order to run tasks needed to accomplish the desired output
  • Can be coded in python, java or scala
  • It has a huge ecosystem on top of it for ML, AI, BI, etc
  • Spark does not need to run on Hadoop. It also has its own built-in cluster manager

Flow

  • Driver Program: It invokes spark context. Driver program is the script to be executed
  • Cluster Manager: Spark or Yarn (when inside of hadoop) - handle how to distribute the processing
  • Executor: The workload. It's has cache and tasks

Components

  • Spark Core: RDDs definitions

  • Additional libraries

  • Spark Streaming: ingest data as it's being produced
  • Spark SQL: SQL interface to spark. Spark as a huge relational database
  • MLLib: Machine Learning and Data Mining
  • GraphX: Graphs handling (e.g., network information, social networks)

Execution Plan

  • Based on the job, an execution plan is created
  • The job is broken into stages (that considers when data needs to be reorganized)
  • The stage is broken into tasks

  • Tasks within the same stage can run in parallel

  • Therefore can be distributed along the cluster
  • A set os tasks can be run in different executors
  • Transition between stages need data to sync up together again before continuing