Apache Spark
- Engine for
processing large-scale data
(faster alternative to mapreduce) - It's a way to
distribute data processing
loads across machines - Divide and conquer approach
- Uses
DAG Engine
(directed acyclic graph) to optimize the workflow, like pig on tez. It finds the most efficient order to run tasks needed to accomplish the desired output - Can be coded in
python
,java
orscala
- It has a huge ecosystem on top of it for ML, AI, BI, etc
- Spark does not need to run on Hadoop. It also has its own built-in cluster manager
Flow
Driver Program
: It invokes spark context. Driver program is the script to be executedCluster Manager
: Spark or Yarn (when inside of hadoop) - handle how to distribute the processingExecutor
: The workload. It's has cache and tasks
Components
-
Spark Core
: RDDs definitions -
Additional libraries
Spark Streaming
: ingest data as it's being producedSpark SQL
: SQL interface to spark. Spark as a huge relational databaseMLLib
: Machine Learning and Data MiningGraphX
: Graphs handling (e.g., network information, social networks)
Execution Plan
- Based on the job, an
execution plan
is created - The
job
is broken intostages
(that considers when data needs to be reorganized) -
The
stage
is broken intotasks
-
Tasks within the same stage can run in parallel
- Therefore can be distributed along the cluster
- A set os tasks can be run in different
executors
- Transition between stages need data to sync up together again before continuing