Dataset
- In Spark 1.x, the
RDD
was the primary application programming interface (API), but as of Spark 2.x use of the Dataset
API is encouraged
- The RDD technology is not deprecated though, it still underlies the Dataset API
- This API can only be used in compiled languages (not for python)
- RDDs can be converted into Dataset
.toDS()
Schema
- Datasets wrap a data type. E.g.,
Dataset[Person]
, Dataset[(String, Int)]
- The schema is inferred at compile time
Efficiency
- Datasets are more efficient than RDDs
- The tasks are serialized more efficiently with optimal execution plans (defined at compile time)
SparkSession
- For Datasets, a
SparkSession
is needed (instead of SparkContext)
- The SparkSession is used to issue SQL queries
- Stop the session when you're done
DataFrame
- A Dataframe is a DataSet of Row objects (
DataSet[Row]
)
- It's just like a Dataset, but does not contain a typed schema
- The schema is inferred at runtime