Skip to content

Dataset

  • In Spark 1.x, the RDD was the primary application programming interface (API), but as of Spark 2.x use of the Dataset API is encouraged
  • The RDD technology is not deprecated though, it still underlies the Dataset API
  • This API can only be used in compiled languages (not for python)
  • RDDs can be converted into Dataset .toDS()

Schema

  • Datasets wrap a data type. E.g., Dataset[Person], Dataset[(String, Int)]
  • The schema is inferred at compile time

Efficiency

  • Datasets are more efficient than RDDs
  • The tasks are serialized more efficiently with optimal execution plans (defined at compile time)

SparkSession

  • For Datasets, a SparkSession is needed (instead of SparkContext)
  • The SparkSession is used to issue SQL queries
  • Stop the session when you're done

DataFrame

  • A Dataframe is a DataSet of Row objects (DataSet[Row])
  • It's just like a Dataset, but does not contain a typed schema
  • The schema is inferred at runtime