Skip to content

Resilient Distributed Dataset (RDD)

  • Spark saves data in memory on RDD (Resilient Distributed Dataset)
  • Spark tries to keep as much as possible on RAM (not on RDFS)
  • Up to 100x faster than mapreduce
  • For a programmer, RDD looks just like any data structure
# RDD from in-memory object
nums = sc.parallelize([1, 2, 3, 4])

# RDD from a text file
nums = sc.textFile("file://home/user/nums.txt")
nums = sc.textFile("s3n://home/user/nums.txt")
nums = sc.textFile("hdfs://home/user/nums.txt")

# RDD from hive context
nums = HiveContext(sc).sql("SELECT name,age FROM users")

# also from JBDC, Cassandra, HBase, Elasticsearch, CSV, etc

RDD transforms

  • The RDD transforms are completely parallelizable

  • map: 1:1 relation (in/out)

  • flatmap: any relationship between input/output
  • filter: filter
  • distinct: unique data
  • sample: sample randomly
  • union, intersection, subtract, cartesian: combine RDDs
rdd = sc.parallelize([1, 2, 3, 4])
squaredRDD = rdd.map(lambda x: x*x) # [1, 4, 9, 16]

RDD actions

  • Only when an RDD action is called the data is processed
  • Nothing actually happens in your driver program until an action is called (lazy evaluation)

  • collect: output the result and return an object

  • count: count items in the RDD
  • countByValue: count by unique value
  • take: take n number of rows
  • top: take the first rows
  • reduce: combine the values