Resilient Distributed Dataset (RDD)
- Spark saves data in memory on RDD (
Resilient Distributed Dataset
) - Spark tries to keep as much as possible on RAM (not on RDFS)
- Up to 100x faster than mapreduce
- For a programmer, RDD looks just like any data structure
# RDD from in-memory object
nums = sc.parallelize([1, 2, 3, 4])
# RDD from a text file
nums = sc.textFile("file://home/user/nums.txt")
nums = sc.textFile("s3n://home/user/nums.txt")
nums = sc.textFile("hdfs://home/user/nums.txt")
# RDD from hive context
nums = HiveContext(sc).sql("SELECT name,age FROM users")
# also from JBDC, Cassandra, HBase, Elasticsearch, CSV, etc
RDD transforms
-
The RDD transforms are completely parallelizable
-
map
: 1:1 relation (in/out) flatmap
: any relationship between input/outputfilter
: filterdistinct
: unique datasample
: sample randomlyunion, intersection, subtract, cartesian
: combine RDDs
rdd = sc.parallelize([1, 2, 3, 4])
squaredRDD = rdd.map(lambda x: x*x) # [1, 4, 9, 16]
RDD actions
- Only when an RDD action is called the data is processed
-
Nothing actually happens in your driver program until an action is called (
lazy evaluation
) -
collect
: output the result and return an object count
: count items in the RDDcountByValue
: count by unique valuetake
: take n number of rowstop
: take the first rowsreduce
: combine the values