Spark

Mastering Apache Spark

公开课

BerkeleyX: CS105x Introduction to Apache Spark

RDD 实现

只读、分区记录的集合

基本属性

  • A list of partitions
  • A function for computing each split
  • A list of dependencies on other RDDs
  • Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
  • Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

创建

  • scala 集合
  • 外部数据集

RDD 操作

  • 转换 transformation
  • 动作 action

RDD 缓存

DISK MEMORY

persist() cache()

窄依赖

宽依赖

results matching ""

    No results matching ""