Spark心得

介绍

Laioffer学习的,628第三次yelp正好用上了。

基础知识

Apache Spark 读作 阿派cheap(没有p)

Spark_Optimizer

数据类型RDD:Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.

Spark allows two distinct kinds of operations by the user. There are transformations and there are actions.

  • Transformations are operations that will not be completed at the time you write and execute the code in a cell - they will only get executed once you have called a action.
  • Examples of Transformations: select, distinct, groupBy, sum, orderBy, filter, limit
  • Actions are commands that are computed by Spark right at the time of their execution.
  • Examples of Actions: show, collect, count, save

%[coding language] 来切换python/R/SQL

display(xxx) 用来展示一下dataframe

  • 这个函数厉害的点在于:we can very easily create some more sophisticated graphs by clicking the graphing icon that you can see below

xxx.explain() 来解释transformation背后的physical plan

Xxx.first() 取出header

xxx.take(n) 取出前n行

Databricks倒入外部包

DS/DF 比RDD更快,因为Spark知道了数据的样子

ReduceByKey 比GroupByKey好处:在shuffle之前做了一步聚合操作,压缩数据传输量。

Join是最慢的一个操作。

spark.read.json自动变成DF

df.show()#画个图

df.count()#有几行

df.printSchema()#看看基本信息

df.createOrReplaceTempView(Name) #可以变成一个table,可以之后用SQL

spark.sql(“SELECT…”) #SQL语句,用这种方式取出来的格式是Data.frame

pyspark.sql.row #用来生成DF

spark.createDataFrame(NAME)

vector assembler 可以把columns转化成vector

pyspark

pyspark教程

如何修改列的typecrimeSunday = crimeSunday.withColumn('X', crimeSunday['X'].cast('float'))

Databricks 语法

dbutils.fs.rm('FileStore/tables/628/review_100k.json',True) #删除自己上传的文件

本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!