# 介绍

Laioffer学习的，628第三次yelp正好用上了。

# 基础知识

Apache Spark 读作 阿派cheap（没有p）

Spark_Optimizer

Spark allows two distinct kinds of operations by the user. There are transformations and there are actions.

• Transformations are operations that will not be completed at the time you write and execute the code in a cell - they will only get executed once you have called a action.
• Examples of Transformations: select, distinct, groupBy, sum, orderBy, filter, limit
• Actions are commands that are computed by Spark right at the time of their execution.
• Examples of Actions: show, collect, count, save

%[coding language] 来切换python/R/SQL

display(xxx) 用来展示一下dataframe

• 这个函数厉害的点在于：we can very easily create some more sophisticated graphs by clicking the graphing icon that you can see below

xxx.explain() 来解释transformation背后的physical plan

Xxx.first() 取出header

xxx.take(n) 取出前n行

Databricks倒入外部包

DS/DF 比RDD更快，因为Spark知道了数据的样子

ReduceByKey 比GroupByKey好处：在shuffle之前做了一步聚合操作，压缩数据传输量。

Join是最慢的一个操作。

df.show()#画个图

df.count()#有几行

df.printSchema()#看看基本信息

df.createOrReplaceTempView(Name) #可以变成一个table，可以之后用SQL

spark.sql(“SELECT…”) #SQL语句，用这种方式取出来的格式是Data.frame

pyspark.sql.row #用来生成DF

spark.createDataFrame(NAME)

vector assembler 可以把columns转化成vector

pyspark教程

# Databricks 语法

dbutils.fs.rm('FileStore/tables/628/review_100k.json',True) #删除自己上传的文件