# 日常

## 2021/11/11

Ordinal Regression

## 2021/11/10

Linear Model还假设 the mean and variance for the outcome is unrelated (i.e. variance is a constant) 所以当我们的residual plot出现喇叭的时候，其实有可能是poisson分布？方差随着y变大而变大。

## 2021/10/31

Lasso 和 Stepwise 来选变量都会导致biased parameter。所以我们做inference的时候会出问题。但是我们可以只用它来做prediction，那还尚可。

## 2021/10/27 ## 2021/10/14

Logistic里面不太好加interaction，因为不好解释，interaction是作用在log-odds上面的。如果变回到pi的level，interaction的关系会变得很复杂。

Least Square和正态没关系。我们可以有一个lm模型，但是没有正态假设。

Transforming Y can affect both linearity and equal variance, but transforming X can affect only linearity.

• 假设是一个指数变换。$Y^{\prime} = e^{Y}$。那么会有效果是越大的Y越容易有大的方差。

## 2021/10/13

Linear Regression里面随着data变多，模型一直保持不变。所以才会有$\hat{\beta}$ 正态收敛到均值。但是GUIDE不同。分类树算法随着数据不断变多之后，树会发生变化，所以不存在limiting distribution的说法。

## 2021/10/10 变异系数

• 判断准则随便找了一个：0.1/0.4/1 弱变异 - 低等变异 - 中等变异 - 强变异

## 2021/10/09 方差齐性检验

1. 【正态敏感】方差比：大除小，得到一个F值。（所以可以是单边检验）
2. 【正态敏感】Hartley检验：用于多组方差的检验，用多组中最大的方差除最小的方差，得到一个F值。
• 这里用的F table是一个 $F_{MAX}$ 的 sample distribution。
• 假设正态，并且每组的数量相同。不足之处是正态不满足的时候不够robust，有时候小p值会来自于非正态。
3. 【正态敏感】Bartlett检验：一个诡异的构造，出来一个卡方检验
4. 【正态敏感】Cochran’s C test: 检验的其实是方差的outlier，是否有某一个variance超出去了。统计量十分简单，就是方差除方差和的ratio。
• 需要每组数量相同。
• 比起Hartley，Cochran用了更多的数据
5. 【正态不敏感】Levene检验：将每个值先转换为为该值与其组内均值的偏离程度绝对值，然后再用转换后的偏离程度去做方差分析（常规ANOVA），即组间方差/组内方差。
6. 【正态不敏感】Brown–Forsythe检验：修改版的Levene。偏离程度用median刻画（其他方法还有用trimmed mean）。比Levene更robust

## 2021/10/03

• 因为在人烟稀少的地方，可能p = 0.51，但是因为只有0/1，我们就会把他当成p=1来看待。所以重新采样的过程可以使得数据的influence point没那么夸张
• 这个重新采样的过程也只适用于那些对于influencial point敏感的算法，比如说linear regression。

• 后来上课时讲到，其实可以用IPW来补偿，也没啥问题。如果每个州的取样相同的话，那么一个加州人可能代表了100个加州人，而一个威斯康辛人可能只代表10个威州人。

• 但是收敛速度这个东西并不是一个统计学的名词，我们该怎么来刻画他呢？

## 2021/09/28

Importance Score利用了Multiple Correction，所以会导致变量多的时候，IS的大小会变（但是因为是同样的修正系数，所以我猜相对位置应该不会变）。

## 2021/09/20 Loh’s 761

Logistic的S-curve是因为对应的 $X$ 是连续型变量，所以对于离散的 $X$ ，都没法看到S-Curve。但是S-Curve的斜率不是 $\beta$ ，因为方程其实是：

$y = \frac{exp\{\beta_0 + \beta_1 X \}}{1 + exp\{\beta_0 + \beta_1 X \}}$

summary(vector) # 这个命令会看到NA的数量

Interaction 有可能全局不显著，但是在某个subspace里面显著 -> GUIDE有了用武之地

Linear Regression 的 Interpretation的问题：有时候一个simple lm系数是正的变量，因为其他变量的加入而变成了负系数。这时我们如果解释为负面作用，岂不是闹笑话。

• 所以有时一个variable解释的过度了，就需要其他的变量来修正。lm的目的是准确的拟合以及预测，而并不是为了解释coef interpretation。就像鞋用来穿，而不是用来吃一样。
• 仔细一想，树好像和其他的变量关系不大，但LDA会受影响。

Reference would change the R-output，when there exists non-convergence.

• 你不知道什么时候R会双手一举，说老子不算了

GUIDE的importance score：

• 简单来讲算法是weighted chi-squared test
• 为何能给出置信度：因为Loh做了一个permutation test，去获得这个未知的分布。重复这个过程10000次，难怪importance score算得很慢。

GUIDE的interface是跟别人互动，本来我以为是个累赘，结果被老罗硬生生说成了一个feature。他说为了避免用户背住所有option，不如一个一个问。所以我在想或许我也可以做一个R版本的互动。

• 想象一下，在一个三维空间中，你有一些点，他们形成了近似一条直线（Multicollinearity）。当你要用一个平面去拟合的时候，就像是跷跷板，中间的支撑很不稳定，所以方差很大。但是如果未来的点还在这条直线上的话，那么就没什么大问题。

## 2021/09/19 998 Homework

Nested Design: The levels of one factor (e.g., factor B) are similar but not identical for different levels of another factor (e.g., A). 比如A1对应了B123，A2对应了B1234… 这时我们就会说 the levels of factor B nested under the levels of factor A.

• 在这种setting下，random effect 和 fixed effect 的ANOVA表变得不同。对应的F-test也会不一样。
• 比如说A fixed, B random。A的F-test是要用 MSA / MSB 而不是 MSA / MSE

The Two-Stage Nested Design: Page 604 in Douglas’s Textbook.

# B: B用的是组内序号1，2，3.所以不同组会出现同样的数字1，2，3
# B.id：这里用的是绝对序号1，2，3...n。所以不同组不会出现同样的数字
# 以下两种方法都可以，结果相同
lmer(y ~ A + (1|A:B))
lmer(y ~ A + (1|B.id)

## 2021/09/14 Miaoyan’s 602

A unit change in $X_j$ is associated with a $\beta_j$ average change in $Y$, while holding all other predictors ﬁxed.

• 听上去没毛病，但是问题来了，如果设计矩阵X之间存在correlation呢？如果有cor，你很难做到在改变一个的同时，不改变其他X。所以这个定义干脆是用不了。

“The only way to ﬁnd out what will happen when a complex system is disturbed is to disturb the system, not merely to observe it passively ”—— Fred Mosteller and John Tukey , paraphrasing George Box ### Dependent Tests

Consider testing all pairwise differences in means between four groups; this is $𝑁=4⋅3/2=6$ tests. Each of the six $p$-values alone is uniformly distributed. But they are all positively correlated: if (on a given attempt) group A by chance has particularly low mean, then A-vs-B comparison might yield a low $p$-value (this would be a false positive). But in this situation it is likely that A-vs-C, as well as A-vs-D, will also yield low $p$-values. So the $p$-values are obviously non-independent and moreover they are positively correlated between each other.

• 可以见得这种comparison testing，如果有一个显著了，会使得其他的comparison也显著。

Imagine I am testing whether A = 0 and also whether B = 0 against one-tailed alternatives (A > 0 and B > 0). Further imagine that B depends on A. For example, imagine I want to know if a population contains more women than men, and also if the population contains more ovaries than testes. Clearly knowing the p-value of the first question changes our expectation of the p-value for the second. Both p-values change in the same direction, and this is PRD. But if I instead test the second hypothesis that population 2 has more testes than ovaries, our expectation for the second p-value decreases as the first p-value increases. This is not PRD.

• PRD: Positive Regression Dependency
• 这个dependence感觉有点偏理论了，先溜了。

## 2021/09/13

Interpreting Residual and Null Deviance in GLM R

• 两种Deviance都是由$-2\log{\Lambda}$ 算出来的。
• Null Deviance比较的是全模型和只有intercept
• Residual Deviance 比较的是全模型和full model
• 全模型指的是Saturated Model: 每个数据有每个数据的参数，df = n拉满

Rank the variable (Importance Score):

• Why Rank? Based on what criteria? 我目前认为这个方法或许不该出现，因为他给出了我误以为有用但其实没用的信息。就像我对于Correlation过分重视，但后来发现他只能描述线性关系，就没那么帅了。

• 所以如果变量过多的话，Bonferroni之后没有显著的怎么办？
• 我想一个节点上面找到了p最小的那个划分，不管它是否显著，其实对于最终效果都是好事。树本身对于一个节点的false positive不是那么关心，因为只要最后分的好即可。

GUIDE对于numeric变量的NA处理方式是划为一类，我觉得不太妙。但是想到了「Age of Spouse」问题，我又觉得好像确实没法用机器来做。对于这种可能填了会很错的事情，还是不填比较好。

• 给一个没有伴侣的人找到了伴侣的年龄。人类会觉得是个离谱的错误。但是机器在分类的时候也会觉得这个问题很离谱吗？

LASSO没法给出Confidence Interval，它的beta估计也是biased的。人们用它只是因为好发论文？

• 对于不同的bootstrap sample，beta的变化可能会很大。所以CI其实也没啥意义。

Human Bias:

• 我看到了一个curve pattern，可能会手动添加一个二次项，这个就属于bias。但是GUIDE等机器学习方法看不到图形形状，所以就不会出现这个问题。它们可以用CV来解决问题，但是很明显人不能成百上千次的重复观察来消除这种bias。
• 所以我曾想让机器有看的能力，现在看起来不一定是个好事？

## 2021/09/12 老李’s Question

• 列昂惕夫逆矩阵：在计算总投入的时候，要加上其他的因素，比如人工费

## 2021/09/11 Miaoyan’s 602

Correlation Model: Inference on pearson correlation coefficient: Mixed-Effects Model: ## 2021/09/10 Miaoyan’s 602

Regression: Joint distribution of outcomes given all covariates. 每一个数据都是有用的，合在一起产生了一个Joint分布。所以在我们使用$(X'X)^{-1}$的时候，其实每一个行X都贡献了一些作用。

Some other Regressions:

• Binary Regression - Logistic Regression: Log-Odds ~

• Bernoulli -> Binomial Logistic Regression: response variable变为【成功了几次】
• Binary Regression - Probit Regression: $\operatorname{Pr}(Y=1 \mid X)=\Phi\left(X^{T} \beta\right)$

• Possion Regression: Log(y) ~

• 这三个模型属于Non-Gaussian Model: There is no error term, there’s just an unknown probability. The logistic model is a probability model. And of course, the error is not iid distributed. 随机性被Bernoulli分布刻画走了

Partial Regression: Attempts to show the effect of adding another variable to a model that already has one or more independent variables. Also referred to as added variable plots, adjusted variable plots, and individual coefficient plots.

Partial Correlation: Formally, the partial correlation between $X$ and $Y$ given a set of $n$ controlling variables $Z = {Z_1, Z_2, ..., Z_n}$, written $\rho_{XY.Z}$, is the correlation between the residuals $e_X$ and $e_Y$ resulting from the linear regression of $X$ with $Z$ and of $Y$ with $Z$, respectively. • $V$ 刻画的是参数如何被使用所导致的方差
• The variance function relates the means to the variances
• 其实$V$是为了描述方差和均值之间的关系。因为他们只有没关系，才符合linear regression的假设。

A problem with logistic regression (as well as other GLMs) is overdispersion. Overdispersion occurs when error (residuals) are more variable than expected from the theorized distribution. In case of logistic regression, the theorized error distribution is the binomial distribution. The variance of binomial distribution is a function of its mean (or the parameter $p$). If there is overdispersion, the coefficient estimates will be more conﬁdent (smaller standard error values) than they should be. One can detect overdispersion by comparing the residual deviance with the degrees of freedom. If these two numbers are close, there is no overdispersion. Residual variation much larger than degree of freedom indicates overdispersion. Underdispersion, detected by lower residual deviance than the degrees of freedom, is also possible but more rare. In our example in Listing 11, there is no indication of overdispersion.

• residual deviance 和 df 会在glm的summary里面给出来，相除即可得到dispersion。

• Over dispersion是个坏事情

A simple solution for overdispersion is to estimate an additional parameter indicating the amount of the oversidpersion. With glm(), this is done so-called ‘quasi’ families, i.e., in logistic regression we specify family=quasibinomial instead of binomial.

$\hat{\beta}=\left(\boldsymbol{X}^{T} \boldsymbol{\Sigma}^{-1} \boldsymbol{X}\right)^{-1} \boldsymbol{X}^{T} \boldsymbol{\Sigma}^{-1} \boldsymbol{Y}, \quad \operatorname{Cov}(\boldsymbol{Y} \mid \boldsymbol{X})=\sigma^{2} \boldsymbol{\Sigma}$

## 2021/09/09 Loh’s 761

I have been thinking about this day and night for 36 years. 1951-1985-2021 老罗在今年9月就70岁了。

Parameter 多的时候，虽然很容易有显著的变量，但是来个Bonferroni Correction，就都凉了。

• 很容易显著是因为Under $H_0$, p-value is uniformly distributed. 所以大约有5%的变量会是显著的
• 尤其是你还要考虑到变量之间的交互作用

Parameter多，但是数据量也大的时候呢，会怎么样？

• 我认为数据量越大的时候，Pattern的寻找就会越简单。就像$\beta=0.001$。数据量小的时候可能检验不出来，所以我们会认为$\beta=0$这个假设没法拒绝。但是数据量只要够大，$0.001$这一点点的区别也能找出来。这就带来一个问题：我们需要重新定义我们的原假设，让他对一些没什么必要的noise不要太关注。

• Holm step-down：从小到大排序后，只要满足$P_{k}<\frac{\alpha}{m+1-k}$ 即拒绝。可以看出第一步和Bonferroni是一样的，但是之后就会变的比Bonferroni稳强。

• Hochberg step-up: 跟Holm判断方法几乎一样，只不过不同的是holm从最小的开始，只要p值不够小就停下来。而Hochberg是从最大的开始，直到p值小才停。

# Hochberg在能用的时候比Holm更强
pp = c(0.015,0.015,0.02,0.05)
#  0.06 0.06 0.06 0.06
#  0.04 0.04 0.04 0.05
# 这个例子中，pp最小的不够小，所以0.015*4 = 0.06>0.05不够拒绝的条件，Holm会认为这四个都不显著
# 但是第三个pp因为0.02*2 = 0.04<0.05够了拒绝的条件，Hochberg会认为第三个显著，而前两个由于monotone也显著。
• 所以Hochberg需要加上条件：It holds only under non-negative dependence。这个要求在Sidak对于Bonferroni的优化中也见到过：$1-(1-\alpha)^{1/m} \Leftrightarrow \alpha/m$
• Bonferroni和Holm适用于各种dependence，但是在positive related的时候会过于保守。这时可以试试Resampling Procedure。【待看】【The procedures of Romano and Wolf (2005a,b) dispense with this condition and are thus more generally valid.】
• Harmonic mean p-value procedure 【待看】

### Multiple Testing

Union-Intersection Testing: 要想拒绝一堆假设，只需要拒绝一个即可。这时

$H_{I}: \bigcap_{i=1}^{m} H_{i} \text { versus } K_{U}: \bigcup_{i=1}^{m} K_{i}$

• 我们一般处于这个状态，所以需要Bonferroni等一系列办法。

Intersection-Union Testing: 要想拒绝一堆假设，必须全部拒绝。这时

$H_{U}: \bigcup_{i=1}^{m} H_{i} \text { versus } K_{I}: \bigcap_{i=1}^{m} K_{i}$

• 当我们处于这个状态时，不需要调整。

An interesting feature of intersection-union tests is that no multiplicity adjustment is necessary to control the size of a test but the individual hypotheses cannot be tested at levels higher than the nominal signiﬁcance level either (Berger, 1982).

• FWEC: FWER under the complete null (FWEC) is the probability that at least one type-I error occurs given that all nulls are true. 所有$H_0$都对的时候，出现一个错误的概率。

• FWEP: FWER under a set of null (FWEP) is the probability that at least one type-I error occurs given that a subset of nulls are true. 只要有一些$H_0$ 对了即可。

• Weak Control of FWER: FWEC $\le \alpha$

• Strong Control of FWER: FWEP $\le \alpha$ for all subsets of null hypotheses.

Closure Principle : The closure principle proposed by Marcus, Peritz and Gabriel (1976) plays a key role in the theory of multiple testing and provides a foundation for virtually all multiple testing methods arising in pharmaceutical applications. This principle has been used to construct a variety of stepwise testing procedures.

• The closure principle states that an FWER-controlling testing procedure can be constructed by testing each hypothesis in the closed family using a suitable local $\alpha$-level test. This procedure rejects a hypothesis if all intersection hypotheses containing this hypothesis are rejected by the associated local tests.
• Closure不一定具备下面的性质，每一个都能举出反例。

Monotone procedures: A monotone procedure rejects a hypothesis whenever it rejects another hypothesis with a larger p-value. For example, if $p_i < p_j$ then the rejection of $H_j$ automatically implies the rejection of $H_i$ . Monotonicity helps to avoid logical inconsistencies; as such it is an essential requirement for multiple testing procedures. When a procedure does not have this property, monotonicity needs to be enforced by updating adjusted p-values.

Consonant procedures: A closed testing procedure is termed consonant (Gabriel, 1969) if the rejection of an intersection hypothesis $H_I$ with $I \subseteq\{1, \ldots, m\}$ and $|I|>1$ always leads to the rejection of at least one $H_J$ implied by $H_I$ , i.e., $H_J$ with $J \subset I$. It is possible for a nonconsonant procedure to reject the global null hypothesis $H_I$ , $I \subseteq\{1, \ldots, m\}$, without rejecting any other intersection hypotheses.

alpha-exhaustive procedures: An $\alpha$-exhaustive procedure is a closed testing procedure based on intersection hypothesis tests the size of which is exactly $\alpha$ (Grechanovsky and Hochberg, 1999). In other words, $P\left(\text { Reject } H_{I}\right)=\alpha$ for any intersection hypothesis $H_I$ , $I \subseteq\{1, \ldots, m\}$. If a procedure is not α-exhaustive, one can construct a uniformly more powerful procedure by setting the size of all intersection hypothesis tests at $\alpha$.

Partitioning Principle: The ﬁrst step involves partitioning the union of the hypotheses into mutually exclusive hypotheses. Since the hypotheses are disjoint, each one of them can be tested at level $\alpha$ without compromising the FWER control. 比如说：

$H_{1}: \mu_{1} \leq \mu_{0}, \quad H_{2}: \mu_{2} \leq \mu_{0}$

\begin{aligned} &H_{1}^{*}: \mu_{1} \leq \mu_{0} \quad \text { and } \quad \mu_{2} \leq \mu_{0} \\ &H_{2}^{*}: \mu_{1} \leq \mu_{0} \quad \text { and } \quad \mu_{2}>\mu_{0} \\ &H_{3}^{*}: \mu_{1}>\mu_{0} \quad \text { and } \quad \mu_{2} \leq \mu_{0} \end{aligned}

ANCOVA：ANOVA只能处理cat变量的X，regression一般处理continuous的X，ANCOVA介于这两者之间。我们一般会用dummy0-1来处理cat，这就是ANCOVA。在R中，可以用car::Anova(fit_lm, type = 'III') 来运行。

• 直接summary(aov)用的是type-I，是不对的。

# 关于GUIDE

GUIDE dsc file的路径问题，可以通过修改dsc中的第一行，把数据的相对路径改成绝对路径来解决（记得带引号）。这样一劳永逸。

Minimum Node Size可能是被老罗设置成为了1%的total sample size。我看到老罗有一个30层的树，这可能说明他设置的maximum level比30还要大。

KL03的NA解决办法：

C Method: 遍历所有变量对，选出resubstitution error最小的那一组。在fit LDA的时候如果有NA的话就直接扔掉对应的数据，用其他的数据进行拟合。在predict（resubstitution）的时候，所有的NA都被填成了mean和mode。

High leverage point 在GUIDE里面不是问题，因为二分法不在乎数值的具体大小（相对即可）。

Missing value flag在GUIDE里面可以帮助划分。没有的话NA只会全去某一侧，如果有的话可能missing code A会去左侧，missing code B会去右侧。

Party&Partykit在利用weight的时候会要求是正整数，因为算法是将同样的数据复制w遍。这样会导致在那一个点处的样本方差 = 0。

• 或许以后需要了解一下其他算法的缺点，defend for GUIDE

# STAT641

Never adjust for post-randomized variable. It might introduce a non-zero term.

• 做完随机分组之后，不能再收集任何其他信息了。这大概也是为什么那些变量叫做Baseline，就是告诉你要在base的时候收集。
• 这也是为什么实验做完之后，我们最好不要只观察仍存活的病人组。因为存活也是一个variable。
• Cook拿出了一个COVID的paper，对只选择了打一针疫苗的人群进行观察。这也是post-randomization variable。

• 不需要假定方差相同，因为只会节省我们两个自由度左右，收益过小。而如果假设错了，会有很大的负收益。所以小亏。
• 甚至不需要Normality，因为数据量大的时候可以用CLT顶

x = rnorm(100000)
hist(x,breaks = 30) # 这很normal
y = abs(x)
hist(y,breaks = 30) # 这个趋势不太妙，单调下降
z = sqrt(y)
hist(z,breaks = 30) # 我们会发现，又有点像normal了。
# 但是这个图本身有问题，要想使这个图表现好，需要同方差 + 正态性。但这个图是用来检测同方差的。所以有时候这个图表现不好，我们以为是异方差，结果其实是因为没有正态性。confounding。 # 旧版

17.11.13
##为什么我们要用R里面的anova函数比较多个模型？
https://stats.stackexchange.com/questions/115304/interpreting-output-from-anova-when-using-lm-as-input

①First of all, you may be perfectly satisfied with the summary output, and that’s fine. However, the ANOVA table may offer some advantages. First, if you have a categorical / factor variable with more than two levels, the summary output is hard to interpret. It will give you tests of individual levels against the reference level, but won’t give you a test of the factor as a whole

②Another reason you might prefer to look at an ANOVA table is that it allows you to use information about the possible associations between your independent variables and your dependent variable that gets thrown away by the t-tests in the summary output. Consider your own example, you may notice that the p-values from the two don’t match (e.g., for v1, the $p$-value in the summary output is 0.93732, but in the ANOVA table it’s 0.982400). The reason is that your variables are not perfectly uncorrelated

The result of this is that there are sums of squares that could be attributed to more than one of the variables. The t-tests are equivalent to ‘type III’ tests of the sums of squares, but other tests are possible. The default ANOVA table uses ‘type I’ sums of squares, which can allow you to make more precise–and more powerful–tests of your hypotheses

Anova的目的：partition the sums of squares

##Further understanding of Type 123 test of RSS:
https://stats.stackexchange.com/questions/20452/how-to-interpret-type-i-type-ii-and-type-iii-anova-and-manova/20455#20455

Lets imagine that there are just two factors A and B (and we’ll throw in the A*B interaction later to distinguish type II SS). Further, lets imagine that nij不完全一样。
Now your two factors are correlated with each other.The problem with your factors being correlated is that there are sums of squares that are associated with both A and B.

When computing an ANOVA (or any other linear regression), we want to partition the sums of squares. A partition puts all sums of squares into one and only one of several subsets. (For example, we might want to divide the SS up into A, B and error.) However, since your factors (still only A and B here) are not orthogonal there is no unique partition of these SS. In fact, there can be very many partitions, and if you are willing to slice your SS up into fractions (e.g., “I’ll put .5 into this bin and .5 into that one”), there are infinite partitions. A way to visualize this is to imagine the MasterCard symbol: The rectangle represents the total SS, and each of the circles represents the SS that are attributable to that factor, but notice the overlap between the circles in the center, those SS could be given to either circle. The question is: How are we to choose the ‘right’ partition out of all of these possibilities? Let’s bring the interaction back in and discuss some possibilities:
Type I SS:
• SS(A)
• SS(B|A)
• SS(A*B|A,B)
Type II SS:
• SS(A|B)
• SS(B|A)
• SS(A*B|A,B)
Type III SS:
• SS(A|B,AB)
• SS(B|A,A
B)
• SS(A*B|A,B)

Notice how these different possibilities work. Only type I SS actually uses those SS in the overlapping portion between the circles in the MasterCard symbol. That is, the SS that could be attributed to either A or B, are actually attributed to one of them when you use type I SS (specifically, the one you entered into the model first). In both of the other approaches, the overlapping SS are not used at all. Thus, type I SS gives to A all the SS attributable to A (including those that could also have been attributed elsewhere), then gives to B all of the remaining SS that are attributable to B, then gives to the AB interaction all of the remaining SS that are attributable to AB, and leaves the left-overs that couldn’t be attributed to anything to the error term.

Type III SS only gives A those SS that are uniquely attributable to A, likewise it only gives to B and the interaction those SS that are uniquely attributable to them. The error term only gets those SS that couldn’t be attributed to any of the factors. Thus, those ‘ambiguous’ SS that could be attributed to 2 or more possibilities are not used. If you sum the type III SS in an ANOVA table, you will notice that they do not equal the total SS. In other words, this analysis must be wrong, but errs in a kind of epistemically conservative way. Many statisticians find this approach egregious, however government funding agencies (I believe the FDA) requires their use.

The type II approach is intended to capture what might be worthwhile about the idea behind type III, but mitigate against its excesses. Specifically, it only adjusts the SS for A and B for each other, not the interaction. However, in practice type II SS is essentially never used. You would need to know about all of this and be savvy enough with your software to get these estimates, and the analysts who are typically think this is bunk.
There are more types of SS (I believe IV and V). They were suggested in the late 60’s to deal with certain situations, but it was later shown that they do not do what was thought. Thus, at this point they are just a historical footnote.

As for what questions these are answering, you basically have that right already in your question:
• Estimates using type I SS tell you how much of the variability in Y can be explained by A, how much of the residual variability can be explained by B, how much of the remaining residual variability can be explained by the interaction, and so on, in order.
• Estimates based on type III SS tell you how much of the residual variability in Y can be accounted for by A after having accounted for everything else, and how much of the residual variability in Y can be accounted for by B after having accounted for everything else as well, and so on. (Note that both go both first and last simultaneously; if this makes sense to you, and accurately reflects your research question, then use type III SS.)

2017.11.14
##What does Standardized Residuals mean?

Intuition:every residual has different variance, so we need to consider standardized residuals.

When you compare the cells, the standardized residual makes it easy to see which cells are contributing the most to the chi-square value, and which are contributing the least.If your sample is large enough, the standardized residual can be roughly compared to a z-score. Standardization can work even if your variables are not normally distributed.
• If the residual is less than -2, the cell’s observed frequency is less than the expected frequency.
• Greater than 2 and the observed frequency is greater than the expected frequency. 之所以用2是因为95%双边正态

2017.11.15

##One Sample T-test:

The test statistic is calculated as:
R code： t.test(a, mu=120)

2017.11.19
###为什么我们要对数据进行变换（比如log）：

2017.11.20
###Forward and Backward selection 的主要差别：

Forward selection has drawbacks, including the fact that each addition of a new feature may render one or more of the already included feature non-significant (p-value>0.05). An alternate approach which avoids this is backward selection.The backward method is generally the preferred method, because the forward method produces so-called suppressor effects. These suppressor effects occur when predictors are only significant when another predictor is held constant.

2017.11.25

### Added Variable Plot (Partial Regression Plot)

For illustration I will take a less complex regression model Y=β1+β2X2+β3X3+ϵ where the predictor variables X1 and X2 may be correlated. Let’s say the slopes β2 and β3 are both positive so we can say that (i) Y increases as X2 increases, if X3 is held constant, since β2 is positive; (ii) Y increases as X3 increases, if X2 is held constant, since β3 is positive.

Note that it’s important to interpret multiple regression coefficients by considering what happens when the other variables are held constant (“ceteris paribus”). Suppose I just regressed Y against X2 with a model Y=β′1+β′2X2+ϵ′. My estimate for the slope coefficient β′2, which measures the effect on Y of a one unit increase in X2 without holding X3 constant, may be different from my estimate of β2 from the multiple regression - that also measures the effect on Y of a one unit increase in X2, but it does hold X3 constant. The problem with my estimate β′2^ is that it suffers from omitted-variable bias if X2 and X3 are correlated.

To understand why, imagine X2 and X3 are negatively correlated. Now when I increase X2 by one unit, I know the mean value of Y should increase since β2>0. But as X2 increases, if we don’t hold X3 constant then X3 tends to decrease, and since β3>0 this will tend to reduce the mean value of Y. So the overall effect of a one unit increase in X2 will appear lower if I allow X3 to vary also, hence β2′<β2. Things get worse the more strongly X2 and X3 are correlated, and the larger the effect of X3 through β3 - in a really severe case we may even find β2′<0 even though we know that, ceteris paribus, X2 has a positive influence on Y!

A lot of the value of an added variable plot comes at the regression diagnostic stage, especially since the residuals in the added variable plot are precisely the residuals from the original multiple regression. This means outliers and heteroskedasticity can be identified in a similar way to when looking at the plot of a simple rather than multiple regression model. Influential points can also be seen - this is useful in multiple regression since some influential points are not obvious in the original data before you take the other variables into account. In my example, a moderately large X2 value may not look out of place in the table of data, but if the X3 value is large as well despite X2 and X3 being negatively correlated then the combination is rare. “Accounting for other predictors”, that X2 value is unusually large and will stick out more prominently on your added variable plot.

∗∗ More technically they would be the residuals from running two other multiple regressions: the residuals from regressing Y against all predictors other than X2 go on the vertical axis, while the residuals from regression X2 against all other predictors go on the horizontal axis. This is really what the legends of “Y given others” and “X2 given others” are telling you. Since the mean residual from both of these regressions is zero, the mean point of (X2 given others, Y given others) will just be (0, 0) which explains why the regression line in the added variable plot always goes through the origin. But I often find that mentioning the axes are just residuals from other regressions confuses people (unsurprising perhaps since we now are talking about four different regressions!) so I have tried not to dwell on the matter. Comprehend them as “X2 given others” and “Y given others” and you should be fine.

2017.12.3

### Imbalanced Classification Problems

https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/

Definition: An imbalanced classification problem is one in which the dependent variable has imbalanced proportion of classes. In other words, a data set that exhibits an unequal distribution between its classes is considered to be imbalanced.

Reason: Below are the reasons which leads to reduction in accuracy of ML algorithms on imbalanced data sets:
1. ML algorithms struggle with accuracy because of the unequal distribution in dependent variable.
2. This causes the performance of existing classifiers to get biased towards majority class.
3. The algorithms are accuracy driven i.e. they aim to minimize the overall error to which the minority class contributes very little.
4. ML algorithms assume that the data set has balanced class distributions.
5. They also assume that errors obtained from different classes have same cost (explained below in detail).

Solution：Below are the methods used to treat imbalanced datasets:
1. Undersampling
2. Oversampling
3. Synthetic Data Generation
4. Cost Sensitive Learning

2017.12.5

①The mid P is less conservative (that is more powerful)
②For larger samples the P value obtained from a χ² test with Yates’ correction will correspond to the conventional approach, and the P value from the uncorrected test will correspond to the mid P value.——应该是说小样本因为离散程度更高，所以需要mid p-value.大样本一般不用mid p-value

### Fisher Exact Test

http://www2.fiu.edu/~howellip/Fisher.pdf

①单边p值：算出一个若H0成立时候的均值。若观测值大于均值，那么P值就是大于等于观测值的概率。如果观测值小于均值，那么P值就是小于等于观测值的概率。
②双边p值：对于所有小于观测值的情况，进行积分。

2017.12.7

### 各种Independence之间的关系

https://onlinecourses.science.psu.edu/stat504/node/108
1. Mutual independence – all variables are independent from each other, denoted (A,B,C)
or A⊥⊥B⊥⊥C
.
2. Joint independence – two variables are jointly independent of the third, denoted (AB,C)
or AB⊥⊥C
.
3. Marginal independence – two variables are independent while ignoring the third, e.g., θAB=1, denoted (A,B)
.
4. Conditional independence – two variables are independent given the third, e.g., θAB(C=k)=1 for all k=1,2,…,K, denoted (AC,BC)
or A⊥⊥B|C
.
5. Homogeneous associations – conditional (partial) odds-ratios don’t depend on the value of the third variable, denoted (AB,AC,BC)
.
Before we look at the details, here is a summary of the relationships among these models:
▪ Mutual independence implies joint independence, i.e., all variables are independent of each other.
▪ Joint independence implies marginal independence, i.e., one variable is independent of the other two.
▪ Marginal independence does NOT imply joint independence.
▪ Marginal independence does NOT imply conditional independence.
▪ Conditional independence does NOT imply marginal independence.

2017.12.11

2018.1.23

### logistic回归的两种报错原因：

http://blog.csdn.net/csqazwsxedc/article/details/52033506

2018.1.27

### 如何确定Experimental Unit：

http://www.isogenic.info/html/6__experimental_unit.html#more

2018.1.29

①iid个体中每一个数据点都提供了和前一个一样多的信息。而如果有相关性，比如说ρ=0.99. 那么基本上第一个数据点就代表了全部信息。后面的数据点与第一个相比都没有提供足够多的信息。
②当我们说到充分统计量的时候，老师说每一个x向量的Sample-Size都是n。如果我们有多个数据比如说一个做了10次，一个做了20次。如果是iid的话，我们应该把他们放到一起变成30个。这样减少方差，没必要分成两个来考虑。

2018.1.30
Fangfang：在2-sample-t-test中，两个分布可以不用都是Normal。只要他们都是对称的，或者skewed的程度差不多就可。

# 新版

• Gamma 函数不是单调递增的 • 检验同方差：Bartlett’s test（采用均值和MSE。不是很robust, 最好没有outlier，很需要正态假设）, the Levene’s test（采用中位数和MAE，更robust）
• One-way-anova中计算test-power的方式还可以用OC Curves（operating characteristic）$\Phi^2=\frac{n\delta^2}{2a\sigma^2}$其中$\delta$指的是组间最大差值:$max(t_i)-min(t_i)$ • one sample t-test 分母为什么是$\sqrt{n}$ ? 因为$\bar{X}$ 的方差是$\sigma^2 / n$

• Boxplot上下的线应该是1.5IQR，但是会停在1.5IQR以内最大值/最小值（所以有时候那条线可能会比1.5IQR短）

• E(X1 + X2) = E(X1) + E(X2) 即使X1, X2不是独立的。因为在double integral的时候$\int X_1f(X_1, X_2) dX_1dX_2$ 会把X2积掉，从而只剩下X1的边缘分布。

• lm的斜率：cov(X,Y) / var(X)

PCA只是Maximize Variance，如果我们想知道什么方向是最Skewness的，我们可以分解更高阶矩

MDS(Multidimensional Scaling)就像是把一个橘子皮撕开，张开在2维平面，而PCA不动橘子，只是改变我们观察的角度。

Loh’s Bootstrap Project