统计学习笔记

介绍

记录一下学统计的时候遇到的一些问题

日常

2021/11/11

Ordinal Regression

2021/11/10

Linear Model还假设 the mean and variance for the outcome is unrelated (i.e. variance is a constant) 所以当我们的residual plot出现喇叭的时候,其实有可能是poisson分布?方差随着y变大而变大。

2021/10/31

Lasso 和 Stepwise 来选变量都会导致biased parameter。所以我们做inference的时候会出问题。但是我们可以只用它来做prediction,那还尚可。

2021/10/27

拉格朗日乘数法:约束条件的等高线与曲线相切 -> 两者的梯度平行。梯度平行的定义即是:f=λg\nabla f=\lambda \nabla g

拉格朗日乘数法遇到不等式:

2021/10/19

在我们impute categorical的时候,我们可以先变成dummy matrix,再impute mean。这样会获得一个weighted版本的imputation。

2021/10/14

如果有NA的话,在plug-in mean和mode之后,可以考虑再添加一列indicator(是否为imputed),这样可能会稍微好一点。(好像是Loh在2019年参加的会议提到的想法)

Logistic里面不太好加interaction,因为不好解释,interaction是作用在log-odds上面的。如果变回到pi的level,interaction的关系会变得很复杂。

Least Square和正态没关系。我们可以有一个lm模型,但是没有正态假设。

只有Random Sample才能去引入那些正态假设,如果我们有了Population数据,就没法做inference了。或者说这样做是不ethical的。

老罗举了个好玩的例子,你来上课又不听光看,就像你去健身房不锻炼,光看别人锻炼一样。哈哈哈哈

Transforming Y can affect both linearity and equal variance, but transforming X can affect only linearity.

  • 假设是一个指数变换。Y=eYY^{\prime} = e^{Y}。那么会有效果是越大的Y越容易有大的方差。

2021/10/13

Linear Regression里面随着data变多,模型一直保持不变。所以才会有β^\hat{\beta} 正态收敛到均值。但是GUIDE不同。分类树算法随着数据不断变多之后,树会发生变化,所以不存在limiting distribution的说法。

如果把伴侣年龄给impute了,那这套数据集就无法代表整体的population。因为整体的population中,有40%是没有伴侣的。

2021/10/10 变异系数

变异系数:Coefficient of variation / relative standard deviation (CV / RSD)

  • 判断准则随便找了一个:0.1/0.4/1 弱变异 - 低等变异 - 中等变异 - 强变异

2021/10/09 方差齐性检验

转自这篇CSDN

方法:

  1. 【正态敏感】方差比:大除小,得到一个F值。(所以可以是单边检验)
  2. 【正态敏感】Hartley检验:用于多组方差的检验,用多组中最大的方差除最小的方差,得到一个F值。
    • 这里用的F table是一个 FMAXF_{MAX} 的 sample distribution。
    • 假设正态,并且每组的数量相同。不足之处是正态不满足的时候不够robust,有时候小p值会来自于非正态。
  3. 【正态敏感】Bartlett检验:一个诡异的构造,出来一个卡方检验
  4. 【正态敏感】Cochran’s C test: 检验的其实是方差的outlier,是否有某一个variance超出去了。统计量十分简单,就是方差除方差和的ratio。
    • 需要每组数量相同。
    • 比起Hartley,Cochran用了更多的数据
  5. 【正态不敏感】Levene检验:将每个值先转换为为该值与其组内均值的偏离程度绝对值,然后再用转换后的偏离程度去做方差分析(常规ANOVA),即组间方差/组内方差。
  6. 【正态不敏感】Brown–Forsythe检验:修改版的Levene。偏离程度用median刻画(其他方法还有用trimmed mean)。比Levene更robust

方差不齐次的时候,将ANOVA变成了Kruskal–Wallis one-way analysis of variance:比较中位数。

2021/10/03

把numeric的数据变成categorical感觉有一个好处:边缘效应?如果有一个变量在小于5和大于10的表现相似,numerical的方法无法检测出来,但是categorical可以。

对于某一些0/1的influence point,我们是否可以将它重新采样为0.6/0.4?

  • 因为在人烟稀少的地方,可能p = 0.51,但是因为只有0/1,我们就会把他当成p=1来看待。所以重新采样的过程可以使得数据的influence point没那么夸张
  • 这个重新采样的过程也只适用于那些对于influencial point敏感的算法,比如说linear regression。

在老罗讲COVID data的时候,说数据集是一天天积累起来的,数据其实不是iid。因为随着时间的推移,可能我们有了更好的医疗条件,我们也有可能面临的不是同一种病毒(比如说Delta Variation)

老罗用GUIDE发现了一个95% missing 的数据其实很重要。仔细一想,95%NA的数据怎么用?我可能根本不会用。

老罗开始上课的时候抨击了ACS采数据的时候用5%的人口来采,这样就导致各个州的数据集大小相差很大。

  • 后来上课时讲到,其实可以用IPW来补偿,也没啥问题。如果每个州的取样相同的话,那么一个加州人可能代表了100个加州人,而一个威斯康辛人可能只代表10个威州人。

老罗88年写了个文章,说χ2\chi^2检验收敛到χ2\chi^2分布的速率是不同的,与df有关系。所以那些continous correction都有些许扯淡,因为所有事情都应该depends on df。

  • 但是收敛速度这个东西并不是一个统计学的名词,我们该怎么来刻画他呢?

2021/09/28

Importance Score利用了Multiple Correction,所以会导致变量多的时候,IS的大小会变(但是因为是同样的修正系数,所以我猜相对位置应该不会变)。

2021/09/21

胡家欣推荐了两个好用的画图包:patchwork、GGally

如果有Complete seperation出现,Logistic的summary会出现很大的estimate和很大的std.error。原因是因为要找到一条斜率很大的线,几乎垂直。而斜率越大,对应X的β\beta也就越大。

2021/09/20 Loh’s 761

Logistic的S-curve是因为对应的 XX 是连续型变量,所以对于离散的 XX ,都没法看到S-Curve。但是S-Curve的斜率不是 β\beta ,因为方程其实是:

y=exp{β0+β1X}1+exp{β0+β1X}y = \frac{exp\{\beta_0 + \beta_1 X \}}{1 + exp\{\beta_0 + \beta_1 X \}}

所以求导之后,得到的不是 β\beta

summary(vector) # 这个命令会看到NA的数量

Interaction 有可能全局不显著,但是在某个subspace里面显著 -> GUIDE有了用武之地

一个思路:以后可以更加focus on interpretation,因为老罗就能把一棵平平无奇的树,讲的很动人。

为什么R的小数是以2.2E-16来作为分界线,而不是其他的什么数?

在我们比较不同的卡方检验时(比如在GUIDE的splitting variable selection),自由度不同时很难比较。就算p-value可以计算出来,但是因为数字很小,就会产生一个收敛的问题。不同自由度的卡方分布收敛速率不同。

如果有20维的数据,why should there be a straight line? 【线性模型的拷问】

Linear Regression 的 Interpretation的问题:有时候一个simple lm系数是正的变量,因为其他变量的加入而变成了负系数。这时我们如果解释为负面作用,岂不是闹笑话。

  • 所以有时一个variable解释的过度了,就需要其他的变量来修正。lm的目的是准确的拟合以及预测,而并不是为了解释coef interpretation。就像鞋用来穿,而不是用来吃一样。
  • 仔细一想,树好像和其他的变量关系不大,但LDA会受影响。

Reference would change the R-output,when there exists non-convergence.

  • 你不知道什么时候R会双手一举,说老子不算了

GUIDE的importance score:

  • 简单来讲算法是weighted chi-squared test
  • 为何能给出置信度:因为Loh做了一个permutation test,去获得这个未知的分布。重复这个过程10000次,难怪importance score算得很慢。

GUIDE的interface是跟别人互动,本来我以为是个累赘,结果被老罗硬生生说成了一个feature。他说为了避免用户背住所有option,不如一个一个问。所以我在想或许我也可以做一个R版本的互动。

为什么Multicollinearity虽然很大的方差,却可能估计的还不错?

  • 想象一下,在一个三维空间中,你有一些点,他们形成了近似一条直线(Multicollinearity)。当你要用一个平面去拟合的时候,就像是跷跷板,中间的支撑很不稳定,所以方差很大。但是如果未来的点还在这条直线上的话,那么就没什么大问题。

2021/09/19 998 Homework

Nested Design: The levels of one factor (e.g., factor B) are similar but not identical for different levels of another factor (e.g., A). 比如A1对应了B123,A2对应了B1234… 这时我们就会说 the levels of factor B nested under the levels of factor A.

  • 在这种setting下,random effect 和 fixed effect 的ANOVA表变得不同。对应的F-test也会不一样。
  • 比如说A fixed, B random。A的F-test是要用 MSA / MSB 而不是 MSA / MSE

The Two-Stage Nested Design: Page 604 in Douglas’s Textbook.

# B: B用的是组内序号1,2,3.所以不同组会出现同样的数字1,2,3
# B.id:这里用的是绝对序号1,2,3...n。所以不同组不会出现同样的数字
# 以下两种方法都可以,结果相同
lmer(y ~ A + (1|A:B))
lmer(y ~ A + (1|B.id)

2021/09/14 Miaoyan’s 602

关于Linear Regression里面的参数如何解读,经典的说法是:

A unit change in XjX_j is associated with a βj\beta_j average change in YY, while holding all other predictors fixed.

  • 听上去没毛病,但是问题来了,如果设计矩阵X之间存在correlation呢?如果有cor,你很难做到在改变一个的同时,不改变其他X。所以这个定义干脆是用不了。

所以最后的解决办法是:

“The only way to find out what will happen when a complex system is disturbed is to disturb the system, not merely to observe it passively ”—— Fred Mosteller and John Tukey , paraphrasing George Box

检测Multicollinearity的方法:Variance Inflation Factor

Dependent Tests

一个Positive的例子:

Consider testing all pairwise differences in means between four groups; this is 𝑁=43/2=6𝑁=4⋅3/2=6 tests. Each of the six pp-values alone is uniformly distributed. But they are all positively correlated: if (on a given attempt) group A by chance has particularly low mean, then A-vs-B comparison might yield a low pp-value (this would be a false positive). But in this situation it is likely that A-vs-C, as well as A-vs-D, will also yield low pp-values. So the pp-values are obviously non-independent and moreover they are positively correlated between each other.

  • 可以见得这种comparison testing,如果有一个显著了,会使得其他的comparison也显著。

一个 Negative (Positive) 的例子:

Imagine I am testing whether A = 0 and also whether B = 0 against one-tailed alternatives (A > 0 and B > 0). Further imagine that B depends on A. For example, imagine I want to know if a population contains more women than men, and also if the population contains more ovaries than testes. Clearly knowing the p-value of the first question changes our expectation of the p-value for the second. Both p-values change in the same direction, and this is PRD. But if I instead test the second hypothesis that population 2 has more testes than ovaries, our expectation for the second p-value decreases as the first p-value increases. This is not PRD.

  • PRD: Positive Regression Dependency
  • 这个dependence感觉有点偏理论了,先溜了。

2021/09/13

Interpreting Residual and Null Deviance in GLM R

  • 两种Deviance都是由2logΛ-2\log{\Lambda} 算出来的。
  • Null Deviance比较的是全模型和只有intercept
  • Residual Deviance 比较的是全模型和full model
  • 全模型指的是Saturated Model: 每个数据有每个数据的参数,df = n拉满

Rank the variable (Importance Score):

  • Why Rank? Based on what criteria? 我目前认为这个方法或许不该出现,因为他给出了我误以为有用但其实没用的信息。就像我对于Correlation过分重视,但后来发现他只能描述线性关系,就没那么帅了。

树在每一个节点的划分其实都做了multiple testing:

  • 所以如果变量过多的话,Bonferroni之后没有显著的怎么办?
  • 我想一个节点上面找到了p最小的那个划分,不管它是否显著,其实对于最终效果都是好事。树本身对于一个节点的false positive不是那么关心,因为只要最后分的好即可。

GUIDE对于numeric变量的NA处理方式是划为一类,我觉得不太妙。但是想到了「Age of Spouse」问题,我又觉得好像确实没法用机器来做。对于这种可能填了会很错的事情,还是不填比较好。

  • 给一个没有伴侣的人找到了伴侣的年龄。人类会觉得是个离谱的错误。但是机器在分类的时候也会觉得这个问题很离谱吗?

LASSO没法给出Confidence Interval,它的beta估计也是biased的。人们用它只是因为好发论文?

  • 对于不同的bootstrap sample,beta的变化可能会很大。所以CI其实也没啥意义。

Human Bias:

  • 我看到了一个curve pattern,可能会手动添加一个二次项,这个就属于bias。但是GUIDE等机器学习方法看不到图形形状,所以就不会出现这个问题。它们可以用CV来解决问题,但是很明显人不能成百上千次的重复观察来消除这种bias。
  • 所以我曾想让机器有看的能力,现在看起来不一定是个好事?

2021/09/12 老李’s Question

完全消耗系数计算

  • 列昂惕夫逆矩阵:在计算总投入的时候,要加上其他的因素,比如人工费

2021/09/11 Miaoyan’s 602

Correlation Model:

Inference on pearson correlation coefficient:

Mixed-Effects Model:


2021/09/10 Miaoyan’s 602

Regression: Joint distribution of outcomes given all covariates. 每一个数据都是有用的,合在一起产生了一个Joint分布。所以在我们使用(XX)1(X'X)^{-1}的时候,其实每一个行X都贡献了一些作用。

Some other Regressions:

  • Binary Regression - Logistic Regression: Log-Odds ~

    • Bernoulli -> Binomial Logistic Regression: response variable变为【成功了几次】
  • Binary Regression - Probit Regression: Pr(Y=1X)=Φ(XTβ)\operatorname{Pr}(Y=1 \mid X)=\Phi\left(X^{T} \beta\right)

  • Possion Regression: Log(y) ~

  • 这三个模型属于Non-Gaussian Model: There is no error term, there’s just an unknown probability. The logistic model is a probability model. And of course, the error is not iid distributed. 随机性被Bernoulli分布刻画走了

Partial Regression: Attempts to show the effect of adding another variable to a model that already has one or more independent variables. Also referred to as added variable plots, adjusted variable plots, and individual coefficient plots.

Partial Correlation: Formally, the partial correlation between XX and YY given a set of nn controlling variables Z=Z1,Z2,...,ZnZ = {Z_1, Z_2, ..., Z_n}, written ρXY.Z\rho_{XY.Z}, is the correlation between the residuals eXe_X and eYe_Y resulting from the linear regression of XX with ZZ and of YY with ZZ, respectively.

  • VV 刻画的是参数如何被使用所导致的方差
    • The variance function relates the means to the variances
    • 其实VV是为了描述方差和均值之间的关系。因为他们只有没关系,才符合linear regression的假设。

About the dispersion:

A problem with logistic regression (as well as other GLMs) is overdispersion. Overdispersion occurs when error (residuals) are more variable than expected from the theorized distribution. In case of logistic regression, the theorized error distribution is the binomial distribution. The variance of binomial distribution is a function of its mean (or the parameter pp). If there is overdispersion, the coefficient estimates will be more confident (smaller standard error values) than they should be. One can detect overdispersion by comparing the residual deviance with the degrees of freedom. If these two numbers are close, there is no overdispersion. Residual variation much larger than degree of freedom indicates overdispersion. Underdispersion, detected by lower residual deviance than the degrees of freedom, is also possible but more rare. In our example in Listing 11, there is no indication of overdispersion.

  • residual deviance 和 df 会在glm的summary里面给出来,相除即可得到dispersion。

  • Over dispersion是个坏事情

    A simple solution for overdispersion is to estimate an additional parameter indicating the amount of the oversidpersion. With glm(), this is done so-called ‘quasi’ families, i.e., in logistic regression we specify family=quasibinomial instead of binomial.

对于广义的误差矩阵形式(非独立,非同方差)的Linear Regression:

β^=(XTΣ1X)1XTΣ1Y,Cov(YX)=σ2Σ\hat{\beta}=\left(\boldsymbol{X}^{T} \boldsymbol{\Sigma}^{-1} \boldsymbol{X}\right)^{-1} \boldsymbol{X}^{T} \boldsymbol{\Sigma}^{-1} \boldsymbol{Y}, \quad \operatorname{Cov}(\boldsymbol{Y} \mid \boldsymbol{X})=\sigma^{2} \boldsymbol{\Sigma}


2021/09/09 Loh’s 761

I have been thinking about this day and night for 36 years. 1951-1985-2021 老罗在今年9月就70岁了。

Parameter 多的时候,虽然很容易有显著的变量,但是来个Bonferroni Correction,就都凉了。

  • 很容易显著是因为Under H0H_0, p-value is uniformly distributed. 所以大约有5%的变量会是显著的
  • 尤其是你还要考虑到变量之间的交互作用

Parameter多,但是数据量也大的时候呢,会怎么样?

  • 我认为数据量越大的时候,Pattern的寻找就会越简单。就像β=0.001\beta=0.001。数据量小的时候可能检验不出来,所以我们会认为β=0\beta=0这个假设没法拒绝。但是数据量只要够大,0.0010.001这一点点的区别也能找出来。这就带来一个问题:我们需要重新定义我们的原假设,让他对一些没什么必要的noise不要太关注。

解决Multiple Testing FWER 的方法:

  • Holm step-down:从小到大排序后,只要满足Pk<αm+1kP_{k}<\frac{\alpha}{m+1-k} 即拒绝。可以看出第一步和Bonferroni是一样的,但是之后就会变的比Bonferroni稳强。

  • Hochberg step-up: 跟Holm判断方法几乎一样,只不过不同的是holm从最小的开始,只要p值不够小就停下来。而Hochberg是从最大的开始,直到p值小才停。

# Hochberg在能用的时候比Holm更强
pp = c(0.015,0.015,0.02,0.05)
p.adjust(pp,method = 'holm')
# [1] 0.06 0.06 0.06 0.06
p.adjust(pp,method = 'hochberg')
# [1] 0.04 0.04 0.04 0.05
# 这个例子中,pp最小的不够小,所以0.015*4 = 0.06>0.05不够拒绝的条件,Holm会认为这四个都不显著
# 但是第三个pp因为0.02*2 = 0.04<0.05够了拒绝的条件,Hochberg会认为第三个显著,而前两个由于monotone也显著。
  • 所以Hochberg需要加上条件:It holds only under non-negative dependence。这个要求在Sidak对于Bonferroni的优化中也见到过:1(1α)1/mα/m1-(1-\alpha)^{1/m} \Leftrightarrow \alpha/m
  • Bonferroni和Holm适用于各种dependence,但是在positive related的时候会过于保守。这时可以试试Resampling Procedure。【待看】【The procedures of Romano and Wolf (2005a,b) dispense with this condition and are thus more generally valid.】
  • Harmonic mean p-value procedure 【待看】

Multiple Testing

Union-Intersection Testing: 要想拒绝一堆假设,只需要拒绝一个即可。这时

HI:i=1mHi versus KU:i=1mKiH_{I}: \bigcap_{i=1}^{m} H_{i} \text { versus } K_{U}: \bigcup_{i=1}^{m} K_{i}

  • 我们一般处于这个状态,所以需要Bonferroni等一系列办法。

Intersection-Union Testing: 要想拒绝一堆假设,必须全部拒绝。这时

HU:i=1mHi versus KI:i=1mKiH_{U}: \bigcup_{i=1}^{m} H_{i} \text { versus } K_{I}: \bigcap_{i=1}^{m} K_{i}

  • 当我们处于这个状态时,不需要调整。

    An interesting feature of intersection-union tests is that no multiplicity adjustment is necessary to control the size of a test but the individual hypotheses cannot be tested at levels higher than the nominal significance level either (Berger, 1982).

From Multiple Comparison Procedures:

  • FWEC: FWER under the complete null (FWEC) is the probability that at least one type-I error occurs given that all nulls are true. 所有H0H_0都对的时候,出现一个错误的概率。

  • FWEP: FWER under a set of null (FWEP) is the probability that at least one type-I error occurs given that a subset of nulls are true. 只要有一些H0H_0 对了即可。

  • Weak Control of FWER: FWEC α\le \alpha

  • Strong Control of FWER: FWEP α\le \alpha for all subsets of null hypotheses.

Closure Principle : The closure principle proposed by Marcus, Peritz and Gabriel (1976) plays a key role in the theory of multiple testing and provides a foundation for virtually all multiple testing methods arising in pharmaceutical applications. This principle has been used to construct a variety of stepwise testing procedures.

  • The closure principle states that an FWER-controlling testing procedure can be constructed by testing each hypothesis in the closed family using a suitable local α\alpha-level test. This procedure rejects a hypothesis if all intersection hypotheses containing this hypothesis are rejected by the associated local tests.
  • Closure不一定具备下面的性质,每一个都能举出反例。

Monotone procedures: A monotone procedure rejects a hypothesis whenever it rejects another hypothesis with a larger p-value. For example, if pi<pjp_i < p_j then the rejection of HjH_j automatically implies the rejection of HiH_i . Monotonicity helps to avoid logical inconsistencies; as such it is an essential requirement for multiple testing procedures. When a procedure does not have this property, monotonicity needs to be enforced by updating adjusted p-values.

Consonant procedures: A closed testing procedure is termed consonant (Gabriel, 1969) if the rejection of an intersection hypothesis HIH_I with I{1,,m}I \subseteq\{1, \ldots, m\} and I>1|I|>1 always leads to the rejection of at least one HJH_J implied by HIH_I , i.e., HJH_J with JIJ \subset I. It is possible for a nonconsonant procedure to reject the global null hypothesis HIH_I , I{1,,m}I \subseteq\{1, \ldots, m\}, without rejecting any other intersection hypotheses.

alpha-exhaustive procedures: An α\alpha-exhaustive procedure is a closed testing procedure based on intersection hypothesis tests the size of which is exactly α\alpha (Grechanovsky and Hochberg, 1999). In other words, P( Reject HI)=αP\left(\text { Reject } H_{I}\right)=\alpha for any intersection hypothesis HIH_I , I{1,,m}I \subseteq\{1, \ldots, m\}. If a procedure is not α-exhaustive, one can construct a uniformly more powerful procedure by setting the size of all intersection hypothesis tests at α\alpha.

Partitioning Principle: The first step involves partitioning the union of the hypotheses into mutually exclusive hypotheses. Since the hypotheses are disjoint, each one of them can be tested at level α\alpha without compromising the FWER control. 比如说:

H1:μ1μ0,H2:μ2μ0H_{1}: \mu_{1} \leq \mu_{0}, \quad H_{2}: \mu_{2} \leq \mu_{0}

就可以被拆分为:

H1:μ1μ0 and μ2μ0H2:μ1μ0 and μ2>μ0H3:μ1>μ0 and μ2μ0\begin{aligned} &H_{1}^{*}: \mu_{1} \leq \mu_{0} \quad \text { and } \quad \mu_{2} \leq \mu_{0} \\ &H_{2}^{*}: \mu_{1} \leq \mu_{0} \quad \text { and } \quad \mu_{2}>\mu_{0} \\ &H_{3}^{*}: \mu_{1}>\mu_{0} \quad \text { and } \quad \mu_{2} \leq \mu_{0} \end{aligned}


ANCOVA:ANOVA只能处理cat变量的X,regression一般处理continuous的X,ANCOVA介于这两者之间。我们一般会用dummy0-1来处理cat,这就是ANCOVA。在R中,可以用car::Anova(fit_lm, type = 'III') 来运行。

  • 直接summary(aov)用的是type-I,是不对的。

关于GUIDE

description file可以放入更多的信息:每一行虽然有默认输出,但是GUIDE只读取第一个Entry,之后你可以任意添加comments,而不需要考虑用什么comment符号。

未来是不是可以加入一个回去重选的功能?要不然反复输入同一个命令,还是有些annoying的。

GUIDE dsc file的路径问题,可以通过修改dsc中的第一行,把数据的相对路径改成绝对路径来解决(记得带引号)。这样一劳永逸。

Minimum Node Size可能是被老罗设置成为了1%的total sample size。我看到老罗有一个30层的树,这可能说明他设置的maximum level比30还要大。

KL03的NA解决办法:

C Method: 遍历所有变量对,选出resubstitution error最小的那一组。在fit LDA的时候如果有NA的话就直接扔掉对应的数据,用其他的数据进行拟合。在predict(resubstitution)的时候,所有的NA都被填成了mean和mode。

High leverage point 在GUIDE里面不是问题,因为二分法不在乎数值的具体大小(相对即可)。

Missing value flag在GUIDE里面可以帮助划分。没有的话NA只会全去某一侧,如果有的话可能missing code A会去左侧,missing code B会去右侧。

Party&Partykit在利用weight的时候会要求是正整数,因为算法是将同样的数据复制w遍。这样会导致在那一个点处的样本方差 = 0。

  • 或许以后需要了解一下其他算法的缺点,defend for GUIDE

如果左右孩子的class name list过长,可能会导致overlap。我们这时候可以用左右不同颜色来区分。当然我认为老罗的方法更好,太长的直接trancate为S1S_1

STAT641

Never adjust for post-randomized variable. It might introduce a non-zero term.

  • 做完随机分组之后,不能再收集任何其他信息了。这大概也是为什么那些变量叫做Baseline,就是告诉你要在base的时候收集。
  • 这也是为什么实验做完之后,我们最好不要只观察仍存活的病人组。因为存活也是一个variable。
  • Cook拿出了一个COVID的paper,对只选择了打一针疫苗的人群进行观察。这也是post-randomization variable。

如果adjusted for more covariates, 不会使得type-I变化,不过会使得type-II变小,因为控制了方差。

对于t-test来说:

  • 不需要假定方差相同,因为只会节省我们两个自由度左右,收益过小。而如果假设错了,会有很大的负收益。所以小亏。
  • 甚至不需要Normality,因为数据量大的时候可以用CLT顶

为什么线性模型的lm有一个图是Standardized Residuals\sqrt{|\text{Standardized Residuals}|} : 为什么要取这个平方根?

x = rnorm(100000)
hist(x,breaks = 30) # 这很normal
y = abs(x)
hist(y,breaks = 30) # 这个趋势不太妙,单调下降
z = sqrt(y)
hist(z,breaks = 30) # 我们会发现,又有点像normal了。
# 但是这个图本身有问题,要想使这个图表现好,需要同方差 + 正态性。但这个图是用来检测同方差的。所以有时候这个图表现不好,我们以为是异方差,结果其实是因为没有正态性。confounding。

对于LRT和Score Test来说,一个问题我们使用pp ,或者使用$,都会得到一样的结果。但是Wald Test对如何parameterization 很敏感。

旧版

17.11.13
##为什么我们要用R里面的anova函数比较多个模型?
https://stats.stackexchange.com/questions/115304/interpreting-output-from-anova-when-using-lm-as-input

①First of all, you may be perfectly satisfied with the summary output, and that’s fine. However, the ANOVA table may offer some advantages. First, if you have a categorical / factor variable with more than two levels, the summary output is hard to interpret. It will give you tests of individual levels against the reference level, but won’t give you a test of the factor as a whole

②Another reason you might prefer to look at an ANOVA table is that it allows you to use information about the possible associations between your independent variables and your dependent variable that gets thrown away by the t-tests in the summary output. Consider your own example, you may notice that the p-values from the two don’t match (e.g., for v1, the pp-value in the summary output is 0.93732, but in the ANOVA table it’s 0.982400). The reason is that your variables are not perfectly uncorrelated

The result of this is that there are sums of squares that could be attributed to more than one of the variables. The t-tests are equivalent to ‘type III’ tests of the sums of squares, but other tests are possible. The default ANOVA table uses ‘type I’ sums of squares, which can allow you to make more precise–and more powerful–tests of your hypotheses

Anova的目的:partition the sums of squares

##Further understanding of Type 123 test of RSS:
https://stats.stackexchange.com/questions/20452/how-to-interpret-type-i-type-ii-and-type-iii-anova-and-manova/20455#20455

Lets imagine that there are just two factors A and B (and we’ll throw in the A*B interaction later to distinguish type II SS). Further, lets imagine that nij不完全一样。
Now your two factors are correlated with each other.The problem with your factors being correlated is that there are sums of squares that are associated with both A and B.

When computing an ANOVA (or any other linear regression), we want to partition the sums of squares. A partition puts all sums of squares into one and only one of several subsets. (For example, we might want to divide the SS up into A, B and error.) However, since your factors (still only A and B here) are not orthogonal there is no unique partition of these SS. In fact, there can be very many partitions, and if you are willing to slice your SS up into fractions (e.g., “I’ll put .5 into this bin and .5 into that one”), there are infinite partitions. A way to visualize this is to imagine the MasterCard symbol: The rectangle represents the total SS, and each of the circles represents the SS that are attributable to that factor, but notice the overlap between the circles in the center, those SS could be given to either circle.

The question is: How are we to choose the ‘right’ partition out of all of these possibilities? Let’s bring the interaction back in and discuss some possibilities:
Type I SS:
• SS(A)
• SS(B|A)
• SS(A*B|A,B)
Type II SS:
• SS(A|B)
• SS(B|A)
• SS(A*B|A,B)
Type III SS:
• SS(A|B,AB)
• SS(B|A,A
B)
• SS(A*B|A,B)

Notice how these different possibilities work. Only type I SS actually uses those SS in the overlapping portion between the circles in the MasterCard symbol. That is, the SS that could be attributed to either A or B, are actually attributed to one of them when you use type I SS (specifically, the one you entered into the model first). In both of the other approaches, the overlapping SS are not used at all. Thus, type I SS gives to A all the SS attributable to A (including those that could also have been attributed elsewhere), then gives to B all of the remaining SS that are attributable to B, then gives to the AB interaction all of the remaining SS that are attributable to AB, and leaves the left-overs that couldn’t be attributed to anything to the error term.

Type III SS only gives A those SS that are uniquely attributable to A, likewise it only gives to B and the interaction those SS that are uniquely attributable to them. The error term only gets those SS that couldn’t be attributed to any of the factors. Thus, those ‘ambiguous’ SS that could be attributed to 2 or more possibilities are not used. If you sum the type III SS in an ANOVA table, you will notice that they do not equal the total SS. In other words, this analysis must be wrong, but errs in a kind of epistemically conservative way. Many statisticians find this approach egregious, however government funding agencies (I believe the FDA) requires their use.

The type II approach is intended to capture what might be worthwhile about the idea behind type III, but mitigate against its excesses. Specifically, it only adjusts the SS for A and B for each other, not the interaction. However, in practice type II SS is essentially never used. You would need to know about all of this and be savvy enough with your software to get these estimates, and the analysts who are typically think this is bunk.
There are more types of SS (I believe IV and V). They were suggested in the late 60’s to deal with certain situations, but it was later shown that they do not do what was thought. Thus, at this point they are just a historical footnote.

As for what questions these are answering, you basically have that right already in your question:
• Estimates using type I SS tell you how much of the variability in Y can be explained by A, how much of the residual variability can be explained by B, how much of the remaining residual variability can be explained by the interaction, and so on, in order.
• Estimates based on type III SS tell you how much of the residual variability in Y can be accounted for by A after having accounted for everything else, and how much of the residual variability in Y can be accounted for by B after having accounted for everything else as well, and so on. (Note that both go both first and last simultaneously; if this makes sense to you, and accurately reflects your research question, then use type III SS.)

2017.11.14
##What does Standardized Residuals mean?

Intuition:every residual has different variance, so we need to consider standardized residuals.

When you compare the cells, the standardized residual makes it easy to see which cells are contributing the most to the chi-square value, and which are contributing the least.If your sample is large enough, the standardized residual can be roughly compared to a z-score. Standardization can work even if your variables are not normally distributed.
• If the residual is less than -2, the cell’s observed frequency is less than the expected frequency.
• Greater than 2 and the observed frequency is greater than the expected frequency. 之所以用2是因为95%双边正态

2017.11.15

##One Sample T-test:

为了检验所给数据是否来自同一个均值总体。需要给出一个原假设Ho:μ=μ0

The test statistic is calculated as:
R code: t.test(a, mu=120)

2017.11.19
###为什么我们要对数据进行变换(比如log):
有时候我们希望残差能满足高斯马尔科夫假设。

2017.11.20
###Forward and Backward selection 的主要差别:

Forward selection has drawbacks, including the fact that each addition of a new feature may render one or more of the already included feature non-significant (p-value>0.05). An alternate approach which avoids this is backward selection.The backward method is generally the preferred method, because the forward method produces so-called suppressor effects. These suppressor effects occur when predictors are only significant when another predictor is held constant.

2017.11.25

Added Variable Plot (Partial Regression Plot)

https://stats.stackexchange.com/questions/125561/what-does-an-added-variable-plot-partial-regression-plot-explain-in-a-multiple

可以告诉我们如果变量之间存在线性相关的话,不能在回归模型中单独对Y和某一个X画图。可能会有Omitted-Variable Bias.

主要原理:纵坐标画Y对除了我们关心的Xi之外其他的X拟合出来的残差,横坐标画我们所关心的Xi对其他所有X拟合出来的残差。R中car包里的avPlots可以实现这个功能。

For illustration I will take a less complex regression model Y=β1+β2X2+β3X3+ϵ where the predictor variables X1 and X2 may be correlated. Let’s say the slopes β2 and β3 are both positive so we can say that (i) Y increases as X2 increases, if X3 is held constant, since β2 is positive; (ii) Y increases as X3 increases, if X2 is held constant, since β3 is positive.

Note that it’s important to interpret multiple regression coefficients by considering what happens when the other variables are held constant (“ceteris paribus”). Suppose I just regressed Y against X2 with a model Y=β′1+β′2X2+ϵ′. My estimate for the slope coefficient β′2, which measures the effect on Y of a one unit increase in X2 without holding X3 constant, may be different from my estimate of β2 from the multiple regression - that also measures the effect on Y of a one unit increase in X2, but it does hold X3 constant. The problem with my estimate β′2^ is that it suffers from omitted-variable bias if X2 and X3 are correlated.

To understand why, imagine X2 and X3 are negatively correlated. Now when I increase X2 by one unit, I know the mean value of Y should increase since β2>0. But as X2 increases, if we don’t hold X3 constant then X3 tends to decrease, and since β3>0 this will tend to reduce the mean value of Y. So the overall effect of a one unit increase in X2 will appear lower if I allow X3 to vary also, hence β2′<β2. Things get worse the more strongly X2 and X3 are correlated, and the larger the effect of X3 through β3 - in a really severe case we may even find β2′<0 even though we know that, ceteris paribus, X2 has a positive influence on Y!

A lot of the value of an added variable plot comes at the regression diagnostic stage, especially since the residuals in the added variable plot are precisely the residuals from the original multiple regression. This means outliers and heteroskedasticity can be identified in a similar way to when looking at the plot of a simple rather than multiple regression model. Influential points can also be seen - this is useful in multiple regression since some influential points are not obvious in the original data before you take the other variables into account. In my example, a moderately large X2 value may not look out of place in the table of data, but if the X3 value is large as well despite X2 and X3 being negatively correlated then the combination is rare. “Accounting for other predictors”, that X2 value is unusually large and will stick out more prominently on your added variable plot.

∗∗ More technically they would be the residuals from running two other multiple regressions: the residuals from regressing Y against all predictors other than X2 go on the vertical axis, while the residuals from regression X2 against all other predictors go on the horizontal axis. This is really what the legends of “Y given others” and “X2 given others” are telling you. Since the mean residual from both of these regressions is zero, the mean point of (X2 given others, Y given others) will just be (0, 0) which explains why the regression line in the added variable plot always goes through the origin. But I often find that mentioning the axes are just residuals from other regressions confuses people (unsurprising perhaps since we now are talking about four different regressions!) so I have tried not to dwell on the matter. Comprehend them as “X2 given others” and “Y given others” and you should be fine.

2017.12.3

Imbalanced Classification Problems

https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/

Definition: An imbalanced classification problem is one in which the dependent variable has imbalanced proportion of classes. In other words, a data set that exhibits an unequal distribution between its classes is considered to be imbalanced.

Reason: Below are the reasons which leads to reduction in accuracy of ML algorithms on imbalanced data sets:
1. ML algorithms struggle with accuracy because of the unequal distribution in dependent variable.
2. This causes the performance of existing classifiers to get biased towards majority class.
3. The algorithms are accuracy driven i.e. they aim to minimize the overall error to which the minority class contributes very little.
4. ML algorithms assume that the data set has balanced class distributions.
5. They also assume that errors obtained from different classes have same cost (explained below in detail).

Solution:Below are the methods used to treat imbalanced datasets:
1. Undersampling
2. Oversampling
3. Synthetic Data Generation
4. Cost Sensitive Learning

2017.12.5

Advantages about Mid P-Value

①The mid P is less conservative (that is more powerful)
②For larger samples the P value obtained from a χ² test with Yates’ correction will correspond to the conventional approach, and the P value from the uncorrected test will correspond to the mid P value.——应该是说小样本因为离散程度更高,所以需要mid p-value.大样本一般不用mid p-value

Fisher Exact Test

http://www2.fiu.edu/~howellip/Fisher.pdf
让我重新了解了p值的概念。
一下讨论的p-值都建立在原假设H0:π=π0
我们往往定义p值的时候,曲线都是随着x的变化,中间高两边低。所以很自然的,就会误以为,p值是对当x等于或者大于(小于)某一特定值的y进行求和。殊不知,其实p值关心的更多是y!如果给出一个中间低两边高的分布。那么
①单边p值:算出一个若H0成立时候的均值。若观测值大于均值,那么P值就是大于等于观测值的概率。如果观测值小于均值,那么P值就是小于等于观测值的概率。
②双边p值:对于所有小于观测值的情况,进行积分。

2017.12.7

各种Independence之间的关系

https://onlinecourses.science.psu.edu/stat504/node/108
1. Mutual independence – all variables are independent from each other, denoted (A,B,C)
or A⊥⊥B⊥⊥C
.
2. Joint independence – two variables are jointly independent of the third, denoted (AB,C)
or AB⊥⊥C
.
3. Marginal independence – two variables are independent while ignoring the third, e.g., θAB=1, denoted (A,B)
.
4. Conditional independence – two variables are independent given the third, e.g., θAB(C=k)=1 for all k=1,2,…,K, denoted (AC,BC)
or A⊥⊥B|C
.
5. Homogeneous associations – conditional (partial) odds-ratios don’t depend on the value of the third variable, denoted (AB,AC,BC)
.
Before we look at the details, here is a summary of the relationships among these models:
▪ Mutual independence implies joint independence, i.e., all variables are independent of each other.
▪ Joint independence implies marginal independence, i.e., one variable is independent of the other two.
▪ Marginal independence does NOT imply joint independence.
▪ Marginal independence does NOT imply conditional independence.
▪ Conditional independence does NOT imply marginal independence.

2017.12.11
在回归分析图中,经常有一道红线。红线是x-y 坐标关系的非参数拟合,并不是均值。

2018.1.23

logistic回归的两种报错原因:

http://blog.csdn.net/csqazwsxedc/article/details/52033506

2018.1.27

如何确定Experimental Unit:

http://www.isogenic.info/html/6__experimental_unit.html#more

帮助区别了一个动物/一群动物,甚至有时候还可以在一个实验中存在多种EU

2018.1.29
今天和Bean聊天:
①iid个体中每一个数据点都提供了和前一个一样多的信息。而如果有相关性,比如说ρ=0.99. 那么基本上第一个数据点就代表了全部信息。后面的数据点与第一个相比都没有提供足够多的信息。
②当我们说到充分统计量的时候,老师说每一个x向量的Sample-Size都是n。如果我们有多个数据比如说一个做了10次,一个做了20次。如果是iid的话,我们应该把他们放到一起变成30个。这样减少方差,没必要分成两个来考虑。

2018.1.30
Fangfang:在2-sample-t-test中,两个分布可以不用都是Normal。只要他们都是对称的,或者skewed的程度差不多就可。

新版

  • Gamma 函数不是单调递增的
  • 检验同方差:Bartlett’s test(采用均值和MSE。不是很robust, 最好没有outlier,很需要正态假设), the Levene’s test(采用中位数和MAE,更robust)
  • One-way-anova中计算test-power的方式还可以用OC Curves(operating characteristic)Φ2=nδ22aσ2\Phi^2=\frac{n\delta^2}{2a\sigma^2}其中δ\delta指的是组间最大差值:max(ti)min(ti)max(t_i)-min(t_i)
  • 在计算线性回归的方差时,RSS=残差平方和/n而不是除n-1,因为原假设下残差均值为0

为什么在公式 y=β0+β1xy=\beta_0+\beta_1x 中,β0^yˉ\hat{\beta_0} \ne \bar{y} ?
答:因为首先我们知道拟合的直线肯定会经过(xˉ,yˉ)(\bar{x},\bar{y}) . 所以不一定有xˉ=0\bar{x} = 0 .除非列正交,导致1nx=0\mathbb{1}_nx = 0.

在所有缺失x但是不缺y的数据中,将x填成xˉ\bar{x} ,只会改变拟合直线的截距而不改变斜率。

  • one sample t-test 分母为什么是n\sqrt{n} ? 因为Xˉ\bar{X} 的方差是σ2/n\sigma^2 / n

  • Boxplot上下的线应该是1.5IQR,但是会停在1.5IQR以内最大值/最小值(所以有时候那条线可能会比1.5IQR短)

  • E(X1 + X2) = E(X1) + E(X2) 即使X1, X2不是独立的。因为在double integral的时候X1f(X1,X2)dX1dX2\int X_1f(X_1, X_2) dX_1dX_2 会把X2积掉,从而只剩下X1的边缘分布。

对于二维数据,什么时候PCA跟Linear-Regression重合?
因为两条线都过X_mean, Y_mean,所以只需要斜率

  • PCA1的斜率:loading_y / loading_x
  • lm的斜率:cov(X,Y) / var(X)

PCA只是Maximize Variance,如果我们想知道什么方向是最Skewness的,我们可以分解更高阶矩

MDS(Multidimensional Scaling)就像是把一个橘子皮撕开,张开在2维平面,而PCA不动橘子,只是改变我们观察的角度。

实对称阵【需要对称】不同特征值的特征向量正交。

Loh’s Bootstrap Project

一个错误:对一个样本我想获得很多SD版本的CI,比如说我做了500次Bootstrap,我得到了一个SD,我用这个SD去套用了这500次的数据。暗含的意思是我觉得这500次数据的SD都是这个SD。但其实不是,因为SD取决于样本,而样本已经变了。正确的方法应该是,对于每一个bootstrap样本,都进行nested bootstrap来获取SD

一个错误:计算Coverage的时候,比如我有4000个样本,获得了4000个不同的calibrated-alpha。我算了一下这4000个alpha的均值,然后再取新的4000个样本,用这个均值alpha来计算coverage。老师说不应当,应该直接用这4000个样本配合上对应的4000个不同的alpha直接计算。

我们不能通过R-output有几个变量带星就说明他们significant,要用bonferonni correction…


本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!