% Introduction to R and Biostatistics
% Leonardo Collado Torres
% Nov 9, 2012
Hopefully
Ideally
LCG's main focus is in developing researchers.
Richard Royall, Statistical Evidence, 1997
Science looks to statistics for help in interpreting data.
Statistics is supposed to provide objective methods for representic scientific data as evidence and for measuring the strength of that evidence.
Statistics serves science in other ways as well, contributing to both efficiency and objectivity through theories for the design of experiments and decision-making, for example.
But its most important task is to provide objective quantitative alternatives to personal judgement for integrating the evidence produced by experiments and observational studies.
There is a disease that you are interested in and you have a single diagnostic test.
Lets say that you have two incompatible cases:
Bayes theorem
Let \( A, B \) be sets contained in the sample space. Then:
\[ P(A | B) = \frac{P(A \cap B)}{P(B)} \]
if \( P(B) > 0 \).
Note that if \( P(A) > 0 \) then
\[ P(B | A) = \frac{P(B \cap A)}{P(A)} \]
and thus \( P(B | A) P(A) = P(B \cap A) \)
But \( P(A \cap B) = P(B \cap A) \) so
\[ P(A |B ) = \frac{ P(B | A) P(A)}{P(B)} \]
Multiplication rule
Let \( A, B \) be sets contained in the sample space. Then:
\[ P(B) = P(B|A)P(A) + P(B|A^c)P(A^c) \]
Using Bayes theorem and the multiplication rule, we have:
\[ P(D|T^+) = \frac{ P(T^+ | D) P(D)}{ P(T^+ | D) P(D) + P(T^+ | D^c) P(D^c)} \]
As \( P(D^c) = 1 - P(D) \) and substituting the values we assumed earlier, we get that
\[ P(D|T^+) = \frac{0.95 P(D)}{0.95 P(D)+ 0.02 (1 - P(D))} \]
Therefore, our answer to question 1 depends on \( P(D) \), which we call the prior probability.
So you can think of \( P(D) \) as the combination of all prior knowledge and your beliefs, and \( P(D|T^+) \) as your new (updated) beliefs after observing the data. In this case, after seeing Juan get a positive test.
We need to know the possible actions that we can take and their risks (consequences).
For example, if the treatment is harmless, inexpensive and not treating a sick person leads to terrible consequences, then we will go for it and treat Juan.
So we need other information.
The idea behind determining the evidence and it's measure of strength is so that you may use it along with other information you may have to do what you think is best.
As you can see, there are three main divisions of statistical thinking.
But you don't have to choose a side. Use what is out there if you are clear about what you are asking and the methods you use are compatible.
The wrong thing to do is to claim something when what you used doesn't answer that question!
Seber and Lee, Linear Regression Analysis, 2003
Seber and Lee, Linear Regression Analysis, 2003
Faraway, Linear Models with R, 2005
Once we have a statistical model we can
estimate parameters and give a measure of uncertainty.
As one of my professors said, 80% of statistics is linear models.
That is a big question that we will ask ourselves very frequently.
In other words, are we observing something unexpected?
This is the problem of hypothesis testing with the null hypothesis being chance.
To be objective
To estimate a parameter of interest and give a measure of uncertainty
To explain relationships between variables
To test your hypothesis
Biostatistics is statistics applied to biological problems.
No matter what you do in science, you will need statistics at some point or another.
I highly recommend it as your first R user interface.
It's very friendly: menus, all the windows you want, highlighting, …
Once you learn to use RStudio you might want to try out other options which generally involve quite a bit of customization. Or you might be happy ever after with RStudio!
You will notice some R code in the next slides followed by it's output.
If you want to get the same results as me, just copy and paste them into R
Note that
# This symbol is used as a comment sign.
# Anything after will not be considered R code.
x <- 1
y <- c(1, 2, 3)
x + y
## [1] 2 3 4
z <- 1:3
y - z
## [1] 0 0 0
1:20
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
## [13] 13 14 15 16 17 18 19 20
Take Royall's example from the beginning. We know that:
\[ P(D|T^+) = \frac{0.95 P(D)}{0.95 P(D)+ 0.02 (1 - P(D))} \]
So, what is the \( P(D|T^+) \) if \( P(D) = 0.001 \)?
Well, lets use R as a calculator!
0.95 * 0.001/(0.95 * 0.001 + 0.02 * (1 -
0.001))
## [1] 0.04539
We want more. We want to check if \( P(D|T^+) \) is greater when \( P(D) = 0.001, 0.20, \) or \( 0.5 \).
To simplify things, we can write a function and check that it works:
pr.d.given.t <- function(pr.d) {
0.95 * pr.d/(0.95 * pr.d + 0.02 * (1 -
pr.d))
}
pr.d.given.t(0.001)
## [1] 0.04539
##
Now we can find the maximum one sequentially
pr.d.given.t(0.001) < pr.d.given.t(0.2)
## [1] TRUE
pr.d.given.t(0.001) < pr.d.given.t(0.5)
## [1] TRUE
pr.d.given.t(0.2) < pr.d.given.t(0.5)
## [1] TRUE
pr.d.given.t(0.5)
## [1] 0.9794
Not enough!
Lets look at \( P(D|T^+) \) for all possible values of \( P(D) \)
##
pr.d <- seq(0.001, 0.999, by = 0.001)
plot(pr.d, pr.d.given.t(pr.d), type = "l",
col = "blue")
apropos("norm")
## [1] "dlnorm" "dnorm"
## [3] "halfnorm" "norm"
## [5] "normalizePath" "plnorm"
## [7] "pnorm" "qlnorm"
## [9] "qnorm" "qqnorm"
## [11] "qqnorm.default" "qqnorml"
## [13] "rlnorm" "rnorm"
`?`(pnorm)
args(rnorm)
## function (n, mean = 0, sd = 1)
## NULL
Let \( X \) be the random variable that has a Normal distribution with mean \( \mu \) and variance \( \sigma^2 \). Then it's probability density function (pdf) is given by the following equation:
\[ f_X(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left( -\frac{(x - \mu)^2}{2 \sigma^2} \right) \]
If \( Z \sim N(0, 1) \) then it's density is:
\[ f_Z(z) = \frac{1}{\sqrt{2\pi}} \exp \left(- \frac{z^2}{2} \right) \]
Very famous and highly useful because of the Central Limit Theorem (check wiki and this useful applet).
x <- seq(-5, 5, by = 0.01)
args(dnorm)
## function (x, mean = 0, sd = 1, log = FALSE)
## NULL
y <- dnorm(x, 0, 1)
##
plot(x, y, type = "l", col = "forest green")
That was neat, because we know the parameters which describe the population.
But in reality, we only observe a sample. And from the sample, we get statistics such as the sample mean and sample variance.
First, lets generate two random samples.
args(rnorm)
## function (n, mean = 0, sd = 1)
## NULL
set.seed(101)
x <- rnorm(100)
set.seed(102)
y <- rnorm(100, 2)
Exploratory Data Analysis is the first step when working backwards.
It relies heavily on plots
plot(x, y)
boxplot(x, y, col = "light blue")
hist(x, freq = FALSE, col = "skyblue")
With estimated density line
hist(x, freq = FALSE, col = "skyblue")
lines(density(x), col = "blue")
Overlaying two histograms
# Create histogram and density line for
# sample 1 (x)
hist(x, xlim = range(c(x, y)) * 1.2, ylim = c(0,
0.55), main = "Density plots of two samples",
xlab = "", density = 2, freq = FALSE,
border = "skyblue", col = "skyblue")
lines(density(x), col = "blue")
# Add histogram and density line for
# sample 2 (y)
hist(y, xlim = range(c(x, y)), ylim = c(0,
0.55), main = "", density = 2, freq = FALSE,
border = "pink", col = "pink", add = TRUE)
lines(density(y), col = "red")
##
# Add legend
legend("top", legend = c("x", "y"),
col=c("blue", "red"), lty=1, bty="n",
density = 20, border = c("skyblue", "pink"),
fill = c("skyblue", "pink"), cex=0.8)
# Example from my code a 140.776 exercise
##
qqnorm(x)
qqline(x, col = "red")
One way to promote reproducibility is using self-contained documents.
Meaning that they have code and words explaining the results all in one single place.
An example? Look at intro_R_Biostat_LCG_2012.Rmd
Normally, we use it to create PDF reports for ourselves and collaborators.
Well, it normally involves learning \LaTeX and the Sweave
.
These files can look very complicated and take quite a bit of practice before you can get comfortable using them.
You can learn how to use R with RStudio from the beginning and easily export your homeworks either in HTML or PDF format using knitr
Take a look at the same material in a webpage format here.
It's always good to start with the best practices, specially if they are very simple
If you have questions, feel free to contact me!
##
sessionInfo()
## R version 2.15.1 (2012-06-22)
## Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
##
## locale:
## [1] C/en_US.UTF-8/C/C/C/C
##
## attached base packages:
## [1] stats graphics grDevices
## [4] utils datasets methods
## [7] base
##
## other attached packages:
## [1] faraway_1.0.5 knitr_0.8
##
## loaded via a namespace (and not attached):
## [1] digest_0.5.2 evaluate_0.4.2
## [3] formatR_0.6 plyr_1.7.1
## [5] stringr_0.6.1 tools_2.15.1