install.packages("tidyverse")
This lecture, as the rest of the course, is adapted from the version Stephanie C. Hicks designed and maintained in 2021 and 2022. Check the recent changes to this file through the GitHub history.
Welcome! I am very excited to have you in our one-term (i.e. half a semester) course on Statistical Computing course number (140.776) offered by the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health.
Iβm excited to be back this year to teach β140.776 Statistical Computingβ at @jhubiostat @JohnsHopkinsSPH π
This year, I decided to start with some music πΆ. I like part of the lyrics of this song, which talks about overcoming challenges, [β¦]https://t.co/H5Uq6QhH4D
1/3 𧡠pic.twitter.com/pkRjefcokcβ π²π½ Leonardo Collado-Torres (@lcolladotor) August 27, 2024
This course is designed for ScM and PhD students at Johns Hopkins Bloomberg School of Public Health. I am pretty flexible about permitting outside students, but I want everyone to be aware of the goals and assumptions so no one feels like they are surprised by how the class works.
The primary goal of the course is to teach you practical programming and computational skills required for the research and application of statistical methods.
This class is not designed to teach the theoretical aspects of statistical or computational methods, but rather the goal is to help with the practical issues related to setting up a statistical computing environment for data analyses, developing high-quality R packages, conducting reproducible data analyses, best practices for data visualization and writing code, and creating websites for personal or project use.
Assumptions and pre-requisites
The course is designed for students in the Johns Hopkins Biostatistics Masters and PhD programs. However, we do not assume a significant background in statistics. Specifically we assume:
1. You know the basics of at least one programming language (e.g. R or Python)
- If itβs not R, we assume that you are willing to spend the time to learn R
- You have heard of things such as control structures, functions, loops, etc
- Know the difference between different data types (e.g. character, numeric, etc)
- Know the basics of plotting (e.g. what is a scatterplot, histogram, etc)
2. You know the basics of computing environments
- You have access to a computing environment (i.e. locally on a laptop or working in the cloud)
- You generally feel comfortable with installing and working with software
3. You know the basics of statistics
- The central dogma (estimates, standard errors, basic distributions, etc.)
- Key statistical terms and methods
- Differences between estimation vs testing vs prediction
- Know how to fit and interpret basic statistical models (e.g. linear models)
4. You know the basics of reproducible research
- Difference between replication and reproducible. If not, check this excellent paper by Prasad Patil et al.: https://doi.org/10.1038/s41562-019-0629-z.
- Know how to cite references (e.g. like in a publication)
- Somewhat familiar with tools that enable reproducible research (In complete transparency, we will briefly cover these topics in the first week, but depending on your comfort level with them, this may impact whether you choose to continue with the course).
Since the target audience for this course is advanced students in statistics we will not be able to spend significant time covering these concepts and technologies. To give you some idea about how these prerequisites will impact your experience in the course, we will be using R for nearly all classes, we will be turning in all assignments via R Markdown documents and you will be encouraged (not required) to use git/GitHub to track changes to your code over time. The majority of the assignments will involve learning the practical issues around performing data analyses, building software packages, building websites, etc all using the R programming language. Data analyses you will perform will also often involve significant data extraction, cleaning, and transformation. We will learn about tools to do all of this, but hopefully most of this sounds familiar to you so you can focus on the concepts we will be teaching around best practices for statistical computing.
Some resources that may be useful if you feel you may be missing pieces of this background:
- Statistics - Mathematical Biostatistics Bootcamp I (Coursera); Mathematical Biostatistics Bootcamp II (Coursera)
- Basic Data Science - Cloud Data Science (Leanpub); Data Science Specialization (Coursera)
- Version Control - Github Learning Lab; Happy Git and Github for the useR
- Rmarkdown - Rmarkdown introduction
- R 101 LIBD rstats club blog post: https://research.libd.org/rstatsclub/2018/12/24/r_101/
- Introductory R videos from the LIBD rstats club such as these videos:
Getting set up
You must install R and RStudio (available from Posit) on your computing environment in order to complete this course.
These are two different applications that must be installed separately before they can be used together:
R
is the core underlying programming language and computing engine that we will be learning in this courseRStudio
is an interface into R that makes many aspects of using and programming R simpler
Both R
and RStudio
are available for Windows, macOS, and most flavors of Unix and Linux. Please download the version that is suitable for your computing setup.
Throughout the course, we will make use of numerous R add-on packages that must be installed over the Internet. Packages can be installed using the install.packages()
function in R. For example, to install the tidyverse
package, you can run
in the R console.
How to Download R for Windows
Go to https://cran.r-project.org and
Click the link to βDownload R for Windowsβ
Click on βbaseβ
Click on βDownload R 4.4.1 for Windowsβ
The version in the video is not the latest version of R. Please download the latest version.
How to Download R for the Mac
Goto https://cran.r-project.org and
Click the link to βDownload R for (Mac) OS Xβ.
Click on βR-4.4.1.pkgβ
The version in the video is not the latest version of R. Please download the latest version.
How to Download RStudio
Goto https://rstudio.com and
Click on βProductsβ in the top menu
Then click on βRStudioβ in the drop down menu
Click on βRStudio Desktopβ
Click the button that says βDOWNLOAD RSTUDIO DESKTOPβ
Click the button under βRStudio Desktopβ Free
Under the section βAll Installersβ choose the file that is appropriate for your operating system.
NOTE: The video shows how to download RStudio for the Mac but you should download RStudio for whatever computing setup you have
How to Download Git
Install Git
on your computer following the detailed instructions at https://happygitwithr.com/install-git, which will depend on your operating system.
Create your GitHub account
Create a personal GitHub account following the instructions at https://docs.github.com/en/get-started/signing-up-for-github/signing-up-for-a-new-github-account.
Following Jeff Leekβs advice on How to be a modern scientist that was part of my teamβs bootcamps at https://lcolladotor.github.io/bioc_team_ds/how-to-be-a-modern-scientist.html, try to choose a username that will be the same one you use on your email and other work-related social media platforms such as Twitter, Bluesky, etc.
Learning Objectives
The goal is by the end of the class, students will be able to:
- Install and configure software necessary for a statistical programming environment
- Discuss generic programming language concepts as they are implemented in a high-level statistical language
- Write and debug code in base R and the tidyverse (and integrate code from Python modules)
- Build basic data visualizations using R and the tidyverse
- Discuss best practices for coding and reproducible research, basics of data ethics, basics of working with special data types, and basics of storing data
Course logistics
The only option is to take the course is in-person (140.776.01).
All communication for the course is going to take place on one of three platforms:
Courseplus: for discussion, sharing resources, collaborating, and announcements
Github: for getting access to course materials (e.g. lectures, project assignments)
- Course Github: https://github.com/lcolladotor/jhustatcomputing
Lectures: lectures will be in person
- All in person lectures to be recorded and posted online after class ends
- If for some reason I am sick or not capable of coming onsite, I will send out a Zoom link for everyone to attend remotely for that day.
The primary communication for the class will go through Courseplus. That is where we will post course announcements, answer common questions, and as the primary means of communication between course participants and course instructors.
If you are registered for the course, you should have access to Courseplus now. Once you have access you will also be able to find all material and dates/times of drop-in hours. Any Zoom links will be posted on Courseplus.
Course Staff
The course instructor this year is Leonardo Collado Torres who is an Investigator at the Lieber Institute for Brain Development and an Assistant Professor in the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health. I was actually a student of this course in Fall 2012 when Roger D. Peng was the instructor of this course and Jenna Krall was the only teaching assistant. I first taught this course in 2023. I myself started learning and teaching about R in 2008 and we have a lot of material in the LIBD rstats club that you might be interested in.
At the Lieber Institute for Brain Development (LIBD), my group works on understanding the roots and signatures of disease (particularly psychiatric disorders) by zooming in across dimensions of gene activity. We achieve this by studying gene expression at all expression feature levels (genes, exons, exon-exon junctions, and un-annotated regions) and by using different gene expression measurement technologies (bulk RNA-seq, single cell/nucleus RNA-seq, and spatial transcriptomics) that provide finer biological resolution and localization of gene expression. We work closely with collaborators from LIBD as well as from Johns Hopkins University (JHU) and other institutions, which reflects the cross-disciplinary approach and diversity in expertise needed to further advance our understanding of high throughput biology.
Every day I use R and Bioconductor, and on some days I write R packages. Occasionally I write blog posts about them and other tools. Iβm a co-founder of the LIBD rstats club and the CDSB community of R and Bioconductor developers in Mexico and Latin America, that we described at the R Consortium website. In the past, I also served on the Bioconductor Community Advisory Board.
If you want, you can find me on Twitter or Bluesky.
Teaching Assistants
We also have three amazing TAs this year:
- Jiaxin (Jessi) Huang (jhuan206@jh.edu): I am a second-year master student in the Biostatistics program, specializing in longitudinal data analysis. My current research focuses on exploring and identifying factors associated with long asymptomatic malaria infections. In my free time, I enjoy traveling, cooking, outdoor activities and watching shows.
- Phyllis Wei (ywei43@jhu.edu). She is a fourth year Ph.D. student in Biostatistics. She develops methods to help understand the genetic basis of complex traits and diseases and enhance disease risk prediction models through data integration. Outside of biostatistics, she enjoys hiking, baking, and visiting museums.
- Yu Lu (ylu136@jh.edu): I am a third-year Ph.D. student in Biostatistics. My work is focused on developing functional data analysis methods inspired by physical activity data. In my free time, I enjoy the outdoor but also cuddling with my cat, Finland. (Didnβt manage to make him wear a harness and take a walk outside yet)
Assignment Due Dates
All course assignment due dates appear on the Schedule and Syllabus.
Grading
Philosophy
We believe the purpose of graduate education is to train you to be able to think for yourself and initiate and complete your own projects. We are super excited to talk to you about ideas, work out solutions with you, and help you to figure out how to produce professional data analyses. We do not think that graduate school grades are important for this purpose. This means that we do not care very much about graduate student grades.
That being said, we have to give you a grade so they will be:
- A - Excellent - 90%+
- B - Passing - 80%+
- C - Needs improvement - 70%+
We rarely give out grades below a C and if you consistently submit work, and do your best you are very likely to get an A or a B in the course.
When I was a JHBSPH student in 2011-2016, I had a scholarship from my country π²π½ which had specific grade requirements, so I recognize that while most employers wonβt care about your grades over your grad school projects, you might have strong reasons for aiming for high grades.
Relative weights
The grades are based on three projects (plus one entirely optional project to help you get set up). The breakdown of grading will be
- 33% for Project 1
- 33% for Project 2
- 34% for Project 3
If you submit an project solution, it is your own work, and it meets a basic level of completeness and effort you will get 100% for that project. If you submit a project solution, but it doesnβt meet basic completeness and effort you will receive 50%. If you do not submit an solution you will receive 0%.
Submitting assignments
Please write up your project solutions using R Markdown. In some cases, you will compile a R Markdown file into an HTML file and submit your HTML file to the dropbox on Courseplus. In other cases, you may create an R package or website. In all of the above, when applicable, show all your code and provide as much explanation / documentation as you can.
For each project, we will provide a time when we download the materials. We will assume whatever version we download at that time is what you are turning in.
Reproducibility
We will talk about reproducibility a bit during class, and it will be a part of the homework assignments as well. Reproducibility of scientific code is very challenging, so the faculty and TAs completely understand difficulties that arise. But we think that it is important that you practice reproducible research. In particular, your project assignments should perform the tasks that you are asked to do and create the figures and tables you are asked to make as a part of the compilation of your document. We will have some pointers for some issues that have come up as we announce the projects.
Code of Conduct
We are committed to providing a welcoming, inclusive, and harassment-free experience for everyone, regardless of gender, gender identity and expression, age, sexual orientation, disability, physical appearance, body size, race, ethnicity, religion (or lack thereof), political beliefs/leanings, or technology choices. We do not tolerate harassment of course participants in any form. Sexual language and imagery is not appropriate for any work event, including group meetings, conferences, talks, parties, Twitter, and other online media. This code of conduct applies to all course participants, including instructors and TAs, and applies to all modes of interaction, both in-person and online, including GitHub project repos, Slack channels, and Twitter.
I was also part of the Bioconductor Code of Conduct committee for a few years. You might find the Bioconductor Code of Conduct useful as it is translated into different languages by native speakers of said languages: https://bioconductor.github.io/bioc_coc_multilingual/.
Course participants violating these rules will be referred to leadership of the Department of Biostatistics and the Title IX coordinator at JHU and may face expulsion from the class.
All class participants agree to:
- Be considerate in speech and actions, and actively seek to acknowledge and respect the boundaries of other members.
- Be respectful. Disagreements happen, but do not require poor behavior or poor manners. Frustration is inevitable, but it should never turn into a personal attack. A community where people feel uncomfortable or threatened is not a productive one. Course participants should be respectful both of the other course participants and those outside the course.
- Refrain from demeaning, discriminatory, or harassing behavior and speech. Harassment includes, but is not limited to: deliberate intimidation; stalking; unwanted photography or recording; sustained or willful disruption of talks or other events; inappropriate physical contact; use of sexual or discriminatory imagery, comments, or jokes; and unwelcome sexual attention. If you feel that someone has harassed you or otherwise treated you inappropriately, please alert Leonardo Collado Torres.
- Take care of each other. Refrain from advocating for, or encouraging, any of the above behavior. And, if someone asks you to stop, then stop. Alert Leonardo Collado Torres if you notice a dangerous situation, someone in distress, or violations of this code of conduct, even if they seem inconsequential.
Need Help?
Please speak with Leonardo Collado Torres or one of the TAs. You can also reach out to Elizabeth Stuart, chair of the department of Biostatistics or Margaret Taub, Ombudsman for the Department of Biostatistics.
You may also reach out to any Hopkins resource for sexual harassment, discrimination, or misconduct:
- JHU Sexual Assault Helpline, 410-516-7333 (confidential)
- University Sexual Assault Response and Prevention website
- Johns Hopkins Compliance Hotline, 844-SPEAK2US (844-733-2528)
- Hopkins Policies Online
- JHU Office of Institutional Equity 410-516-8075 (nonconfidential)
- Johns Hopkins Student Assistance Program (JHSAP), 443-287-7000
- University Health Services, 410-955-1892
- The Faculty and Staff Assistance Program (FASAP), 443-997-7000
Feedback
We welcome feedback on this Code of Conduct.
License and attribution
This Code of Conduct is distributed under a Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license. Portions of above text comprised of language from the Codes of Conduct adopted by rOpenSci and Django, which are licensed by CC BY-SA 4.0 and CC BY 3.0. This work was further inspired by Ada Initiativeβs ββhow to design a code of conduct for your communityββ and Geek Feminismβs Code of conduct evaluations and expanded by Ashley Johnson and Shannon Ellis in the Jeff Leek group.
Academic Ethics
Students enrolled in the Bloomberg School of Public Health of The Johns Hopkins University assume an obligation to conduct themselves in a manner appropriate to the Universityβs mission as an institution of higher education. A student is obligated to refrain from acts which he or she knows, or under the circumstances has reason to know, impair the academic integrity of the University. Violations of academic integrity include, but are not limited to: cheating; plagiarism; knowingly furnishing false information to any agent of the University for inclusion in the academic record; violation of the rights and welfare of animal or human subjects in research; and misconduct as a member of either School or University committees or recognized groups or organizations.
Students should be familiar with the policies and procedures specified under Policy and Procedure Manual Student-01 (Academic Ethics), available on the schoolβs portal.
The faculty, staff and students of the Bloomberg School of Public Health and the Johns Hopkins University have the shared responsibility to conduct themselves in a manner that upholds the law and respects the rights of others. Students enrolled in the School are subject to the Student Conduct Code (detailed in Policy and Procedure Manual Student-06) and assume an obligation to conduct themselves in a manner which upholds the law and respects the rights of others. They are responsible for maintaining the academic integrity of the institution and for preserving an environment conducive to the safe pursuit of the Schoolβs educational, research, and professional practice missions.
Disability Support Service
Students requiring accommodations for disabilities should register with Student Disability Service (SDS). It is the responsibility of the student to register for accommodations with SDS. Accommodations take effect upon approval and apply to the remainder of the time for which a student is registered and enrolled at the Bloomberg School of Public Health. Once you are f a student in your class has approved accommodations you will receive formal notification and the student will be encouraged to reach out. If you have questions about requesting accommodations, please contact BSPH.dss@jhu.edu
.
Previous versions of the class
Typos and corrections
Feel free to submit typos/errors/etc via the GitHub repository associated with the class: https://github.com/lcolladotor/jhustatcomputing/issues. You will have the thanks of your grateful instructor!
R session information
options(width = 120)
::session_info() sessioninfo
β Session info βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
setting value
version R version 4.4.1 (2024-06-14)
os macOS Sonoma 14.5
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2024-08-27
pandoc 3.2 @ /opt/homebrew/bin/ (via rmarkdown)
β Packages βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
package * version date (UTC) lib source
cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0)
colorout * 1.3-0.2 2024-05-03 [1] Github (jalvesaq/colorout@c6113a2)
digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.0)
evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0)
fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)
htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)
jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0)
knitr 1.48 2024-07-07 [1] CRAN (R 4.4.0)
rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)
rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0)
rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)
xfun 0.46 2024-07-18 [1] CRAN (R 4.4.0)
yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.0)
[1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ