01 - Welcome!

Overview course information for BSPH Biostatistics 140.776 in Fall 2023
course-admin
module 1
week 1
Author
Affiliations

This lecture, as the rest of the course, is adapted from the version Stephanie C. Hicks designed and maintained in 2021 and 2022. Check the recent changes to this file through the GitHub history.

Welcome! I am very excited to have you in our one-term (i.e. half a semester) course on Statistical Computing course number (140.776) offered by the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health. This will be my first time as the lead instructor of a JHBSPH course! πŸ™ŒπŸ½

This course is designed for ScM and PhD students at Johns Hopkins Bloomberg School of Public Health. I am pretty flexible about permitting outside students, but I want everyone to be aware of the goals and assumptions so no one feels like they are surprised by how the class works.

Note

The primary goal of the course is to teach you practical programming and computational skills required for the research and application of statistical methods.

This class is not designed to teach the theoretical aspects of statistical or computational methods, but rather the goal is to help with the practical issues related to setting up a statistical computing environment for data analyses, developing high-quality R packages, conducting reproducible data analyses, best practices for data visualization and writing code, and creating websites for personal or project use.

Assumptions and pre-requisites

The course is designed for students in the Johns Hopkins Biostatistics Masters and PhD programs. However, we do not assume a significant background in statistics. Specifically we assume:

1. You know the basics of at least one programming language (e.g. R or Python)

  • If it’s not R, we assume that you are willing to spend the time to learn R
  • You have heard of things such as control structures, functions, loops, etc
  • Know the difference between different data types (e.g. character, numeric, etc)
  • Know the basics of plotting (e.g. what is a scatterplot, histogram, etc)

2. You know the basics of computing environments

  • You have access to a computing environment (i.e. locally on a laptop or working in the cloud)
  • You generally feel comfortable with installing and working with software

3. You know the basics of statistics

  • The central dogma (estimates, standard errors, basic distributions, etc.)
  • Key statistical terms and methods
  • Differences between estimation vs testing vs prediction
  • Know how to fit and interpret basic statistical models (e.g. linear models)

4. You know the basics of reproducible research

  • Difference between replication and reproducible. If not, check this excellent paper by Prasad Patil et al.: https://doi.org/10.1038/s41562-019-0629-z.
  • Know how to cite references (e.g. like in a publication)
  • Somewhat familiar with tools that enable reproducible research (In complete transparency, we will briefly cover these topics in the first week, but depending on your comfort level with them, this may impact whether you choose to continue with the course).

Since the target audience for this course is advanced students in statistics we will not be able to spend significant time covering these concepts and technologies. To give you some idea about how these prerequisites will impact your experience in the course, we will be using R for nearly all classes, we will be turning in all assignments via R Markdown documents and you will be encouraged (not required) to use git/GitHub to track changes to your code over time. The majority of the assignments will involve learning the practical issues around performing data analyses, building software packages, building websites, etc all using the R programming language. Data analyses you will perform will also often involve significant data extraction, cleaning, and transformation. We will learn about tools to do all of this, but hopefully most of this sounds familiar to you so you can focus on the concepts we will be teaching around best practices for statistical computing.

Tip

Some resources that may be useful if you feel you may be missing pieces of this background:

Getting set up

You must install R and RStudio on your computing environment in order to complete this course.

These are two different applications that must be installed separately before they can be used together:

  • R is the core underlying programming language and computing engine that we will be learning in this course

  • RStudio is an interface into R that makes many aspects of using and programming R simpler

Both R and RStudio are available for Windows, macOS, and most flavors of Unix and Linux. Please download the version that is suitable for your computing setup.

Throughout the course, we will make use of numerous R add-on packages that must be installed over the Internet. Packages can be installed using the install.packages() function in R. For example, to install the tidyverse package, you can run

install.packages("tidyverse")

in the R console.

How to Download R for Windows

Go to https://cran.r-project.org and

  1. Click the link to β€œDownload R for Windows”

  2. Click on β€œbase”

  3. Click on β€œDownload R 4.3.1 for Windows”

Warning

The version in the video is not the latest version of R. Please download the latest version.

Video Demo for Downloading R for Windows

How to Download R for the Mac

Goto https://cran.r-project.org and

  1. Click the link to β€œDownload R for (Mac) OS X”.

  2. Click on β€œR-4.3.1.pkg”

Warning

The version in the video is not the latest version of R. Please download the latest version.

Video Demo for Downloading R for the Mac

How to Download RStudio

Goto https://rstudio.com and

  1. Click on β€œProducts” in the top menu

  2. Then click on β€œRStudio” in the drop down menu

  3. Click on β€œRStudio Desktop”

  4. Click the button that says β€œDOWNLOAD RSTUDIO DESKTOP”

  5. Click the button under β€œRStudio Desktop” Free

  6. Under the section β€œAll Installers” choose the file that is appropriate for your operating system.

Warning

NOTE: The video shows how to download RStudio for the Mac but you should download RStudio for whatever computing setup you have

Video Demo for Downloading RStudio

How to Download Git

Install Git on your computer following the detailed instructions at https://happygitwithr.com/install-git, which will depend on your operating system.

Create your GitHub account

Create a personal GitHub account following the instructions at https://docs.github.com/en/get-started/signing-up-for-github/signing-up-for-a-new-github-account.

Following Jeff Leek’s advice on How to be a modern scientist that was part of my team’s bootcamps at https://lcolladotor.github.io/bioc_team_ds/how-to-be-a-modern-scientist.html, try to choose a username that will be the same one you use on your email and other work-related social media platforms such as Twitter.

Learning Objectives

The goal is by the end of the class, students will be able to:

  1. Install and configure software necessary for a statistical programming environment
  2. Discuss generic programming language concepts as they are implemented in a high-level statistical language
  3. Write and debug code in base R and the tidyverse (and integrate code from Python modules)
  4. Build basic data visualizations using R and the tidyverse
  5. Discuss best practices for coding and reproducible research, basics of data ethics, basics of working with special data types, and basics of storing data

Course logistics

Unlike 2022, the only option is to take the course is in-person (140.776.01).

All communication for the course is going to take place on one of three platforms:

  • Courseplus: for discussion, sharing resources, collaborating, and announcements

  • Github: for getting access to course materials (e.g. lectures, project assignments)

  • Lectures: most lectures will be in person

    • All in person lectures to be recorded and posted online after class ends
    • If for some reason I am sick or not capable of coming onsite, I will send out a zoom link for everyone to attend remotely for that day. I already know that I will be teaching remotely on Tuesday August 29 and Thursday August 31st, due to prior travel arrangments I had made before knowing when I would be teaching this class.

The primary communication for the class will go through Courseplus. That is where we will post course announcements, answer common questions, and as the primary means of communication between course participants and course instructors.

Important

If you are registered for the course, you should have access to Courseplus now. Once you have access you will also be able to find all material and dates/times of drop-in hours. Any Zoom links will be posted on Courseplus.

Course Staff

The course instructor this year is Leonardo Collado Torres who is an Investigator at the Lieber Institute for Brain Development and is an Associate in the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health. I was actually a student of this course in Fall 2012 when Roger D. Peng was the instructor of this course and Jenna Krall was the only teaching assistant. I myself started learning and teaching about R in 2008 and we have a lot of material in the LIBD rstats club that you might be interested in.

At the Lieber Institute for Brain Development (LIBD), my group works on understanding the roots and signatures of disease (particularly psychiatric disorders) by zooming in across dimensions of gene activity. We achieve this by studying gene expression at all expression feature levels (genes, exons, exon-exon junctions, and un-annotated regions) and by using different gene expression measurement technologies (bulk RNA-seq, single cell/nucleus RNA-seq, and spatial transcriptomics) that provide finer biological resolution and localization of gene expression. We work closely with collaborators from LIBD as well as from Johns Hopkins University (JHU) and other institutions, which reflects the cross-disciplinary approach and diversity in expertise needed to further advance our understanding of high throughput biology.

Every day I use R and Bioconductor, and on some days I write R packages. Occasionally I write blog posts about them and other tools. I’m a co-founder of the LIBD rstats club and the CDSB community of R and Bioconductor developers in Mexico and Latin America, that we described at the R Consortium website. In the past, I also served on the Bioconductor Community Advisory Board and the advisory board for rOpenSci’s Statistical Software Peer Review.

If you want, you can find me on Twitter.

Teaching Assistants

We also have three amazing TAs this year:

  • Emily Norton (enorton7@jhmi.edu). She is a second year Ph.D. student in Biostatistics, with interest in understanding how genetics and environment interact to impact someone’s risk of disease. She currently validates polygenic risk scores and, in combination with other risk factors, translates disease risk into a clinically-relevant quantity. Aside from research, she enjoys swing dancing, camping, and crafting.
  • Joe Sartini (jsartin1@jhu.edu). He is a third year Ph.D. student in Biostatistics, with interest in models for precision medicine applications. Currently, the focus of his work is extracting meaningful insights from time-series produced by Continuous Glucose Monitoring and other devices worn by Type 2 diabetics. Outside of research, he enjoys participating in endurance sports and weightlifting.
  • Phyllis Wei (ywei43@jhu.edu). She is a third year Ph.D. student in Biostatistics. She develops methods to help understand the genetic basis of complex traits and diseases and enhance disease risk prediction models through data integration. Outside of biostatistics, she enjoys hiking, baking, and visiting museums.

Assignment Due Dates

All course assignment due dates appear on the Schedule and Syllabus.

Grading

Philosophy

We believe the purpose of graduate education is to train you to be able to think for yourself and initiate and complete your own projects. We are super excited to talk to you about ideas, work out solutions with you, and help you to figure out how to produce professional data analyses. We do not think that graduate school grades are important for this purpose. This means that we do not care very much about graduate student grades.

That being said, we have to give you a grade so they will be:

  • A - Excellent - 90%+
  • B - Passing - 80%+
  • C - Needs improvement - 70%+

We rarely give out grades below a C and if you consistently submit work, and do your best you are very likely to get an A or a B in the course.

When I was a JHBSPH student in 2011-2016, I had a scholarship from my country πŸ‡²πŸ‡½ which had specific grade requirements, so I recognize that while most employers won’t care about your grades over your grad school projects, you might have strong reasons for aiming for high grades.

Relative weights

The grades are based on three projects (plus one entirely optional project to help you get set up). The breakdown of grading will be

  • 33% for Project 1
  • 33% for Project 2
  • 34% for Project 3

If you submit an project solution, it is your own work, and it meets a basic level of completeness and effort you will get 100% for that project. If you submit a project solution, but it doesn’t meet basic completeness and effort you will receive 50%. If you do not submit an solution you will receive 0%.

Submitting assignments

Please write up your project solutions using R Markdown. In some cases, you will compile a R Markdown file into an HTML file and submit your HTML file to the dropbox on Courseplus. In other cases, you may create an R package or website. In all of the above, when applicable, show all your code and provide as much explanation / documentation as you can.

For each project, we will provide a time when we download the materials. We will assume whatever version we download at that time is what you are turning in.

Reproducibility

We will talk about reproducibility a bit during class, and it will be a part of the homework assignments as well. Reproducibility of scientific code is very challenging, so the faculty and TAs completely understand difficulties that arise. But we think that it is important that you practice reproducible research. In particular, your project assignments should perform the tasks that you are asked to do and create the figures and tables you are asked to make as a part of the compilation of your document. We will have some pointers for some issues that have come up as we announce the projects.

Code of Conduct

We are committed to providing a welcoming, inclusive, and harassment-free experience for everyone, regardless of gender, gender identity and expression, age, sexual orientation, disability, physical appearance, body size, race, ethnicity, religion (or lack thereof), political beliefs/leanings, or technology choices. We do not tolerate harassment of course participants in any form. Sexual language and imagery is not appropriate for any work event, including group meetings, conferences, talks, parties, Twitter, and other online media. This code of conduct applies to all course participants, including instructors and TAs, and applies to all modes of interaction, both in-person and online, including GitHub project repos, Slack channels, and Twitter.

I was also part of the Bioconductor Code of Conduct committee for a few years. You might find the Bioconductor Code of Conduct useful as it is translated into different languages by native speakers of said languages: https://bioconductor.github.io/bioc_coc_multilingual/.

Course participants violating these rules will be referred to leadership of the Department of Biostatistics and the Title IX coordinator at JHU and may face expulsion from the class.

All class participants agree to:

  • Be considerate in speech and actions, and actively seek to acknowledge and respect the boundaries of other members.
  • Be respectful. Disagreements happen, but do not require poor behavior or poor manners. Frustration is inevitable, but it should never turn into a personal attack. A community where people feel uncomfortable or threatened is not a productive one. Course participants should be respectful both of the other course participants and those outside the course.
  • Refrain from demeaning, discriminatory, or harassing behavior and speech. Harassment includes, but is not limited to: deliberate intimidation; stalking; unwanted photography or recording; sustained or willful disruption of talks or other events; inappropriate physical contact; use of sexual or discriminatory imagery, comments, or jokes; and unwelcome sexual attention. If you feel that someone has harassed you or otherwise treated you inappropriately, please alert Stephanie Hicks.
  • Take care of each other. Refrain from advocating for, or encouraging, any of the above behavior. And, if someone asks you to stop, then stop. Alert Stephanie Hicks if you notice a dangerous situation, someone in distress, or violations of this code of conduct, even if they seem inconsequential.

Need Help?

Please speak with Leonardo Collado Torres or one of the TAs. You can also reach out to Elizabeth Stuart, chair of the department of Biostatistics or Margaret Taub, Ombudsman for the Department of Biostatistics.

You may also reach out to any Hopkins resource for sexual harassment, discrimination, or misconduct:

Feedback

We welcome feedback on this Code of Conduct.

License and attribution

This Code of Conduct is distributed under a Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license. Portions of above text comprised of language from the Codes of Conduct adopted by rOpenSci and Django, which are licensed by CC BY-SA 4.0 and CC BY 3.0. This work was further inspired by Ada Initiative’s β€˜β€™how to design a code of conduct for your community’’ and Geek Feminism’s Code of conduct evaluations and expanded by Ashley Johnson and Shannon Ellis in the Jeff Leek group.

Academic Ethics

Students enrolled in the Bloomberg School of Public Health of The Johns Hopkins University assume an obligation to conduct themselves in a manner appropriate to the University’s mission as an institution of higher education. A student is obligated to refrain from acts which he or she knows, or under the circumstances has reason to know, impair the academic integrity of the University. Violations of academic integrity include, but are not limited to: cheating; plagiarism; knowingly furnishing false information to any agent of the University for inclusion in the academic record; violation of the rights and welfare of animal or human subjects in research; and misconduct as a member of either School or University committees or recognized groups or organizations.

Students should be familiar with the policies and procedures specified under Policy and Procedure Manual Student-01 (Academic Ethics), available on the school’s portal.

The faculty, staff and students of the Bloomberg School of Public Health and the Johns Hopkins University have the shared responsibility to conduct themselves in a manner that upholds the law and respects the rights of others. Students enrolled in the School are subject to the Student Conduct Code (detailed in Policy and Procedure Manual Student-06) and assume an obligation to conduct themselves in a manner which upholds the law and respects the rights of others. They are responsible for maintaining the academic integrity of the institution and for preserving an environment conducive to the safe pursuit of the School’s educational, research, and professional practice missions.

Disability Support Service

Students requiring accommodations for disabilities should register with Student Disability Service (SDS). It is the responsibility of the student to register for accommodations with SDS. Accommodations take effect upon approval and apply to the remainder of the time for which a student is registered and enrolled at the Bloomberg School of Public Health. Once you are f a student in your class has approved accommodations you will receive formal notification and the student will be encouraged to reach out. If you have questions about requesting accommodations, please contact BSPH.dss@jhu.edu.

Previous versions of the class

Typos and corrections

Feel free to submit typos/errors/etc via the GitHub repository associated with the class: https://github.com/lcolladotor/jhustatcomputing2023/issues. You will have the thanks of your grateful instructor!

R session information

options(width = 120)
sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16)
 os       macOS Ventura 13.5
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Mexico_City
 date     2023-08-29
 pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
 colorout      1.2-2   2023-05-06 [1] Github (jalvesaq/colorout@79931fd)
 digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)
 evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
 htmltools     0.5.6   2023-08-10 [1] CRAN (R 4.3.0)
 htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.0)
 jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)
 knitr         1.43    2023-05-25 [1] CRAN (R 4.3.0)
 rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)
 rmarkdown     2.24    2023-08-14 [1] CRAN (R 4.3.1)
 rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 xfun          0.40    2023-08-09 [1] CRAN (R 4.3.0)
 yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)

 [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────