If you are either new to my team or LIBD, you might this page useful. Note that each LIBD prinpical investigator (PI) might decide to do things differently. Some of the content here comes from jaffe_onboarding, which was actively used from Sep 8, 2019 till Sep 2020.
On your first few weeks, you will be acquiring a lot of information about how we work, how we organize our work, where things are located, scientific background on the topic(s) you will be working on, as well as configuring your computer so you can work efficiently with the resources we have available. As part of this process, you will likely get a LIBD laptop and will need to spend time configuring it before you can use efficient workflow techniques.
As you’ll see, there is a lot of information on this page and there’s going to be lots to learn for quite some time. We even have continuous learning activities, as new things come out all the time! Some of us have been doing this longer than others, but we all had to learn somewhere and want to make it easier for you.
If you are overwhelmed, schedule a Data Science guidance session (DSgs) and we’ll help you get started by pointing you to the relevant parts appropriate to you. For example, we can help you make a list of useful resources or videos to watch from the LIBD rstats club schedule, as well as from the team presentations schedule; both are available as YouTube playlists. These videos help clone ourselves so we can reach more people. This follows the same idea about writing a blog post if 3 people ask the same question (more on this further below).
When in doubt, schedule a DSgs! We only list when we are available, so if we are not, you’ll notice it on Calendly and don’t need to worry about it. Planning ahead is always useful and you should feel free to ask us for help. We like to help and the DSgs help both you and us: we want to make it easy for you to schedule DSgs sessions, while we also need to do other things and thus need to limit when we are available.
This might be your first job experience which is very exciting, but also different from being a student. You will thus need to learn to work in our team and maybe adapt some of your behavior. Ultimately, communication will be your friend to avoid any misunderstandings such as different expectations. There are also some rules you will have to follow, which are aimed at making sure we all are on the same playing field. The following are some team guidelines I have, but you should also read the LIBD Employee Handbook which you can find through Paylocity under the “Company” section after you log in.
- Work 40 hours per week. The employee handbook states: The Lieber Institute’s normal business hours are 8:30 a.m. to 5:30 p.m., Monday through Friday with an hour included for lunch and breaks.. That’s exactly 8 work hours per day, and 40 work hours per week.
- You can measure this by installing the free version of RescueTime.
- Minimize all distractions on the 8 work hours.
- Avoid checking social media or any other lurking activities. You can put your phone in “work mode” (“focus” for iPhone) to silence all notifications as well. If you need to do something else, notify me or request time off through Paylocity when appropriate.
- Aim to achieve a weekly productivity score on RescueTime of 70 or above (ideally over 80).
- I have
terminal (iterm2), and
rstudiolisted as “very productive”. Then
mail.libd.orgas “productive”. I have
zoom.usas “very distracting” since I’m personally trying to de-incentivize having meetings all day, but for you that might be different.
- I have
- Independently perform basic job responsibilities without instruction.
- Keep a notebook with you to take notes when necessary.
- Maintain ongoing list of daily to do items.
- Optionally post your to dos on the respective Slack channels by 9:30 am each morning (10:30 am when we have 9-10 am meetings). Doing so can help the most in your first few months at the team, so we can identify how to best help you.
- Write code daily and version control it on Git/GitHub. If writing academic papers, then post equivalent updates on Slack.
- Make at least 10 commits per day, with 15 per day being the ideal number. That’s from 5 commits per 2 hour coding session as noted at “project feedback”, with at least 3 such sessions per day.
- If you spend time writing part of an academic manuscript, then post bullet point updates on the respective Slack channel.
- If you spend more than 15 minutes trying to solve something and are not making any more progress, ask a question about it on a Slack channel (not direct messages) with as much information needed to help you.
- Don’t use direct messages (DMs) on Slack, as multiple people can very likely help you and they won’t be able to do so through DMs.
- For R code, remember to use
reprexlike we mention at “how to ask for help”.
- You could also request help sessions from team members through the Data Science guidance sessions system.
- Communicate your reasoning and thought process. Respond to your colleagues in appropriate time windows.
- Don’t assume others know what you are thinking: be as verbose as possible. Avoid one liner replies on emails and Slack. Respond to Slack messages within a business hour and to emails from colleagues within a business day. CC me on all replies you send via email.
- Write a summary after every 1 on 1 meeting. Doing so reduces the burden on memorization.
- Write a bullet point list or make some GitHub issues after every 1 on 1 meeting you have with me or other colleagues.
- Comply with your “at the office” schedule.
- The exact days will depend on which days you have approval for working remotely (if any at all).
- Read at least one relevant scientific manuscript per week.
- Write a summary about it on the appropriate Slack channel. Relevant manuscripts are those explaining software/methods you are using, those related to the biological question you are working on, or new software/methods we might want to use for the projects you are working on.
- Work on your main projects at least some portion each day.
- Avoid cramming work the day before a meeting for a given project. Working the same amount of hours spread throughout the week allows you to identify questions and problems you might need help with, and to have the time to get the required help.
- I like having blocks of 2 hours where I work on a given project during 1 block of time. Switching projects too frequently is detrimental since it takes some time to refresh your memory and load the relevant data.
Below is a list of papers that might provide some useful background information for bulk RNA-seq or DNA methylation. We will also give you a more updated list of background papers to read depending on the project you will be working on. You might find some relevant papers and videos on the team presentations schedule and companion YouTube playlist. For example, check this presentation by Louise A. Huuki-Myers on the Cobos et al, Nature Communications, 2020 deconvolution paper.
- RNA-Seq: a revolutionary tool for transcriptomics
- Advancing RNA-Seq analysis
- Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks
- TopHat: discovering splice junctions with RNA-Seq
- TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions
- Systematic evaluation of spliced alignment programs for RNA-seq data
- Assessment of transcript reconstruction methods for RNA-seq
- Transcriptome and genome sequencing uncovers functional variation in humans
- Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories
Note that we don’t work much with DNAm data nowadays, unlike the jaffe_onboarding days. However, this list might still be useful for some.
- Measuring cell-type specific differential methylation in human brain tissue
- From promises to practical strategies in epigenetic epidemiology
- A data-driven approach to preprocessing Illumina 450K methylation array data
- Epigenetic memory in induced pluripotent stem cells
- The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores
- Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies
- Analysing and interpreting DNA methylation data
- DNA methylation: roles in mammalian development
- The Key Role of Epigenetics in Human Disease Prevention and Mitigation
- Charting a dynamic DNA methylation landscape of the human genome
- Global Epigenomic Reconfiguration During Mammalian Brain Development
- Divergent neuronal DNA methylation patterns across human cortical development: Critical periods and a unique role of CpH methylation
We use quite a bit of different software and while we list out some here below, you will want to check the LIBD rstats club schedule and companion YouTube playlist for more recent videos that might be relevant to your situation. For example, this video talks about RStudio and key packages like:
You will need to learn the basics of
linux as we work a lot with the high performance computing cluster named JHPCE, which is a
linux cluster. A computing cluster is basically a collection of machines (with no monitors) that have more memory than your regular laptop and that you can use for hours/days. That way, you are not limited by the specific properties of your laptop for most of our work.
R quite a bit, and that’s actually how this website was made. Here are some useful resources, though you will eventually want to check the R/Bioconductor Data Science bootcamps and related resources.
- Download and install R from CRAN.
- R 101 LIBD rstats blog post by Carrie Wright.
- Download RStudio Desktop.
- Using the RStudio Terminal.
- RStudio cheatsheets for many things including regular expressions,
purrrfor functional programming, etc.
- Building Tidy Tools workshop materials by Charlotte Wickham and Hadley Wickham for the rstudio 2019 conference. Materials from other workshops is available here.
- R Markdown documentation website announced by Alison Hill.
- Cookbook for R graphs, 2nd edition with useful
- Updating R blog post, though also check more recent resources on the LIBD rstats club Google Sheet.
- Tutorial for building R websites by Emily C Zabor. More on R Markdown websites here, which is how this website was made.
- You can use RStudio for Python syntax highlighting and to run code through the integrated terminal window on JHPCE.
reticulatemight be of interest if you are working with Python from R.
- If you are writing an R package that will use Python internally, check out
- If you want to open
SummarizedExperimentobjects in Python as
AnnDataobjects, you might be interested in
Overall, it’s important to ask for help. We try to follow the advice from the you must try, and then you must ask blog post. Also try to ask in a Slack channel where others might be able to benefit from your answers in the future. If enough people ask about something, we’ll try to write a blog post as advised here by David Robinson. For more details, see the “how to ask for help” page.
When on a team meeting Zoom call and most importantly, on one on one meetings, please have your camera on. Otherwise it feels like you are talking to the void. There are of course good reasons to turn off your camera, like if your internet connection is slow and you are presenting. Turning your camera on also incentives you to pay closer attention. Thank you!
- LieberInstitute is our GitHub organization account. After you’ve made your GitHub account let your PI, Bill Ulrich or Leonardo Collado-Torres know about your username through Slack so they can add you to the relevant teams.
- GitHub issues are a useful way to specify tasks for a given project. As an example, check the libdRSE project.
- git to know git: an 8 minute introduction LIBD rstats blog post by Amy Peterson.
- Happy Git and GitHub for the useR book by Jenny Bryan, Jim Hester and other TAs.
- Videos by Jacob Fiksel on how to use Git for Windows and Mac.
- BFG Repo-Cleaner for fixing GitHub repositories that have data files (example, large files) that you want to delete from your history. Use the JHPCE LIBD module for bfg with
module load bfg.
- Commit together with co-authors such that everyone gets recognized GitHub activity for joint work.
- Closing issues using keywords in commit messages.
- Merging a pull request. Also check this comment by Yihui Xie on how to best submit pull requests. Yihui Xie wrote two blog posts related to that particular series of pull requests here. Feel free to make pull requests to edit the contents of this website!
- Syncing a fork.
- How can I undo git reset –hard HEAD~1?.
- Contributing to the LIBD rstats club blog post if you want to write another LIBD rstats blog post or use blogdown.
- We have multiple Slack workspaces, though the main one we use is the JHU Genomics Slack which has over 500 members across many JHU and LIBD genomics teams. Ask your PI or Leonardo Collado-Torres to add you to that Slack workspace. Our main team channel is
- Using Slack for Academic Departmental Communication blog post written by Leo and Stephanie Hicks.
- Preferably ask questions on a project channel such that others involved in the project are able to contribute answers. If it’s a more general question, try asking in the
#libd_helpdeskchannel. Beyond our team, you might want to ask for R help at
#rstatsand JHPCE help at
#jhpce. You might also want to check out
#langmead_rss, among many others.
- For a new project, create a private channel with the prefix
libd_then invite the team members who are working on the project or might be able to help.
- To integrate a GitHub repository with a Slack channel type in the channel the command
/github subscribe LieberInstitute/RepoName.
- Slack supports Markdown syntax for your messages, in particular we use a lot the backtick for inline code and the code chunk syntax for multi-line code. Examples:
`inline code` example ``` multi line code example ```
inline code example
multi line code example
- We have a team Google Calendar. Ask Leonardo Collado-Torres via Slack to add you to the calendar with your Google account email address (could be your libd one or could be a gmail account). The calendar is linked to
#libd_team_lcolladotorand sends reminders to that channel 30 minutes before an event happens.
- Tackling Twitter as a Graduate Student blog post.
The “config files” page contains some relevant links that can help you find some software we frequently use or configuration files. Overall, we use frequently:
- RStudio: for writing R/Python code, executing it at JHPCE through the integrated terminal.
- Cyberduck or WinSCP for browsing JHPCE files and previewing PDF files.
- Git: for version controlling files. See the install git chapter from Happy Git and GitHub for the useR.
- GitHub: for sharing our code within the team and externally, plus to transfer code between your laptop and JHPCE.
- TextMate or Sublime Text or some other text editor that has code syntax highlighting. RStudio has this too.
- JHPCE modules: for having a common installation location at JHPCE for software we use. See jhpce_mod_source and jhpce_module_config for more details.
Since most analyses we run involve resources larger than those in our laptops, we typically used a high performance computing environment. Particularly, JHPCE. Note that the only official pronunciation for JHPCE is by its letters J-H-P-C-E.
One of the first things you’ll do is get a JHPCE account: under “sponsoring organization” scroll all the way down till you find the LIBD PIs such as “Martinowich” (applicable also for members of Stephanie Page and Kristen Maynard’s teams) or “Torres” for Leonardo Collado-Torres. You will receive an email with a private link to schedule your appointment at one of the training sessions for new users. They typically have space for 8 people and do the training sessions once every 2 weeks, so you will want to do this early as it can take a while to setup. Even if you are not going to write code but want to see some plots others have made, you will need to get your JHPCE account.
Once you have your JHPCE account, you will want to set up your laptop and JHPCE to work efficiently. You highly encourage that you use some of the configuration files from the “config files” page.
- Check the LIBD rstats club videos on JHPCE laptop configuration. Though I also highly encourage you to schedule a Data Science guidance sessions to have someone check your configuration files.
- Setup password-less login through ssh key pairs. You will need 4 key pairs.
- One key pair for your laptop to the JHPCE login node.
- One key pair from the JHPCE login node to the compute nodes (typically created by JHPCE admins).
- One key pair from your laptop to GitHub; see their documentation.
- One key pair from JHPCE to GitHub.
- You should edit your JHPCE
~/.bashrcand create your JHPCE
~/.inputrcfile following the information we have at the “config files” page.
- You might want to do something similar for your macOS
~/.bashrcfile. Though I lately use a
~/.zshrcfile on my macOS computer.
- You might want to do something similar for your macOS
- You should specify your email on your JHPCE
~/.sge_requestfile following the information we have at the “config files” page.
- You should edit your laptop’s and JHPCE
~/.gitconfigfiles following the information we have at the “config files” page.
- We will want to email BitSupport to make sure that your default user group is not
users, but a more appropriate group such as
- You will likely want to configure your laptop & JHPCE to be able to use
rmateto open JHPCE files from the terminal window in TextMate / Sublime Text / other text editors. This will involve editing your laptop’s
~/.ssh/configas well as your JHPCE’s
~/.ssh/configfiles. See the video below for more information about this process.
- You might be interested in configuring
rmoteto emulate the RStudio plot and help file panels. This will also involve editing your laptop’s
~/.ssh/configas well as your JHPCE’s
- You will likely want to edit your laptop’s and JHPCE
~/.Renvironconfiguration files following the information we have at the “config files” page.
Now that you have your JHPCE account and can access files, you’ll want to get familiarized with many parts of JHPCE. You can see the archive of questions people have asked through
bithelp and/or the
- Environment modules are useful so you don’t have to install some common software from scratch, such as
Rwhich you can load using
module load conda_R.
- Use LIBD modules which you can setup following these instructions. jhpce_mod_source and jhpce_module_config are our GitHub repositories for our LIBD modules.
- Edit your bashrc file for a nicer terminal experience blog post. See the “config files” page for the latest common configuration files we use.
- sgejobs (documentation): contains helper functions for writing and interacting with SGE jobs at JHPCE.
- Array and Sequential Cluster Jobs. See also
- Setting up your computer for bioinformatics/biostatistics and a compedium of resources 2012 blog post. Includes some Mac and Windows tools.
- Login to the cluster, request a node and change to your project directory in a single command 2013 blog post.
- me: Bad rm, don’t delete stuff I didn’t want to delete! (rm: well, I do what you tell me to do!) 2012 blog post.
- Automatically coloring your R output in the terminal using colorout blog post on
coloroutwhich you can access through jalvesaq/colorout nowadays and install with
- See “config files” page for the latest configuration files 1.
There were a lot of miscellaneous things about using JHPCE that might save some confusion if you knew now:
- there’s a 100G storage limit on your home directory
/users/[yourusername]/, which will likely fill up very quickly. Most of us have directories under the
/dcl02filesystems, where there is far more space.
- you generally will want to include the
-cwdflag in bash scripts. By default, it will default to dumping output files in your home directory, regardless of where you submit the script.
- for scripts/jobs where you are generating large files, you will need to change the maximum writable file size. This is the
h_fsizeflag for bash scripts, which defaults to 10G. See https://jhpce.jhu.edu/knowledge-base/how-to/ for more details.
- Cyberduck for accessing files remotely. With it, you can open any R files at JHPCE with RStudio in your laptop, then open a terminal from RStudio so you can execute your code.
- Alfred for quickly finding files among many other powerful workflows. I highly recommend paying for the power pack, to get access to the most advanced (and time saving) features.
- Textmate setup (Mac only) LIBD rstats 2018 blog post; it’s an alternative to using RStudio. The blog post also explains
rmatewhich now also exists as a LIBD module that you can load with
module load rmateat JHPCE. There’s also a more recent video on this topic on the LIBD rstats club Google Sheet.
- Mac keyboard shortcuts.
- WinSCP for browsing JHPCE files, previewing PDF files, opening R files on RStudio, etc.
- putty (which comes with WinSCP), for accessing JHPCE through a terminal window.
- MobaXTerm combines SSH and SFTP functionality and is fairly simple to set up and use.
- git for windows which includes Git Bash. Check rstudio issue 2224 for specifics on how to install Git Bash such that it will work with RStudio terminal.
For both macOS and winOS, you might be interested in this older (2012) blog post about my computers setup.
For example, you might enjoy the video about Bioconductor or the one on
- Where do I start using Bioconductor? blog post.
- How to ask for help for Bioconductor packages blog post.
- Bioconductor workflows such as rnaseqGene and sequencing are very helpful to get started in the RNA sequencing (RNA-seq) field.
- BioC2019 workshops are available here. More recently, you might want to use Orchestra to find recent workshops and even access them interactively.
- limma: is one of the main packages we use for analyzing data, particularly with the limma-voom method. Check the Bioconductor workflows that explain limma in more detail here.
- GenomicRanges: is the package we use for interacting with genomic data across ranges. Check the introduction for more information.
- SummarizedExperiment: is the package we use for creating and interacting with
- bsseq: is highly useful for analyzing DNA methylation data, particularly whole genome bisulfite sequencing (WGBS) data.
- Biostrings: is useful for dealing with sequence information such as the human genome sequence.
- jaffelab (documentation): contains several functions we’ve written over the years that we find helpful for different analyses.
- sgejobs (documentation): contains helper functions for writing and interacting with SGE jobs at JHPCE. We have several videos on
sgejobsat the rstats club schedule.
- jhpce_mod_source and jhpce_module_config are our GitHub repositories for our LIBD modules.
- shinycsv: allows interactively exploring a table.
- megadepth by David Zhang, Christopher Wilks, et al is an R package for dealing with BigWig files. Check also the older recount.bwtool.
- LIBD RNA-seq processing pipeline: SPEAQeasy, and SPEAQeasy-example. See also the initial version.
globus: our Globus share endpoints such that others can access our data.
- LIBD rstats club described in more detail on this blog post.
- Joint genomics meeting: check the
#joint_group_meetingchannel for more information about this bi-weekly (every 2 weeks) meeting across multiple JHU labs.
- Baltimore R Ladies: check the
#r-ladiesSlack channel as well as meetup.com/rladies-baltimore/. They also have a website with code from earlier sessions and you can find the slides here.
- JHU Genomics events: check their calendar here. Every year they organize a symposium around October where you can present a poster if you want. They also organize other workshops which you might be interested in signing up for.
- BioC: announced here.
- ASHG: website.
- CSHL Biology of Genomes (BoG) or Biological Data Science (BDS): through the CSHL website.
- RStudio conf: website.
- Women in Data Science (has multiple meetings), website.
- Joint Statistical Meetings (JSM): through the ASA website.
See the conferences/courses of interest Google Sheet for more up to date information.
- North East Market: it’s on 2101 E Monument St which is just a few minutes away walking. You can find a lot of variety there including a Korean stall along the south wall. That stall is pay-by-weight and lots of the dishes are delicious.
- School of Public Health 9th floor salad bar Garden Plate: walk two blocks north to 615 N Wolfe St or K 3-4 on the Hopkins East Baltimore Campus map, then take an elevator on the south side of the building to the 9th floor. The elevators on the north side don’t go all the way up to the 9th floor. 2
- Kabobi: Afghan food, really quick for take-out. Popular choices are the lamb or chicken rice bowls, chopped salads, and the veggie rice bowls.
- Greenhouse: This is in neighboring Preclinical Teaching Building (PCTB); it’s on K2 on the Hopkins East Baltimore Campus map. Have someone show the inside way - or go outside and it’s just inside the Wolfe/Monument corner entrance. It’s a pay-by-weight buffet with salad bar and hot options. Not amazing but great lower cost option, especially if you are going for the lighter weight foods like salad!
- Koco - Korean food on Tuesdays and Fridays. A lab favorite! Favorites are bulgogi (beef) and tofu dishes.
- Chowhound - burgers (incl. veggie).
- Other good ones: Greek truck, Kabob/falafel truck in front of Starbucks
- Food court inside the hospital with Subway, Einstein Bros, and others. It’s on G-H 4 on the Hopkins East Baltimore campus map. There’s a second one at I6 on the same map.
- Dunkin, Subway, and Chinese places along Monument St.