In the team anonymous survey from 2021-03 several responses revealed that as a team we have some work to do to be more organized at JHPCE / GitHub as well in documenting our work. This chapter attempts to clarify what you can do to get feedback.

## 13.1 JHPCE file organization

### 13.1.1 Background history

As noted the the survey mentioned earlier, files at JHPCE could be organized better. We have multiple disks at JHPCE and for historical reasons, some projects live at locations such as /dcl01/lieber/ajaffe/lab and similar directories such as /dcl01/lieber/ajaffe/lab/brainseq_phase2, while other projects live inside directories named after individuals such as /dcl01/lieber/ajaffe/Nick. Thus, this makes it challenging to locate all the projects.

Even after re-organizing files, we’ll have multiple “group home” directories given that we have multiple disks and no disk is large enough to store all the data.

JHPCE admins send quarterly invoices that keeps track of expenses in two ways:

• for storage, it computes the size of the main paths such as /dcl01/lieber/ajaffe
• for compute hours, they keep track of the memory and compute hours used by each each individual for each compute node queue (namely, shared and bluejay)

In 2021 (currently working on this), we re-organized files such that JHPCE will compute storage 1 level deeper than they used to, such as /dcl01/lieber/ajaffe/dir01 and /dcl01/lieber/ajaffe/dir02. The idea is that this extra level of information will enable LIBD Finance to have more detailed information on which grant or project code to charge for the different storage costs we incur at JHPCE. We also want to avoid personal directories, thus the “group home” directories will become the root for each disk such as /dcl01/lieber/ajaffe/. Overall, this new organization will also help us identify old data that could be deleted.

Another issue we have with the current file organization at JHPCE is that we don’t know which data has to be backed up and which doesn’t. Files can be classified into:

• raw-data: FASTQ, images, sample phenotype tables, etc
• code: ideally version-controlled through GitHub
• processed-data: files generated by running some code on the raw-data or other processed data

### 13.1.2 Organization since 2021

Thus the new organization is like this.

jhpce \$ tree -a
.
├── group_home_disk1
│   └── projectName_grantNumber
│       ├── .git
│       ├── .gitignore
│       ├── code
│       │   ├── 01_some_analysis
│       │   │   ├── 01_some_code.R
│       │   │   └── 02_more_code.sh
│       │   ├── 02_another_analysis
│       │   │   ├── 01_initial_code.R
│       │   │   └── 02_plot_hello.R
│       │   └── 03_image_processing
│       │       └── 01_img_process.sh
│       ├── plots
│       │   └── 02_another_analysis
│       │       └── hello.pdf
│       ├── processed-data
│       │   ├── 01_some_analysis
│       │   │   └── some_file.Rdata
│       │   ├── 02_another_analysis
│       │   │   └── some_data.Rdata
│       │   └── 03_image_processing
│       │       ├── sample1_processed.tiff
│       │       └── sample2_processed.tiff
│       ├── projectName_grantNumber.Rproj
│       └── raw-data
│           ├── FASTQ
│           │   ├── sample1.fastq.gz
│           │   └── sample2.fastq.gz
│           ├── images
│           │   ├── sample1_raw.tiff
│           │   └── sample2_raw.tiff
│           └── sample_phenotype.csv
└── group_home_disk2
├── backup_projectName_grantNumber
│   ├── code
│   ├── plots
│   ├── processed-data
│   └── raw-data
├── projectHello_grantRANDOM14515
└── projectTest_grantRANDOM0123

24 directories, 19 files

In this example, /group_home_disk1/projectName_grantNumber has been backed up across JHPCE disks to /group_home_disk2/backup_projectName_grantNumber. Not shown in this view above, the backup will not be group-writable unlike the original location. For some projects we might just backup the raw data, for others we’ll back up the code, plots and processed-data. /group_home_disk2 also contains it’s own unique projects that are not backed up at /group_home_disk1 (projectHello_grantRANDOM14515 and projectTest_grantRANDOM0123).

/group_home_disk1/projectName_grantNumber has several components.

• .git: it is version controlled at that level using Git.
• .gitignore: specifies what files to ignore. Typically, raw-data and processed-data. It could also ignore plots though smaller plots could be version controlled using git add -f (-f forces Git to version that file even if it’s ignored in a .gitignore file).
• README.md: describes the project and some main features of the project. It should include the JHPCE location for the project as noted in the project documentation further below.
• code: contains directories that are ordered by using a two digit leading ID (or three if the project is huge).
• 01_some_analysis and 02_another_analysis: contain scripts also using a two digit leading ID to help organize the files in the order they should be used for re-generating the processed data and plots.
• plots: contains a directory matching the directory names used in code such that it is clear what code generate what plot. Here for plots/02_another_analysis/hello.pdf we can identify that the code for this plot lives at code/02_another_analysis/02_plot_hello.R.
• processed-data: also contains a directory matching the directory names used in code.
• raw-data: contains the raw data for the project. If the raw data is backed up elsewhere, then we can use symbolic links to point to the location of the data. For example, /group_home_disk1/projectName_grantNumber/raw-data/FASTQ could be a soft link to /dcl02/lieber/ajaffe/Nina/Joel_R01/Year1/fastq. This will be helpful to differentiate what is the raw data our group is responsible for backing up against the raw data the LIBD RNA-seq core facility or other entities are responsible for backing up. It also contains the sample phenotype information such as /group_home_disk1/projectName_grantNumber/raw-data/sample_phenotype.csv that is required for any analyses of the raw data.
• projectName_grantNumber.Rproj an RStudio project file, which makes it easier to write R code.

The project organization detailed shown above enables us to write code inside a project in such a way that if the disk where we store the project changes, none of the code has to change. That’s because we can use the here package as illustrated in the 2020-04-17 LIBD rstats club session (notes).

While we currently don’t have a project following the project structure detailed here, the closest one is LieberInstitute/spatialDLPFC. Once we re-organize files, we’ll link here to proper examples (TODO) as well as a template project (TODO). See also the following slides describing the file re-organization at JHPCE (TODO).

## 13.2 Project documentation

As a minimum, each project should include a README.md file in the root folder for the project, such as /group_home_disk1/projectName_grantNumber/README.md. This README should include the following sections:

• title
• abstract / summary
• how to cite the project
• overview of the files and how they are organized
• internal notes
• JHPCE internal location
• Slack channel
• Any internal reminders, like setting the umask

Additionally, nested directories could have their own README.md files. For example, /group_home_disk1/projectName_grantNumber/raw-data/README.md is one such README.md file which can describe where the raw-data has been backed up, such as saying “The raw data has been backed up at /group_home_disk2/backup_projectName_grantNumber/raw-data as of 2021-04-16.”. Other README.md files could show can show results such as this example from brainseq_phase2.

If the project is quite big, it might make sense to make a documentation website such as the one we made for SPEAQeasy. The advantage of such websites is that they enable visitors to search and locate information more rapidly. Though they take longer to write.

© 2011-2021. All thoughts and opinions here are my own. The icon was designed by Mauricio Guzmán and is inspired by Huichol culture; it represents my community building interests.

Published with Bookdown