14 Organizing your work

In the team anonymous survey from 2021-03 several responses revealed that as a team we have some work to do to be more organized at JHPCE / GitHub as well in documenting our work. This chapter attempts to clarify what you can do to get feedback.

14.1 JHPCE file organization

14.1.1 Background history

As noted the the survey mentioned earlier, files at JHPCE could be organized better. We have multiple disks at JHPCE and for historical reasons, some projects live at locations such as /dcl01/lieber/ajaffe/lab and similar directories such as /dcl01/lieber/ajaffe/lab/brainseq_phase2, while other projects live inside directories named after individuals such as /dcl01/lieber/ajaffe/Nick. Thus, this makes it challenging to locate all the projects.

Even after re-organizing files, we’ll have multiple “group home” directories given that we have multiple disks and no disk is large enough to store all the data.

JHPCE admins send quarterly invoices that keeps track of expenses in two ways:

for storage, it computes the size of the main paths such as /dcl01/lieber/ajaffe
for compute hours, they keep track of the memory and compute hours used by each each individual for each compute node queue (namely, shared and bluejay)

In 2021, we started to re-organize files as JHPCE’s admins now compute storage 1 level deeper than they used to, such as /dcl01/lieber/ajaffe/dir01 and /dcl01/lieber/ajaffe/dir02. The idea is that this extra level of information enables LIBD Finance to have more detailed information on which grant or project code to charge for the different storage costs we incur at JHPCE. We also want to avoid personal directories, thus the “group home” directories will become the root for each disk such as /dcl01/lieber/ajaffe/. Overall, this new organization can help us identify old data that could be deleted.

Another issue we have with the current file organization at JHPCE is that we don’t know which data has to be backed up and which doesn’t. Files can be classified into:

raw-data: FASTQ, images, sample phenotype tables, etc
code: ideally version-controlled through GitHub
processed-data: files generated by running some code on the raw-data or other processed data

14.1.2 Organization since 2021

Thus the new organization is like this.

jhpce $ tree -a
.
├── group_home_disk1
│   └── projectName_grantNumber/repositoryName
│       ├── .git
│       ├── .gitignore
│       ├── README.md
│       ├── code
│       │   ├── run_all.sh
│       │   ├── update_style.R
│       │   ├── 01_some_analysis
│       │   │   ├── 01_some_code.R
│       │   │   └── 02_more_code.sh
│       │   ├── 02_another_analysis
│       │   │   ├── 01_initial_code.R
│       │   │   └── 02_plot_hello.R
│       │   └── 03_image_processing
│       │       └── 01_img_process.sh
│       ├── plots
│       │   └── 02_another_analysis
│       │       └── hello.pdf
│       ├── processed-data
│       │   ├── 01_some_analysis
│       │   │   └── some_file.Rdata
│       │   ├── 02_another_analysis
│       │   │   └── some_data.Rdata
│       │   └── 03_image_processing
│       │       ├── sample1_processed.tiff
│       │       └── sample2_processed.tiff
│       ├── repositoryName.Rproj
│       └── raw-data
│           ├── FASTQ
│           │   ├── sample1.fastq.gz
│           │   └── sample2.fastq.gz
│           ├── README.md
│           ├── images
│           │   ├── sample1_raw.tiff
│           │   └── sample2_raw.tiff
│           └── sample_phenotype.csv
└── group_home_disk2
    ├── backup_projectName_grantNumber/repositoryName
    │   ├── code
    │   ├── plots
    │   ├── processed-data
    │   └── raw-data
    ├── projectHello_grantRANDOM14515
    └── projectTest_grantRANDOM0123

24 directories, 19 files

In this example, /group_home_disk1/projectName_grantNumber has been backed up across JHPCE disks to /group_home_disk2/backup_projectName_grantNumber. Not shown in this view above, the backup will not be group-writable unlike the original location. For some projects we might just backup the raw-data, for others we’ll back up the code, plots and processed-data. /group_home_disk2 also contains it’s own unique projects that are not backed up at /group_home_disk1 (projectHello_grantRANDOM14515 and projectTest_grantRANDOM0123).

/group_home_disk1/projectName_grantNumber has several components.

.git: it is version controlled at that level using Git.
.gitignore: specifies what files to ignore. Typically, raw-data and processed-data. It could also ignore plots though smaller plots could be version controlled using git add -f (-f forces Git to version that file even if it’s ignored in a .gitignore file).
README.md: describes the project and some main features of the project. It should include the JHPCE location for the project as noted in the project documentation further below.
code: contains directories that are ordered by using a two digit leading ID (or three if the project is huge).
- 01_some_analysis and 02_another_analysis: contain scripts also using a two digit leading ID to help organize the files in the order they should be used for re-generating the processed data and plots.
plots: contains a directory matching the directory names used in code such that it is clear what code generate what plot. Here for plots/02_another_analysis/hello.pdf we can identify that the code for this plot lives at code/02_another_analysis/02_plot_hello.R.
processed-data: also contains a directory matching the directory names used in code.
raw-data: contains the raw data for the project. If the raw data is backed up elsewhere, then we can use symbolic links to point to the location of the data. For example, /group_home_disk1/projectName_grantNumber/repositoryName/raw-data/FASTQ could be a soft link to /dcl02/lieber/ajaffe/Nina/Joel_R01/Year1/fastq. This will be helpful to differentiate what is the raw data our group is responsible for backing up against the raw data the LIBD RNA-seq core facility or other entities are responsible for backing up. It also contains the sample phenotype information such as /group_home_disk1/projectName_grantNumber/repositoryName/raw-data/sample_phenotype.csv that is required for any analyses of the raw data.
repositoryName.Rproj` an RStudio project file, which makes it easier to write R code.

The project organization detailed shown above enables us to write code inside a project in such a way that if the disk where we store the project changes, none of the code has to change. That’s because we can use the here package as illustrated in the 2020-04-17 LIBD rstats club session (notes).

14.1.3 Project template

You might be interested in checking out the template_project, which shows these organization guidelines in action.

Some current examples are:

14.2 Project documentation

As a minimum, each project should include a README.md file in the root folder for the project, such as /group_home_disk1/projectName_grantNumber/repositoryName/README.md. This README.md should include the following sections:

title
abstract / summary
how to cite the project
overview of the files and how they are organized
internal notes
- JHPCE internal location
- Slack channel
- Any internal reminders, like setting the umask

Additionally, nested directories could have their own README.md files. For example, /group_home_disk1/projectName_grantNumber/repositoryName/raw-data/README.md is one such README.md file which can describe where the raw-data has been backed up, such as saying “The raw data has been backed up at /group_home_disk2/backup_projectName_grantNumber/repositoryName/raw-data as of 2021-04-16.”. Other README.md files could show can show results such as this example from brainseq_phase2.

If the project is quite big, it might make sense to make a documentation website such as the one we made for SPEAQeasy. The advantage of such websites is that they enable visitors to search and locate information more rapidly. Though they take longer to write.

14.3 Setting JHPCE file permissions

The following video on how to move files and setup permissions might be useful. It covers the same material shown here.

At JHPCE we have multiple user groups. Everyone within a user group has the same permissions, such that everyone is equal. That is, there is no user group admin role. As there are many JHPCE users, the typical approach is to make a user group and restrict read and/or write access to that user group. We use to do this with lieber_jaffe, but then the user group grew quite large over time and not everyone necessarily should have had read/write access to everything.

This leads to having multiple user groups. One option is to create a user group for every specific project, but this requires constantly asking JHPCE’s admins (through BitSupport) for changes, since they are the only ones who can add/remove people to user groups, or even create them.

This is where Access Control Lists (ACLs) come into play. See “An introduction to Linux Access Control Lists (ACLs)” for more details. That’s because if you have user groups that are small, such as those representing specific members of a given team/lab, then you say that

team 1 should be given write and read access to project 1
teams 1 and 2 should be given write and read access to project 2
teams 1 and 2 should be given write and read access to project 3 and team 3 only given read access

Some examples with actual teams are:

Team (user group) lieber_lcolladotor is working on a private project.
Team (user group) lieber_lcolladotor is working on a private project along with team (user group) lieber_marmaypag.
Teams (user groups) lieber_lcolladotor and lieber_marmaypag created some files that we want to share with everyone at LIBD (lieber user group) but don’t want others to accidentally delete or modify them.

For more details about ACLs at JHPCE check the Granting File Permissions using ACLs page.

14.3.1 Useful commands

To check who belongs to a specific group, use getent group groupName. For example:

$ getent group lieber_lcolladotor
lieber_lcolladotor:*:4217:jstolz,aseyedia,neagles,lcollado,gpertea1,lhuuki,bpardo

Note that 4217 is the actual user group ID. Sometimes we need to know that ID to run some ACL commands.

To find all the user groups for a specific person, use the groups command. For example:

$ groups lcollado
lcollado : lieber_jaffe swdev leekgroup recount3 lieber docker lieber_lcolladotor epi stanley rnaseq lieber_marmaypag lieber_snps lieber_martinowich users libdandme lieber_cmc lieber_gursini lieber_moods

14.3.2 dcs04 scripts

At /dcs04, we use the nfs4_setfacl and nfs4_getfacl commands to either set or get (list) the current ACL settings. This is different from the commands we use at /dcl01 or other disks, and depends on how JHPCE’s admins configured specific disks. You might find this email exchange I had with JHPCE’s admins in 2021 useful to understand more details about these commands.

Ultimately, my exchanges with JHPCE’s admins led to two scripts we use frequently when setting file permissions. One of them is for the collaboration between lieber_lcolladotor, lieber_marmaypag and hickslab. The second one is for a collaboration between lieber_lcolladotor and lieber_moods, that we want lieber to be able to see. The files are:

/dcs04/lieber/lcolladotor/_jhpce_org_LIBD001/update_permissions_spatialteam.sh
- Usage: sh /dcs04/lieber/lcolladotor/_jhpce_org_LIBD001/update_permissions_spatialteam.sh /dcs04/lieber/yourTeam/someProject_LIBDcode/yourRepository
/dcs04/lieber/lcolladotor/_jhpce_org_LIBD001/update_permissions_moods.sh
- Usage: sh /dcs04/lieber/lcolladotor/_jhpce_org_LIBD001/update_permissions_moods.sh /dcs04/lieber/yourTeam/someProject_LIBDcode/yourRepository

A very common problem arises from copying files to JHPCE (for example, through Cyberduck), as copying files doesn’t respect the ACL settings we specify. This also happens when you move file around at JHPCE using mv that already existed. So after uploading a file or moving it with mv, you will likely need to re-run these permission scripts.

14.3.2.1 spatialteam example

/dcs04/lieber/lcolladotor/_jhpce_org_LIBD001/update_permissions_spatialteam.sh

#!/bin/bash

echo "**** Updating permissions for $1 ****"
date
echo ""

if [[ $HOSTNAME == compute-* ]] || [[ $HOSTNAME == transfer-* ]]; then
    echo "**** Note that warning/error messages are expected for files and directories that you are not the owner of."
    echo "The expected warning/error messages are: "
    echo "    'chgrp: changing group of ‘some_JHPCE_file_path’: Operation not permitted'"
    echo " or 'chmod: changing permissions of ‘some_JHPCE_file_path’: Operation not permitted'."
    echo "If for many files you are not the owner (creator of), you will get lots of these warning/error messages, this is expected!"
    echo "Error or warnings with another syntax are likely real. ****"
    echo ""
    echo "You will need to re-run this script anytime you upload files to JHPCE through Cyberduck / WinSCP as they break the ACLs."
    echo "Every new team member on a given project will likely also need to run this script once."
    echo ""
    echo "For more details about setting permissions at JHPCE using ACLs, please check https://lcolladotor.github.io/bioc_team_ds/organizing-your-work.html#setting-jhpce-file-permissions."
    echo ""
    echo "This message will be displayed for 90 seconds before the script proceeds."
    echo "That way you will have time enough to read it and/or copy it."
    sleep 90
    
    echo ""
    echo "**** Setting read (R), write (W), and execute (X) permissions for hickslab ****"
    sleep 5
    date
    
    find ${1} -user ${USER} -type d -exec nfs4_setfacl -a "A:g:hickslab@cm.cluster:RWX" {} \;
    find ${1} -user ${USER} -type d -exec nfs4_setfacl -a "A:gfdi:hickslab@cm.cluster:RWX" {} \;
    find ${1} -user ${USER} -type f -exec nfs4_setfacl -a "A:g:hickslab@cm.cluster:RW" {} \;
    
    echo ""
    echo "**** Setting read (R), write (W), and execute (X) permissions for lieber_lcolladotor ****"
    sleep 5
    date
    
    find ${1} -user ${USER} -type d -exec nfs4_setfacl -a "A:g:lieber_lcolladotor@cm.cluster:RWX" {} \;
    find ${1} -user ${USER} -type d -exec nfs4_setfacl -a "A:gfdi:lieber_lcolladotor@cm.cluster:RWX" {} \;
    find ${1} -user ${USER} -type f -exec nfs4_setfacl -a "A:g:lieber_lcolladotor@cm.cluster:RW" {} \;
    
    echo ""
    echo "**** Setting read (R), write (W), and execute (X) permissions for lieber_marmaypag ****"
    sleep 5
    date
    
    find ${1} -user ${USER} -type d -exec nfs4_setfacl -a "A:g:lieber_marmaypag@cm.cluster:RWX" {} \;
    find ${1} -user ${USER} -type d -exec nfs4_setfacl -a "A:gfdi:lieber_marmaypag@cm.cluster:RWX" {} \;
    find ${1} -user ${USER} -type f -exec nfs4_setfacl -a "A:g:lieber_marmaypag@cm.cluster:RW" {} \;
    
    ## To move away from lieber_jaffe
    echo ""
    if getent group lieber_lcolladotor | grep -q "\b${USER}\b"; then
        echo "**** Running chgrp lieber_lcolladotor ****"
        sleep 5
        date
        chgrp lieber_lcolladotor -R ${1}
    elif getent group lieber_marmaypag | grep -q "\b${USER}\b"; then
        echo "**** Running chgrp lieber_marmaypag ****"
        sleep 5
        date
        chgrp lieber_marmaypag -R ${1}
    elif getent group hickslab | grep -q "\b${USER}\b"; then
        echo "**** Running chgrp hickslab ****"
        sleep 5
        date
        chgrp hickslab -R ${1}
    else
        echo "**** Skipping chgrp step ****"
    fi
    
    ## For setting the group sticky bit
    echo ""
    echo "**** Setting the group sticky bit ****"
    sleep 5
    date
    find ${1} -user ${USER} -type d | xargs chmod g+s
    
    ## Check settings
    echo ""
    echo "**** Checking the nfs4 (ACLs) settings ****"
    sleep 5
    date
    nfs4_getfacl ${1}
else
    echo "**** This script can only work on a qrsh / qsub session. It does not work on a login node since it does not have access to the nfs4_setfacl and nfs4_getfacl commands.****"
fi

14.3.2.2 lieber_moods example

/dcs04/lieber/lcolladotor/_jhpce_org_LIBD001/update_permissions_moods.sh

#!/bin/bash

echo "**** Updating permissions for $1 ****"
date
echo ""

if [[ $HOSTNAME == compute-* ]] || [[ $HOSTNAME == transfer-* ]]; then
    echo "**** Note that warning/error messages are expected for files and directories that you are not the owner of."
    echo "The expected warning/error messages are: "
    echo "    'chgrp: changing group of ‘some_JHPCE_file_path’: Operation not permitted'"
    echo " or 'chmod: changing permissions of ‘some_JHPCE_file_path’: Operation not permitted'."
    echo "If for many files you are not the owner (creator of), you will get lots of these warning/error messages, this is expected!"
    echo "Error or warnings with another syntax are likely real. ****"
    echo ""
    echo "You will need to re-run this script anytime you upload files to JHPCE through Cyberduck / WinSCP as they break the ACLs."
    echo "Every new team member on a given project will likely also need to run this script once."
    echo ""
    echo "For more details about setting permissions at JHPCE using ACLs, please check https://lcolladotor.github.io/bioc_team_ds/organizing-your-work.html#setting-jhpce-file-permissions."
    echo ""
    echo "This message will be displayed for 90 seconds before the script proceeds."
    echo "That way you will have time enough to read it and/or copy it."
    sleep 90
    
    echo ""
    echo "**** Setting read (R), write (W), and execute (X) permissions for lieber_moods ****"
    sleep 5
    date

    find ${1} -user ${USER} -type d -exec nfs4_setfacl -a "A:g:lieber_moods@cm.cluster:RWX" {} \;
    find ${1} -user ${USER} -type d -exec nfs4_setfacl -a "A:gfdi:lieber_moods@cm.cluster:RWX" {} \;
    find ${1} -user ${USER} -type f -exec nfs4_setfacl -a "A:g:lieber_moods@cm.cluster:RW" {} \;
    
    echo ""
    echo "**** Setting read (R), write (W), and execute (X) permissions for lieber_lcolladotor ****"
    sleep 5
    date
    
    find ${1} -user ${USER} -type d -exec nfs4_setfacl -a "A:g:lieber_lcolladotor@cm.cluster:RWX" {} \;
    find ${1} -user ${USER} -type d -exec nfs4_setfacl -a "A:gfdi:lieber_lcolladotor@cm.cluster:RWX" {} \;
    find ${1} -user ${USER} -type f -exec nfs4_setfacl -a "A:g:lieber_lcolladotor@cm.cluster:RW" {} \;
    
    echo "" 
    echo "**** Setting read (R) and execute (X) permissions for lieber ****"
    sleep 5
    date
    find ${1} -user ${USER} -type d -exec nfs4_setfacl -a "A:g:lieber@cm.cluster:RX" {} \;
    find ${1} -user ${USER} -type d -exec nfs4_setfacl -a "A:gfdi:lieber@cm.cluster:RX" {} \;
    find ${1} -user ${USER} -type f -exec nfs4_setfacl -a "A:g:lieber@cm.cluster:R" {} \;
    
    echo "" 
    echo "**** Removing permissions for lieber_marmaypag ****"
    sleep 5
    date
    find ${1} -user ${USER} -type d -exec nfs4_setfacl -x "A:g:lieber_marmaypag@cm.cluster:rwaDxtcy" {} \;
    find ${1} -user ${USER} -type d -exec nfs4_setfacl -x "A:gfdi:lieber_marmaypag@cm.cluster:rwaDxtcy" {} \;
    find ${1} -user ${USER} -type f -exec nfs4_setfacl -x "A:g:lieber_marmaypag@cm.cluster:rwatcy" {} \;

    find ${1} -user ${USER} -type d -exec nfs4_setfacl -x "A:g:4218:rwaDxtcy" {} \;
    find ${1} -user ${USER} -type d -exec nfs4_setfacl -x "A:gfdi:4218:rwaDxtcy" {} \;
    find ${1} -user ${USER} -type f -exec nfs4_setfacl -x "A:g:4218:rwatcy" {} \;
    
    ## To move away from lieber_jaffe
    echo ""
    if getent group lieber_moods | grep -q "\b${USER}\b"; then
        echo "**** Running chgrp lieber_moods ****"
        sleep 5
        date
        chgrp lieber_moods -R ${1}
    elif getent group lieber_lcolladotor | grep -q "\b${USER}\b"; then
        echo "**** Running chgrp lieber_lcolladotor ****"
        sleep 5
        date
        chgrp lieber_lcolladotor -R ${1}
    else
        echo "**** Skipping chgrp step ****"
    fi
    
    ## For setting the group sticky bit
    echo ""
    echo "**** Setting the group sticky bit ****"
    sleep 5
    date
    find ${1} -user ${USER} -type d | xargs chmod g+s
    
    ## Check settings
    echo ""
    echo "**** Checking the nfs4 (ACLs) settings ****"
    sleep 5
    date
    nfs4_getfacl ${1}
else
    echo "**** This script can only work on a qrsh / qsub session. It does not work on a login node since it does not have access to the nfs4_setfacl and nfs4_getfacl commands.****"
fi

14.4 Moving files across JHPCE disks

When you are moving files from other disks into /dcs04, you will likely need up to 4 files:

one file listing the directories you want to move
- Optional, but useful if you are moving one that one directory into a particular destination directory and want to use a single array job for moving the directories.
one script for moving files to /dcs04 with rsync
- We don’t use mv since it’s error prone and can have terrible consequences when it fails mid run.
one script for updating the permissions
- Even if the script simply runs one of the permissions scripts covered in the earlier section, you might want to have a script you can qsub as scripts that update permissions for many files can take hours to run.
one script for deleting the original files
- You will likely want to wait a few days / weeks before running this, to make sure that you have successfully moved everything you needed.
- If you don’t have write permissions to all the original files, you will need to request JHPCE’s admins to run this for you.

14.4.1 qSVA example

In this example, we moved 3 different /dcl01 directories into /dcs04, hence the use of an array job. These 3 directories were all related to the qSVA R21 project that had the internal LIBD code 3080, hence why all of them were relocated to /dcs04/lieber/lcolladotor/qSVA_LIBD3080. To store the scripts and log files associated with the relocation, I used the _jhpce_org directory which I typically use for storing these types of files.

We also wanted to list the owners of all original files (thanks Bill Ulrich for this code!) since sometimes having the owner is useful to know who to ask for help with a particular file.

/dcs04/lieber/lcolladotor/qSVA_LIBD3080/_jhpce_org files:

qSVA_dcl01_dirs.txt

It contains the full paths to the 3 original directories that were relocated.

/dcl01/ajaffe/data/lab/degradation
/dcl01/ajaffe/data/lab/qsva_brain
/dcl01/lieber/ajaffe/lab/degradation_experiments

move_qSVA_dcl01.sh

It’s an array job for 3 tasks which represent the directories (#$ -t 1:3). Note the increase of the maximum file size (h_fsize=400G), in case we are moving some very large files. To reduce the burden on the network, we only copied 2 directories at a time (#$ -tc 2; number of concurrent tasks).

Before transferring all the data, we compute the total data size (in TB) of the transfer. We use the -sk --apparent-size options to the du command in order to compute the un-compressed disk size, since some disk systems automatically compress some files, which can lead to different file size calculations across disk systems.

Once we have listed the owner of all files, we then move the files using rsync which will not preserve the original owner. The owner of all the transferred files (at /dcs04) will be the person who did the transfer. While doing this, we also change the unix group to the new one we want people to use. In this case, lieber_lcolladotor (the original files were most likely under lieber_jaffe). A nice thing about using rsync is that if it fails in the middle of a run, it can be resumed without problems later on.

Once we have completed the transfer across disks, we move the original files using mv to a trash directory (here trash_qSVA). This frees up the original directory name, which we can then use to create a soft link from the original location to the new location. Creating this soft link (symlink) helps users of the original files locate these files even if we moved them across disks (assuming they still have permissions to read the new files).

#!/bin/bash
#$ -cwd
#$ -l bluejay,mem_free=2G,h_vmem=2G,h_fsize=400G
#$ -N move_qSVA_dcl01
#$ -o logs/move_qSVA_dcl01.$TASK_ID.txt
#$ -e logs/move_qSVA_dcl01.$TASK_ID.txt
#$ -m e
#$ -t 1-3
#$ -tc 2

echo "**** Job starts ****"
date

echo "**** JHPCE info ****"
echo "User: ${USER}"
echo "Job id: ${JOB_ID}"
echo "Job name: ${JOB_NAME}"
echo "Hostname: ${HOSTNAME}"
echo "Task id: ${SGE_TASK_ID}"

## List current modules for reproducibility
module list

## Locate directory
ORIGINALDIR=$(awk "NR==${SGE_TASK_ID}" qSVA_dcl01_dirs.txt)
echo "Processing sample ${ORIGINALDIR}"
date

BASEDIR=$(basename ${ORIGINALDIR})
ORIGINALHOME=$(dirname ${ORIGINALDIR})

## Determine amount of data to transfer
du -sk --apparent-size ${ORIGINALDIR}/ | awk '{$1=$1/(1024^3); print $1, "TB";}'

## List owners of all the files in the original path
find ${ORIGINALDIR}/ -exec ls -l {} \; | grep -v total | tr -s ' ' | cut -d ' ' -f3,9-

## Copy from dcl01 to dcs04
rsync -rltgvh --chown=:lieber_lcolladotor ${ORIGINALDIR}/ /dcs04/lieber/lcolladotor/qSVA_LIBD3080/${BASEDIR}/

## Label as trash the files that were moved
mkdir -p ${ORIGINALHOME}/trash_qSVA/
mv ${ORIGINALDIR} ${ORIGINALHOME}/trash_qSVA/

## Create link
ln -s /dcs04/lieber/lcolladotor/qSVA_LIBD3080/${BASEDIR} ${ORIGINALDIR}

echo "**** Job ends ****"
date

## This script was made using sgejobs version 0.99.1
## available from http://research.libd.org/sgejobs/

update_permissions.sh

By default, directories under /dcs04/lieber/lcolladotor have read, write and execute permissions to the lieber_marmaypag group. Here though, we first remove the permissions to lieber_marmaypag, then we give read and execute permissions to both the lieber and lieber_marmaypag unix user groups.

We (redundantly) set the group of all files to be lieber_lcolladotor and then set the group sticky bit which preserves the group (lieber_lcolladotor in this case) to any new files created within these directories. Finally, we use nfs4_getfacl to check that the ACLs were set up correctly.

Note how we use #$ -hold_jid with the same name of the script from the move step earlier (here move_qSVA_dcl01) so we can qsub this script immediately after the move one, but wait for them to finish running before this one starts.

#!/bin/bash
#$ -cwd
#$ -l bluejay,mem_free=2G,h_vmem=2G,h_fsize=400G
#$ -N update_permissions_qSVA
#$ -o logs/update_permissions.txt
#$ -e logs/update_permissions.txt
#$ -m e
#$ -hold_jid move_qSVA_dcl01

echo "**** Job starts ****"
date

echo "**** JHPCE info ****"
echo "User: ${USER}"
echo "Job id: ${JOB_ID}"
echo "Job name: ${JOB_NAME}"
echo "Hostname: ${HOSTNAME}"
echo "Task id: ${SGE_TASK_ID}"

## List current modules for reproducibility
module list


MAINDIR="/dcs04/lieber/lcolladotor/qSVA_LIBD3080"

## Remove default permissions
find ${MAINDIR} -type d -exec nfs4_setfacl -x "A:g:lieber_marmaypag@cm.cluster:rwaDxtcy" {} \;
find ${MAINDIR} -type d -exec nfs4_setfacl -x "A:gfdi:lieber_marmaypag@cm.cluster:rwaDxtcy" {} \;
find ${MAINDIR} -type f -exec nfs4_setfacl -x "A:g:lieber_marmaypag@cm.cluster:rwatcy" {} \;

## Set new permissions
find ${MAINDIR} -type d -exec nfs4_setfacl -a "A:g:lieber@cm.cluster:RX" {} \;
find ${MAINDIR} -type d -exec nfs4_setfacl -a "A:gfdi:lieber@cm.cluster:RX" {} \;
find ${MAINDIR} -type f -exec nfs4_setfacl -a "A:g:lieber@cm.cluster:R" {} \;

find ${MAINDIR} -type d -exec nfs4_setfacl -a "A:g:lieber_marmaypag@cm.cluster:RX" {} \;
find ${MAINDIR} -type d -exec nfs4_setfacl -a "A:gfdi:lieber_marmaypag@cm.cluster:RX" {} \;
find ${MAINDIR} -type f -exec nfs4_setfacl -a "A:g:lieber_marmaypag@cm.cluster:R" {} \;

## To move away from lieber_jaffe
chgrp lieber_lcolladotor -R ${MAINDIR}

## For setting the group sticky bit
find ${MAINDIR} -type d | xargs chmod g+s

## Check settings
nfs4_getfacl ${MAINDIR}

echo "**** Job ends ****"
date

## This script was made using sgejobs version 0.99.1
## available from http://research.libd.org/sgejobs/

delete_qSVA_dcl01.sh

This script also uses #$ -hold_jid so in theory we could submit it (qsub) immediately after the move script, however, it’s best to wait a few days or weeks before running the script that deletes the original files under trash (here trash_qSVA), to make sure that we are not missing any files or encountered unexpected scenarios when moving files across disks.

#!/bin/bash
#$ -cwd
#$ -l bluejay,mem_free=2G,h_vmem=2G,h_fsize=100G
#$ -N delete_qSVA_dcl01
#$ -o logs/delete_qSVA_dcl01.txt
#$ -e logs/delete_qSVA_dcl01.txt
#$ -m e
#$ -hold_jid move_qSVA_dcl01

echo "**** Job starts ****"
date

echo "**** JHPCE info ****"
echo "User: ${USER}"
echo "Job id: ${JOB_ID}"
echo "Job name: ${JOB_NAME}"
echo "Hostname: ${HOSTNAME}"
echo "Task id: ${SGE_TASK_ID}"

## List current modules for reproducibility
module list

## Delete dcl01 data
rm -fr /dcl01/ajaffe/data/lab/trash_qSVA/
rm -fr /dcl01/lieber/ajaffe/lab/trash_qSVA/

echo "**** Job ends ****"
date

## This script was made using sgejobs version 0.99.1
## available from http://research.libd.org/sgejobs/

14.4.2 HumanPilot example

This is a simpler example, since we only moved one directory. So there’s no need for an array job or a file listing the directories we want to move. This project was done without any grant support, hence why we associated the LIBD 001 code to it. It was done in collaboration with 10x Genomics, hence why we decided to move it to /dcs04/lieber/lcolladotor/with10x_LIBD001. As in the previous example, I created a _jhpce_org directory to store the scripts related to relocating files and updating permissions. We also don’t have a script for deleting the files, since we knew that we didn’t have write permissions on all the original files and thus would need JHPCE’s admins help deleting them. This scenario is common when one or more people who were involved in the initial project leave LIBD prior to the file relocation, as you can’t ask them to run the delete script to delete the files they control. However, since we did have read access to all files, we were able to relocate them without needing JHPCE’s admins support.

/dcs04/lieber/lcolladotor/with10x_LIBD001/_jhpce_org files:

move_HumanPilot_dcl02.sh

This script is similar to the qSVA example, but there is no array job. It was adapted from it, and thus we still specify the ORIGINALDIR environment variable, but we do it manually this time instead of using awk to do so.

Most of the rest is the same, but here the trash directory is just called trash, as it became easier to do so and explain to JHPCE admins which directories we wanted deleted later on.

#!/bin/bash
#$ -cwd
#$ -l bluejay,mem_free=2G,h_vmem=2G,h_fsize=400G
#$ -N move_HumanPilot_dcl02
#$ -o logs/move_HumanPilot_dcl02.txt
#$ -e logs/move_HumanPilot_dcl02.txt
#$ -m e

echo "**** Job starts ****"
date

echo "**** JHPCE info ****"
echo "User: ${USER}"
echo "Job id: ${JOB_ID}"
echo "Job name: ${JOB_NAME}"
echo "Hostname: ${HOSTNAME}"
echo "Task id: ${SGE_TASK_ID}"

## List current modules for reproducibility
module list

## Locate directory
ORIGINALDIR="/dcl02/lieber/ajaffe/SpatialTranscriptomics/HumanPilot"
echo "Processing directory ${ORIGINALDIR}"
date

BASEDIR=$(basename ${ORIGINALDIR})
ORIGINALHOME=$(dirname ${ORIGINALDIR})
NEWDIR="/dcs04/lieber/lcolladotor/with10x_LIBD001/${BASEDIR}"

## Determine amount of data to transfer
du -sk --apparent-size ${ORIGINALDIR}/ | awk '{$1=$1/(1024^3); print $1, "TB";}'

## List owners of all the files in the original path
find ${ORIGINALDIR}/ -exec ls -l {} \; | grep -v total | tr -s ' ' | cut -d ' ' -f3,9-

## Copy from dcl02 to dcs04
rsync -rltgvh --chown=:lieber_lcolladotor ${ORIGINALDIR}/ ${NEWDIR}/

## Label as trash the files that were moved
mkdir -p ${ORIGINALHOME}/trash/
mv ${ORIGINALDIR} ${ORIGINALHOME}/trash/

## Create link
ln -s ${NEWDIR} ${ORIGINALDIR}

echo "**** Job ends ****"
date

## This script was made using sgejobs version 0.99.1
## available from http://research.libd.org/sgejobs/

update_permissions.sh

This is a much simpler bash script which we didn’t qsub and basically just reminds us of the specific command we used to update the permissions. That way, if anyone asks us what we ran in the future, we know exactly how we updated the permissions.

#!/bin/bash

sh /dcs04/lieber/lcolladotor/_jhpce_org_LIBD001/update_permissions_spatialteam.sh /dcs04/lieber/lcolladotor/with10x_LIBD001