Data managment on O2

As an HMS Core we get storage on O2 for free. We are not the biggest user but we are in the top 10. As such it is smart for us to be good citizens and keep our footprint as low as possible. Ways to reduce our footprint include

Reduce space used by active consults

delete redundant files
- a special case of this is when you receive raw data as some seq facilities (cough, BPF, cough) will send us both foo.fastq.gz and foo.fastq.bz2 files, we can delete the bz2 files
run bcbio analyses so temporary work folders are put on scratch
- the brute force approach to this is to setup on scratch with everything as a symlink, run the analysis and move the final folder over to the PIs project folder
- a more elegant way to do this is to set the output folder of the bcbio run to be in the project folder in the PIs folder
keep project folders tidy, delete things you are no longer using (this is a judgement call and not really enforced but can be an issue once the project is complete)
compress raw data files
- bcbio can handle gzipped and bzipped (bz2) files, compress those raw fastqs!

Get the data out of our main storage area

Return data from completed analyses to the researcher

can be by
1. GLOBUS - HMS (and FSARC) provides Globus for secure data transfer. See Globus - for sending data to clients.
2. upload to the researcher's server - preferred if access to the server is simple. A good example would be the research data storage that HMS PIs have access to or an FTP server. Passwords, logins and occasionally VPN access are typically required.
3. "sneaker net" - downloading onto a drive and handing it off to the researcher (Rare, not preferred)

I recommend avoiding things like Dropbox, Google Drive or Box unless the data is small. They aren't really built for this purpose.

Archive the data

We have access to standby storage on O2 (/n/standby/cores/bcbio/). standby dir is only accessible from the transfer node. For projects that are either too small to bother with returning to the researcher or projects where we think we may want to access the data again, we can tar.gz them and store them here. Leave a symlink in the original directory to allow easy restoration of the project

Once you have restored the project, delete the standby file.

Once you are finished with the project, rearchive it

Please don't keep an archived copy of the diretory in two places plus the expanded folder. Duplicated data is wasted space and makes John cry.

How to decide what to do with the data

With the caveat that every project is different here are some general guidelines to help guide your decision making process.

Is the data "large" (>500GB)? As much as possible, we'd prefer to get rid of these ASAP
Will you need to access the data and derived files again? If yes, tidy up any unnecessary files and archive

The following points can inform your decision making about how likely we will need to reaccess the data

Is the analysis published? If its published, its likely we won't be using it again. Ask the researcher what they want returned to them and delete the rest.
If the data is unpublished, did the analysis work? If the data is garbage, we likely won't be using it again. Ask the researcher what they want returned to them and delete the rest.
Did the consult end well? If it didn't its likely we won't be using it again. Ask the researcher what they want returned to them (with a time warning in case they don't respond) and delete the rest.
Is the consult unique in its approach? i.e. do you think its something we might come back to in the future for other analyses. If yes, tidy it up and archive. If it is something like a standard RNA-seq analysis, we probably don't need it.
How long has it been since the data was accessed? If its older than 2 years, we likely won't be using it again. Ask the researcher what they want returned to them and delete the rest.

Globus howto

See Globus - for sending data to clients AND for downloading data with Globus, see the Globus section in [Data Management ](https://github.com/hbc/knowledgebase/blob/master/admin/download_data.md#globus)

Sample script for older projects.

Subject: Delete/return data? re: (list project) PIname_c018_20K_cells_Samples_9-28-17

Hi xxx –

I hope this email finds you well. We have data from your lab’s project. Do you have the data you need from this? Due to storage constraints, we must remove the data from our HMS O2 storage. We will delete the data if you are all set. The project is from 2019 and is labeled:

PI1/Contact1 - Test RNASeq of human brain HBC12345

If you do not have the data and want to retain the raw and derived data in whole or in part, please let us know so we can facilitate transferring the data back to you.

Is two weeks enough before we delete the data?

We look forward to hearing from you.

Thanks and best,

If they say yes, they'd like the data, Start by asking the PI/Postdoc for their globus ID or to get one. Once you get it, move on to explaining how to use it. See Globus for more scripts/suggested process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data managment on O2

Reduce space used by active consults

Get the data out of our main storage area

Return data from completed analyses to the researcher

Archive the data

How to decide what to do with the data

Globus howto

FilesExpand file tree

data_management.md

Latest commit

History

data_management.md

File metadata and controls

Data managment on O2

Reduce space used by active consults

Get the data out of our main storage area

Return data from completed analyses to the researcher

Archive the data

How to decide what to do with the data

Globus howto