This repo is a helper script to get the latest data from a FlyBase database and dockerise and run
https://github.com/grivaz/FlyBaseAnnotationHelper and
https://huggingface.co/cgrivaz/FlyBaseGeneAbstractClassifier
See these for information on how the code all works. NOTE: To generate the data using this repo you will need access to a FlyBase postgres database. If you do not have this then you must generate the data another way but call the file names the same things.
- Clone this repository.
- Build docker image OR pull from dockerhub
- docker build . -t gene-identifier
- docker pull flybase/harvdev-gene-identifier
NOTE: gene-identifier being used in examples below. Switch names in commands if using the pulled image.
- USER - FlyBase postgres db user name
- PGPASSWORD - FlyBase postgres password
- SERVER - FlyBase postgres server
if using local db instance you need to use
host.docker.internal - PORT - FlyBase postgres port
- DB - FlyBase postgres db name
- PORT - FlyBase postgres port
- MONDAY_DATE - start date used to generate new Pubs in FlyBase.
- GI_DATA_INPUT - local directory to store files needed to run gene-identifier (*optional, can change docker command directly)
- GI_DATA_OUTPUT - local directory to put output from gene-identifier (*optional)
-
Generate a list of Dmel and Hsap current gene synonyms (fb_synonym_latest.tsv)
docker run --rm -p$PORT:$PORT -v $GI_DATA_INPUT:/src/input/ -e SERVER=$SERVER -e PGPASSWORD=$PGPASSWORD -e USER=$USER -e DB=$DB -e PORT=$PORT --entrypoint /usr/bin/python3 gene-identifier src/get_synonyms_batch.py --filepath /src/input/ -
Generate a list of Dmel and Hsap gene unique names (currentDmelHsap.txt)
docker run --rm -p$PORT:$PORT -v $GI_DATA_INPUT:/src/input/ -e SERVER=$SERVER -e PGPASSWORD=$PGPASSWORD -e USER=$USER -e DB=$DB -e PORT=$PORT --entrypoint /usr/bin/python3 gene-identifier src/get_gene_uniquenames.py --filepath /src/input/ -
Get PMC ids file (PMC-ids.csv)
docker run --rm -v $GI_DATA_INPUT:/src/input/ --entrypoint /usr/bin/bash gene-identifier src/get_PMC.sh -
Get PMC's to examine (new_pub_dbxrefs.txt)
docker run --rm -p$PORT:$PORT -v $GI_DATA_INPUT:/src/input/ -e SERVER=$SERVER -e PGPASSWORD=$PGPASSWORD -e MONDAY_DATE=$MONDAY_DATE -e USER=$USER -e DB=$DB -e PORT=$PORT --entrypoint /usr/bin/python3 gene-identifier src/get_new_pubs.py --filepath /src/input/Note you can also create this by hand by just adding a list on PMC identifiers.
-
Run the gene identifier code (interactive mode):
docker run --rm -v $GI_DATA_INPUT:/src/input -e SERVER=$SERVER -e PGPASSWORD=$PGPASSWORD -e USER=$USER -e DB=$DB -e PORT=$PORT -v $GI_DATA_OUTPUT:/usr/src/app/output_files -it gene-identifier- If input files not created yet create them
- python3 src/get_synonyms_batch.py --filepath /src/input/
- python3 src/get_gene_uniquenames.py --filepath /src/input/
- python3 src/get_new_pubs.py --filepath /src/input/
- sh src/get_PMC.sh
- Change to the directory
FlyBaseAnnotationHelperby runningcd FlyBaseAnnotationHelper - Execute the command
python3 update_resources.py - Execute the command
python3 annotation_helper.py /usr/src/app/output_files/new_pub_dbxrefs.txt - Output file can be found in the output directory, $GI_DATA_OUTPUT outside of docker and /usr/src/app/output_files inside docker
-
Run code on command line locally (via GoCd etc))
- Get the files needed by following Datafiles needed section or via alternative methods.
- docker run --rm -v $GI_DATA_INPUT:/src/input -v $GI_DATA_OUTPUT:/src/output --entrypoint /usr/bin/bash gene-identifier src/run_gene_identifier.sh