Skip to content

prepare_transcripts: build pyfasta .gdx/.flat sidecars eagerly so downstream tasks don't write to staged inputs #70

@pinin4fjords

Description

@pinin4fjords

Summary

prepare_transcripts writes annotation/transcripts_sequence.fa but doesn't build its pyfasta index. Instead, downstream RiboCode steps (RiboCode, RiboCode_onestep) trigger pyfasta's lazy index build on first read of the FASTA - which writes transcripts_sequence.fa.gdx and transcripts_sequence.fa.flat next to the FASTA.

That breaks any deployment where the annotation directory is shared, read-only, or otherwise not the consumer's own writable working directory:

  • read-only /mnt or NFS mounts on shared HPC infrastructure
  • container bind mounts published :ro
  • workflow engines that stage the annotation as a symlink into each consumer task (writes follow the symlink back to the producer, parallel consumers then race)

Suggested fix

Add one line at the end of processTranscripts(...) in prepare_transcripts.py so the indexes are built once in the producing call, before the annotation is published:

# Eagerly build pyfasta indexes so downstream readers don't write to the
# (possibly read-only) staged annotation directory.
GenomeSeq(os.path.join(out_dir, "transcripts_sequence.fa"))

GenomeSeq.__init__ already calls Fasta(filename, key_fn=get_chrom) with the same key function downstream code uses, so the .gdx/.flat produced are byte-identical to what would otherwise be written lazily at first downstream read.

Optionally gate behind --prebuild-indexes if you'd rather keep current behaviour by default.

Happy to send a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions