Design RNA smFISH oligonucleotide probes from the command line. One command to install, one command to design probes.
Key features:
- Pre-built genome indices for 7 organisms — no index building needed
- Automatic gene sequence download from NCBI
- Multi-layer off-target detection (genome alignment, transcriptome BLAST, repeat masking, expression weighting, and more)
- Adaptive probe length to normalize Tm across the probe set
- Protocol presets (
smfish,merfish,dna-fish, etc.) - Automated probe validation with PASS/FLAG/FAIL recommendations
Tested on macOS and Linux with Python 3.10+. Works on HPC/cluster servers via SSH — no sudo, Docker, or conda needed. For Windows, use WSL.
curl -LsSf https://raw.githubusercontent.com/BBQuercus/eFISHent/main/install.sh | bashRestart your shell, then verify:
efishent --checkInstallation options
With BLAST+ and transcriptome tools (for transcriptome-level off-target filtering):
curl -LsSf https://raw.githubusercontent.com/BBQuercus/eFISHent/main/install.sh | bash -s -- --with-blastCustom install path:
curl -LsSf https://raw.githubusercontent.com/BBQuercus/eFISHent/main/install.sh | bash -s -- --prefix /path/to/installUpdate:
efishent --updateDevelopment install:
git clone https://github.com/BBQuercus/eFISHent.git
cd eFISHent/
./install.sh --deps-only
uv venv && source .venv/bin/activate
uv pip install -e .Uninstall:
curl -LsSf https://raw.githubusercontent.com/BBQuercus/eFISHent/main/install.sh | bash -s -- --uninstallOr simply: rm -rf ~/.local/efishent
The fastest way to design probes — genome indices are downloaded automatically:
efishent --genome hg38 --gene-name "ACTB" --organism-name "homo sapiens" --preset smfishThat's it. This downloads the pre-built human genome index on first use and designs smFISH probes for ACTB.
| Organism | Aliases |
|---|---|
| Human | hg38, GRCh38, human |
| Mouse | mm39, GRCm39, mouse |
| Zebrafish | danRer11, GRCz11, zebrafish |
| Rat | rn7, GRCr8, rat |
| Drosophila | dm6, BDGP6, fly |
| C. elegans | ce11, WBcel235, worm, elegans |
| Yeast | sacCer3, R64, yeast |
efishent --list-genomes # List all available genomes
efishent --download-genome hg38 # Pre-download for offline useIndices are cached in ~/.local/efishent/indices/ by default. Override with --index-cache-dir /path/to/dir or the EFISHENT_INDEX_DIR environment variable.
Three ways to provide the target sequence:
| Method | Example |
|---|---|
| Gene name + organism | --gene-name "ACTB" --organism-name "homo sapiens" |
| Ensembl ID | --ensembl-id ENSG00000128272 --organism-name "homo sapiens" |
| FASTA file | --sequence-file ./my_gene.fasta |
For organisms without a pre-built index, provide your own reference genome:
# Build indices once (can take 30-60 min for large genomes)
efishent --reference-genome <genome.fa> --build-indices True
# Design probes
efishent --reference-genome <genome.fa> --gene-name <gene> --organism-name <organism>Downloading genomes and annotations
For any organism, download the genome FASTA and GTF annotation from Ensembl or UCSC. Prefer primary_assembly if available, otherwise toplevel. Unzip with gunzip.
Example for human (GRCh38):
# Reference genome
wget https://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# GTF annotation (for intergenic filtering, rRNA filtering, expression weighting)
wget https://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.115.gtf.gz
gunzip Homo_sapiens.GRCh38.115.gtf.gzEnsembl GTFs use
gene_biotypewhile GENCODE usesgene_type— eFISHent supports both.
Reference transcriptome (optional, for BLAST cross-validation):
gffread Homo_sapiens.GRCh38.115.gtf -g Homo_sapiens.GRCh38.dna.primary_assembly.fa -w transcriptome.fa
# Append rRNA sequences (18S/28S/5.8S are NOT in standard GTFs)
efetch -db nucleotide -id NR_003286.4 -format fasta >> transcriptome.fa # 45S pre-rRNA
efetch -db nucleotide -id NR_023363.1 -format fasta >> transcriptome.fa # 5S rRNAThe major rRNA genes exist in ~300 tandem copies in unassembled regions, so they're absent from standard GTFs. Including them in the transcriptome FASTA ensures the BLAST filter catches probes binding these abundant sequences.
Count table (optional, for expression-weighted filtering):
Download a normalized RNA-seq dataset (FPKM/TPM) for your cell line from GEO or Expression Atlas. The file needs Ensembl gene IDs in column 1 and normalized counts in column 2.
Use --preset to apply optimized parameters for common FISH protocols:
| Preset | Description |
|---|---|
smfish |
Standard smFISH (18-22nt probes, adaptive length, 10% formamide) |
merfish |
MERFISH encoding probes (tight Tm, 30% formamide) |
dna-fish |
DNA FISH (longer probes, relaxed specificity) |
strict |
Maximum specificity (low k-mer tolerance, low-complexity filter) |
relaxed |
Maximum probe yield (permissive thresholds + rescue filters) |
exogenous |
Exogenous genes — GFP, Renilla, reporters (no k-mer filter, strict BLAST) |
Use --preset list to see details. Explicit arguments override preset values.
Workflow
flowchart TD
A["Gene Sequence<br/><i>FASTA file or NCBI download</i>"] --> B["Generate Candidate Probes<br/><i>Sliding window (adaptive or fixed length)</i>"]
B --> C["Basic Filtering<br/><i>TM, GC, homopolymers, low-complexity</i>"]
C --> D["Genome Alignment<br/><i>Bowtie2 (default) or Bowtie</i><br/><i>+ repeat masking, intergenic, Tm scoring</i>"]
D --> E{"Transcriptome<br/>provided?"}
E -- Yes --> F["Transcriptome BLAST<br/><i>Off-target detection (TrueProbes params)</i>"]
E -- No --> G["K-mer Filtering<br/><i>Jellyfish frequency count</i>"]
F --> G
G --> H["Secondary Structure<br/><i>deltaG prediction (RNAstructure)</i>"]
H --> H2["Accessibility Scoring<br/><i>RNA folding (optional)</i>"]
H2 --> I["Quality-Weighted Optimization<br/><i>Greedy or optimal (MILP) + gap filling + Tm refinement</i>"]
I --> K["Validation Report<br/><i>Quality scores, off-target genes, recommendations</i>"]
K --> J["Final Probe Set"]
style A fill:#e1f5fe
style J fill:#e8f5e9
- Candidate probes are generated from the input sequence using a sliding window. When
--adaptive-lengthis enabled, probe lengths are adjusted based on local GC content to normalize Tm. - Basic filtering removes probes failing sequence criteria: melting temperature, GC content, homopolymer runs, optionally low-complexity regions, and optionally G-quadruplex motifs (
--filter-g-quadruplex). - Probes are aligned to the reference genome using Bowtie2 (sensitive local alignment with OligoMiner/Tigerfish parameters). Optional filters refine off-target counting: repeat masking, intergenic filtering, thermodynamic scoring, and expression weighting.
- If a reference transcriptome is provided, probes are BLASTed against expressed transcripts to catch off-targets that genome alignment alone may miss (e.g., splice junctions).
- Short k-mers are counted using Jellyfish — probes with frequently occurring k-mers are discarded.
- Secondary structure is predicted using a nearest-neighbor thermodynamic model — probes with too-stable structures are filtered.
- If
--accessibility-scoringis enabled, target RNA accessibility is scored using RNA folding predictions. - Quality-weighted optimization selects non-overlapping probes maximizing coverage. A gap-filling pass covers remaining regions and Tm uniformity refinement swaps outlier probes.
- The output includes per-probe quality scores, off-target gene names, expression risk, and PASS/FLAG/FAIL recommendations.
eFISHent produces three files per run:
| File | Description |
|---|---|
GENE_HASH.fasta |
Final probes in FASTA format |
GENE_HASH.csv |
Detailed probe table (see columns below) |
GENE_HASH.txt |
Run parameters and command for reproducibility |
The HASH is a unique identifier based on the parameters used — rerunning with the same parameters reuses cached results.
Output CSV columns
| Column | Description |
|---|---|
name |
Probe identifier |
sequence |
Probe nucleotide sequence |
start, end |
Position along the target gene |
length |
Probe length in nucleotides |
GC |
GC content (%) |
TM |
Predicted melting temperature (deg C) |
deltaG |
Secondary structure free energy (kcal/mol) |
kmers |
Maximum k-mer count in reference genome |
count |
Genome off-target hit count |
txome_off_targets |
Transcriptome off-target count (when --reference-transcriptome is used) |
off_target_genes |
Off-target gene names with hit counts, e.g., ACTG1(3), MYH9(1) |
worst_match |
Best off-target match quality, e.g., 95%/20bp/0mm |
expression_risk |
Expression risk for off-target genes, e.g., ACTG1:HIGH(850) |
quality |
Composite quality score (0-100) |
recommendation |
PASS, FLAG(reason), or FAIL |
Analyze an existing probe set with comprehensive metrics and a PDF report:
efishent \
--reference-genome <genome.fa> \
--sequence-file <gene.fa> \
--analyze-probeset <probes.fasta>Analysis report contents
| Plot | Description |
|---|---|
| Lengths | Distribution of probe lengths |
| Melting temperatures | Boxplot of calculated Tm values |
| GC Content | Boxplot of GC percentages |
| G quadruplet | Count of G-quadruplet motifs per probe |
| K-mer count | Maximum k-mer frequency in genome |
| Free energy | Predicted secondary structure stability (deltaG) |
| Off target count | Number of off-target binding sites per probe |
| Binding affinity | Probe-to-probe similarity matrix (potential cross-hybridization) |
| Gene coverage | Visual map of probe positions along the target sequence |
| Parameter | Description |
|---|---|
--reference-genome |
Path to reference genome FASTA |
--genome |
Use a pre-built genome index (e.g., hg38, mm39, zebrafish) |
--gene-name |
Gene name for automatic sequence download from NCBI |
--organism-name |
Organism name (used with --gene-name or --ensembl-id) |
--sequence-file |
Path to target gene FASTA file |
--preset |
Parameter preset: smfish, merfish, dna-fish, strict, relaxed, exogenous |
--threads |
Number of threads for parallel processing |
--is-plus-strand |
Strand orientation of the gene of interest |
--is-endogenous |
Whether the gene is endogenous to the organism |
| Parameter | Description |
|---|---|
--min-length, --max-length |
Probe length range in nucleotides |
--spacing |
Minimum distance between probes |
--min-tm, --max-tm |
Melting temperature range |
--min-gc, --max-gc |
GC content range (%) |
--formamide-concentration |
Formamide concentration (%) |
--na-concentration |
Sodium ion concentration (mM) |
--adaptive-length |
Adjust probe length by local GC to normalize Tm |
--max-homopolymer-length |
Max homopolymer run (default: 5, 0 to disable) |
--filter-low-complexity |
Filter dinucleotide repeats and low entropy regions |
--filter-g-quadruplex |
Filter G-quadruplex motifs in target |
--max-deltag |
Secondary structure free energy threshold |
--target-regions |
Target region: exon (default), intron, both, cds-only, utr-only |
--accessibility-scoring |
Score target RNA accessibility via RNA folding |
--optimization-method |
greedy (default, fast) or optimal (MILP, max coverage) |
--optimization-time-limit |
Time limit in seconds for optimal solver |
--sequence-similarity |
Max allowed inter-probe similarity (%) to avoid cross-hybridization |
Off-target filtering parameters
Genome alignment (default):
| Parameter | Description |
|---|---|
--max-off-targets |
Maximum genome hits per probe (default: 0) |
--aligner |
bowtie2 (default) or bowtie (legacy) |
--mask-repeats |
Ignore off-targets in repetitive regions (uses dustmasker) |
--intergenic-off-targets |
Ignore off-targets outside annotated genes (requires --reference-annotation) |
--off-target-min-tm |
Min Tm (deg C) for an off-target to count. Set to hybridization temp to rescue thermodynamically unstable hits (default: 0) |
--filter-rrna |
Remove probes hitting rRNA genes (requires --reference-annotation) |
Transcriptome BLAST (optional):
| Parameter | Description |
|---|---|
--reference-transcriptome |
Transcriptome FASTA for BLAST cross-validation |
--max-transcriptome-off-targets |
Max transcriptome hits per probe (default: 0) |
--blast-identity-threshold |
Min % identity for BLAST hit (default: 75) |
--min-blast-match-length |
Min effective alignment length (default: max(18, 0.8 * min_probe_length)) |
Expression weighting (optional):
| Parameter | Description |
|---|---|
--reference-annotation |
GTF annotation file |
--encode-count-table |
Normalized RNA-seq count table (FPKM/TPM) |
--max-expression-percentage |
Top expression percentile to exclude |
--max-probes-per-off-target |
Cap on probes hitting same off-target gene (default: 0 = disabled, recommended: 5) |
Index and cache parameters
| Parameter | Description |
|---|---|
--build-indices |
Build genome indices (bowtie2, jellyfish, BLAST) |
--download-genome |
Pre-download a genome index for offline use |
--list-genomes |
List available pre-built genomes |
--index-cache-dir |
Override index cache directory (default: ~/.local/efishent/indices/). Also settable via EFISHENT_INDEX_DIR |
--kmer-length |
K-mer length for Jellyfish filtering |
--max-kmers |
Max k-mer occurrences in genome before discarding probe |
--save-intermediates |
Keep all intermediate files for debugging |
Full examples
smFISH with pre-built index (simplest):
efishent --genome hg38 --gene-name "ACTB" --organism-name "homo sapiens" --preset smfish --threads 8smFISH with full off-target filtering:
efishent \
--reference-genome ./hg-38.fa \
--reference-annotation ./hg-38.gtf \
--reference-transcriptome ./transcriptome.fa \
--gene-name "GAPDH" \
--organism-name "homo sapiens" \
--preset smfish \
--mask-repeats True \
--intergenic-off-targets True \
--filter-rrna True \
--max-probes-per-off-target 5 \
--threads 8Long probes (45-50nt) with optimal solver:
efishent \
--reference-genome ./hg-38.fa \
--gene-name "norad" \
--organism-name "homo sapiens" \
--is-plus-strand True \
--optimization-method optimal \
--min-length 45 \
--max-length 50 \
--formamide-concentration 45 \
--threads 8Exogenous gene (GFP, Renilla, etc.):
efishent \
--reference-genome ./hg38.fa \
--reference-transcriptome ./transcriptome.fa \
--reference-annotation ./hg38.gtf \
--sequence-file "./renilla.fasta" \
--preset exogenous \
--threads 8Expression-weighted off-target filtering:
efishent \
--reference-genome ./hg-38.fa \
--reference-annotation ./hg-38.gtf \
--ensembl-id ENSG00000128272 \
--organism-name "homo sapiens" \
--is-plus-strand False \
--max-off-targets 5 \
--encode-count-table ./count_table.tsv \
--max-expression-percentage 20 \
--threads 8Rescue probes with thermodynamic and repeat masking filters:
efishent \
--reference-genome ./hg-38.fa \
--reference-annotation ./hg-38.gtf \
--sequence-file ./my_gene.fasta \
--mask-repeats True \
--intergenic-off-targets True \
--off-target-min-tm 37 \
--threads 8Have questions? Open an issue on GitHub.
