Skip to content

CTLab-ITMO/PROBEst

PROBEst

St. Petersburg tool for generating nucleotide probes with specified properties

ITMO conda build

PROBEst is a tool designed for generating nucleotide probes with specified properties. It uses a wrapped evolutionary algorithm to iteratively refine the probes by introducing mutations and evaluating their performance, ensuring that the final output is both specific to the target and universally applicable across related sequences.

In general, tool is usefull for the probes generation to match the long target list and still have good specificity.

Download and installation

Installation

git clone https://github.com/CTLab-ITMO/PROBESt.git
cd PROBEst
bash setup/install.sh
# or if you prefer to install with pip (may cause conflicts):
# pip install -e .
#validate installation with 
#setup/test_generator.sh

Usage

Generation

PROBEst can be run using the following command:

#conda activate probest
python pipeline.py \
  -i {INPUT} \ # fasta // .fa.gz; if ommited, the first file from 
  the `-tb` directory will be used
  -tb {TRUE_BASE} \ # optional; BLAST DB or FASTA directory for oligo improvement; if omitted, true base is built from per-sequence split of `-i`
  -fb [FALSE_BASE ...] \ # same as TB, but for off-target search; optional; omit to skip off-target search
  -c {CONTIG_TABLE} \ # .tsv table with BLAST database information (optional; defaults to `{OUTPUT}/contigs.tsv` when using FASTA directories).
  -o {OUTPUT} \ # output directory

Blastn databases and contig table are produced by prep_db.sh (or built automatically when -tb / -fb point at directories of FASTA files). After each automatic build, the contig table is deduplicated by contig ID (duplicate rows from merged FASTAs or repeated runs keep the last mapping).

Key arguments:

  • -i INPUT: Input FASTA file (or directory with fasta / fasta.gz file) for the initial probe set generation.
  • -tb TRUE_BASE: BLASTn database path or directory of FASTA files for primer adjusting (directories are converted under {OUTPUT}/.blast_db/). Optional: omit to build the true base from -i by writing one .fna per sequence under {OUTPUT}/.true_base_from_input/ (then -i is required).
  • -fb FALSE_BASE: BLASTn database path(s) or FASTA directories for non-specific testing. Optional: if omitted, off-target blastn is skipped and an empty negative_hits.tsv is used each iteration (the rest of the pipeline still runs).
  • -c CONTIG_TABLE: .tsv table with BLAST database information (optional; defaults to {OUTPUT}/contigs.tsv when using FASTA directories).
  • -o OUTPUT: Output path for results.
  • -t THREADS: Number of threads to use.
  • -a ALGORITHM: Algorithm for probe generation (FISH or primer).
  • --initial_generator: Tool for initial probe set generation (primer3 or oligominer, default: primer3).

For a full list of arguments, run:

python pipeline.py --help

For parameter selection, grid search is implemented. You can specify parameters in json (see for example data/test/general/param_grid_light.json) and run

python test_parameters.py \
  -p {JSON}

Example usage:

python pipeline.py \
  -i data/test/general/test.fna \
  -o data/test/general/output \
  -tb data/test/general/fasta_base/true_base \
  -fb data/test/general/fasta_base/false_base_1 \
      data/test/general/fasta_base/false_base_2 \
  -a FISH \
  --PRIMER_PICK_PRIMER 5 \
  --PRIMER_NUM_RETURN 5 \
  -N 3 \
  --visualize True --AI True

Manual preparation

You can pre-build BLASTn databases yourself (e.g. when inputs are already makeblastdb outputs). To create true_base, false_base, and contig_table from FASTA files, use:

bash scripts/generator/prep_db.sh \
  -n {DATABASE_NAME} \
  -c {CONTIG_NAME} \
  -t {TMP_DIR} \
  [FASTA]

Arguments:

  • -n DATABASE_NAME: Name of the output BLAST database (required).
  • -c CONTIG_TABLE: Output file to store contig names and their corresponding sequence headers (required).
  • -t TMP_DIR: Temporary directory for intermediate files (optional, defaults to ./.tmp).
  • FASTA: List of input FASTA files (gzipped or uncompressed).

Web Application

PROBEst includes a user-friendly web interface for probe generation. The web app provides:

python app/app.py

For detailed web app documentation, see app/README.md

Algorithm

Algorithm Steps

  1. Prepare BLASTn databases

  2. Select File for Probe Generation (INPUT)

  3. Select Files for Universality Check (TRUE_BASE)

  4. Select Files for Specificity Check (FALSE_BASE)

  5. Select Layouts and Run Wrapped Evolutionary Algorithm (pipeline.py)

    a. Primer3 Generation

    b. BLASTn Check

    c. Parsing

    d. Mutation in Probe

    e. AI corrections (if -AI is enabled)

    f. De-degeneration (if -degeneration is enabled)

---
config:
  layout: elk
  look: classic
---
%%{init: {
  'theme': 'base',
  'themeVariables': {
    'fontFamily': 'arial',
    'fontSize': '16px',
    'primaryColor': '#fff',
    'primaryBorderColor': '#FFAC1C',
    'primaryTextColor': '#000',
    'lineColor': '#000',
    'secondaryColor': 'white',
    'tertiaryColor': '#fff',
    'subgraphBorderStyle': 'dotted'
  },
  'flowchart': {
    'curve': 'monotoneY',
    'padding': 15
  }
}}%%

graph LR
  subgraph inputs
  A
  A1
  T1
  T3
  end

  A([Initial probe generation]):::input -- primer3 --> B2(initial probe set):::probe
  A -- oligominer --> B2
  A1([Custom probes]):::input --> B2

  T1([Target sequences]):::input -- blastn-db --> T2[(target)]
  T3([Offtarget sequences]):::input -- blastn-db --> T4[(offtarget)]

  subgraph database
  T2
  T4
  end

  T2 --> EA
  T4 --> EA
  B2 --> EA

  EA[evolutionary algorithm] --> T11(results):::probe

  classDef empty width:0px,height:0px;
  classDef input fill:#90EE9020,stroke:#fff,stroke-width:2px,shape:ellipse;
  classDef probe fill:#FFAC1C20,stroke:#fff,stroke-width:2px;
Loading
---
config:
  layout: elk
  look: classic
---
%%{init: {
  'layout': 'elk',
  'theme': 'base',
  'themeVariables': {
    'fontFamily': 'arial',
    'fontSize': '16px',
    'primaryColor': '#fff',
    'primaryBorderColor': '#FFAC1C',
    'primaryTextColor': '#000',
    'lineColor': '#000',
    'secondaryColor': 'white',
    'tertiaryColor': '#fff',
    'subgraphBorderStyle': 'dotted'
  },
  'flowchart': {
    'curve': 'monotoneY',
    'padding': 15
  }
}}%%

graph LR
  subgraph evolutionary algorithm
    subgraph hits
      TP
      TN
    end

    B(probe set):::probe --> TP[target]
    B --> TN[offtarget]
    B1 -- mutations --> B

    TP -- coverage --> T6[universality]
    TP -- duplications --> T7[multimapping]
    TN ---> T8[specificity]

    subgraph check
    T6
    T7
    T8
    M1
    E3
    end

    B --- E6[ ]:::empty --> M1[modeling]
    TP --- E6

    M1 --- E3[ ]:::empty
    T6 --- E3
    T7 --- E3
    T8 --- E3
    E3 -- quality prediction --> B1(filtered probe set):::probe
  end
  B1 --> T11(results):::probe

  classDef empty width:0px,height:0px;
  classDef input fill:#90EE9020,stroke:#fff,stroke-width:2px,shape:ellipse;
  classDef probe fill:#FFAC1C20,stroke:#fff,stroke-width:2px;
Loading

Project Structure

---
config:
  theme: neutral
  look: classic
---
%%{init: {
  'theme': 'base',
  'themeVariables': {
    'fontFamily': 'arial',
    'fontSize': '16px',
    'primaryColor': '#fff',
    'primaryBorderColor': '#FFAC1C',
    'primaryTextColor': '#000',
    'lineColor': '#000',
    'secondaryColor': '#90EE90',
    'tertiaryColor': '#fff',
    'subgraphBorderStyle': 'dotted'
  },
  'flowchart': {
    'curve': 'monotoneY',
    'padding': 15
  }
}}%%

graph LR
    PROBEst([PROBEst]) --> src[src/]
    PROBEst --> scripts[scripts/]
    PROBEst --> tests[tests/]
    PROBEst --> app[app/]

    subgraph folders
    src
    scripts
    tests
    app
    end
    
    src --> C[benchmarking]
    src --> A[generation]
    tests --> A
    
    scripts --> D[preprocessing]
    scripts --> B[database parsing]
    D --> A
    
    app --> E[web interface]
    E --> A
Loading

Testing

  • To check the installation: bash test_generator.sh

  • For developers: use pytest

Citation

If you use PROBEst LLM pipeline for the extraction of the research data, please cite:

BibTeX:

@article{202511.2140,
	doi = {10.20944/preprints202511.2140.v1},
	url = {https://doi.org/10.20944/preprints202511.2140.v1},
	year = 2025,
	month = {November},
	publisher = {Preprints},
	author = {Alexandr Serdiukov and Vitaliy Dragvelis and Daniil Smutin and Amir Taldaev and Sergey Muravyov},
	title = {Efficient and Verified Extraction of the Research Data Using LLM},
	journal = {Preprints}
}

Plain text: Serdiukov, A., Dragvelis, V., Smutin, D., Taldaev, A., & Muravyov, S. (2025). Efficient and Verified Extraction of the Research Data Using LLM. Preprints. https://doi.org/10.20944/preprints202511.2140.v1

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contribution

We welcome contributions from the community! To contribute:

Please read the Contribution Guidelines for more details.

Wiki

Tool have its own Wiki pages with detailed information on usage cases, data description and another neccessary information

About

St. Petersburg tool for generating nucleotide probes with specified properties

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors