PROBEst

St. Petersburg tool for generating nucleotide probes with specified properties

PROBEst is a tool designed for generating nucleotide probes with specified properties. It uses a wrapped evolutionary algorithm to iteratively refine the probes by introducing mutations and evaluating their performance, ensuring that the final output is both specific to the target and universally applicable across related sequences.

In general, tool is usefull for the probes generation to match the long target list and still have good specificity.

Download and installation

Installation

git clone https://github.com/CTLab-ITMO/PROBESt.git
cd PROBEst
bash setup/install.sh
# or if you prefer to install with pip (may cause conflicts):
# pip install -e .
#validate installation with 
#setup/test_generator.sh

Usage

Generation

PROBEst can be run using the following command:

#conda activate probest
python pipeline.py \
  -i {INPUT} \ # fasta // .fa.gz; if ommited, the first file from 
  the `-tb` directory will be used
  -tb {TRUE_BASE} \ # optional; BLAST DB or FASTA directory for oligo improvement; if omitted, true base is built from per-sequence split of `-i`
  -fb [FALSE_BASE ...] \ # same as TB, but for off-target search; optional; omit to skip off-target search
  -c {CONTIG_TABLE} \ # .tsv table with BLAST database information (optional; defaults to `{OUTPUT}/contigs.tsv` when using FASTA directories).
  -o {OUTPUT} \ # output directory

Blastn databases and contig table are produced by prep_db.sh (or built automatically when -tb / -fb point at directories of FASTA files). After each automatic build, the contig table is deduplicated by contig ID (duplicate rows from merged FASTAs or repeated runs keep the last mapping).

Key arguments:

-i INPUT: Input FASTA file (or directory with fasta / fasta.gz file) for the initial probe set generation.
-tb TRUE_BASE: BLASTn database path or directory of FASTA files for primer adjusting (directories are converted under {OUTPUT}/.blast_db/). Optional: omit to build the true base from -i by writing one .fna per sequence under {OUTPUT}/.true_base_from_input/ (then -i is required).
-fb FALSE_BASE: BLASTn database path(s) or FASTA directories for non-specific testing. Optional: if omitted, off-target blastn is skipped and an empty negative_hits.tsv is used each iteration (the rest of the pipeline still runs).
-c CONTIG_TABLE: .tsv table with BLAST database information (optional; defaults to {OUTPUT}/contigs.tsv when using FASTA directories).
-o OUTPUT: Output path for results.
-t THREADS: Number of threads to use.
-a ALGORITHM: Algorithm for probe generation (FISH or primer).
--initial_generator: Tool for initial probe set generation (primer3 or oligominer, default: primer3).

For a full list of arguments, run:

python pipeline.py --help

For parameter selection, grid search is implemented. You can specify parameters in json (see for example data/test/general/param_grid_light.json) and run

python test_parameters.py \
  -p {JSON}

Example usage:

python pipeline.py \
  -i data/test/general/test.fna \
  -o data/test/general/output \
  -tb data/test/general/fasta_base/true_base \
  -fb data/test/general/fasta_base/false_base_1 \
      data/test/general/fasta_base/false_base_2 \
  -a FISH \
  --PRIMER_PICK_PRIMER 5 \
  --PRIMER_NUM_RETURN 5 \
  -N 3 \
  --visualize True --AI True

Manual preparation

You can pre-build BLASTn databases yourself (e.g. when inputs are already makeblastdb outputs). To create true_base, false_base, and contig_table from FASTA files, use:

bash scripts/generator/prep_db.sh \
  -n {DATABASE_NAME} \
  -c {CONTIG_NAME} \
  -t {TMP_DIR} \
  [FASTA]

Arguments:

-n DATABASE_NAME: Name of the output BLAST database (required).
-c CONTIG_TABLE: Output file to store contig names and their corresponding sequence headers (required).
-t TMP_DIR: Temporary directory for intermediate files (optional, defaults to ./.tmp).
FASTA: List of input FASTA files (gzipped or uncompressed).

Web Application

PROBEst includes a user-friendly web interface for probe generation. The web app provides:

python app/app.py

For detailed web app documentation, see app/README.md

Algorithm

Algorithm Steps

Prepare BLASTn databases
Select File for Probe Generation (INPUT)
Select Files for Universality Check (TRUE_BASE)
Select Files for Specificity Check (FALSE_BASE)
Select Layouts and Run Wrapped Evolutionary Algorithm (pipeline.py)

a. Primer3 Generation

b. BLASTn Check

c. Parsing

d. Mutation in Probe

e. AI corrections (if -AI is enabled)

f. De-degeneration (if -degeneration is enabled)

---
config:
  layout: elk
  look: classic
---
%%{init: {
  'theme': 'base',
  'themeVariables': {
    'fontFamily': 'arial',
    'fontSize': '16px',
    'primaryColor': '#fff',
    'primaryBorderColor': '#FFAC1C',
    'primaryTextColor': '#000',
    'lineColor': '#000',
    'secondaryColor': 'white',
    'tertiaryColor': '#fff',
    'subgraphBorderStyle': 'dotted'
  },
  'flowchart': {
    'curve': 'monotoneY',
    'padding': 15
  }
}}%%

graph LR
  subgraph inputs
  A
  A1
  T1
  T3
  end

  A([Initial probe generation]):::input -- primer3 --> B2(initial probe set):::probe
  A -- oligominer --> B2
  A1([Custom probes]):::input --> B2

  T1([Target sequences]):::input -- blastn-db --> T2[(target)]
  T3([Offtarget sequences]):::input -- blastn-db --> T4[(offtarget)]

  subgraph database
  T2
  T4
  end

  T2 --> EA
  T4 --> EA
  B2 --> EA

  EA[evolutionary algorithm] --> T11(results):::probe

  classDef empty width:0px,height:0px;
  classDef input fill:#90EE9020,stroke:#fff,stroke-width:2px,shape:ellipse;
  classDef probe fill:#FFAC1C20,stroke:#fff,stroke-width:2px;

---
config:
  layout: elk
  look: classic
---
%%{init: {
  'layout': 'elk',
  'theme': 'base',
  'themeVariables': {
    'fontFamily': 'arial',
    'fontSize': '16px',
    'primaryColor': '#fff',
    'primaryBorderColor': '#FFAC1C',
    'primaryTextColor': '#000',
    'lineColor': '#000',
    'secondaryColor': 'white',
    'tertiaryColor': '#fff',
    'subgraphBorderStyle': 'dotted'
  },
  'flowchart': {
    'curve': 'monotoneY',
    'padding': 15
  }
}}%%

graph LR
  subgraph evolutionary algorithm
    subgraph hits
      TP
      TN
    end

    B(probe set):::probe --> TP[target]
    B --> TN[offtarget]
    B1 -- mutations --> B

    TP -- coverage --> T6[universality]
    TP -- duplications --> T7[multimapping]
    TN ---> T8[specificity]

    subgraph check
    T6
    T7
    T8
    M1
    E3
    end

    B --- E6[ ]:::empty --> M1[modeling]
    TP --- E6

    M1 --- E3[ ]:::empty
    T6 --- E3
    T7 --- E3
    T8 --- E3
    E3 -- quality prediction --> B1(filtered probe set):::probe
  end
  B1 --> T11(results):::probe

  classDef empty width:0px,height:0px;
  classDef input fill:#90EE9020,stroke:#fff,stroke-width:2px,shape:ellipse;
  classDef probe fill:#FFAC1C20,stroke:#fff,stroke-width:2px;

Project Structure

---
config:
  theme: neutral
  look: classic
---
%%{init: {
  'theme': 'base',
  'themeVariables': {
    'fontFamily': 'arial',
    'fontSize': '16px',
    'primaryColor': '#fff',
    'primaryBorderColor': '#FFAC1C',
    'primaryTextColor': '#000',
    'lineColor': '#000',
    'secondaryColor': '#90EE90',
    'tertiaryColor': '#fff',
    'subgraphBorderStyle': 'dotted'
  },
  'flowchart': {
    'curve': 'monotoneY',
    'padding': 15
  }
}}%%

graph LR
    PROBEst([PROBEst]) --> src[src/]
    PROBEst --> scripts[scripts/]
    PROBEst --> tests[tests/]
    PROBEst --> app[app/]

    subgraph folders
    src
    scripts
    tests
    app
    end
    
    src --> C[benchmarking]
    src --> A[generation]
    tests --> A
    
    scripts --> D[preprocessing]
    scripts --> B[database parsing]
    D --> A
    
    app --> E[web interface]
    E --> A

Testing

To check the installation: bash test_generator.sh
For developers: use pytest

Citation

If you use PROBEst LLM pipeline for the extraction of the research data, please cite:

BibTeX:

@article{202511.2140,
	doi = {10.20944/preprints202511.2140.v1},
	url = {https://doi.org/10.20944/preprints202511.2140.v1},
	year = 2025,
	month = {November},
	publisher = {Preprints},
	author = {Alexandr Serdiukov and Vitaliy Dragvelis and Daniil Smutin and Amir Taldaev and Sergey Muravyov},
	title = {Efficient and Verified Extraction of the Research Data Using LLM},
	journal = {Preprints}
}

Plain text: Serdiukov, A., Dragvelis, V., Smutin, D., Taldaev, A., & Muravyov, S. (2025). Efficient and Verified Extraction of the Research Data Using LLM. Preprints. https://doi.org/10.20944/preprints202511.2140.v1

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contribution

We welcome contributions from the community! To contribute:

Please read the Contribution Guidelines for more details.

Wiki

Tool have its own Wiki pages with detailed information on usage cases, data description and another neccessary information

Name		Name	Last commit message	Last commit date
Latest commit History 278 Commits
.github/workflows		.github/workflows
app		app
benchmark		benchmark
data		data
docs		docs
extraction		extraction
img		img
scripts		scripts
setup		setup
src		src
tests		tests
.gitignore		.gitignore
CITATION.bib		CITATION.bib
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pipeline.py		pipeline.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PROBEst

St. Petersburg tool for generating nucleotide probes with specified properties

Download and installation

Installation

Usage

Generation

Key arguments:

Manual preparation

Arguments:

Web Application

Algorithm

Algorithm Steps

Project Structure

Testing

Citation

License

Contribution

Wiki

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PROBEst

St. Petersburg tool for generating nucleotide probes with specified properties

Download and installation

Installation

Usage

Generation

Key arguments:

Manual preparation

Arguments:

Web Application

Algorithm

Algorithm Steps

Project Structure

Testing

Citation

License

Contribution

Wiki

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages