PROBEst is a tool designed for generating nucleotide probes with specified properties. It uses a wrapped evolutionary algorithm to iteratively refine the probes by introducing mutations and evaluating their performance, ensuring that the final output is both specific to the target and universally applicable across related sequences.
In general, tool is usefull for the probes generation to match the long target list and still have good specificity.
git clone https://github.com/CTLab-ITMO/PROBESt.git
cd PROBEst
bash setup/install.sh
# or if you prefer to install with pip (may cause conflicts):
# pip install -e .
#validate installation with
#setup/test_generator.shPROBEst can be run using the following command:
#conda activate probest
python pipeline.py \
-i {INPUT} \ # fasta // .fa.gz; if ommited, the first file from
the `-tb` directory will be used
-tb {TRUE_BASE} \ # optional; BLAST DB or FASTA directory for oligo improvement; if omitted, true base is built from per-sequence split of `-i`
-fb [FALSE_BASE ...] \ # same as TB, but for off-target search; optional; omit to skip off-target search
-c {CONTIG_TABLE} \ # .tsv table with BLAST database information (optional; defaults to `{OUTPUT}/contigs.tsv` when using FASTA directories).
-o {OUTPUT} \ # output directoryBlastn databases and contig table are produced by prep_db.sh (or built automatically when -tb / -fb point at directories of FASTA files). After each automatic build, the contig table is deduplicated by contig ID (duplicate rows from merged FASTAs or repeated runs keep the last mapping).
-i INPUT: Input FASTA file (or directory with fasta / fasta.gz file) for the initial probe set generation.-tb TRUE_BASE: BLASTn database path or directory of FASTA files for primer adjusting (directories are converted under{OUTPUT}/.blast_db/). Optional: omit to build the true base from-iby writing one.fnaper sequence under{OUTPUT}/.true_base_from_input/(then-iis required).-fb FALSE_BASE: BLASTn database path(s) or FASTA directories for non-specific testing. Optional: if omitted, off-targetblastnis skipped and an emptynegative_hits.tsvis used each iteration (the rest of the pipeline still runs).-c CONTIG_TABLE: .tsv table with BLAST database information (optional; defaults to{OUTPUT}/contigs.tsvwhen using FASTA directories).-o OUTPUT: Output path for results.-t THREADS: Number of threads to use.-a ALGORITHM: Algorithm for probe generation (FISHorprimer).--initial_generator: Tool for initial probe set generation (primer3oroligominer, default:primer3).
For a full list of arguments, run:
python pipeline.py --helpFor parameter selection, grid search is implemented. You can specify parameters in json (see for example data/test/general/param_grid_light.json) and run
python test_parameters.py \
-p {JSON}Example usage:
python pipeline.py \
-i data/test/general/test.fna \
-o data/test/general/output \
-tb data/test/general/fasta_base/true_base \
-fb data/test/general/fasta_base/false_base_1 \
data/test/general/fasta_base/false_base_2 \
-a FISH \
--PRIMER_PICK_PRIMER 5 \
--PRIMER_NUM_RETURN 5 \
-N 3 \
--visualize True --AI TrueYou can pre-build BLASTn databases yourself (e.g. when inputs are already makeblastdb outputs). To create true_base, false_base, and contig_table from FASTA files, use:
bash scripts/generator/prep_db.sh \
-n {DATABASE_NAME} \
-c {CONTIG_NAME} \
-t {TMP_DIR} \
[FASTA]-n DATABASE_NAME: Name of the output BLAST database (required).-c CONTIG_TABLE: Output file to store contig names and their corresponding sequence headers (required).-t TMP_DIR: Temporary directory for intermediate files (optional, defaults to./.tmp).FASTA: List of input FASTA files (gzipped or uncompressed).
PROBEst includes a user-friendly web interface for probe generation. The web app provides:
python app/app.pyFor detailed web app documentation, see app/README.md
-
Prepare BLASTn databases
-
Select File for Probe Generation (
INPUT) -
Select Files for Universality Check (
TRUE_BASE) -
Select Files for Specificity Check (
FALSE_BASE) -
Select Layouts and Run Wrapped Evolutionary Algorithm (
pipeline.py)a. Primer3 Generation
b. BLASTn Check
c. Parsing
d. Mutation in Probe
e. AI corrections (if
-AIis enabled)f. De-degeneration (if
-degenerationis enabled)
---
config:
layout: elk
look: classic
---
%%{init: {
'theme': 'base',
'themeVariables': {
'fontFamily': 'arial',
'fontSize': '16px',
'primaryColor': '#fff',
'primaryBorderColor': '#FFAC1C',
'primaryTextColor': '#000',
'lineColor': '#000',
'secondaryColor': 'white',
'tertiaryColor': '#fff',
'subgraphBorderStyle': 'dotted'
},
'flowchart': {
'curve': 'monotoneY',
'padding': 15
}
}}%%
graph LR
subgraph inputs
A
A1
T1
T3
end
A([Initial probe generation]):::input -- primer3 --> B2(initial probe set):::probe
A -- oligominer --> B2
A1([Custom probes]):::input --> B2
T1([Target sequences]):::input -- blastn-db --> T2[(target)]
T3([Offtarget sequences]):::input -- blastn-db --> T4[(offtarget)]
subgraph database
T2
T4
end
T2 --> EA
T4 --> EA
B2 --> EA
EA[evolutionary algorithm] --> T11(results):::probe
classDef empty width:0px,height:0px;
classDef input fill:#90EE9020,stroke:#fff,stroke-width:2px,shape:ellipse;
classDef probe fill:#FFAC1C20,stroke:#fff,stroke-width:2px;
---
config:
layout: elk
look: classic
---
%%{init: {
'layout': 'elk',
'theme': 'base',
'themeVariables': {
'fontFamily': 'arial',
'fontSize': '16px',
'primaryColor': '#fff',
'primaryBorderColor': '#FFAC1C',
'primaryTextColor': '#000',
'lineColor': '#000',
'secondaryColor': 'white',
'tertiaryColor': '#fff',
'subgraphBorderStyle': 'dotted'
},
'flowchart': {
'curve': 'monotoneY',
'padding': 15
}
}}%%
graph LR
subgraph evolutionary algorithm
subgraph hits
TP
TN
end
B(probe set):::probe --> TP[target]
B --> TN[offtarget]
B1 -- mutations --> B
TP -- coverage --> T6[universality]
TP -- duplications --> T7[multimapping]
TN ---> T8[specificity]
subgraph check
T6
T7
T8
M1
E3
end
B --- E6[ ]:::empty --> M1[modeling]
TP --- E6
M1 --- E3[ ]:::empty
T6 --- E3
T7 --- E3
T8 --- E3
E3 -- quality prediction --> B1(filtered probe set):::probe
end
B1 --> T11(results):::probe
classDef empty width:0px,height:0px;
classDef input fill:#90EE9020,stroke:#fff,stroke-width:2px,shape:ellipse;
classDef probe fill:#FFAC1C20,stroke:#fff,stroke-width:2px;
---
config:
theme: neutral
look: classic
---
%%{init: {
'theme': 'base',
'themeVariables': {
'fontFamily': 'arial',
'fontSize': '16px',
'primaryColor': '#fff',
'primaryBorderColor': '#FFAC1C',
'primaryTextColor': '#000',
'lineColor': '#000',
'secondaryColor': '#90EE90',
'tertiaryColor': '#fff',
'subgraphBorderStyle': 'dotted'
},
'flowchart': {
'curve': 'monotoneY',
'padding': 15
}
}}%%
graph LR
PROBEst([PROBEst]) --> src[src/]
PROBEst --> scripts[scripts/]
PROBEst --> tests[tests/]
PROBEst --> app[app/]
subgraph folders
src
scripts
tests
app
end
src --> C[benchmarking]
src --> A[generation]
tests --> A
scripts --> D[preprocessing]
scripts --> B[database parsing]
D --> A
app --> E[web interface]
E --> A
-
To check the installation:
bash test_generator.sh -
For developers: use
pytest
If you use PROBEst LLM pipeline for the extraction of the research data, please cite:
BibTeX:
@article{202511.2140,
doi = {10.20944/preprints202511.2140.v1},
url = {https://doi.org/10.20944/preprints202511.2140.v1},
year = 2025,
month = {November},
publisher = {Preprints},
author = {Alexandr Serdiukov and Vitaliy Dragvelis and Daniil Smutin and Amir Taldaev and Sergey Muravyov},
title = {Efficient and Verified Extraction of the Research Data Using LLM},
journal = {Preprints}
}Plain text: Serdiukov, A., Dragvelis, V., Smutin, D., Taldaev, A., & Muravyov, S. (2025). Efficient and Verified Extraction of the Research Data Using LLM. Preprints. https://doi.org/10.20944/preprints202511.2140.v1
This project is licensed under the MIT License - see the LICENSE file for details.
We welcome contributions from the community! To contribute:
Please read the Contribution Guidelines for more details.
Tool have its own Wiki pages with detailed information on usage cases, data description and another neccessary information
