Skip to content

ericmalekos/crisprware

Repository files navigation

CRISPRware

Tests Lint

CRISPRware is a comprehensive toolkit designed to preprocess NGS data and identify, score, and rank guide RNAs (gRNAs) for CRISPR experiments. It supports RNASeq, RiboSeq, ATACSeq, DNASESeq, ChIPSeq, and other genomic preprocessing techniques.

Table of Contents

  1. Installation
  2. Tutorials
  3. Quickstart
  4. Requirements
  5. Leveraging NGS data
  6. Alternate PAMs and scoring methods
  7. Full Commands
  8. References

Installation

If you have not already, install one of the package managers miniconda or micromamba

Linux installation

With conda installed perform the following commands. (If you installed micromamba, conda -> micromamba)

git clone https://github.com/ericmalekos/crisprware crisprware && cd crisprware

conda env create -f environment.yml && conda activate crisprware

MacOs installation and troubleshooting

Try running git -h, if you hit an error xcrun: error: invalid developer path ..., you may need to install the Command Line Tools package with xcode-select --install With this complete, follow the same instructions as for Linux.

You may encounter an error with the score_guides module. In short you need to install a specific version of libomp for RS3 scoring, this can be accomplished with the command brew install libomp@11.1.0. You need to have homebrew installed.

Docker

Avoid local installation by pulling the latest docker image and run commands:

docker pull ericmalekos/crisprware:latest

docker run crisprware -h

Tutorials

These interactive notebooks demonstrate use of CRISPRware modules with text explanations and codeblocks. These are the best place to start for gaining understanding of the workflow and capabilities of this software. In each case the first block sets up the environment by pulling the latest version from Github.

Full Tutorial: Open In Colab

Covers all major CRISPRware functions.

CRISPRware Rice Genome Tutorial: Open In Colab

End to end processing of rice osa1_r7 genome and gene annotation.

CRISPRware NGS applications: Open In Colab

Examples of retrieving public NGS data and applying to custom gRNA library design.

Quickstart

Input Requirements

  • FASTA File

Optional inputs

  • BED file: A BED file can be provided to specify regions of interest within the genome. This file can help to limit the search space for gRNA identification.
  • GTF/GFF file: A GTF or GFF file can be used to provide gene annotations. This information can be used to filter gRNAs based on specific genomic features such as exons or coding sequences .

CRISPRware workflow

We demonstrate usage with ce11 chromosome III fasta and NCBI GTF, included in the tests/test_data/ce11 directory:

Note the example off-target index is limited to chrIII, not the full ce11 genome

crisprware index_genome -f tests/test_data/ce11/chrIII_sequence.fasta

We can build gene models from NCBI GTF,

crisprware preprocess_annotation -g tests/test_data/ce11/chrIII_ce11.ncbiRefSeq.gtf \
-m metagene consensus longest shortest

Default settings generate NGG protospacer guides

crisprware generate_guides -f tests/test_data/ce11/chrIII_sequence.fasta \
-k tests/test_data/ce11/chrIII_ce11.ncbiRefSeq.gtf \
--feature CDS

Scoring will take ~5 minutes and uses 8 threads by default. Change this with --threads . --tracr is either Chen2013,Hsu2013, os both, see RuleSet3 scoring for details

crisprware score_guides -b chrIII_sequence_gRNA/chrIII_sequence_gRNA.bed \
-i chrIII_sequence_gscan2/chrIII_sequence_gscan2 --tracr Chen2013 --threads 8

Ranking is done based on scoring columns
-c is matched with -m order so this filters out RS3_score_Chen2013 < 0, specificity_gscan_index < 0.2
-p 5 65, -f CDS filters out gRNAs outside of the 5th-65th percentile of the CDS
--output_all outputs TSV and histograms for each stage of filtering in addition to the final output.

crisprware rank_guides \
-k chrIII_sequence_scoredgRNA/chrIII_sequence_scoredgRNA.bed \
-t tests/test_data/ce11/chrIII_ce11.ncbiRefSeq.gtf \
-f CDS \
-c RS3_score_Chen2013 specificity_chrIII_sequence_gscan2 \
-m 0 0.2 \
-p 5 65 \
-r RS3_score_Chen2013 \
--output_all

Requirements

Memory requirements may be substantial in both the index_genome and score_guides steps. Guidescan2 authors provide compiled indices for some model species in the download section of their website which can be downloaded directly to avoid use of index_genome.
For score_guides we provide a parameter --chunk_size <n> which can be used to decrease memory usage by processing <n> guides at a time instead of all at once. Default setting is 100000. Increasing this number will speed up processing time and memory requirements, decreasing will slow down processing time and decrease memory requirements.

Leveraging NGS data

CRISPRware offers a series of modules to preprocess NGS data and determine suitable gRNAs for CRISPR applications.

RNASeq Guided Preprocessing

The module preprocess_annotation takes processed RNASeq TPMs from Kallisto, Salmon, FLAIR, or Mandalorian from one or more samples along with the GTF/GFF gene annotation. All processed samples should be from the same quantification tool, don't mix Salmon and Kallisto files. If multiple samples are passed, max, min, median, and mean TPM values for each transcript are determined, and the user can supply minimum cut-offs for any combination of these to filter out lowly expressed isoforms. All detected isoforms (TPM > 0) are kept by default. The user can also set an integer flag --top-n <n> which will filter out all but the most highly expressed isoform for each gene. So, --top-n 1 will retain only the gene model of the most highly expressed isoform - according to median_tpm if multiple RNA seq files are passed. There are also --tss_window and --tes_window options, which produce BED for dCas target choices. User can use these GTFs/BEDs in the generate_guides step and the rank_guides step.

crisprware preprocess_annotation -g test_data/chr19_ucsc_mm39.ncbiRefSeq.gtf \
-t quant1.sf quant2.sf quant3.sf \
--type infer \
--median 5 \
--top_n 10 \
--top_n_column median \
--model consensus metagene shortest longest \
--tss_window 300 300
--tes_window 300 300

IMPORTANT: ensure the GTF and the TPM files have the same transcript IDs

RiboSeq Guided Preprocessing

A number of tools exists for calling translated ORFs from RiboSeq. In order to find gRNAs against these putative coding regions we can convert output from these programs into a GTF with annotated coding sequence (CDS) entries and run the pipeline normally.

For ORFs called with RiboTISH set these options in the ribotish predict command: --inframecount, --blocks, --aaseq and provide the same GTF that was passed to ribotish. Default settings should work for ORFs called with Price, but it does have fewer filtering options.

For other RiboSeq ORF callers raise a github issue and I will address it.

Full filtering options:

gtf_from_ribotish.py -h

options:
  -h, --help            show this help message and exit
  -r RIBOTISH, --ribotish RIBOTISH
                        Path to the Ribotish predict TSV file
  -i INPUT_GTF, --input_gtf INPUT_GTF
                        Path to the corresponding GTF file
  -o OUTPUT_GTF, --output_gtf OUTPUT_GTF
                        Path to output the new GTF file
  --min_aalen MIN_AALEN
                        Minimum amino acid length
  --min_inframecount MIN_INFRAMECOUNT
                        Minimum in-frame count
  --max_tisqvalue MAX_TISQVALUE
                        Maximum TIS Q-value
  --max_frameqvalue MAX_FRAMEQVALUE
                        Maximum Frame Q-value
  --max_fisherqvalue MAX_FISHERQVALUE
                        Maximum Fisher Q-value
  --select_based_on {AALen,InFrameCount,TISQvalue,FrameQvalue,FisherQvalue}
                        Column to select the best row for each Tid, TisType pair
  --genetype GENETYPE   GeneType to filter, must match a column entry
  --tistype TISTYPE     TisType to filter, must match a column entry


gtf_from_price.py -h

options:
  -h, --help            show this help message and exit
  -i INPUT_TSV, --input_tsv INPUT_TSV
                        Path to the input price TSV file
  -g INPUT_GTF, --input_gtf INPUT_GTF
                        Path to the input GTF file to be used as a reference
  -o OUTPUT_GTF, --output_gtf OUTPUT_GTF
                        Path to output the new GTF file
  -p MIN_P_VALUE, --min_p_value MIN_P_VALUE
                        Minimum p value for filtering
  --min_aalen MIN_AALEN
                        Minimum amino acid length
  --tis_type TIS_TYPE   Tis Type to filter
  --start_codon START_CODON
                        start codon to filter

Alternate PAMs and scoring methods

Default crisprware generate_guides settings are equivalent to

crisprware generate_guides \
-f <fasta> \
--pam [-p] NGG
--sgRNA_length [-l] 20
--context_window [-w] 4 6
--active_site_offset_5 [-5] "-4"
--active_site_offset_3 [-3] "-4"

plot

All IUPAC ambiguity codes are allowed and will be automically expanded, e.g. NGG -> AGG, TGG, CGG, GGG. Note that context_window[0] extends the sequence in the 5' direction, context_window[1] in the 3' direction. active_site_offsets are calculated relative to PAM-protospacer position, and should be passed in quotes if they are negative.

For Cas12A guide selection change crisprware generate_guides settings to

crisprware generate_guides \
-f <fasta> \
--pam TTTV --pam_5_prime -5 19 -3 23 -l 23 -w 8 3

plot

Here the pam is 5-prime to the protospacer so --pam_5_prime flag is set and the length is increased 23. The window is resized for compatibility with DeepCpf1 and EnPAMGB scoring and final sequence should be 34 nts long.

Additional scoring methods

For additional on-target scoring, including of Cas12A/Cpf1 guides, first install crisprScore (recommendation: install in a new conda environment). Once installed the crisprScore_multi.R script can be used to score guides. The scoring methods have different requirements related to the 5'/3' flanking sequence lengths of the input which is set in the --context_window argument of generate_guides. As long as the flank sequence is equal to or longer than the required flank length then the method can be applied. Any number of scoring methods can be applied in a single run in which case you want the context window to be the legnth of the longest required input. Here is an example of finding and scoring all gRNAs of the exons of the genes ITGA2B and ITGB3, with a 50bp buffer around each exon. Assuming you have a human genome fasta and GTF in the current directory:

# extract a bed of gene exons, +/- 50 bps

awk 'BEGIN{FS=OFS="\t"} $0!~/^#/ && $3=="exon" && $9~/gene_name "(ITGA2B|ITGB3)";/ {s=$4-1-50; if(s<0)s=0; e=$5+50; print $1,s,e}' gencode.v49.primary_assembly.annotation.gtf \
| sort -k1,1 -k2,2n \
| bedtools merge -i - > itga2b_itgb3_exons_50bpBuffer.merged.bed

# generate guide RNAs with a 13-bp 5' flank and 32-bp 3' flank which includes the NGG PAM. 

crisprware generate_guides -f GRCh38.primary_assembly.genome.fa \
-k itga2b_itgb3_exons_50bpBuffer.merged.bed \
--coords_as_active_site \
--context_window 13 32

# score with Cas9 methods, notice the last two inputs are the flanks - now excluding the PAM length

crisprscore_multi.R sgRNAs/sgRNAs.bed 1,2,3,4,5,6,7,8,9,10,11,12,13,14 sgRNAs_scored.bed Cas9 13 29

# Perform off-target scoring and formatting of bed for rank_guides
# turn off --drop_duplicates and set --threshold=-1 to retain all gRNAs

crisprware score_guides -b sgRNAs_scored.bed \
-i Hg38_Index/Hg38_index \
--skip_rs3 \
--drop_duplicates \
--threshold=-1

Example usage for Cas12a

# generate Cas12a guides

crisprware generate_guides \
-f GRCh38.primary_assembly.genome.fa \
-k itga2b_itgb3_exons_50bpBuffer.merged.bed \
--pam TTTV --pam_5_prime -5 19 -3 23 -l 23 -w 8 3 \
--coords_as_active_site \
-o Cas12a

# score with Cas12a methods, note the 5' prime context goes 8->4, removing the PAM length

crisprscore_multi.R  Cas12asgRNAs/Cas12asgRNAs.bed 15,16,17 scored_Cas12asgRNAs.bed Cas12a 4 3

# format for rank_guides

crisprware score_guides -b scored_Cas12asgRNAs.bed --skip_rs3 --skip_gs2

Guidescan2 is not compatible with PAMs 5' to protospacers, for off-target scoring in these cases I suggest FlashFry. I am working on a tutorial for FlashFry off-target scoring.

Full Commands

crisprware preprocess_annotation

options:
  -h, --help            show this help message and exit
  -g GTF, --gtf GTF     GTF/GFF file to use for isoform filtering.
  -t [TPM_FILES ...], --tpm_files [TPM_FILES ...]
                        A list of one or more isoform quantification files
                        produced by Salmon, Kallisto or FLAIR (FLAIR outputs
                        counts, not TPMs). The first column should contain
                        only the transcript_id and should exactly match the
                        transcript_ids in --gtf. All transcript_ids in each
                        TPM file must be common across all files and must be
                        found in the GTF file.
  -f {salmon,kallisto,flair,mandalorian,infer}, --type {salmon,kallisto,flair,mandalorian,infer}
                        Specify TPM input type. 'infer' guesses the input type
                        based on the header line. [default: "infer"].
  --mean MEAN           For a given isoform, the mean tpm/count across samples
                        must be at least this to be considered, else discard
                        isoform. [default: 0.0]
  --median MEDIAN       For a given isoform, the median tpm/count across
                        samples must be at least this to be considered, else
                        discard isoform. [default: 0.0]
  --min MIN             For a given isoform, each sample must have at least
                        this tpm/count to be considered, else discard isoform.
                        [default: 0.0]
  --max MAX             For a given isoform, at least one sample must have at
                        least this tpm/count to be considered, else discard
                        isoform. [default: 0.0]
  -n TOP_N, --top_n TOP_N
                        For a given gene, rank all isoforms by median_tpm,
                        keep the top_n ranked isoforms and discard the rest.
                        '-1' to keep all isoforms. [default: -1]
  -c {median,mean,min,max}, --top_n_column {median,mean,min,max}
                        The metric by which to rank and filter top isoforms.
                        Used with '-n' to select expressed transcripts.
                        [default: median]
  -m [{metagene,consensus,longest,shortest} ...], --model [{metagene,consensus,longest,shortest} ...]
                        Whether to output 'metagene', 'consensus', 'longest',
                        'shortest' model. 'longest' and 'shortest' select, for
                        a given gene, the transcript with the longest or
                        shortest CDS, for now noncoding genes are ignored.
                        Output is always after tpm filtering has been applied.
                        Multiple entries are allowed e.g. --model metagene
                        consensus longest [default: None]
  -w  TSS_WINDOW TSS_WINDOW, --tss_window TSS_WINDOW TSS_WINDOW
                        Pass two, space-separated, integers to specifiy the bp
                        window around the TSS as '<upstream>' '<downstream>'.
                        Strand-orientation is inferred, i.e. '<upstream>' will
                        be in the 5' direction of the TSS and <downstream> in
                        the 3' direction. e.g. --tss_window 250 150. [default:
                        None]
  -e  TES_WINDOW TES_WINDOW, --tes_window TES_WINDOW TES_WINDOW
                        Pass two, space-separated, integers to specifiy the bp
                        window around the transcription end site, TES, as
                        '<upstream>' '<downstream>'. Strand-orientation is
                        inferred, i.e. '<upstream>' will be in the 5'
                        direction of the TES and <downstream> in the 3'
                        direction. e.g. --tss_window 0 150. [default: None]
  -x  TX_TO_GENE, --tx_to_gene TX_TO_GENE
                        A TSV with transcript IDs in the first column and Gene
                        IDs in the second. The transcript IDs must match the
                        first column entries of the --quant_files. If this is
                        not provided it will be deduced from the GTF/GFF3 and
                        saved as
                        './annotations/intermediateFiles/tx2gene.tsv'.
  --strip_tx_id         Set this flag if there are transcript IDs in the 
                        quantification files but not in the GTF/GFF3. [default: False]
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Path to output. [default: current directory]

crisprware index_genome

options:
  -h, --help            show this help message and exit
  -f FASTA, --fasta FASTA
                        FASTA file to use as a reference for index creation.
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Path to output. [default: current directory]
  --locations_to_keep [LOCATIONS_TO_KEEP ...]
                        List of BED/GTF files with coordinates to use for
                        index creation. These locations will be used for off-
                        target scoring. If multiple files are passed,
                        coordinates will be merged with a union operation.
                        Leave empty to use entire fasta.
  --feature FEATURE     For any GTF/GFF in '--locations_to_keep', only this
                        feature will be used for determining appropriate
                        sgRNA. The feature should match an entry in the third
                        column of the GTF/GFF. [default: 'transcript']
  -w CONTEXT_WINDOW CONTEXT_WINDOW, --context_window CONTEXT_WINDOW CONTEXT_WINDOW
                        Pass two, space-separated, integers to specifiy the
                        nucleotide window around the --locations_to_keep
                        '<upstream>' '<downstream>'. This can be used to
                        expand the window around the final intervals e.g. '-w
                        1000 1500' expands chr1 2000 3500 -> chr1 1000 5000
                        Good for CRISPRi/a [default: 20 20]
crisprware generate_guides

options:
  -h, --help            show this help message and exit
  -f FASTA, --fasta FASTA
                        FASTA file to use as a reference for sgRNA generation.
  -p PAM, --pam PAM     Protospacer adjacent motif to match. All IUPAC
                        ambiguity codes are accepted as well as standard ATCG.
                        [default: NGG]
  -l SGRNA_LENGTH, --sgRNA_length SGRNA_LENGTH
                        Length of sgRNA to generate. [default: 20]
  -w CONTEXT_WINDOW CONTEXT_WINDOW, --context_window CONTEXT_WINDOW CONTEXT_WINDOW
                        Pass two, space-separated, integers to specifiy the
                        nucleotide window around the sgRNA as '<upstream>'
                        '<downstream>'. This can be used for downstream
                        scoring, For Ruleset3 use -w 4 6 to obtain an
                        appropriate score context. [default: 4 6]
  -5 ACTIVE_SITE_OFFSET_5, --active_site_offset_5 ACTIVE_SITE_OFFSET_5
                        Where cut occurs relative to PAM 5' end. To avoid
                        error, use '=' sign when passing a negative number,
                        e.g. --active_site_offset_5=-1 [default: -4]
  -3 ACTIVE_SITE_OFFSET_3, --active_site_offset_3 ACTIVE_SITE_OFFSET_3
                        Where cut occurs relative to PAM 5' end. [default: -2]
                        To avoid error, use '=' sign when passing a negative
                        number, e.g. --active_site_offset_3=-3 [default: -4]
  -k [LOCATIONS_TO_KEEP ...], --locations_to_keep [LOCATIONS_TO_KEEP ...]
                        List of BED/GTF files with coordinates in which the
                        sgRNA desired. If the sgRNA cutsite does not intersect
                        coordinates in these files they are discarded. Leave
                        blank to keep all sgRNA. e.g. atac_peak.bed genes.gtf
  --feature FEATURE     For any GTF/GFF in '--locations_to_keep', only this
                        feature will be used for determining appropriate
                        sgRNA. The feature should match an entry in the third
                        column of the GTF/GFF. [default: 'exon']
  --join_operation {merge,intersect}
                        How to treat '--locations_to_keep' if multiple files
                        are passed. Either 'merge' or 'intersect' can be used
                        and work as described in Bedtools. If 'merge', sgRNA
                        will be kept if its cutsite intersects an entry in ANY
                        of the files, if 'intersect' the cutsite must
                        intersect an entry in EACH file. [default:
                        'intersect']
  --locations_to_discard [LOCATIONS_TO_DISCARD ...]
                        List of BED/GTF files with coordinates where sgRNA
                        should not target. If the sgRNA cutsite intersects
                        coordinates in these files the sgRNA is discarded.
                        Leave blank to keep all sgRNA. e.g. TSS.bed
                        coding_genes.gtf
  --prefix PREFIX       Prefix to use for sgRNA identifiers. [default: None]
  --gc_range GC_RANGE GC_RANGE
                        Pass two, space-separated, integers to specifiy the
                        percentile range of GC content e.g. '--gc_range 25
                        75'. [default: 0 100]
  --discard_poly_T      Whether to discard polyT (>TTT) sgRNA. Recommend True
                        for PolIII promoters [default: False]
  --discard_poly_G      Whether to discard polyT (>GGGG) sgRNA. [default:
                        False]
  --restriction_patterns [RESTRICTION_PATTERNS ...]
                        Reject sgRNA with these restriction patterns. Also
                        checks 5'flank+sgRNA+3'flank, and reverse complement,
                        if provided. For multiple values, separate by space.
                        e.g. GCGGCCGC TCTAGA CACCTGC
  --flank_5 FLANK_5     include the 5' context of the lentivirus vector. Used
                        in conjunction with --restriction_patterns to remove
                        incompatible sgRNA
  --flank_3 FLANK_3     include the 3' context of the lentivirus vector. Used
                        in conjunction with --restriction_patterns to remove
                        incompatible sgRNA
  --min_chr_length MIN_CHR_LENGTH
                        Minimum chromosome length to consider for sgRNA
                        generation. [default: 20]
  --pam_5_prime         If the PAM is positioned 5' to the protospacer set
                        this flag, e.g. for Cas12a sgRNAs [default: False]
  --coords_as_active_site
                        Whether to output bed coordinates at the active site
                        rather than the coordinates of the entire protospacer.
                        For purposes of keeping or discarding sgRNAs, overlap
                        with the active site coordinates will be used
                        regardless [default: True]
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Path to output. [default: current directory]
  -t THREADS, --threads THREADS
                        Number of threads. [default: 4]
crisprware score_guides

options:
  -h, --help            show this help message and exit
  -b SGRNA_BED, --sgrna_bed SGRNA_BED
                        sgrnas.bed ouput of GenerateGuides.
  -i [GUIDESCAN2_INDICES ...], --guidescan2_indices [GUIDESCAN2_INDICES ...]
                        One or more, space-separate Guidescan2 indices. A
                        specificity score will be calculated against each
                        index separately.
  --tracr {Hsu2013,Chen2013,both}
                        TracrRNA version for cleavage scoring. Either
                        'Hsu2013' or 'Chen2013' or 'both', see
                        https://github.com/gpp-rnd/rs3 for details.
  --threshold THRESHOLD
                        Threshold for Guidescan2 off-target hits. If off-
                        targets are found this distance away the sgRNA will be
                        discarded, i.e. set to 2 to discard any guides with a
                        0, 1 or 2 mismatches from another PAM adjacent
                        sequence. --threshold=-1 to retain all guides
                        [default: 2]
  --mismatches MISMATCHES
                        Number of mismatches for Guidescan2 off-target scoring
                        [default: 3]
  --rna_bulges RNA_BULGES
                        RNA bulges for Guidescan2 off-target scoring [default:
                        0]
  --dna_bulges DNA_BULGES
                        DNA bulges for Guidescan2 off-target scoring [default:
                        0]
  --mode {succinct,complete}
                        Whether Guidescan2 temporary output should be succinct
                        or complete mode [default: 0]
  --alt_pams [ALT_PAMS ...]
                        One or more, space-separate alternative pams for off-
                        target consideration. e.g. NAG
  -d, --drop_duplicates
                        Drop exact duplicate sgRNAs before scoring to save
                        time. Set flag to retain duplicates. [default: True]
  --skip_rs3            Set flag to skip RS3 scoring [default: False]
  --skip_gs2            Set flag to skip Guidescan2 scoring [default: False]
  --min_rs3 MIN_RS3     Minimum cleavage RS3 score. RS3 cleavage scores are
                        formatted as z-scores, so this is interpreted as a
                        standard deviation cutoff. Functionality also
                        available in rank_guides.py. Applying at this stage
                        can increase speed by filtering before off-target
                        scoring. [default: None]
  --chunk_size CHUNK_SIZE
                        Number of sgRNAs to hold in memory for cleavage
                        scoring and off-target filtering. Reduce if memory
                        constrained. Increasing may improve runtime [default:
                        100000]
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Path to output. [default: current directory]
  -k, --keep_tmp        Set flag to keep temporary Guidescan2 output [default:
                        False]
  -t THREADS, --threads THREADS
                        Number of threads [default: 8]
crisprware rank_guides

options:
  -h, --help            show this help message and exit
  -k SCORED_GUIDES, --scored_guides SCORED_GUIDES
                        <score_guides_output>.tsv output from score_guides.
  -t TARGETS, --targets TARGETS
                        BED/GTF/GFF used to select final guides per target.
                        For GTF/GFF, set --target_mode to either 'gene' or
                        'transcript'. For BED, targets are each entry. Use '--
                        number_of_targets' to set the number of guides chosen
                        for each target.
  --target_mode {gene,tx}
                        If a GTF/GFF is used to select targets, sgRNAs can be
                        grouped at either the 'tx' or 'gene' level e.g. '--
                        target_mode gene -n 10' chooses 10 guides per gene, '
                        --target_mode tx -n 10' chooses 10 per transcript
                        [default: gene].
  -f FEATURE, --feature FEATURE
                        If GTF/GFF passed, use this feature for processing
                        e.g. 'exon', 'CDS', '5UTR', etc. The feature appears
                        in the third column of the GTF/GFF [default: CDS].
  -p PERCENTILE_RANGE PERCENTILE_RANGE, --percentile_range PERCENTILE_RANGE PERCENTILE_RANGE
                        Allowable range of guide for each transcript and
                        feature set, e.g. '-p 60 80 -f exon' returns sgRNAs in
                        the 60th to 80th percentile of exons for a given
                        transcript. Default setting returns guides anywhere in
                        the CDS for each transcript [default: 0 100]
  -n NUMBER_OF_GUIDES, --number_of_guides NUMBER_OF_GUIDES
                        Number of guides returned per target.'-1' to keep all
                        guides [default: -1]
  --min_spacing MIN_SPACING
                        The minimum nucleotide space between guides for a
                        given target. e.g. --min_spacing 10, requires guides
                        10 nts appart. 0 to allow overlapping guides.[default:
                        0]
  --output_all          Set flag to save sgRNA-target TSVs at each stage of
                        filtering rather than just the end.[default: False]
  --plot_histogram      Set flag to plot a histogram of the distribution of
                        sgRNAs per target after each filtering step. Sets '--
                        output_all' to True.[default: False]
  -c [FILTERING_COLUMNS ...], --filtering_columns [FILTERING_COLUMNS ...]
                        One or more space-separated column names used for
                        filtering. Uses raw values. e.g. '-c rs3_z_score
                        specificity_Hg38_index'.
  -m [MINIMUM_VALUES ...], --minimum_values [MINIMUM_VALUES ...]
                        A space-separated list of minimum values for each
                        column in passed by --ranking_columns. e.g. '-c
                        rs3_z_score specificity_Hg38_index -m "-1" 0.2'
                        Default is no minimum [default: None]
  -r [RANKING_COLUMNS ...], --ranking_columns [RANKING_COLUMNS ...]
                        One or more space-separated column names used for
                        guide ranking. e.g. '-r rs3_score_Hsu2013
                        rs3_score_Chen2013'.
  -w [COLUMN_WEIGHTS ...], --column_weights [COLUMN_WEIGHTS ...]
                        A space-separated list of weight values for each
                        column in passed by --ranking_columns. e.g. '-c
                        rs3_score specificity_Hg38_index -w 1 0' Default is
                        equal weighting for all ranking columns.
  --normalize_columns   Scale ranking column values to 0 to 1 [default: True]
  -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        Path to output. [default: current directory]


conda activate <crisprscore env>

crisprscore_multi.R

	Usage: crisprscore.R <path_to_sgrna_bed_file> <comma_separated_method_numbers> <outputfile> <enzyme> <5prime_flank_length> <3prime_flank_length> [--chunk-size <size>] [--debug]
	Example: crisprscore.R input.tsv 1,2,5 output_scored.tsv Cas9 13 29
	Example with debug: crisprscore.R input.tsv 1,2,11 output_scored.tsv Cas12a 4 3 --debug

	Input TSV format:
	Required column: 'context' (can be in any position)
	All other columns are preserved in the output

	Enzyme types:
	Cas9 - Use for SpCas9-based scoring methods (1-14)
	Cas12a - Use for AsCas12a-based scoring methods (15-17)

	Context format:
	For Cas9: [5' flank] + [20nt spacer] + [3nt PAM] + [3' flank]
	For Cas12a: [5' flank] + [4nt PAM] + [23nt spacer] + [3' flank]
	Specify the lengths of your 5' and 3' flanks as arguments
	The script will automatically trim to the required length for each scoring method

	Scoring Methods for Cas9:
	1:  RuleSet1 - SpCas9 (Length: 30)
	2:  Azimuth - SpCas9 (Length: 30)
	3:  DeepHF_WT_U6 - SpCas9 (Length: 23)
	4:  DeepHF_WT_T7 - SpCas9 (Length: 23)
	5:  DeepHF_ESP_U6 - SpCas9 (Length: 23)
	6:  DeepHF_ESP_T7 - SpCas9 (Length: 23)
	7:  DeepHF_HF_U6 - SpCas9 (Length: 23)
	8:  DeepHF_HF_T7 - SpCas9 (Length: 23)
	9:  Lindel - SpCas9 (Length: 65)
	10: CRISPRscan - SpCas9 (Length: 35)
	11: CRISPRater - SpCas9 (Length: 20, spacer only)
	12: DeepSpCas9 - SpCas9 (Length: 30)
	13: RuleSet3_Hsu2013 - SpCas9 (Length: 30)
	14: RuleSet3_Chen2013 - SpCas9 (Length: 30)

	Scoring Methods for Cas12a:
	15: DeepCpf1 - AsCas12a (Length: 34, canonical PAM conversion)
	16: DeepCpf1_noConvert - AsCas12a (Length: 34, no PAM conversion)
	17: EnPAMGB - enAsCas12a (Length: 34)

	Note: CasRx-RF and CRISPRai methods are not currently available

	Optional arguments:
	--chunk-size <size>: Process dataframe in chunks of specified size (default: entire file)
	--debug: Show detailed trimming information for each method (shows what sequence is sent to each scoring function)

References

When using CRISPRware in your research, please cite:

And the score method(s) you used:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors