diff --git a/README.md b/README.md index 9e3bb76..941767a 100644 --- a/README.md +++ b/README.md @@ -24,6 +24,13 @@ In summary, ReporTree facilitates and accelerates the production of surveillance _Note: this tool relies on the usage of programs/modules of other developers. DO NOT FORGET TO ALSO [CITE](https://github.com/insapathogenomics/ReporTree/edit/main/README.md#citation) THEM!_ ## News! +#### 2025.10.22 - ReporTree v2.6.0 +We release a new version of ReporTree that besides fixing some small issues, also brings important new features: +1. ReporTree is much more efficient in filtering allele matrices based on "--loci-called, "--site-inclusion" or "--loci" arguments. +2. Reports with the percentage of loci called per sample are always produced, independently of whether the "--loci-called" argument was specified. +3. New argument "--zoom-all", which makes it easier to perform zoom-in analysis in all clusters detected at a given threshold. +4. New argument "--update-cluster-names", which forces ReporTree to update the information in metadata columns that correspond to thresholds from a previous ReporTree run. + #### 2024.12.03 - ReporTree v2.5.4 We release a new version of ReporTree that fixes the cluster and singleton names in the _clusterCompostition.tsv_ file when a previous nomenclature is provided. Noteworthy, this issue only affected the _clusterComposition.tsv_ file. All the others were correct and, therefore, this update does not bring major novelties. @@ -142,7 +149,7 @@ _Details about these and other situations are presented in the following table:_ Of note, to increase the flexibility of the nomenclature system, ReporTree also allows the users to change the regular expression for cluster nomenclature (i.e., starting with “cluster_” or “singleton_”) by other nomenclature of interest (e.g., other official codes for outbreaks, genogroups, etc.), which will be kept afterwards. If the cluster/singleton names do not follow ReporTree's regular expression, the rules mentioned in the above table will be applied, except for the particular situation in which the name a former singleton does not follow ReporTree's regular expression. In this case, if a sigleton named as 'mycode' integrates a cluster only with new samples, this new cluster will be named 'mycode'. ## Cluster zoom-in and subtree of interest -ReporTree cluster zoom-in (i.e. the application of '--zoom-cluster-of-interest' and '--subtree-of-interest' arguments) was designed to allow a gain of resolution in clusters of interest by dynamically increasing their set of core loci. +ReporTree cluster zoom-in (i.e. the application of '--zoom-cluster-of-interest', '--subtree-of-interest' or '--zoom-all' arguments) was designed to allow a gain of resolution in clusters of interest by dynamically increasing their set of core loci. The following table summarizes the behavior of the tool according to the user's specifications. In summary: @@ -239,68 +246,92 @@ ReporTree: [MANDATORY] Metadata file (tsv format). To take the most profit of ReporTree functionalities, you must provide this file. -vcf VCF, --vcf VCF Single-column list of VCF files (txt format). This file must comprise the full PATH to each vcf file. -var VARIANTS, --variants VARIANTS - Input table (tsv format) with sample name in the first column and a comma-separated list of variants in the second column with the following regular expression: '\w(\d+)\w' + Input table (tsv format) with sample name in the first column and a comma-separated list of variants in the second column with the + following regular expression: '\w(\d+)\w' -out OUTPUT, --output OUTPUT [OPTIONAL] Tag for output file name (default = ReporTree) - -l LOCI, --loci LOCI [OPTIONAL] List of loci (e.g. cgMLST) that will be used for the clustering analysis. If '--zoom-cluster-of-interest' or the '--subtree-of-interest' are set, in the zoom-in/subtree analysis, this list will be extended with additional loci (e.g., accessory - loci in the wgMLST schema) that pass the criteria defined in the '--site-inclusion' argument. If --loci is set, the --site-inclusion argument will be ignored during the first clustering, i.e. it will only be applied at the zoom-in/subtree step. - --list [OPTIONAL] If after your command line you specify this option, ReporTree will list all the possible columns that you can use as input in '--columns_summary_report'. To obtain information about the partition name for other arguments('--frequency-matrix' - and/or '--count-matrix'), please also indicate the type of analysis. NOTE!! The objective of this argument is to help you with the input of some other arguments. So, it will not run reportree.py main functions!! + -l LOCI, --loci LOCI [OPTIONAL] List of loci (e.g. cgMLST) that will be used for the clustering analysis. If '--zoom-cluster-of-interest' or the '-- + subtree-of-interest' are set, in the zoom-in/subtree analysis, this list will be extended with additional loci (e.g., accessory + loci in the wgMLST schema) that pass the criteria defined in the '--site-inclusion' argument. If --loci is set, the --site- + inclusion argument will be ignored during the first clustering, i.e. it will only be applied at the zoom-in/subtree step. + --list [OPTIONAL] If after your command line you specify this option, ReporTree will list all the possible columns that you can use as + input in '--columns_summary_report'. To obtain information about the partition name for other arguments('--frequency-matrix' + and/or '--count-matrix'), please also indicate the type of analysis. NOTE!! The objective of this argument is to help you with the + input of some other arguments. So, it will not run reportree.py main functions!! Analysis details: Analysis details - --analysis ANALYSIS Type of clustering analysis (options: grapetree, HC, treecluster). If you provide a tree, genetic clusters will always be obtained with treecluster. If you provide a distance matrix, genetic clusters will always be obtained with HC. If you provide any - other input, it is MANDATORY to specify this argument. + --analysis ANALYSIS Type of clustering analysis (options: grapetree, HC, treecluster). If you provide a tree, genetic clusters will always be obtained + with treecluster. If you provide a distance matrix, genetic clusters will always be obtained with HC. If you provide any other + input, it is MANDATORY to specify this argument. --subset [OPTIONAL] Obtain genetic clusters using only the samples that correspond to the filters specified in the '--filter' argument. - -d DIST, --dist DIST [OPTIONAL] Distance unit by which partition thresholds will be multiplied (example: if -d 10 and -thr 5,8,10-30, the minimum spanning tree will be cut at 50,80,100,110,120,...,300. If -d 10 and --method-threshold avg_clade-2, the avg_clade threshold will - be set at 20). This argument is particularly useful for non-SNP-scaled trees. Currently, the default is 1, which is equivalent to 1 allele distance or 1 SNP distance. [1.0] + -flag, --flag-multiple-clusters + [OPTIONAL] Generate an additional file with information about the + -d DIST, --dist DIST [OPTIONAL] Distance unit by which partition thresholds will be multiplied (example: if -d 10 and -thr 5,8,10-30, the minimum + spanning tree will be cut at 50,80,100,110,120,...,300. If -d 10 and --method-threshold avg_clade-2, the avg_clade threshold will + be set at 20). This argument is particularly useful for non-SNP-scaled trees. Currently, the default is 1, which is equivalent to + 1 allele distance or 1 SNP distance. [1.0] Cleaning missing data: Remove loci/positions and samples based on missing content --missing-code MISSING_CODE - [OPTIONAL] Code representing missing data. If different from '0' or 'N', please try to avoid a IUPAC character (even in lower-case). [default: 'N' when '-align' provided and '0' for other inputs] + [OPTIONAL] Code representing missing data. If different from '0' or 'N', please try to avoid a IUPAC character (even in lower- + case). [default: 'N' when '-align' provided and '0' for other inputs] --site-inclusion N_CONTENT - [OPTIONAL: Useful to remove informative sites/loci with excess of missing data] Minimum proportion of samples per site without missing data (e.g. '--site-inclusion 1.0' will only keep loci/positions without missing data, i.e. a core alignment/profile; ' - --site-inclusion 0.0' will keep all loci/positions) NOTE: This argument works on profile/alignment loci/positions (i.e. columns)! [default: 0.0]. + [OPTIONAL: Useful to remove informative sites/loci with excess of missing data] Minimum proportion of samples per site without + missing data (e.g. '--site-inclusion 1.0' will only keep loci/positions without missing data, i.e. a core alignment/profile; '-- + site-inclusion 0.0' will keep all loci/positions) NOTE: This argument works on profile/alignment loci/positions (i.e. columns)! + [default: 0.0]. --loci-called LOCI_CALLED - [OPTIONAL - only works for matrices] Minimum proportion of loci/positions called for SNP/allele matrices (e.g. '--loci-called 0.95' will only keep in the profile matrix samples with > 95% of alleles/positions, i.e. <= 5% missing data). Applied after '-- + [OPTIONAL - only works for matrices] Minimum proportion of loci/positions called for SNP/allele matrices (e.g. '--loci-called + 0.95' will only keep in the profile matrix samples with > 95% of alleles/positions, i.e. <= 5% missing data). Applied after '-- site-inclusion' argument! [default: 0.0] --sample-ATCG-content ATCG_CONTENT - [OPTIONAL - only works for alignment] Minimum proportion (0 to 1) of ATCG in informative sites of the alignment per sample (e.g. '--sample-ATCG-content 1.0' will only keep samples without N's or any non-ATCG code in informative sites) [default: 0 - keep - all samples] + [OPTIONAL - only works for alignment] Minimum proportion (0 to 1) of ATCG in informative sites of the alignment per sample (e.g. ' + --sample-ATCG-content 1.0' will only keep samples without N's or any non-ATCG code in informative sites) [default: 0 - keep all + samples] Alignment processing: Alignment processing - --remove-reference Set only if you want to remove the reference sequence of the alignment (reference name must be provided with the argument '--reference'). + --remove-reference Set only if you want to remove the reference sequence of the alignment (reference name must be provided with the argument '-- + reference'). --use-reference-coords - Set only if you want that column names in the final alignment matrix represent the reference coordinates (reference name must be provided with the argument '--reference') Note: Depending on the alignment size, this argument can make alignment processing - very slow! + Set only if you want that column names in the final alignment matrix represent the reference coordinates (reference name must be + provided with the argument '--reference') Note: Depending on the alignment size, this argument can make alignment processing very + slow! -r REFERENCE, --reference REFERENCE [OPTIONAL] Name of reference sequence. Required if '--remove-reference' and/or '--use-reference-coords' specified. --get-position-correspondence POS_CORR - [OPTIONAL] Request a .tsv with position correspondence between any sequences of your alignment. These should be indicated separated by a comma (e.g. seqA,seqB). To get the position coordinates of all sequences just write 'all'. + [OPTIONAL] Request a .tsv with position correspondence between any sequences of your alignment. These should be indicated + separated by a comma (e.g. seqA,seqB). To get the position coordinates of all sequences just write 'all'. --position-list POS_LIST - [OPTIONAL] .tsv file with the positions of interest to be reported when '--get-position-correspondence' is requested. Each column should correspond to the positions of a sequence and the sequence name should be indicated in the header. If this file is - not provided, all positions of the alignment will be reported. + [OPTIONAL] .tsv file with the positions of interest to be reported when '--get-position-correspondence' is requested. Each column + should correspond to the positions of a sequence and the sequence name should be indicated in the header. If this file is not + provided, all positions of the alignment will be reported. Partitioning with GrapeTree: Specifications to get and cut minimum spanning trees --method GRAPETREE_METHOD "MSTreeV2" [DEFAULT] Alternative:"MSTree (goeBURST)" - --missing HANDLER ONLY FOR MSTree. 0: [DEFAULT] ignore missing data in pairwise comparison. 1: remove column with missing data. 2: treat missing data as an allele. 3: use absolute number of allelic differences. + --missing HANDLER ONLY FOR MSTree. 0: [DEFAULT] ignore missing data in pairwise comparison. 1: remove column with missing data. 2: treat missing + data as an allele. 3: use absolute number of allelic differences. --n_proc NUMBER_OF_PROCESSES Number of CPU processes in parallel use. [5] -thr THRESHOLD, --threshold THRESHOLD - [OPTIONAL] Partition threshold for clustering definition (integer). Different thresholds can be comma-separated (e.g. 5,8,16). Ranges can be specified with a hyphen separating minimum and maximum (e.g. 5,8,10-20). If this option is not set, the script - will perform clustering for all the values in the range 0 to max. If you prefer to exclusively use the '-pct_thr' argument, please set '-thr none'. Note: Threshold values are inclusive, i.e. '-thr 7' will consider samples with <= 7 differences as - belonging to the same cluster! + [OPTIONAL] Partition threshold for clustering definition (integer). Different thresholds can be comma-separated (e.g. 5,8,16). + Ranges can be specified with a hyphen separating minimum and maximum (e.g. 5,8,10-20). If this option is not set, the script will + perform clustering for all the values in the range 0 to max. If you prefer to exclusively use the '-pct_thr' argument, please set + '-thr none'. Note: Threshold values are inclusive, i.e. '-thr 7' will consider samples with <= 7 differences as belonging to the + same cluster! -pct_thr PCT_THRESHOLD, --pct_threshold PCT_THRESHOLD - [OPTIONAL] Similar to 'thr' but values are indicated as the proportion of differences to the final allelic schema size or number of informative positions, e.g. '-pct_thr 0.005' corresponds to a threshold of 5 allelic/SNP differences in a matrix with 1000 - loci/sites under analysis). Different values can be comma-separated (e.g. 0.005,0.01,0.1). Ranges CANNOT be specified. This option is particularly useful for dynamic wgMLST analysis for which the size of the schema under analysis is contigent on dataset + [OPTIONAL] Similar to 'thr' but values are indicated as the proportion of differences to the final allelic schema size or number + of informative positions, e.g. '-pct_thr 0.005' corresponds to a threshold of 5 allelic/SNP differences in a matrix with 1000 + loci/sites under analysis). Different values can be comma-separated (e.g. 0.005,0.01,0.1). Ranges CANNOT be specified. This option + is particularly useful for dynamic wgMLST analysis for which the size of the schema under analysis is contigent on dataset diversity. Note: This argument can be specified even if you used the '-thr' argument. --matrix-4-grapetree Output an additional allele profile matrix with the header ready for GrapeTree visualization. Set only if you WANT the file! --wgMLST [EXPERIMENTAL] a better support of wgMLST schemes (check GrapeTree github for details). @@ -309,76 +340,129 @@ Partitioning with HC: Specifications to genetic clusters with hierarchical clustering --HC-threshold HCMETHOD_THRESHOLD - [OPTIONAL] List of HC methods and thresholds to include in the analysis (comma-separated). To get clustering at all possible thresholds for a given method, write the method name (e.g. single). To get clustering at a specific threshold, indicate the - threshold with a hyphen (e.g. single-10). To get clustering at a specific range, indicate the range with a hyphen separating minimum and maximum (e.g. single-2-10). If you prefer to exclusively use the '--pct-HC-threshold' argument, please set '--HC- - threshold none'. Note: Threshold values are inclusive, i.e. '--HC-threshold single-7' will consider samples with <= 7 differences as belonging to the same cluster! Default: single (Possible methods: single, complete, average, weighted, centroid, median, - ward) + [OPTIONAL] List of HC methods and thresholds to include in the analysis (comma-separated). To get clustering at all possible + thresholds for a given method, write the method name (e.g. single). To get clustering at a specific threshold, indicate the + threshold with a hyphen (e.g. single-10). To get clustering at a specific range, indicate the range with a hyphen separating + minimum and maximum (e.g. single-2-10). If you prefer to exclusively use the '--pct-HC-threshold' argument, please set '--HC- + threshold none'. Note: Threshold values are inclusive, i.e. '--HC-threshold single-7' will consider samples with <= 7 differences + as belonging to the same cluster! Default: single (Possible methods: single, complete, average, weighted, centroid, median, ward) --pct-HC-threshold PCT_HCMETHOD_THRESHOLD - [OPTIONAL] Similar to '--HC-threshold' but the partition threshold for cluster definition is set as the proportion of differences to the final allelic schema size or number of informative positions, e.g. '--pct-HC-threshold single-0.005' corresponds to a + [OPTIONAL] Similar to '--HC-threshold' but the partition threshold for cluster definition is set as the proportion of differences + to the final allelic schema size or number of informative positions, e.g. '--pct-HC-threshold single-0.005' corresponds to a threshold of 5 allelic/SNP differences in a matrix with 1000 loci/sites under analysis. Ranges CANNOT be specified. Partitioning with TreeCluster: Specifications to cut the tree with TreeCluster --method-threshold METHOD_THRESHOLD - List of TreeCluster methods and thresholds to include in the analysis (comma-separated). To get clustering at all possible thresholds for a given method, write the method name (e.g. root_dist). To get clustering at a specific threshold, indicate the - threshold with a hyphen (e.g. root_dist-10). To get clustering at a specific range, indicate the range with a hyphen (e.g. root_dist-2-10). Default: root_dist,avg_clade-1 (List of possible methods: avg_clade, leaf_dist_max, leaf_dist_min, length, - length_clade, max, max_clade, root_dist, single_linkage, single_linkage_cut, single_linkage_union) Warning!! So far, ReporTree was only tested with avg_clade and root_dist! + List of TreeCluster methods and thresholds to include in the analysis (comma-separated). To get clustering at all possible + thresholds for a given method, write the method name (e.g. root_dist). To get clustering at a specific threshold, indicate the + threshold with a hyphen (e.g. root_dist-10). To get clustering at a specific range, indicate the range with a hyphen (e.g. + root_dist-2-10). Default: root_dist,avg_clade-1 (List of possible methods: avg_clade, leaf_dist_max, leaf_dist_min, length, + length_clade, max, max_clade, root_dist, single_linkage, single_linkage_cut, single_linkage_union) Warning!! So far, ReporTree was + only tested with avg_clade and root_dist! --support SUPPORT [OPTIONAL: see TreeCluster github for details] Branch support threshold - --root-dist-by-node [OPTIONAL] Set only if you WANT to cut the tree with root_dist method at each tree node distance to the root (similar to root_dist at all levels but just for informative distances) + --root-dist-by-node [OPTIONAL] Set only if you WANT to cut the tree with root_dist method at each tree node distance to the root (similar to root_dist + at all levels but just for informative distances) -root ROOT, --root ROOT - Set root of the input tree. Specify the leaf name to use as output. Alternatively, write 'midpoint', if you want to apply midpoint rooting method. + Set root of the input tree. Specify the leaf name to use as output. Alternatively, write 'midpoint', if you want to apply midpoint + rooting method. ReporTree cluster nomenclature: Cluster nomenclature instructions --nomenclature-file NOMENCLATURE - [OPTIONAL] Intended usage: provide a .tsv file with the nomenclature information to be used (normaly the '*_partitions.tsv' file of a previous ReporTree run) with the following structure: First column should have the sample names; Subsequent columns have - the partitions at any/all levels. Columns matching the column names of the partitions requested in the current run (e.g. MST-7x1.0) will be used to name the current clusters at the respective levels. For more details on how to use this nomenclature - system please visit our github! Alternative usage (at your own risk): you can provide a .tsv file with just a single column with a grouping variable that does not match any partition requested in the current run (e.g. ST). In this case, all clusters will - be named according to this column (e.g. ST1.1,ST1.2,ST2.1...). + [OPTIONAL] Intended usage: provide a .tsv file with the nomenclature information to be used (normaly the '*_partitions.tsv' file + of a previous ReporTree run) with the following structure: First column should have the sample names; Subsequent columns have the + partitions at any/all levels. Columns matching the column names of the partitions requested in the current run (e.g. MST-7x1.0) + will be used to name the current clusters at the respective levels. For more details on how to use this nomenclature system please + visit our github! Alternative usage (at your own risk): you can provide a .tsv file with just a single column with a grouping + variable that does not match any partition requested in the current run (e.g. ST). In this case, all clusters will be named + according to this column (e.g. ST1.1,ST1.2,ST2.1...). --nomenclature-code-levels CODE_LEVELS - [OPTIONAL and only available for --analysis grapetree or HC] This argument allows getting a nomenclature code combining cluster information at different hierarchical levels (e.g. if '150,30,7' is provided when the analysis is GrapeTree, a code combining - the cluster names at these levels will be generated: C3-C2-C1). The order of levels indicated in the command-line will be kept in the code (in the example, C3 indicates that the sample belongs to cluster_3 at MST-150). Of note, if a '--nomenclature-file' - is provided in subsequent ReporTree runs using the same method and thresholds, the nomenclature code will be kept. You can also add one metadata variable (e.g. Country) to get an extra layer to the code (e.g. C3-C2-C1-Portugal). Partition thresholds can - be indicated in this argument following the same rules as the arguments '-thr' and '-pct_thr' for GrapeTree or '--HC-threshold' and '--pct-HC-threshold' for HC. + [OPTIONAL and only available for --analysis grapetree or HC] This argument allows getting a nomenclature code combining cluster + information at different hierarchical levels (e.g. if '150,30,7' is provided when the analysis is GrapeTree, a code combining the + cluster names at these levels will be generated: C3-C2-C1). The order of levels indicated in the command-line will be kept in the + code (in the example, C3 indicates that the sample belongs to cluster_3 at MST-150). Of note, if a '--nomenclature-file' is + provided in subsequent ReporTree runs using the same method and thresholds, the nomenclature code will be kept. You can also add + one metadata variable (e.g. Country) to get an extra layer to the code (e.g. C3-C2-C1-Portugal). Partition thresholds can be + indicated in this argument following the same rules as the arguments '-thr' and '-pct_thr' for GrapeTree or '--HC-threshold' and ' + --pct-HC-threshold' for HC. + --update-cluster-names + [OPTIONAL] Activate update clusters mode. If you set this argument, and the metadata you provided has columns names that + correspond to partition thresholds from a previous ReporTree run using the same analysis method as in the current run, ReporTree + will provide updated cluster names in the *metadata_w_partitions.tsv, instead of adding new columns with the prefix subset. This + argument is only applicable if you provide a metadata file. To use this argument you do not need to provide a nomenclature file + nor request a nomenclature code. Please be careful when using this argument in combination with '--subset' bacause ReporTree will + perform a run on the subset samples you selected AND update the *metadata_w_partitions.tsv with the new cluster names instead of + adding new columns. ReporTree metadata report: Specific parameters to report clustering/grouping information associated to metadata --columns_summary_report COLUMNS_SUMMARY_REPORT - Columns (i.e. variables of metadata) to get statistics for the derived genetic clusters or for other grouping variables defined in --metadata2report (comma-separated). If the name of the column is provided, the different observations and the respective - percentage are reported. If 'n_column' is specified, the number of the different observations is reported. For example, if 'n_country' and 'country' are specified, the summary will report the number of countries and their distribution (percentage) per - cluster/group. Exception: if a 'date' column is in the metadata, it can also report first_seq_date, last_seq_date, timespan_days. Check '--list' argument for some help. Default = - n_sequence,lineage,n_country,country,n_region,first_seq_date,last_seq_date,timespan_days [the order of the list will be the order of the columns in the report] + Columns (i.e. variables of metadata) to get statistics for the derived genetic clusters or for other grouping variables defined in + --metadata2report (comma-separated). If the name of the column is provided, the different observations and the respective + percentage are reported. If 'n_column' is specified, the number of the different observations is reported. For example, if + 'n_country' and 'country' are specified, the summary will report the number of countries and their distribution (percentage) per + cluster/group. Exception: if a 'date' column is in the metadata, it can also report first_seq_date, last_seq_date, timespan_days. + Check '--list' argument for some help. Default = + n_sequence,lineage,n_country,country,n_region,first_seq_date,last_seq_date,timespan_days [the order of the list will be the order + of the columns in the report] --partitions2report PARTITIONS2REPORT - Thresholds for which clustering information will be included in a joint report (comma-separated). Other alternatives: 'all' == all partitions; 'stability_regions' == first partition of each stability region as determined by comparing_partitions_v2.py. - Note: 'stability_regions' can only be inferred when partitioning TreeCluster or GrapeTree or HC is run for all possible thresholds or when a similar partitions table is provided (i.e. sequential partitions obtained with the same clustering method) [all]. - Partition thresholds can be indicated in this argument following the same rules as the arguments '-thr' and '-pct_thr' for GrapeTree or '--HC-threshold' and '--pct-HC-threshold' for HC or '--method-threshold' for TreeCluster. + Thresholds for which clustering information will be included in a joint report (comma-separated). Other alternatives: 'all' == all + partitions; 'stability_regions' == first partition of each stability region as determined by comparing_partitions_v2.py. Note: + 'stability_regions' can only be inferred when partitioning TreeCluster or GrapeTree or HC is run for all possible thresholds or + when a similar partitions table is provided (i.e. sequential partitions obtained with the same clustering method) [all]. Partition + thresholds can be indicated in this argument following the same rules as the arguments '-thr' and '-pct_thr' for GrapeTree or '-- + HC-threshold' and '--pct-HC-threshold' for HC or '--method-threshold' for TreeCluster. --metadata2report METADATA2REPORT Columns of the metadata table for which a separated summary report must be provided (comma-separated) -f FILTER_COLUMN, --filter FILTER_COLUMN - [OPTIONAL] Filter for metadata columns to select the samples to analyze. This must be specified within quotation marks in the following format 'column< >operation< >condition' (e.g. 'country == Portugal'). When more than one condition is specified for a - given column, they must be separated with commas (e.g 'country == Portugal,Spain,France'). When filters include more than one column, they must be separated with semicolon (e.g. 'country == Portugal,Spain,France;date > 2018-01-01;date < 2022-01-01'). - White spaces are important in this argument, so, do not leave spaces before and after commas/semicolons. + [OPTIONAL] Filter for metadata columns to select the samples to analyze. This must be specified within quotation marks in the + following format 'column< >operation< >condition' (e.g. 'country == Portugal'). When more than one condition is specified for a + given column, they must be separated with commas (e.g 'country == Portugal,Spain,France'). When filters include more than one + column, they must be separated with semicolon (e.g. 'country == Portugal,Spain,France;date > 2018-01-01;date < 2022-01-01'). White + spaces are important in this argument, so, do not leave spaces before and after commas/semicolons. + --check-cluster-incongruencies + [OPTIONAL and only available for --analysis grapetree or HC]] Perform a comparison between the genetic clusters obtained and the + distance matrix to identify incongruencies between the two (e.g. isolates A and B dist by 10 allele differences but are not + clustered at this threshold. --sample_of_interest SAMPLE_OF_INTEREST - [OPTIONAL] List of samples of interest for which summary reports will be created. This list can be a comma-separated list in the command line, or a comma-separated list in a file, or a list in the first column of a tsv file. No headers should be provided - in the input files. If nothing is provided, only the summary reports comprising all samples will be generated. + [OPTIONAL] List of samples of interest for which summary reports will be created. This list can be a comma-separated list in the + command line, or a comma-separated list in a file, or a list in the first column of a tsv file. No headers should be provided in + the input files. If nothing is provided, only the summary reports comprising all samples will be generated. --zoom-cluster-of-interest ZOOM - [OPTIONAL and only available for --analysis grapetree or HC] Repeat the analysis using only the samples that belong to each cluster of the samples of interest at a given distance threshold. This argument takes as input a comma-separated list of - partitions for which you want the zoom-in. Partition thresholds can be indicated in this argument following the same rules as the arguments '-thr' and '-pct_thr' for GrapeTree or '--HC-threshold' and '--pct-HC-threshold' for HC. This argument requires - that a metadata table was provided with '-m'. The argument '--loci-called' is not applied in the zoom-in analysis, i.e. all samples of the cluster are included. Default: no zoom-in. + [OPTIONAL and only available for --analysis grapetree or HC] Repeat the analysis using only the samples that belong to each + cluster of the samples of interest at a given distance threshold. This argument takes as input a comma-separated list of + partitions for which you want the zoom-in. Partition thresholds can be indicated in this argument following the same rules as the + arguments '-thr' and '-pct_thr' for GrapeTree or '--HC-threshold' and '--pct-HC-threshold' for HC. This argument requires that a + metadata table was provided with '-m'. The argument '--loci-called' is not applied in the zoom-in analysis, i.e. all samples of + the cluster are included. Default: no zoom-in. --subtree-of-interest SUBTREE - [OPTIONAL and only available for --analysis grapetree or HC] Repeat the analysis using the n closest samples of each sample of interest. This argument takes as input a comma-separated list of n's, corresponding to the number of closest samples you want - to include for the samples of interest. This argument requires that a metadata table was provided with '-m'. The argument '--loci-called' is not applied in the subtree analysis, i.e. all samples of the cluster are included. Default: no subtree. - --unzip [OPTIONAL and only available for --analysis grapetree or HC] Provide the outputs of '--zoom-cluster-of-interest' and '--subtree-of-interest' in unzipped format. + [OPTIONAL and only available for --analysis grapetree or HC] Repeat the analysis using the n closest samples of each sample of + interest. This argument takes as input a comma-separated list of n's, corresponding to the number of closest samples you want to + include for the samples of interest. This argument requires that a metadata table was provided with '-m'. The argument '--loci- + called' is not applied in the subtree analysis, i.e. all samples of the cluster are included. Default: no subtree. + --zoom-all ZOOM_ALL [OPTIONAL and only available for --analysis grapetree or HC] Repeat the analysis using only the samples that belong to each + cluster identified at a given distance threshold. This argument takes as input only one threshold (partition) for which you want + the zoom-in. This threshold can be indicated in this argument following the same rules as the arguments '-thr' and '-pct_thr' for + GrapeTree or '--HC-threshold' and '--pct-HC-threshold' for HC. This argument requires that a metadata table was provided with + '-m'. The argument '--loci-called' is not applied in the zoom-in analysis, i.e. all samples of the cluster are included. Default: + no zoom-in. + --unzip [OPTIONAL and only available for --analysis grapetree or HC] Provide the outputs of '--zoom-cluster-of-interest', '--subtree-of- + interest' and '--zoom-all' in unzipped format. --frequency-matrix FREQUENCY_MATRIX - [OPTIONAL] Metadata column names for which a frequency matrix will be generated. This must be specified within quotation marks in the following format 'variable1,variable2'. Variable1 is the variable for which frequencies will be calculated (e.g. for - 'lineage,iso_week' the matrix reports the percentage of samples that correspond to each lineage per iso_week). If you want more than one matrix you can separate the different requests with semicolon (e.g. 'lineage,iso_week;country,lineage'). If you want - a higher detail in your variable2 and decompose it into two columns you use a colon (e.g. lineage,country:iso_week will report the percentage of samples that correspond to each lineage per iso_week in each country) + [OPTIONAL] Metadata column names for which a frequency matrix will be generated. This must be specified within quotation marks in + the following format 'variable1,variable2'. Variable1 is the variable for which frequencies will be calculated (e.g. for + 'lineage,iso_week' the matrix reports the percentage of samples that correspond to each lineage per iso_week). If you want more + than one matrix you can separate the different requests with semicolon (e.g. 'lineage,iso_week;country,lineage'). If you want a + higher detail in your variable2 and decompose it into two columns you use a colon (e.g. lineage,country:iso_week will report the + percentage of samples that correspond to each lineage per iso_week in each country) --count-matrix COUNT_MATRIX [OPTIONAL] Same as '--frequency-matrix' but outputs counts and not frequencies - --mx-transpose [OPTIONAL] Set ONLY if you want that the variable1 specified in '--frequency-matrix' or in '--count-matrix' corresponds to the matrix first column. + --mx-transpose [OPTIONAL] Set ONLY if you want that the variable1 specified in '--frequency-matrix' or in '--count-matrix' corresponds to the + matrix first column. --pivot [OPTIONAL] Set ONLY if you want an additional table for each count/frequency matrix in pivot format. Stability regions: @@ -387,10 +471,12 @@ Stability regions: -AdjW ADJUSTEDWALLACE, --AdjustedWallace ADJUSTEDWALLACE Threshold of Adjusted Wallace score to consider an observation for method stability analysis [0.99] -n N_OBS, --n_obs N_OBS - Minimum number of sequencial observations that pass the Adjusted Wallace score to be considered a 'stability region' (i.e. a threshold range in which cluster composition is similar) [5] + Minimum number of sequencial observations that pass the Adjusted Wallace score to be considered a 'stability region' (i.e. a + threshold range in which cluster composition is similar) [5] -o ORDER, --order ORDER [Set only if you provide your own partitions table] Partitions order in the partitions table (0: min -> max; 1: max -> min) [0] - --keep-redundants Set ONLY if you want to keep all samples of each cluster of the most discriminatory partition (by default redundant samples are removed to avoid the influence of cluster size) + --keep-redundants Set ONLY if you want to keep all samples of each cluster of the most discriminatory partition (by default redundant samples are + removed to avoid the influence of cluster size) ``` diff --git a/reportree.py b/reportree.py index d59d9eb..68659c2 100644 --- a/reportree.py +++ b/reportree.py @@ -13,10 +13,13 @@ import textwrap import datetime as datetime from datetime import date +import glob import pandas +import numpy +from collections import defaultdict -version = "2.5.4" -last_updated = "2024-12-03" +version = "2.6.0" +last_updated = "2025-10-22" reportree_script = os.path.realpath(__file__) reportree_path = reportree_script.rsplit("/", 1)[0] @@ -49,6 +52,8 @@ def run_metadata_report(args, partitions2report_final, partitions_status): extra_metadata += " --mx-transpose " if args.pivot: extra_metadata += " --pivot " + if args.update_clusters: + extra_metadata += " --update-cluster-names " if partitions_status == "new": partitions = args.output + "_partitions.tsv" no_partitions = False @@ -305,7 +310,7 @@ def infer_partitions_from_list(analysis, partitions, metadata, thresholds, outpu partitions2report_lst.append(part) partitions2report = ",".join(partitions2report_lst) - + return partitions2report, partitions2report_lst def all_partitions_available(method_threshold, include_node): @@ -544,6 +549,23 @@ def get_clusters_interest(samples, matrix, metadata, day, partitions): return clusters_of_interest, not_in_partitions +def get_all_clusters(matrix, partitions, log): + """ get all clusters for zoom-all + input: partitions table + output: list of all clusters """ + + partitions_mx = pandas.read_table(matrix) + clusters_of_interest = {} + + if partitions in partitions_mx.columns: + counts = partitions_mx[partitions].value_counts() + clusters_of_interest = counts[counts > 1].index.tolist() + else: + print_log("\tThresold " + str(partitions) + "not found in partitions table...", log) + + return clusters_of_interest + + def loci_called2metadata(metadata_original, out, thr, analysis): """ adds column with percentage of called loci to metadata table """ @@ -582,7 +604,7 @@ def get_closest_samples(sample, n, hamming, samples_in_partitions): i = 0 for d in distances: if d <= max_dist: - closest_samples.append(samples[i]) + closest_samples.append(str(samples[i])) i += 1 code = "run" else: @@ -702,15 +724,19 @@ def determine_new_cluster_names(previous_samples, db_samples, old_partition, mx_ conversion = {} # cluster_names_db[cluster partitions] = cluster nomenclature mx = mx_original.set_index(mx_original.columns[0], drop = False) + changes = {} + split_clusters = {} + merged_clusters = {} + previous_samples_set = set(previous_samples) clusters = list(mx[partition].unique()) changes = {} split_clusters = {} - for cluster in clusters: - merged_clusters = {} - cluster_mx = mx[mx[partition] == cluster] + grouped = mx.groupby(partition) + + for cluster, cluster_mx in grouped: cluster_samples = cluster_mx[cluster_mx.columns[0]].values.tolist() - new_samples = set(cluster_samples) - set(previous_samples) + new_samples = set(cluster_samples) - previous_samples_set if len(new_samples) == len(cluster_samples): # all cluster samples are new if len(new_samples) == 1: max_singleton += 1 @@ -720,11 +746,11 @@ def determine_new_cluster_names(previous_samples, db_samples, old_partition, mx_ new_cluster_name = "cluster_" + str(max_cluster) conversion[cluster] = new_cluster_name change_info = "new","-","new",new_cluster_name,str(len(cluster_samples)),str(len(new_samples)),",".join(list(new_samples)) - if "new" not in changes.keys(): + if "new" not in changes: changes["new"] = [] changes["new"].append(change_info) elif len(new_samples) == 0: # cluster does not have new samples - for sample_list in db_samples.keys(): + for sample_list in db_samples: old_samples = sample_list.split(",") old_cluster = str(db_samples[sample_list]) cluster_intersection = set(cluster_samples) & set(old_samples) @@ -734,13 +760,13 @@ def determine_new_cluster_names(previous_samples, db_samples, old_partition, mx_ if len(exclusive_old) == 0 and len(exclusive_new) == 0: # cluster remains exactly the same conversion[cluster] = old_cluster change_info = old_cluster,str(len(old_samples)),"kept (none)",old_cluster,str(len(cluster_samples)),str(len(new_samples)),",".join(list(new_samples)) - if old_cluster not in changes.keys(): + if old_cluster not in changes: changes[old_cluster] = [] changes[old_cluster].append(change_info) elif len(exclusive_old) > 0 and len(exclusive_new) == 0: # cluster was split and did not merge - if old_cluster not in split_clusters.keys(): + if old_cluster not in split_clusters: split_clusters[old_cluster] = 0 - if old_cluster not in changes.keys(): + if old_cluster not in changes: changes[old_cluster] = [] if len(cluster_samples) == 1: # it is a singleton now max_singleton += 1 @@ -758,7 +784,7 @@ def determine_new_cluster_names(previous_samples, db_samples, old_partition, mx_ cluster_old_mx = mx_new[mx_new[mx_new.columns[0]].isin(list(cluster_samples))] all_clusters = sorted(cluster_old_mx[old_partition].unique()) all_clusters = "_".join(all_clusters) - if all_clusters not in merged_clusters.keys(): + if all_clusters not in merged_clusters: max_cluster += 1 new_cluster_name = "cluster_" + str(max_cluster) merged_clusters[all_clusters] = new_cluster_name @@ -766,7 +792,7 @@ def determine_new_cluster_names(previous_samples, db_samples, old_partition, mx_ new_cluster_name = merged_clusters[all_clusters] conversion[cluster] = new_cluster_name change_info = all_clusters,"multiple clusters","new (split_merge)",new_cluster_name,str(len(cluster_samples)),str(len(new_samples)),",".join(list(new_samples)) - if all_clusters not in changes.keys(): + if all_clusters not in changes: changes[all_clusters] = [] changes[all_clusters].append(change_info) else: @@ -783,7 +809,7 @@ def determine_new_cluster_names(previous_samples, db_samples, old_partition, mx_ cluster_old_mx = mx_new[mx_new[mx_new.columns[0]].isin(list(cluster_samples))] all_clusters = sorted(cluster_old_mx[old_partition].unique()) all_clusters = "_".join(all_clusters) - if all_clusters not in merged_clusters.keys(): + if all_clusters not in merged_clusters: max_cluster += 1 new_cluster_name = "cluster_" + str(max_cluster) merged_clusters[all_clusters] = new_cluster_name @@ -791,7 +817,7 @@ def determine_new_cluster_names(previous_samples, db_samples, old_partition, mx_ new_cluster_name = merged_clusters[all_clusters] conversion[cluster] = new_cluster_name change_info = all_clusters,"multiple clusters","new (merge)",new_cluster_name,str(len(cluster_samples)),str(len(new_samples)),",".join(list(new_samples)) - if all_clusters not in changes.keys(): + if all_clusters not in changes: changes[all_clusters] = [] changes[all_clusters].append(change_info) else: @@ -822,13 +848,13 @@ def determine_new_cluster_names(previous_samples, db_samples, old_partition, mx_ else: # it was already a cluster that increased conversion[cluster] = old_cluster change_info = old_cluster,str(len(old_samples)),"kept (increase)",old_cluster,str(len(cluster_samples)),str(len(new_samples)),",".join(list(new_samples)) - if old_cluster not in changes.keys(): + if old_cluster not in changes: changes[old_cluster] = [] changes[old_cluster].append(change_info) elif len(exclusive_old) > 0 and len(exclusive_new) == len(new_samples): # cluster was split, increased and did not merge - if old_cluster not in split_clusters.keys(): + if old_cluster not in split_clusters: split_clusters[old_cluster] = 0 - if old_cluster not in changes.keys(): + if old_cluster not in changes: changes[old_cluster] = [] split_clusters[old_cluster] += 1 new_cluster_name = old_cluster + "." + str(split_clusters[old_cluster]) @@ -839,7 +865,7 @@ def determine_new_cluster_names(previous_samples, db_samples, old_partition, mx_ cluster_old_mx = mx_new[mx_new[mx_new.columns[0]].isin(list(cluster_samples))] all_clusters = sorted(cluster_old_mx[old_partition].unique()) all_clusters = "_".join(all_clusters) - if all_clusters not in merged_clusters.keys(): + if all_clusters not in merged_clusters: max_cluster += 1 new_cluster_name = "cluster_" + str(max_cluster) merged_clusters[all_clusters] = new_cluster_name @@ -847,7 +873,7 @@ def determine_new_cluster_names(previous_samples, db_samples, old_partition, mx_ new_cluster_name = merged_clusters[all_clusters] conversion[cluster] = new_cluster_name change_info = all_clusters,"multiple clusters","new (split_merge_increase)",new_cluster_name,str(len(cluster_samples)),str(len(new_samples)),",".join(list(new_samples)) - if all_clusters not in changes.keys(): + if all_clusters not in changes: changes[all_clusters] = [] changes[all_clusters].append(change_info) else: @@ -864,7 +890,7 @@ def determine_new_cluster_names(previous_samples, db_samples, old_partition, mx_ cluster_old_mx = mx_new[mx_new[mx_new.columns[0]].isin(list(cluster_samples))] all_clusters = sorted(cluster_old_mx[old_partition].unique()) all_clusters = "_".join(all_clusters) - if all_clusters not in merged_clusters.keys(): + if all_clusters not in merged_clusters: max_cluster += 1 new_cluster_name = "cluster_" + str(max_cluster) merged_clusters[all_clusters] = new_cluster_name @@ -872,7 +898,7 @@ def determine_new_cluster_names(previous_samples, db_samples, old_partition, mx_ new_cluster_name = merged_clusters[all_clusters] conversion[cluster] = new_cluster_name change_info = all_clusters,"multiple clusters","new (merge_increase)",new_cluster_name,str(len(cluster_samples)),str(len(new_samples)),",".join(list(new_samples)) - if all_clusters not in changes.keys(): + if all_clusters not in changes: changes[all_clusters] = [] changes[all_clusters].append(change_info) else: @@ -880,7 +906,7 @@ def determine_new_cluster_names(previous_samples, db_samples, old_partition, mx_ old,old_l,modification,new,new_l,n_new_samples,new_samples_name = change if old == all_clusters and new != new_cluster_name: changes[all_clusters].append(change_info) - if cluster not in conversion.keys(): + if cluster not in conversion: if len(cluster_samples) == 1: max_singleton += 1 new_cluster_name = "singleton_" + str(max_singleton) @@ -1034,12 +1060,14 @@ def solve_nomenclature_code_in_args(args,day): # metadata2report if args.metadata2report != "none": + metadata2report = str(args.metadata2report).split(",") if "nomenclature_code" in args.metadata2report: - metadata2report = str(args.metadata2report).split(",") for i in range(len(metadata2report)): if metadata2report[i] == "nomenclature_code": metadata2report[i] = "nomenclature_code_" + str(day) - args.metadata2report = ",".join(metadata2report) + else: + metadata2report.append("nomenclature_code_" + str(day)) + args.metadata2report = ",".join(metadata2report) else: args.metadata2report = "nomenclature_code_" + str(day) @@ -1096,6 +1124,7 @@ def rename_clusters_subsets(analysis,partitions,tag): mx.replace(new_name, regex=True, inplace=True) mx.to_csv(partitions, index = False, header=True, sep = "\t") + # running the pipeline ---------- @@ -1209,6 +1238,7 @@ def main(): input, it is MANDATORY to specify this argument.") group1.add_argument("--subset", dest="subset", required=False, action="store_true", help="[OPTIONAL] Obtain genetic clusters using only the samples that correspond to the filters specified in the \ '--filter' argument.") + group1.add_argument("-flag", "--flag-multiple-clusters", dest="flag_clusters", required=False, action="store_true", help="[OPTIONAL] Generate an additional file with information about the ") group1.add_argument("-d", "--dist", dest="dist", required=False, default=1.0, type=float, help="[OPTIONAL] Distance unit by which partition thresholds will be multiplied (example: if -d 10 and \ -thr 5,8,10-30, the minimum spanning tree will be cut at 50,80,100,110,120,...,300. If -d 10 and --method-threshold avg_clade-2, the avg_clade threshold will be set \ at 20). This argument is particularly useful for non-SNP-scaled trees. Currently, the default is 1, which is equivalent to 1 allele distance or 1 SNP distance. [1.0]") @@ -1309,6 +1339,11 @@ def main(): Of note, if a '--nomenclature-file' is provided in subsequent ReporTree runs using the same method and thresholds, the nomenclature code will be kept. You can also add one metadata variable \ (e.g. Country) to get an extra layer to the code (e.g. C3-C2-C1-Portugal). Partition thresholds can be indicated in this argument following the same rules as the arguments '-thr' and '-pct_thr' \ for GrapeTree or '--HC-threshold' and '--pct-HC-threshold' for HC.") + group7.add_argument("--update-cluster-names", dest="update_clusters", required=False, action="store_true", help="[OPTIONAL] Activate update clusters mode. If you set this argument, and the metadata you provided \ + has columns names that correspond to partition thresholds from a previous ReporTree run using the same analysis method as in the current run, ReporTree will provide updated cluster names in \ + the *metadata_w_partitions.tsv, instead of adding new columns with the prefix subset. This argument is only applicable if you provide a metadata file. To use this argument you do not need to \ + provide a nomenclature file nor request a nomenclature code. Please be careful when using this argument in combination with '--subset' bacause ReporTree will perform a run on the subset samples \ + you selected AND update the *metadata_w_partitions.tsv with the new cluster names instead of adding new columns.") ## reportree @@ -1332,6 +1367,9 @@ def main(): they must be separated with commas (e.g 'country == Portugal,Spain,France'). When filters include more than one column, they must be separated with semicolon (e.g. \ 'country == Portugal,Spain,France;date > 2018-01-01;date < 2022-01-01'). White spaces are important in this argument, so, do not leave spaces before and after \ commas/semicolons.") + group8.add_argument("--check-cluster-incongruencies", dest="incongruencies", required=False, action="store_true", help="[OPTIONAL and only available for --analysis grapetree or HC]] Perform \ + a comparison between the genetic clusters obtained and the distance matrix to identify incongruencies between the two (e.g. isolates A and B dist by 10 allele differences \ + but are not clustered at this threshold.") group8.add_argument("--sample_of_interest", dest="sample_of_interest", required=False, default="all", help="[OPTIONAL] List of samples of interest for which summary reports will be created. This \ list can be a comma-separated list in the command line, or a comma-separated list in a file, or a list in the first column of a tsv file. No headers should be provided in the \ input files. If nothing is provided, only the summary reports comprising all samples will be generated.") @@ -1344,8 +1382,13 @@ def main(): closest samples of each sample of interest. This argument takes as input a comma-separated list of n's, corresponding to the number of closest samples you want to include for \ the samples of interest. This argument requires that a metadata table was provided with '-m'. The argument '--loci-called' is not applied in the subtree analysis, i.e. \ all samples of the cluster are included. Default: no subtree.") - group8.add_argument("--unzip", dest="unzip", required=False, action="store_true", help="[OPTIONAL and only available for --analysis grapetree or HC] Provide the outputs of '--zoom-cluster-of-interest' and \ - '--subtree-of-interest' in unzipped format.") + group8.add_argument("--zoom-all", dest="zoom_all", required=False, default="no", help="[OPTIONAL and only available for --analysis grapetree or HC] Repeat the analysis using only the samples that belong \ + to each cluster identified at a given distance threshold. This argument takes as input only one threshold (partition) for which you want the zoom-in. This threshold can be \ + indicated in this argument following the same rules as the arguments '-thr' and '-pct_thr' for GrapeTree or '--HC-threshold' and '--pct-HC-threshold' for HC. This argument \ + requires that a metadata table was provided with '-m'. The argument '--loci-called' is not applied in the zoom-in analysis, i.e. all samples of the cluster are included. \ + Default: no zoom-in.") + group8.add_argument("--unzip", dest="unzip", required=False, action="store_true", help="[OPTIONAL and only available for --analysis grapetree or HC] Provide the outputs of '--zoom-cluster-of-interest', \ + '--subtree-of-interest' and '--zoom-all' in unzipped format.") group8.add_argument("--frequency-matrix", dest="frequency_matrix", required=False, default="no", help="[OPTIONAL] Metadata column names for which a frequency matrix will be generated. This \ must be specified within quotation marks in the following format 'variable1,variable2'. Variable1 is the variable for which frequencies will be calculated (e.g. for \ 'lineage,iso_week' the matrix reports the percentage of samples that correspond to each lineage per iso_week). If you want more than one matrix you can separate the \ @@ -1419,13 +1462,18 @@ def main(): day = str(datetime.datetime.today().strftime("%Y-%m-%d")) if args.metadata == "none": - if args.zoom != "no" or args.subtree != "no": - sys.exit("\nYou have asked for '--zoom-cluster-of-interest' and/or '--subtree-of-interest'... I also need a metadata table :-)\n") + if args.zoom != "no" or args.subtree != "no" or args.zoom_all != "no": + sys.exit("\nYou have asked for '--zoom-cluster-of-interest' and/or '--subtree-of-interest' and/or '--zoom-all'... I also need a metadata table :-)\n") else: print_log("\nWARNING!! You did not provide a metadata file... metadata_report.py will not be run! Provide such a file if you want to take the most profit of ReporTree :-)", log) metadata_col = [] + if args.update_clusters: + sys.exit("\nYou have asked for '--update-cluster-names'... I also need a metadata table :-)\n") else: - metadata_mx = pandas.read_table(args.metadata) + metadata_mx = pandas.read_table(args.metadata, dtype = str) + first_col = metadata_mx.columns[0] + if metadata_mx[first_col].duplicated().any(): + sys.exit("\nYou have duplicated entries in your metadata file! I cannot proceed :-(\n") metadata_col = [col.replace(" ", "_") for col in metadata_mx.columns] # adapting inputs to report nomenclature_code @@ -1434,8 +1482,8 @@ def main(): # solve incompatible input if args.loci != "none" and args.N_content != 0.0: - if args.zoom == "no" and args.subtree == "no": - sys.exit("\nThe arguments '--loci' and '--site-inclusion' were specified without '--zoom-cluster-of-interest' or '--subtree-of-interest'... I am confused :-(\n") + if args.zoom == "no" and args.subtree == "no" and args.zoom_all == "no": + sys.exit("\nThe arguments '--loci' and '--site-inclusion' were specified without '--zoom-cluster-of-interest' or '--subtree-of-interest' or '--zoom-all'... I am confused :-(\n") # reportree workflow ---------- @@ -1638,12 +1686,12 @@ def main(): else: # can continue using the allele profile analysis = args.analysis profile = args.allele_profile - if args.zoom != "no" or args.subtree != "no": + if args.zoom != "no" or args.subtree != "no" or args.zoom_all != "no": future_zoom = True print_log("\nProfiles file provided -> will run partitioning_" + analysis + ".py:\n", log) elif args.alignment != "": # ALIGNMENT ---> PROCESS INPUT - if args.vcf != "": + if args.vcf != "": print_log("\nAlignment and VCF files specified... I am confused :-(\n", log) sys.exit(1) @@ -1657,7 +1705,7 @@ def main(): else: # can continue using the alignment analysis = args.analysis - if args.zoom != "no" or args.subtree != "no": + if args.zoom != "no" or args.subtree != "no" or args.zoom_all != "no": future_zoom = True print_log("\nAlignment file provided -> will run alignment_processing.py and partitioning_" + analysis + ".py:\n", log) @@ -1768,6 +1816,16 @@ def main(): cmds.append("nomenclature") + # FLAG INCONGRUENCIES ---------- + + if args.incongruencies: + if analysis == "HC" or analysis == "grapetree": + print_log("\nChecking for incongruencies between clustering and distance matrix...", log) + loci_report = args.output + "_loci_report.tsv" + else: + print_log("\nRequest for checking clustering incongruencies is not compatible with this analysis, so it will not be performed :-(", log) + + # running comparing partitions if "stability_regions" in args.partitions2report and "all" in args.partitions2report: print_log("\t'stability_regions' and 'all' options cannot be simultaneously specified in --partitions2report... I am confused :-(", log) @@ -1823,17 +1881,17 @@ def main(): nomenclature_change2summary(args.output) # samples of interest - s_of_interest = args.sample_of_interest - if s_of_interest != "all" and args.metadata != "none": - print("START SAMPLES INTEREST ANALYSIS") + if future_zoom: print_log("\n\n\n\n****************************** PROCESSING SAMPLES OF INTEREST ******************************\n\n", log) - print_log("Filtering partitions_summary.tsv according to samples of interest...", log) - samples_of_interest, singletons, do_not_exist = info_samples_interest(s_of_interest, args.output + "_partitions_summary.tsv", args.output + "_partitions.tsv", args.output) - - # UPDATE METADATA ---------- - print("ADD CATEGORY") - metadata = interest2metadata(args.output, samples_of_interest) + if args.zoom != "no" or args.subtree != "no": + s_of_interest = args.sample_of_interest + print_log("Filtering partitions_summary.tsv according to samples of interest...", log) + samples_of_interest, singletons, do_not_exist = info_samples_interest(s_of_interest, args.output + "_partitions_summary.tsv", args.output + "_partitions.tsv", args.output) + + # UPDATE METADATA ---------- + print("ADD CATEGORY") + metadata = interest2metadata(args.output, samples_of_interest) # ZOOM-IN ---------- @@ -1844,7 +1902,7 @@ def main(): partitions4zoom, partitions4zoom_lst = infer_partitions_from_list(analysis, args.output + "_partitions.tsv", metadata_col, args.zoom, args.output, args.dist, log) clusters_of_interest, not_in_partitions = get_clusters_interest(samples_of_interest, args.output + "_partitions.tsv", args.output + "_metadata_w_partitions.tsv", day, partitions4zoom_lst) if len(partitions4zoom_lst) == 0: - print_log("\tNone of the requested partitions for zoom-in is valid!... We are sorry but cluster zoom-in will be done! Please check that you have correctly indicated the list of partitions in the '--zoom-cluster-of-interest' argument.", log) + print_log("\tNone of the requested partitions for zoom-in is valid!... We are sorry but cluster zoom-in will not be done! Please list the right partitions in the '--zoom-cluster-of-interest' argument.", log) else: for part in partitions4zoom_lst: if part in clusters_of_interest.keys(): @@ -1859,6 +1917,27 @@ def main(): else: print_log("\tAt the requested partition " + str(part) + ", we did not find any cluster with at least one sample of interest to zoom-in.", log) + + # ZOOM-IN ALL ---------- + + if args.zoom_all != "no": + print_log("Checking cluster zoom-all requests...", log) + if "," in args.zoom_all: + print("\tMore than one threshold indicated in '--zoom-all'. No zoom-all will be performed :-(", log) + else: + partitions4zoom, partitions4zoom_lst = infer_partitions_from_list(analysis, args.output + "_partitions.tsv", metadata_col, args.zoom_all, args.output, args.dist, log) + clusters_of_interest = get_all_clusters(args.output + "_partitions.tsv", partitions4zoom, log) + if len(partitions4zoom_lst) == 0: + print_log("\tThe requested partition for zoom-all is not valid!... We are sorry but cluster zoom-all will not be done!", log) + else: + part = partitions4zoom + for cluster in clusters_of_interest: + tag_subset = str(part) + "_" + str(cluster) + filter_subset = part + " == " + str(cluster) + info = "zoom-in",tag_subset,filter_subset + if info not in new_subset_filters: + new_subset_filters.append(info) + # TREE INTEREST ---------- if args.subtree != "no" and args.metadata != "": @@ -1869,6 +1948,7 @@ def main(): sample_col = metadata.columns[0] for n in n4subtree: for sample in samples_of_interest: + sample = str(sample) closest_samples, code = get_closest_samples(sample, n, hamming, samples_in_partitions) if code == "all_samples": print_log("\tSubtree of interest requested for n = " + str(n) + " will not be done because this includes all samples in the dataset!", log) @@ -1885,6 +1965,7 @@ def main(): # SUBSET ---------- zooms_file = open(args.output + "_zooms.txt", "w+") + zoom_df = pandas.DataFrame() zoom = True for type_analysis,tag_subset,filter_subset in new_subset_filters: out_folder = args.output diff --git a/scripts/alignment_processing.py b/scripts/alignment_processing.py index 04816e1..b1a570a 100644 --- a/scripts/alignment_processing.py +++ b/scripts/alignment_processing.py @@ -16,8 +16,8 @@ import datetime as datetime from Bio import SeqIO, AlignIO, Align, Alphabet -version = "1.5.1" -last_updated = "2024-09-27" +version = "1.5.2" +last_updated = "2024-10-20" # functions ---------- diff --git a/scripts/metadata_report.py b/scripts/metadata_report.py index 62528cd..e7aad88 100644 --- a/scripts/metadata_report.py +++ b/scripts/metadata_report.py @@ -15,12 +15,12 @@ from pandas.api.types import is_datetime64_any_dtype as is_datetime import datetime -version = "1.2.0" -last_updated = "2024-11-06" +version = "1.4.0" +last_updated = "2025-10-22" # functions ---------- -def partitions2metadata(partitions_name, partitions, mx_metadata, partitions2report, filters, log): +def partitions2metadata(partitions_name, partitions, mx_metadata, partitions2report, filters, update_clusters, log): """ create a combined matrix with metadata and partitions information @@ -54,10 +54,9 @@ def partitions2metadata(partitions_name, partitions, mx_metadata, partitions2rep mx_metadata.insert(index_no, "year_original", year_original) index_no = mx_metadata.columns.get_loc("date") mx_metadata["date"] = pandas.to_datetime(mx_metadata["date"], errors = "coerce") - mx_metadata["year"] = mx_metadata["date"].dt.year + mx_metadata["year"] = mx_metadata["date"].dt.year.astype("Int64") year = mx_metadata.pop("year") mx_metadata.insert(index_no + 1, "year", year) - index_no = mx_metadata.columns.get_loc("date") if "iso_week_nr" not in mx_metadata.columns and "iso_year" not in mx_metadata.columns and "iso_week" not in mx_metadata.columns: isoyear = mx_metadata["date"].dt.isocalendar().year isoweek = mx_metadata["date"].dt.isocalendar().week @@ -93,28 +92,51 @@ def partitions2metadata(partitions_name, partitions, mx_metadata, partitions2rep a = mx_metadata.set_index(sample_column, drop = True) b = mx_partitions.set_index(sample_column_part, drop = True) - if len(set(a.columns) & set(b.columns)) > 0: - b = b.add_prefix("subset_") - possible_subset = True - mx_partitions.columns = [sample_column_part] + b.columns.tolist() - mx_partitions.to_csv(partitions_name, index = False, header=True, sep ="\t") - if partitions2report == "all": # add all partitions - c = pandas.concat([a, b], axis=1) - - else: # add specific set of partitions - required_partitions = partitions2report.split(",") - c = a - for column_name in required_partitions: - if column_name in b.columns: - c = pandas.concat([c,b[column_name]], axis = 1) - else: - print("\t\t" + column_name + " will not be reported because it was not found in the partitions table!!") - print("\t\t" + column_name + " will not be reported because it was not found in the partitions table!!", file = log) - + if not update_clusters: + if len(set(a.columns) & set(b.columns)) > 0: + b = b.add_prefix("subset_") + possible_subset = True + mx_partitions.columns = [sample_column_part] + b.columns.tolist() + mx_partitions.to_csv(partitions_name, index = False, header=True, sep ="\t") + if partitions2report == "all": # add all partitions + c = pandas.concat([a, b], axis=1) + else: # add specific set of partitions + required_partitions = partitions2report.split(",") + c = a + for column_name in required_partitions: + if column_name in b.columns: + c = pandas.concat([c,b[column_name]], axis = 1) + else: + print("\t\t" + column_name + " will not be reported because it was not found in the partitions table!!") + print("\t\t" + column_name + " will not be reported because it was not found in the partitions table!!", file = log) + else: + shared_columns = list(set(a.columns) & set(b.columns)) + new_columns = b.columns.difference(a.columns) + b_shared = b[shared_columns] + common_samples = a.index.intersection(b.index) + new_samples = b.index.difference(a.index) + + a.loc[common_samples, shared_columns] = b_shared.loc[common_samples] + new_rows = b_shared.loc[new_samples] + a = pandas.concat([a, new_rows], axis=0) + + c = a.copy() + + if partitions2report == "all": # add all partitions + valid_cols = b.columns.intersection(new_columns) + c = pandas.concat([c, b.loc[b.index.intersection(c.index), valid_cols].reindex(c.index)], axis=1) + else: + required_partitions = partitions2report.split(",") + for column_name in required_partitions: + if column_name in b.columns and column_name not in shared_columns: + valid_rows = b.index.intersection(c.index) + c = pandas.concat([c, b.loc[valid_rows, [column_name]].reindex(c.index)], axis=1) + else: + print("\t\t" + column_name + " will not be reported because it was not found in the partitions table!!") + print("\t\t" + column_name + " will not be reported because it was not found in the partitions table!!", file=log) else: c = mx_metadata.set_index(mx_metadata.columns[0], drop = True) - # filtering table according to specifications c_filtered = c.reset_index(drop = False) c_filtered.rename(columns={c_filtered.columns[0]: mx_metadata.columns[0]}, inplace=True) @@ -225,7 +247,8 @@ def partitions_summary(complete_metadata, partitions, partitions2report, summary summary = {"partition": [], "cluster": [], "cluster_length": [], "samples": []} # dictionary for final dataframe order_columns = ["partition", "cluster", "cluster_length", "samples"] # list with the order of columns in final dataframe - complete_metadata = complete_metadata.fillna("EMPTY") + str_cols = complete_metadata.select_dtypes(include=["object", "string"]).columns + complete_metadata[str_cols] = complete_metadata[str_cols].fillna("EMPTY") if partitions2report == "all": if isinstance(partitions, pandas.DataFrame): @@ -335,7 +358,8 @@ def col_summary(main_column, complete_metadata, columns_summary_report, sample_c absent_columns = [] - complete_metadata = complete_metadata.fillna("EMPTY") + str_cols = complete_metadata.select_dtypes(include=["object", "string"]).columns + complete_metadata[str_cols] = complete_metadata[str_cols].fillna("EMPTY") if main_column in complete_metadata.columns: order_columns = [main_column] summary = {main_column: []} # dictionary for final dataframe @@ -707,6 +731,9 @@ def main(): they must be separated with commas (e.g 'country == Portugal,Spain,France'). When filters include more than one column, they must be separated with semicolon (e.g. \ 'country == Portugal,Spain,France;date > 2018-01-01;date < 2022-01-01'). White spaces are important in this argument, so, do not leave spaces before and after \ commas/semicolons.") + parser.add_argument("--update-cluster-names", dest="update_clusters", required=False, action="store_true", help="[OPTIONAL] Activate update clusters mode. If you set this argument, and the metadata \ + you provided has columns names that correspond to partition thresholds from a previous ReporTree run using the same analysis method as in the current run, we will provide updated \ + cluster names in the *metadata_w_partitions.tsv, instead of adding new columns with the prefix subset. This argument is only applicable if you provide a partitions file.") parser.add_argument("--frequency-matrix", dest="frequency_matrix", required=False, default="no", help="[OPTIONAL] Metadata column names for which a frequency matrix will be generated. This must \ be specified within quotation marks in the following format 'variable1,variable2'. Variable1 is the variable for which frequencies will be calculated (e.g. for \ 'lineage,iso_week' the matrix reports the percentage of samples that correspond to each lineage per iso_week). If you want more than one matrix you can separate the \ @@ -744,14 +771,19 @@ def main(): # analysis of the partitions file + update_clusters = False if args.partitions != "": # partitions table provided print("Getting information from the partitions table: " + str(args.partitions)) print("Getting information from the partitions table: " + str(args.partitions), file = log) partitions_name = args.partitions partitions = pandas.read_table(args.partitions, dtype = str) + if args.update_clusters: + update_clusters = True else: partitions = args.partitions partitions_name = "" + if args.update_clusters: + print("\nYou requested to update cluster names in metadata_w_partitions.tsv, but no partitions file was provided. So, I cannot do it :-(") # adding partitions to metadata @@ -764,7 +796,7 @@ def main(): if " " in col: new_name = col.replace(" ", "_") col_rename[new_name] = col - complete_metadata, possible_subset = partitions2metadata(partitions_name, partitions, mx_metadata, args.partitions2report, args.filter_column, log) + complete_metadata, possible_subset = partitions2metadata(partitions_name, partitions, mx_metadata, args.partitions2report, args.filter_column, update_clusters, log) sample_column = complete_metadata.columns[0] diff --git a/scripts/partitioning_HC.py b/scripts/partitioning_HC.py index ace75ec..ab035f2 100644 --- a/scripts/partitioning_HC.py +++ b/scripts/partitioning_HC.py @@ -20,8 +20,8 @@ sys.setrecursionlimit(10000) # please increase this number, if you are getting the error "RecursionError: maximum recursion depth exceeded while calling a Python object" -version = "1.8.0" -last_updated = "2024-09-09" +version = "1.9.0" +last_updated = "2025-10-21" # functions ---------- @@ -161,15 +161,9 @@ def filter_mx(matrix, mx, filters, matrix_type, log): samples = mx[sample_column].tolist() if matrix_type == "dist": pairwise_dist_mx = matrix.set_index(matrix.columns[0], drop = True) - columns_interest = [] - - for sample in pairwise_dist_mx.columns: - if sample in samples: - columns_interest.append(sample) + columns_interest = [col for col in pairwise_dist_mx.columns if col in samples] + df = pairwise_dist_mx.loc[columns_interest, columns_interest].reset_index() - df = pairwise_dist_mx[columns_interest].loc[columns_interest] - df = df.reset_index(drop=False) - else: df = matrix[matrix[matrix.columns[0]].isin(samples)] @@ -410,17 +404,17 @@ def main(): if float(args.samples_called) == 0.0: print("Keeping the sites/loci present in the loci list.") print("Keeping the sites/loci present in the loci list.", file = log) - for col in allele_mx.columns[1:]: - if col not in loci2include: - allele_mx = allele_mx.drop(columns=col) + cols_to_keep = [allele_mx.columns[0]] + [col for col in allele_mx.columns[1:] if col in loci2include] + allele_mx = allele_mx[cols_to_keep] else: print("Keeping the sites/loci present in the loci list and those with information in >= " + str(float(args.samples_called) * 100) + "%% of the samples...") print("Keeping the sites/loci present in the loci list and those with information in >= " + str(float(args.samples_called) * 100) + "%% of the samples...", file = log) - for col in allele_mx.columns[1:]: - if col not in loci2include: - values = allele_mx[col].values.tolist() - if (len(values)-values.count("0"))/len(values) < float(args.samples_called): - allele_mx = allele_mx.drop(columns=col) + cols_to_check = [col for col in allele_mx.columns[1:] if col not in loci2include] + subset = allele_mx[cols_to_check] + called_proportion = (subset != "0").sum() / len(subset) + columns_to_keep = called_proportion[called_proportion >= float(args.samples_called)].index.tolist() + final_cols = [col for col in allele_mx.columns if col == allele_mx.columns[0] or col in columns_to_keep or col in loci2include] + allele_mx = allele_mx[final_cols] pos_t1 = len(allele_mx.columns[1:]) print("\tFrom the " + str(pos_t0) + " loci/positions, " + str(pos_t1) + " were kept in the matrix.") print("\tFrom the " + str(pos_t0) + " loci/positions, " + str(pos_t1) + " were kept in the matrix.", file = log) @@ -429,10 +423,12 @@ def main(): if float(args.samples_called) > 0.0: print("Keeping the sites/loci with information in >= " + str(float(args.samples_called) * 100) + "%% of the samples...") print("Keeping the sites/loci with information in >= " + str(float(args.samples_called) * 100) + "%% of the samples...", file = log) - for col in allele_mx.columns[1:]: - values = allele_mx[col].values.tolist() - if (len(values)-values.count("0"))/len(values) < float(args.samples_called): - allele_mx = allele_mx.drop(columns=col) + cols_to_check = allele_mx.columns[1:] + subset = allele_mx[cols_to_check] + called_proportion = (subset != "0").sum(axis=0) / len(subset) + columns_to_keep = called_proportion[called_proportion >= float(args.samples_called)].index.tolist() + final_cols = [allele_mx.columns[0]] + [col for col in cols_to_check if col in columns_to_keep] + allele_mx = allele_mx[final_cols] pos_t1 = len(allele_mx.columns[1:]) print("\tFrom the " + str(pos_t0) + " loci/positions, " + str(pos_t1) + " were kept in the matrix.") print("\tFrom the " + str(pos_t0) + " loci/positions, " + str(pos_t1) + " were kept in the matrix.", file = log) @@ -448,24 +444,25 @@ def main(): # cleaning allele matrix (rows) ---------- - if float(args.loci_called) > 0.0 and float(args.samples_called) < 1.0: + if float(args.loci_called) >= 0.0: print("Cleaning the profile matrix using a threshold of >" + str(args.loci_called) + " alleles/positions called per sample...") print("Cleaning the profile matrix using a threshold of >" + str(args.loci_called) + " alleles/positions called per sample...", file = log) - report_allele_mx = {} - len_schema = len(allele_mx.columns) - 1 + + missing_counts = allele_mx.isin(["0"]).sum(axis=1) + called_counts = len_schema - missing_counts + report_allele_mx = {} report_allele_mx["samples"] = allele_mx[allele_mx.columns[0]] - report_allele_mx["missing"] = allele_mx.isin(["0"]).sum(axis=1) - report_allele_mx["called"] = len_schema - allele_mx.isin(["0"]).sum(axis=1) - report_allele_mx["pct_called"] = (len_schema - allele_mx.isin(["0"]).sum(axis=1)) / len_schema + report_allele_mx["missing"] = missing_counts + report_allele_mx["called"] = called_counts + report_allele_mx["pct_called"] = called_counts / len_schema report_allele_df = pandas.DataFrame(data = report_allele_mx) - if float(args.loci_called) != 1.0: - flt_report = report_allele_df[report_allele_df["pct_called"] > float(args.loci_called)] - else: - flt_report = report_allele_df[report_allele_df["pct_called"] == float(args.loci_called)] + + flt_report = report_allele_df[report_allele_df["pct_called"] >= float(args.loci_called)] + pass_samples = flt_report["samples"].values.tolist() report_allele_df.to_csv(args.out + "_loci_report.tsv", index = False, header=True, sep ="\t") diff --git a/scripts/partitioning_grapetree.py b/scripts/partitioning_grapetree.py index ea2535d..4900fab 100644 --- a/scripts/partitioning_grapetree.py +++ b/scripts/partitioning_grapetree.py @@ -25,8 +25,8 @@ grapetree = partitioning_grapetree_script.rsplit("/", 1)[0] + "/GrapeTree/grapetree.py" python = sys.executable -version = "1.5.0" -last_updated = "2024-07-12" +version = "1.6.0" +last_updated = "2025-10-21" # additional functions ---------- @@ -335,19 +335,18 @@ def main(): if float(args.samples_called) == 0.0: print("Keeping the sites/loci present in the loci list.") print("Keeping the sites/loci present in the loci list.", file = log) - for col in mx_allele.columns[1:]: - if col not in loci2include: - mx_allele = mx_allele.drop(columns=col) + cols_to_keep = [mx_allele.columns[0]] + [col for col in mx_allele.columns[1:] if col in loci2include] + mx_allele = mx_allele[cols_to_keep] else: print("Keeping the sites/loci present in the loci list and those with information in >= " + str(float(args.samples_called) * 100) + "%% of the samples...") print("Keeping the sites/loci present in the loci list and those with information in >= " + str(float(args.samples_called) * 100) + "%% of the samples...", file = log) - for col in mx_allele.columns[1:]: - if col not in loci2include: - values = mx_allele[col].values.tolist() - if (len(values)-values.count("0"))/len(values) < float(args.samples_called): - mx_allele = mx_allele.drop(columns=col) - elif (len(values)-values.count(0))/len(values) < float(args.samples_called): - mx_allele = mx_allele.drop(columns=col) + cols_to_check = [col for col in mx_allele.columns[1:] if col not in loci2include] + subset = mx_allele[cols_to_check] + called_proportion = (subset != "0").sum() / len(subset) + columns_to_keep = called_proportion[called_proportion >= float(args.samples_called)].index.tolist() + final_cols = [col for col in mx_allele.columns if col == mx_allele.columns[0] or col in columns_to_keep or col in loci2include] + mx_allele = mx_allele[final_cols] + pos_t1 = len(mx_allele.columns[1:]) print("\tFrom the " + str(pos_t0) + " loci/positions, " + str(pos_t1) + " were kept in the matrix.") print("\tFrom the " + str(pos_t0) + " loci/positions, " + str(pos_t1) + " were kept in the matrix.", file = log) @@ -356,12 +355,12 @@ def main(): if float(args.samples_called) > 0.0: print("Keeping the sites/loci with information in >= " + str(float(args.samples_called) * 100) + "%% of the samples...") print("Keeping the sites/loci with information in >= " + str(float(args.samples_called) * 100) + "%% of the samples...", file = log) - for col in mx_allele.columns[1:]: - values = mx_allele[col].values.tolist() - if (len(values)-values.count("0"))/len(values) < float(args.samples_called): - mx_allele = mx_allele.drop(columns=col) - elif (len(values)-values.count(0))/len(values) < float(args.samples_called): - mx_allele = mx_allele.drop(columns=col) + cols_to_check = mx_allele.columns[1:] + subset = mx_allele[cols_to_check] + called_proportion = (subset != "0").sum(axis=0) / len(subset) + columns_to_keep = called_proportion[called_proportion >= float(args.samples_called)].index.tolist() + final_cols = [mx_allele.columns[0]] + [col for col in cols_to_check if col in columns_to_keep] + mx_allele = mx_allele[final_cols] pos_t1 = len(mx_allele.columns[1:]) print("\tFrom the " + str(pos_t0) + " loci/positions, " + str(pos_t1) + " were kept in the matrix.") print("\tFrom the " + str(pos_t0) + " loci/positions, " + str(pos_t1) + " were kept in the matrix.", file = log) @@ -380,25 +379,26 @@ def main(): # cleaning allele matrix (rows) ---------- - if float(args.loci_called) > 0.0: + if float(args.loci_called) >= 0.0: print("Cleaning the profile matrix using a threshold of >" + str(args.loci_called) + " alleles/positions called...") print("Cleaning the profile matrix using a threshold of >" + str(args.loci_called) + " alleles/positions called...", file = log) allele_mx = pandas.read_table(allele_filename, dtype = str) - report_allele_mx = {} - len_schema = len(allele_mx.columns) - 1 - + + missing_counts = allele_mx.isin(["0"]).sum(axis=1) + called_counts = len_schema - missing_counts + + report_allele_mx = {} report_allele_mx["samples"] = allele_mx[allele_mx.columns[0]] - report_allele_mx["missing"] = allele_mx.isin(["0"]).sum(axis=1) - report_allele_mx["called"] = len_schema - allele_mx.isin(["0"]).sum(axis=1) - report_allele_mx["pct_called"] = (len_schema - allele_mx.isin(["0"]).sum(axis=1)) / len_schema + report_allele_mx["missing"] = missing_counts + report_allele_mx["called"] = called_counts + report_allele_mx["pct_called"] = called_counts / len_schema report_allele_df = pandas.DataFrame(data = report_allele_mx) - if float(args.loci_called) != 1.0: - flt_report = report_allele_df[report_allele_df["pct_called"] > float(args.loci_called)] - else: - flt_report = report_allele_df[report_allele_df["pct_called"] == float(args.loci_called)] + + flt_report = report_allele_df[report_allele_df["pct_called"] >= float(args.loci_called)] + pass_samples = flt_report["samples"].values.tolist() print("\tFrom the " + str(len(allele_mx[allele_mx.columns[0]].values.tolist())) + " samples, " + str(len(pass_samples)) + " were kept in the profile matrix.")