Author: Osamu Gotoh
Contact: Osamu Gotoh o.gotoh@aist.go.jp
Last updated: 11/01/2017
Intron length distribution (ILD) is a specific feature of a genome. This suit includes four programs related to ILD: fitild, compild, decompild, and plotild.
- Given an experimentally observed ild, fitild calculates the maximum likelihood estimate of parameters of a statistical model consisting of one, two, or three components of Frechet or lognormal distributions. The degree of fitness is evaluated by three statistics, residual root mean square error, AIC, and BIC. As an option, the fitness can be visually confirmed by a graph displayed in a separate window.
- Compild calculates a distance matrix for a given set of experimental or model ILDs. Typically, the outputs of fitild are used as the model ILDs. In each run, one of seven different measures of distance may be chosen.
- Decompild decomposes a model ILD into individual components and outputs 1) fractional contribution, 2) mode, 3) median, 4) width (= 3/4 quantile - 1/4 quantile) of each component, and 5) boundary between i-th and (i+1)-th components.
- Plotild displays one or more experimental and/or model ILDs in a single window. Hence, plotild is useful to visually compare ILDs of several species.
This suit of programs is currently compiled only with g++ compiler on a Linux system. As outer resources, Gnu Science Library (GSL) and libLBFGS are required. In addition, Gnuplot is assumed to be installed in the system.
- bin: binaries
- doc: documents
- perl: perl scripts
- sample: sample ild files
- src: source codes
- table: text files used by programs and scripts
- gnm2tab: list of genomes with some characteristics
- IldModel_1.txt: list of one-component Frechet models
- IldModel_2.txt: list of two-component Frechet models
- IldModel_3.txt: list of three-component Frechet models
% ./configure [--help]% make% make install
The input to fitild is a simple text file, each line of which should be:
intron_length frequency
The lines should be numerically sorted on intron_length in the ascending order. The header line is optional.
- (A)
% fitild [-N|G] [other options] -d IldModel.txt ild - (B)
% fitild [-N|G] [other options] ild initial_parameter_values
-a: as is: no optimization is attempted.-b[1|2]: BFGS (Broyden-Fletcher-Golodfarb-Shanno) method-c#: degree of convergence (1e-4)-d IldModel.txt: list of template ild parameters-f: Fletcher-Reeves conjugate method-g[out]: graphic output (screen)-h: display help-i: use template of the same identifier as the input ild file-k#: don't use template with sample size less than # (0)-l#: lower bound of plot in log_10 (1)-m: multiple methods in series (default)-n[1|2]: Nelder-Mead simplex method (2)-p: Polak-Ribi`ere conjugate method-r#: maximum number of iterations (1000)-s#: step size in GSL minimization function (1e-2)-u#: upper bound of plot in log_10 (6)-x#: number of bins (100)-G: geometric model (Frechet model)-L#: lower bound of intron length (0)-N: lognormal model (Frechet model)-V[#]: verbose output-U#: upper bound of intron length (inf)
-
Frechet model: one of the following three ways, where a: amplitude, m: position, t: scale, and k: shape parameter.
- 1.0 m_1 t_1 k_1 (one component)
- a_1 m_1 t_1 k_1 m_2 t_2 k_2 (two components)
- a_1 m_1 t_1 k_1 m_2 t_2 k_2 a_2 m_3 t_3 k_3 (three components)
-
lognormal model: one of the following three ways, where a: amplitude, m: position, s: scale parameter.
- 1.0 m_1 s_1 (one component)
- a_1 m_1 s_1 k_1 m_2 s_2 (two components)
- a_1 m_1 s_1 m_2 s_2 a_2 m_3 s_3 (three components)
-
geometric model: one of the following three ways, where a: amplitude, q: common ratio, d: shift parameter.
- 1.0 q_1 d_1 (one component)
- a_1 q_1 d_1 q_1 d_2 s_2 (two components)
- a_1 q_1 d_1 q_2 d_2 a_2 q_3 d_3 (three components)
- (A)
% compild [options] -d IldModel.txt - (B)
% compild [options] ild1 ild2 ...
-c: cosine distance.-e: Euclid distance-f: Euclid^2 distance-j: Jaccard distance-k: Kullback-Leibler distance-m: Manhattan distance-s: Jensen-Shannon distance-x#: maximum intron length-H#: minimum frequency-O: output each pair in one line-Q#: horizontal at # quantile points-i[a|e|f|g|l|p]: input mode (B) onlya: alternativee: every pair (default)f: first and othersl: last and othersp: alternative
-N: lognormal (Frechet)
-g out: write into out (display on screen)-l#: lower bound of plot in log_10 (1)-s dir: directory of ild files-u#: upper bound of plot in log_10 (6)-x#: number of bins (100)-C: plot cumulative probability (pdf)-L#: shortest intron length (0)-G#: geometric model (Frechet model)-N: lognormal model (Frechet model)-U#: longest intron length (inf)
% fitild -g ascasuum.ild 1.0 -408.40 991.29 2.6236% fitild -d IldModel_2.txt gallgall.ild% compild -k -if acidrich.ild ascasuum.ild dicdis.ild% compild -H1000 -d IldModel_1.txt% decompild IldModel_3.txt% plotild -u4 acidrich.ild dictdisc.ild -d IldModel_2.txt acidrich dictdis
Gotoh, O. "Modeling one thousand intron length distributions with fitild" (in preparation).