Skip to content

Transitioning from RUM 1 to RUM 2

Mike DeLaurentis edited this page Oct 2, 2012 · 8 revisions

Transitioning from RUM 1.x to RUM 2.0.3

There are many exciting enhancements in RUM 2.0. The user interface and the job management system have been completely rewritten, in order to provide the user with more control over the job, and to improve reliability. The installation method has been changed to follow the de-facto Perl standard. A user can now easily check the status of a running job, stop a job, and restart one from where it left off. Some performance improvements have been made, and RUM 2 should run slightly faster for most input files.

If you have used RUM 1.x previously, we recommend reading this document in its entirety in order to familiarize yourself with the new features in RUM 2.

Standard Perl Installation

RUM's installation process now follows the convention of using Makefile.PL. Please see the README.md file in your distribution or the [Installation wiki page](Installing RUM) for installation instructions.

Cleaner User Interface

The command-line interface for RUM has changed in the following ways:

  • RUM_runner.pl has been renamed to rum_runner.

  • Command-line options have changed to match Unix conventions. Long options start with two dashes, e.g. --index or --strand-specific. Short options only have one dash, e.g. -o. Please run rum_runner help for help information.

  • rum_runner now has multiple actions that it can perform: running a pipeline, checking the status of a job, stopping a job, resuming a stopped job, killing a job, and performing other common actions. See Actions section, below for a full list and description of each.

Actions

rum_runner now has several different actions you can run. An action is specified by the first command-line argument:

  • rum_runner align ...: Run the pipeline on a specified output directory.

  • rum_runner status -o *dir*: Check on the status of a job in a specified directory.

  • rum_runner stop -o *dir*: Stop a running job.

  • rum_runner resume -o *dir*: Resume a job that crashed or was stopped. This attempts to start the job from the last successful step that was completed, avoiding any unnecessary work.

  • rum_runner kill -o *dir*: Stop a job if it's running, and remove all traces of it. This is useful if you realize that you ran a job incorrectly and want to start over from scratch.

  • rum_runner clean -o *dir*: Remove intermediate output files from a specified output directory. Useful if you ran rum_runner align --no-clean ....

  • rum_runner help [action]: Get help. Use rum_runner help *action* to get help for a particular action.

  • rum_runner version: Print out the version of RUM.

New index structure

In RUM 2, the structure of the indexes has changed, and we have introduced a new tool called rum_indexes that should be used to manage your indexes.

Each index is now stored in its own directory, and you indicate to RUM which index you want to use simply by giving the path to that directory. We recommend keeping all of your index directories under one common parent directory. When you run rum_indexes, you specify the location of this parent directory with the --prefix option.

If you have existing RUM 1.x indexes, you will need to either migrate them to the new structure or re-install the indexes. If your indexes are available on our server, it may be easier for you to simply reinstall them. You can do that by simply running

rum_indexes --prefix *master-index-dir*

where master-index-dir is the parent directory where you want all of your indexes to go. This will show you a list of available indexes and prompt you for which indexes you want to install.

If you have custom indexes, you can migrate them by running

rum_indexes --prefix *master-index-dir* --migrate CONFIG_FILES

where CONFIG_FILES is a list of one or more RUM 1.x configuration files. This will read in the configuration files you list, and convert all of those indexes into the new structure. Each migrated index will be placed in a subdirectory of master-index-dir.

Job Status

You can now get a simple report showing the progress of the pipeline by running rum_runner status -o *output-dir*.

It will show you whether you're in the "Preprocessing", "Processing", or "Postprocessing" phase. In the processing phase, it will show you each step, along with an X for each chunk that has completed that step. It should look something like this:

Processing in 10 chunks
-----------------------
XXXXXXXXXX Run bowtie on genome
XXXXXXXXXX Run bowtie on transcriptome
XXXXXXXXXX Separate unique and non-unique mappers from genome bowtie output
XXXXXXXXXX Separate unique and non-unique mappers from transcriptome bowtie
           output
XXXXXXXXXX Merge unique mappers together
XXXXXXXXXX Merge non-unique mappers together
XXXXXXXXXX Make a file containing the unmapped reads, to be passed into
           blat
XXX  X X X Run blat on unmapped reads
XXX  X X X Run mdust on unmapped reads
         X Parse blat output
         X Merge bowtie and blat results
           Clean up RUM files
           Produce RUM_Unique
           Sort RUM_Unique by location
           Sort cleaned non-unique mappers by ID
           Remove duplicates from NU
           Create SAM file
           Create non-unique stats
           Sort RUM_NU
           Generate quants

This shows that all 10 chunks are past the "Make a file containing the unmapped reads..." step, 6 of them are past the "Run mdust..." step, and one is past the "Merge bowtie and blat results step".

In the postprocessing phase, the output should look something like this:

Postprocessing
--------------
X Merge RUM_NU files
X Make non-unique coverage
X Merge RUM_Unique files
X Compute mapping statistics
X Make unique coverage
X Finish mapping stats
X Merge SAM headers
X Concatenate SAM files
X Merge novel exons
X Merge quants
  make_junctions
  Sort junctions (all, bed) by location
  Sort junctions (all, rum) by location
  Sort junctions (high-quality, bed) by location
  Get inferred internal exons

There is only one column of X's, since the postprocessing phase is not run in parallel.

Restarting

If you start running a job and it stops for some reason, simply running rum_runner resume -o *dir* should make it pick up from the last step it succesfully ran. For example, suppose you run a job like this:

rum_runner align \
  -o sample01 \
  --name sample01 \
  --index ~/rum-indexes/mm9 \
  --chunks 30 \
  ~/samples/sample01/forward.fq ~/samples/sample01/reverse.fq

rum_runner will save the settings for the job, including the name, number of chunks, input files, and any other parameters, to the file sample01/rum_job_config. As it runs, it will keep track of the state of the pipeline, based on which intermediate files are present. Then if you stop the job or it fails for some reason, you should be able to restart it again simply by running

rum_runner resume -o sample01

It will load the settings from sample01/rum_job_config, examine the intermediate files to figure out what state it was in when it stopped, and then restart from there. If you're running it in a single chunk in a terminal, it will tell you which steps it is skipping:

(skipping) Run bowtie on genome
(skipping) Run bowtie on transcriptome
(skipping) Separate unique and non-unique mappers from genome bowtie output
(skipping) Separate unique and non-unique mappers from transcriptome bowtie
              output
(skipping) Merge unique mappers together
(running)  Merge non-unique mappers together
(running)  Make a file containing the unmapped reads, to be passed into
              blat
(running)  Run blat on unmapped reads
(running)  Run mdust on unmapped reads
(running)  Parse blat output
(running)  Merge bowtie and blat results

Restarting on a Cluster

Restarting RUM alignments on a compute cluster (e.g. --qsub)is the same as restarting local runs:

rum_runner resume -o sample01

The difference is that you will not see the status messages, they will be written to a log instead. You can always see the current status of the run by tailing the log files or with rum_runner status -o sample01.

Logging

RUM will now use the popular LOG::Log4perl module for logging if you have it installed, or will use a simpler home-grown logging module if you don't. In either case, all of the log files will be placed in a log sub directory of the job directory.

By default RUM will write "info" level log messages to one set of files, and "error" level messages to another set. The info level files are:

  • rum.log - Main log file, contains output from the preprocessing phase, and from the master process that monitors the chunks.

  • rum_N.log, where N is the number of a chunk. Contains output from processing a single chunk of reads. If you don't run with multiple chunks, this will be folded into rum.log.

  • rum_postproc.log Output from the postprocessing phase.

The corresponding error log files are rum_errors.log, rum_errors_N.log where N is a chunk number, and rum_postproc.log. If everything goes well, these error log files should be empty.

Error Handling, Job Cancellation

RUM 2.x should handle errors and allow you to stop a running job a little more smoothly than RUM 1.x. For example, if you are running RUM in a terminal, simply hitting CTRL-C should kill the parent process as well as all subprocesses. If you find that it doesn't, please open an issue here.

If you are running a job in the background or on a cluster, use rum_runner stop to stop it. See rum_runner help stop for more information.

Third-Party Libraries

Autodie

You will now need the autodie Perl module. If you are using perl >= 5.10, this should already be installed. If not, you may need to install it. You should be able to install it very quickly by running:

cpan -i autodie

Log::Log4perl

Log::Log4perl is recommended, but not required. You should be able to install it by running:

cpan -i Log::Log4perl

If you are able to install Log::Log4perl, you will be able to fine-tune the logging output by modifying the conf/rum_logging.conf file in the RUM distribution. This is very useful when developing RUM, but is not necessary for normal usage. If Log::Log4perl is not installed, RUM will use its own logging system, which will print the most important log messages to the log files as described in the Logging section above. So if you aren't able to install Log::Log4perl, don't worry, you will most likely still get all the logging output you need for normal (non-development) usage.

See http://mschilli.github.com/log4perl/ for more information about the module.

Better SAM output

We have revised the format of the SAM output file so that it should validate against the Picard SAM parser.

Add RUM job report file

As of 2.0.2, you will find a file called rum_job_report.txt in the output directory when you run a job. This file contains a listing of the parameters that were used to run the job, and the timestamp of some major milestones: when preprocessing started and finished, when each chunk finished, and when postprocessing finished.

Performance improvements

RUM 2.0.3 includes some changes that should improve performance for many jobs.

We have added a default limit of 100 non-unique Bowtie alignments per read. This may make processing Bowtie's output much faster, and may result in fewer non-unique mappings, which would improve performance some.

Prior to RUM 2.0.3, RUM would wait for Bowtie and Blat to finish before starting to process their output. As of RUM 2.0.3 it now reads the output of those programs as it is being produced. In many situations this will reduce the amount of I/O and result in better CPU utilization.

We have attempted to reduce the running time of the "Quantify novel exons" step by changing the data structure that the program uses to accumulate quantification counts.

The running time will vary heavily from job to job, but we have seen performance improvements ranging from 17% to 24% in terms of total CPU time (across all chunks) and 14% to 24% in terms of wall clock time, using typical input files mapped against the hg19 index.

Alignment changes

Several of the changes have affected the alignment algorithm slightly. By default, we now cap the number of alignments produced by bowtie at 100. The purpose of this change is to avoid situations where we have to consider thousands of combinations of Bowtie mappings, resulting in poor performance. The effect of this change is that RUM may now suppress some ambiguous Bowtie alignments that used to be included in the results.

We also fixed a bug in the code that processes the Bowtie genome alignments. Prior to 2.0.3, we were accidentally suppressing many non-unique alignments for paired reads.

The net effect of both of these changes is that RUM 2.0.3 may produce more alignments for some reads and fewer alignments for other reads. We analyzed the changes in the number of alignments for reads for a typical job, consisting of approximately 87 million reads of length 101 mapped against the hg19 index. We place each read in a class according to the number of alignments that RUM produced for it: "none", "unique", and "non-unique". For this data set, 99.985% of the reads were put in the same class by both versions of RUM.

The following table shows the number of reads that moved from one class to another, for each combination of classes:

Old mapping New mapping Number of reads Percent of all reads for job
unique unique 68655126 78.08057%
none none 10011229 11.38564%
non-unique non-unique 9249449 10.51928%
unique non-unique 9641 0.01096%
non-unique unique 2514 0.00286%
unique none 410 0.00047%
non-unique none 173 0.00020%
none non-unique 18 0.00002%
none unique 3 0.00000%

Other Changes

  • We have added many unit and integration tests.

  • Intermediate output files produced for the chunks now go in *output-dir*/chunks.

  • Log files all go in *output-dir*/log.

  • The mechanism for running jobs on an SGE cluster has been made extensible, so that we may be able to support other types of clusters in the future. You should be able to run RUM on any Linux-based cluster using the --preprocess, --process, --postprocess, and --chunks options. Please see Running a job on a cluster for more details.

Clone this wiki locally