0. Installation

Downloading and decompressing aRNApipe
aRNApipe files
Adapting the configuration library
Migrating aRNApipe to a workload manager other than LSF
Complete guide to install aRNApipe dependencies

The following sections provide a guide for installing aRNApipe in a cluster environment or in a single machine.

### Downloading and decompressing aRNApipe The Python script and library files of aRNApipe can be downloaded using two approaches:

Using GIT within the GitHub aRNApipe repository:

Go to the directory where aRNApipe scripts will be placed
Clone the aRNApipe repository:
git clone https://github.com/HudsonAlpha/aRNAPipe.git
Once cloned, a folder aRNAPipe will be created in the working directory with all the required files for running aRNApipe.

Downloading aRNApipe as a zip file from GitHub:

Go to the directory where aRNApipe scripts will be placed
Download aRNApipe from this GitHub repository: wget https://github.com/HudsonAlpha/aRNAPipe/archive/master.zip
Once downloaded, decompress the zip file and rename the decompressed folder: unzip aRNAPipe-master.zip mv aRNAPipe-master aRNAPipe

The path to the aRNAPipe folder will be the pipeline baseline path and should be provided to the users in order to execute the pipeline (refered as $BP in the User Installation wiki page).

### aRNApipe files The schema belows shows the structure of aRNApipe script files and all the included libraries:

The aRNApipe directory is organized as follows:

./aRNAPipe: Main folder with the wrapper functions for aRNApipe, the Spider and the reference builder.
./aRNAPipe/lib: Python libraries used by the three main applications.
./aRNAPipe/R: R scripts used by the spider module.
- log_stats.R: This function reads the output generated by the function spider_stats.stats_log() and generates a figure showing the timeline of the executed jobs and their memory usage.
- stats_alg.R: This function reads the output generated by the functions spider_stats.stats_htseq() and spider_stats.stats_star() and generates the files required for showing PCA plots and count percentile statistics.
./aRNAPipe/template: This folder contains the templates used by aRNApipe.
- config.txt and samples.list: These templates are used when calling aRNApipe skeleton mode.
- TEMPLATE_[X].html: HTML templates used by the Spider to build the reports using javascript functions.
./aRNAPipe/html: Libraries (javascript and stylesheets) required by the web reports. Four main javascript libraries are used:
- jQuery: Fast, small, and feature-rich JavaScript library.
- DataTables: DataTables is a plug-in for the jQuery Javascript library.
- AMcharts: JavaScript/HTML5 charts.
- Lytebox: Lightweight and cross-browser compatible Javascript library and content viewer.

### Adapting the configuration library ('config.py') The configuration library (*./aRNAPipe/lib/config.py*) stores all the system specific variables. These variables are categorized in 4 main groups:

1) Cluster settings: The variable mode sets the library that will be used to manage the job submission of aRNApipe. When set to LSF aRNApipe will use the library designed for clusters using the LSF workload manager (sys_LSF.py). The administrator can also set the mode to LOCAL in order to run the pipeline in a single machine (sys_single.py). In this case all the jobs will be processed sequentially. Finally, the administrator can select OTHER to use the functions defined in sys_OTHER.py. This library can be edited by the administrator to adapt aRNApipe to other workload managers such as Slurm or PBS:

mode = "LSF"

2) Paths to binaries used by aRNApipe: The administrator should verify that all the programs required by aRNApipe are installed and provide the path to each of the program binaries used by aRNApipe:

PATHS TO BINARIES USED BY aRNApipe
path_fastqc   = "/gpfs/gpfs1/software/fastqc/fastqc"
path_kallisto = "/gpfs/gpfs1/software/kallisto-0.42.4/kallisto"
path_star     = "/gpfs/gpfs1/software/STAR_2.5.2b/bin/Linux_x86_64_static/STAR"
path_htseq    = "/gpfs/gpfs1/software/HTSeq-0.5.3/bin/htseq-count"
path_picard   = "/gpfs/gpfs1/software/picard-tools-1.88"
path_samtools = "/gpfs/gpfs1/software/samtools-1.2/bin/samtools"
path_varscan  = "/gpfs/gpfs1/software/varscan/VarScan.v2.3.6.jar"
path_gtf2gp   = "/gpfs/gpfs1/software/ucsc/gtfToGenePred"
path_gatk     = "/gpfs/gpfs1/software/GATK-3.5/GenomeAnalysisTK.jar"
path_cutadapt = "/gpfs/gpfs1/software/python2.7/bin/cutadapt"
path_trimgalore = "/gpfs/gpfs1/myerslab/reference/genomes/rnaseq_pipeline/bin/trim_galore"
path_starfusion = "/gpfs/gpfs1/software/STAR_2.5.2b/STAR-Fusion/STAR-Fusion"
path_jsplice = "/gpfs/gpfs1/software/jSplice-1.0.1/" done

The current version of aRNApipe has been designed to work with the following applications:

Main tools:
- Quality filtering of reads and adaptor trimming using TrimGalore (v0.4.1) and cutadapt (v1.8.1).
- Quality control of the input fastq files using FastQC (v0.10.1)
- RNA-seq pseudo-alignment using Kallisto (v0.42.4):
- RNA-seq alignment using STAR (v2.5.2b)
- Quality control of the alignment performed by STAR using Picard (v1.88)
- Generation of gene counts using HTseq (v0.5.3)
- Generation of exon counts using HTseq (v0.5.3)
- Identification of gene fusions using Star-Fusion (v0.8.0)
- Variant detection and calling using Varscan (v2.3.6)
- Variant detection and calling using GATK (v3.5)
- Differential alternative splicing detection using jSplice (v1.0.1)
Additional tools:
- SAMtools (v1.2)
- gtfToGenePred-UCSC

At the end of this section the used can find a complete guide to install aRNApipe dependencies

3) STAR options: The administrator can provide default argument settings for STAR that the users will be able to select using the star_args argument in the project configuration file. Three options are already provided ('default' STAR arguments , 'encode' recommended arguments for long RNA-seq and 'fusion' recommended arguments for gene fusion detection):

STAR options: The keys of this dict are used in the project config files to use the referenced STAR arguments
star_options = {"default": "",
                "encode":  "--outFilterType BySJout --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000"}

Environment variables: Some of the programs used by aRNApipe may require specific libraries. aRNApipe is able to dynamically set the required environment variables when running a dataset without modifying the current user environment:

ENVIRONMENT VARIABLES TO CHANGE (ONLY FOR THE PIPELINE EXECUTION)
environment = {"JAVA_HOME":      "/usr/java/jdk1.8.0_60/",
               "PYTHONPATH":     "/gpfs/gpfs1/software/HTSeq-0.5.3/lib/python",
               "PATH":           "/gpfs/gpfs1/software/Python-2.7.2/bin",
               "LD_LIBRARY_PATH":"/gpfs/gpfs1/software/gcc-4.8.2/usr/lib64",
               "PERL5LIB"       :"/gpfs/gpfs1/software/perl-modules/lib/perl5/5.10.1:/gpfs/gpfs1/software/perl-modules/lib/perl5/5.10.1/lib64/perl5"}

4) GATK: See the GATK subsection in the Reference Builder section of the user guide:

FILES REQUIRED BY GATK
gatk_multithread = {"RTC": 4, "BR": 4, "PR": 4}
annots_gatk = {"g1k_v37": ["dbsnp_138.b37.vcf", ["1000G_phase1.indels.b37.vcf", "Mills_and_1000G_gold_standard.indels.b37.vcf"]]}

### Migrating aRNApipe to a workload manager other than LSF aRNApipe has been designed to provide an easy migration of the whole implementation to cluster environments based on workload managers other than LSF. All the workload manager specific functions have been concentrated in a single library. When migrating to other workload managers, the administrator must set the parameter mode to *OTHER* in the main configuration file *./lib/config.py*. The library file *./lib/sys_OTHER.py* contains five functions that must be edited to use the specific commands of the workload manager being used.

get_pid(path, secs)

This function returns the process ID of a job according to the output of the command used for submitting a job. When submitting a job aRNApipe writes the command output to a file. The path to this file is provided as the first argument of the function (path). The second argument sets the time in seconds the function will wait until parsing the output file.

When submitting a job in LSF-based systems using the command bsub the command writes in the standard shell the following output:

Job <435124> is submitted to default queue <normal>.

aRNApipe writes this output to a temporal file and parses it using the function get_pid:

def get_pid(path, secs):
    time.sleep(secs)
    try:
        f   = open(path, 'r')
        uds = f.readline().rstrip().split(" ")[1].replace(">", "").replace("<", "")
        f.close()
        os.system("rm " + path)
        return uds
    except:
        # error opening or parsing the job submission ouptut file
        return "NA"

job_kill(uds)

This function kills a job given its job ID.

def job_kill(uds):
    os.system("bkill " + uds + " &>/dev/null")

job_status(path)

This function returns the status of a job given the path to the log file generated by the workload manager. This log file provides information of the job execution and can be parsed to know if the job has successfully terminated, has exited or is still running. The function returns 1 if the job has terminated successfully, 0 if the job is still running or -1 if the job has exited with errors.

def job_status(path):
    if os.path.exists(path): # if the log output file exists the job has terminated (successfully or with errors)
        status = ""
        f = open(path, 'r')
        for i in f:
            i = i.strip("\n")
            if i.startswith("Subject:") & i.endswith("Exited"): # The job has exited with error
                status = "error"
            elif i.startswith("Subject:") and i.endswith("Done"): # The job has terminated successfully
                status = "ok"
            else:
                if (status == "error") & i.startswith("TERM_REQUEUE_OWNER"): # When TERM_REQUEUE_OWNER appears, the job has been requeued, what means is still running
                    status = "requeing"
        f.close()
        if status == "ok":
            return 1
        elif status == "error":
            return -1
        else:
            return 0
    else: # the log file does not exist --> Job still running
        return 0

submit_job(wt, n, q, output_log, jid, path2script, script, pt, bsub_suffix)

This function submits a job to the cluster queue system. The arguments are described below:

wt: Wall-time for the job
n: Number of CPUs that will be used
q: Queue for the job
output_log: Path to the output log of the job (it will be used by other functions such as job_status())
jid: Job identifier
path2script: Text string
script: If 1, the job corresponds to a script and its path is given by path2script. If 0, the job is a command given by path2script.
pt: Path to the project folder (is used to generate the output command file that will be parsed to obtain the job's PID)
bsub_suffix: Additional arguments that will be provided to the job submission command (i.e. bsub) and generated by the function get_bsub_arg().

Implementation for LSF-based systems (./lib/sys_LSF.py):

def submit_job(wt, n, q, output_log, jid, path2script, script, pt, bsub_suffix):
    r = str(random.randint(0, 10000)) # random number for naming the ouptut command file that will be subsequently parsed to obtain the job PID
    opts = "-R select[type==any] " + bsub_suffix  # arguments for bsub
    if script == 1: # script file
        command = 'bsub '+opts+' -W ' + wt + " -n " + n + " -q " + q + " -o " + output_log + " -J " + jid + " < " + path2script
    else: # command
        command = 'bsub '+opts+' -W ' + wt + " -n " + n + " -q " + q + " -o " + output_log + " -J " + jid + " '" + path2script + "'"
    os.system(command + " > " + pt + "/temp/temp" + jid + "_" + r + ".txt")
    uds = get_pid(pt + "/temp/temp" + jid + "_" + r + ".txt", 3)
    return uds

get_bsub_arg(pars, L)

This function parses the configuration file module configuration "a/b/c" to apply the system-specific arguments. Returns the number of jobs (nchild), the number of CPUs per job (nproc) and a string with the required arguments for requesting a certain memory and spanning a single node.

def get_bsub_arg(pars, L):
    pars = pars.split("/")
    bsub_suffix = list()
    nproc  = int(pars[0])
    if pars[1] != "NA": # memory (Gb)
        mem = str(int(pars[1]) * 1024)
        bsub_suffix.append("-R rusage[mem="+mem+"]")
    if pars[2] != "NA":
        nchild = int(pars[2])
        bsub_suffix.append("-R span[hosts=1]")
    else:
        nchild = nproc
        nproc  = 1
    bsub_suffix = " ".join(bsub_suffix)
    if nchild > L:
        nchild = L
    return nproc, nchild, bsub_suffix

### Complete guide to install aRNApipe dependencies

FastQC

Execute the following commands to install FastQC in the working directory:

wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip
unzip fastqc_v0.11.5.zip
rm fastqc_v0.11.5.zip

Set path_fastqc in the aRNApipe configuration file (config.py):

path_fastqc = '[WORKING_DIRECTORY]/FastQC/fastq'

Kallisto

Execute the following commands to install Kallisto in the working directory:

wget https://github.com/pachterlab/kallisto/releases/download/v0.42.4/kallisto_linux-v0.42.4.tar.gz
tar -xvf kallisto_linux-v0.42.4.tar.gz
rm kallisto_linux-v0.42.4.tar.gz

Set path_kallisto in the aRNApipe configuration file (config.py):

path_kallisto = '[WORKING_DIRECTORY]/kallisto_linux-v0.42.4/kallisto'

Picard

Execute the following commands to install Picard in the working directory:

wget http://downloads.sourceforge.net/project/picard/picard-tools/1.88/picard-tools-1.88.zip
unzip picard-tools-1.88.zip
rm picard-tools-1.88.zip

Set path_picard in the aRNApipe configuration file (config.py):

path_picard = '[WORKING_DIRECTORY]/picard-tools-1.88'

samtools

Execute the following commands to install samtools in the working directory:

wget http://downloads.sourceforge.net/project/samtools/samtools/1.3.1/samtools-1.3.1.tar.bz2
tar -xvf samtools-1.3.1.tar.bz2
rm samtools-1.3.1.tar.bz2
cd samtools-1.3.1
./configure --enable-plugins --enable-libcurl --with-plugin-path=$PWD/htslib-1.3.1
make all plugins-htslib

Set path_samtools in the aRNApipe configuration file (config.py):

path_samtools = '[WORKING_DIRECTORY]/samtools-1.3.1/samtools'

VarScan

Execute the following commands to install VarScan in the working directory:

mkdir varscan
cd varscan
wget http://downloads.sourceforge.net/project/varscan/VarScan.v2.3.6.jar

Set path_varscan in the aRNApipe configuration file (config.py):

path_varscan = '[WORKING_DIRECTORY]/varscan/VarScan.v2.3.6.jar'

gtfToGenePred

Execute the following commands to install gtfToGenePred in the working directory:

mkdir gtfToGenePred
cd gtfToGenePred
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred

Set path_gtf2gp in the aRNApipe configuration file (config.py):

path_gtf2gp = '[WORKING_DIRECTORY]/gtfToGenePred/gtfToGenePred'

TrimGalore and cutadapt

Execute the following commands to install TrimGalore in the working directory:

wget http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/trim_galore_v0.4.1.zip
unzip trim_galore_v0.4.1.zip
rm trim_galore_v0.4.1.zip

Set path_trimgalore in the aRNApipe configuration file (config.py):

path_trimgalore = '[WORKING_DIRECTORY]/trim_galore_zip/trim_galore'

TrimGalore requires cutadapt. In order to install cutadapt execute the following commands in the working directory:

wget https://pypi.python.org/packages/62/bc/77da8a0f0c162831fdccb89306e65cbe14bab7eb72c150afb8e197fa262f/cutadapt-1.8.1.tar.gz
tar -xvf cutadapt-1.8.1.tar.gz
cd cutadapt-1.8.1
python setup.py build
python setup.py install

cutadapt binary is installed in the Python bin folder of the system. Find the path and set path_cutadapt in the aRNApipe configuration file (config.py):

Example: path_cutadapt = '[PYTHON_INSTALL_DIR]/python2.7/bin/cutadapt'

jSplice

Execute the following commands to install jsplice in the working directory:

mkdir jSplice-1.0.1
cd jSplice-1.0.1
wget https://www.ethz.ch/content/dam/ethz/special-interest/biol/mhs-biol/molecular-health-sciences-platform-dam/documents/research/krek/jSplice-1.0.1.zip
unzip jSplice-1.0.1.zip
rm jSplice-1.0.1.zip

Set path_jsplice in the aRNApipe configuration file (config.py):

path_jsplice = '[WORKING_DIRECTORY]/jSplice-1.0.1/'

GATK

GATK is only available for registered users. Go to https://software.broadinstitute.org/gatk/download/ and sign in. Once signed in go to https://software.broadinstitute.org/gatk/download/auth?package=GATK-archive&version=3.5-0-g36282e4 and download GATK. Execute the following command to extract the GATK jar file:

tar -xvf GenomeAnalysisTK-3.5-0-g36282e4.tar.bz2

Set path_gatk in the aRNApipe configuration file (config.py):

path_gatk = '[WORKING_DIRECTORY]/GenomeAnalysisTK.jar'

HTseq

Execute the following commands to install HTseq in the working directory:

wget https://pypi.python.org/packages/4d/3a/c7e0bf6b9061c85c6e5163ea21a01e95ed76a8f2d79d22e16cf78a74ca8f/HTSeq-0.5.3.tar.gz
tar -xvf HTSeq-0.5.3.tar.gz 
cd HTSeq-0.5.3
python setup.py build
python setup.py install

Set path_htseq in the aRNApipe configuration file (config.py):

path_htseq = '[WORKING_DIRECTORY]/HTSeq-0.5.3/build/scripts-2.7/htseq-count'

STAR

Execute the following commands to install STAR in the working directory:

wget https://github.com/alexdobin/STAR/archive/2.5.2b.zip
unzip 2.5.2b.zip

The unzipped directory already contains precompiled binaries of STAR. Set path_star in the aRNApipe configuration file (config.py):

path_star = '[WORKING_DIRECTORY]/HTSeq-0.5.3/STAR-2.5.2b/bin/Linux_x86_64_static/STAR'

STAR-Fusion

Execute the following commands to install STAR-Fusion in the working directory:

wget https://github.com/STAR-Fusion/STAR-Fusion/releases/download/v0.8.0/STAR-Fusion_v0.8.FULL.tar.gz
tar -xvf STAR-Fusion_v0.8.FULL.tar.gz

Set path_starfusion in the aRNApipe configuration file (config.py):

path_starfusion = '[WORKING_DIRECTORY]/STAR-Fusion_v0.8/STAR-Fusion'

Check STAR-Fusion dependencies in https://github.com/STAR-Fusion/STAR-Fusion/wiki.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0. Installation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally