-
Notifications
You must be signed in to change notification settings - Fork 6
0. Installation
- Downloading and decompressing aRNApipe
- aRNApipe files
- Adapting the configuration library
- Migrating aRNApipe to a workload manager other than LSF
- Complete guide to install aRNApipe dependencies
The following sections provide a guide for installing aRNApipe in a cluster environment or in a single machine.
### Downloading and decompressing aRNApipe The Python script and library files of aRNApipe can be downloaded using two approaches:
Using GIT within the GitHub aRNApipe repository:
- Go to the directory where aRNApipe scripts will be placed
- Clone the aRNApipe repository:
git clone https://github.com/HudsonAlpha/aRNAPipe.git - Once cloned, a folder aRNAPipe will be created in the working directory with all the required files for running aRNApipe.
Downloading aRNApipe as a zip file from GitHub:
- Go to the directory where aRNApipe scripts will be placed
- Download aRNApipe from this GitHub repository: wget https://github.com/HudsonAlpha/aRNAPipe/archive/master.zip
- Once downloaded, decompress the zip file and rename the decompressed folder: unzip aRNAPipe-master.zip mv aRNAPipe-master aRNAPipe
The path to the aRNAPipe folder will be the pipeline baseline path and should be provided to the users in order to execute the pipeline (refered as $BP in the User Installation wiki page).
### aRNApipe files The schema belows shows the structure of aRNApipe script files and all the included libraries:
The aRNApipe directory is organized as follows:
- ./aRNAPipe: Main folder with the wrapper functions for aRNApipe, the Spider and the reference builder.
- ./aRNAPipe/lib: Python libraries used by the three main applications.
-
./aRNAPipe/R: R scripts used by the spider module.
- log_stats.R: This function reads the output generated by the function spider_stats.stats_log() and generates a figure showing the timeline of the executed jobs and their memory usage.
- stats_alg.R: This function reads the output generated by the functions spider_stats.stats_htseq() and spider_stats.stats_star() and generates the files required for showing PCA plots and count percentile statistics.
-
./aRNAPipe/template: This folder contains the templates used by aRNApipe.
- config.txt and samples.list: These templates are used when calling aRNApipe skeleton mode.
- TEMPLATE_[X].html: HTML templates used by the Spider to build the reports using javascript functions.
-
./aRNAPipe/html: Libraries (javascript and stylesheets) required by the web reports. Four main javascript libraries are used:
- jQuery: Fast, small, and feature-rich JavaScript library.
- DataTables: DataTables is a plug-in for the jQuery Javascript library.
- AMcharts: JavaScript/HTML5 charts.
- Lytebox: Lightweight and cross-browser compatible Javascript library and content viewer.
### Adapting the configuration library ('config.py') The configuration library (*./aRNAPipe/lib/config.py*) stores all the system specific variables. These variables are categorized in 4 main groups:
1) Cluster settings: The variable mode sets the library that will be used to manage the job submission of aRNApipe. When set to LSF aRNApipe will use the library designed for clusters using the LSF workload manager (sys_LSF.py). The administrator can also set the mode to LOCAL in order to run the pipeline in a single machine (sys_single.py). In this case all the jobs will be processed sequentially. Finally, the administrator can select OTHER to use the functions defined in sys_OTHER.py. This library can be edited by the administrator to adapt aRNApipe to other workload managers such as Slurm or PBS:
mode = "LSF"
2) Paths to binaries used by aRNApipe: The administrator should verify that all the programs required by aRNApipe are installed and provide the path to each of the program binaries used by aRNApipe:
PATHS TO BINARIES USED BY aRNApipe
path_fastqc = "/gpfs/gpfs1/software/fastqc/fastqc"
path_kallisto = "/gpfs/gpfs1/software/kallisto-0.42.4/kallisto"
path_star = "/gpfs/gpfs1/software/STAR_2.5.2b/bin/Linux_x86_64_static/STAR"
path_htseq = "/gpfs/gpfs1/software/HTSeq-0.5.3/bin/htseq-count"
path_picard = "/gpfs/gpfs1/software/picard-tools-1.88"
path_samtools = "/gpfs/gpfs1/software/samtools-1.2/bin/samtools"
path_varscan = "/gpfs/gpfs1/software/varscan/VarScan.v2.3.6.jar"
path_gtf2gp = "/gpfs/gpfs1/software/ucsc/gtfToGenePred"
path_gatk = "/gpfs/gpfs1/software/GATK-3.5/GenomeAnalysisTK.jar"
path_cutadapt = "/gpfs/gpfs1/software/python2.7/bin/cutadapt"
path_trimgalore = "/gpfs/gpfs1/myerslab/reference/genomes/rnaseq_pipeline/bin/trim_galore"
path_starfusion = "/gpfs/gpfs1/software/STAR_2.5.2b/STAR-Fusion/STAR-Fusion"
path_jsplice = "/gpfs/gpfs1/software/jSplice-1.0.1/" done
The current version of aRNApipe has been designed to work with the following applications:
-
Main tools:
- Quality filtering of reads and adaptor trimming using TrimGalore (v0.4.1) and cutadapt (v1.8.1).
- Quality control of the input fastq files using FastQC (v0.10.1)
- RNA-seq pseudo-alignment using Kallisto (v0.42.4):
- RNA-seq alignment using STAR (v2.5.2b)
- Quality control of the alignment performed by STAR using Picard (v1.88)
- Generation of gene counts using HTseq (v0.5.3)
- Generation of exon counts using HTseq (v0.5.3)
- Identification of gene fusions using Star-Fusion (v0.8.0)
- Variant detection and calling using Varscan (v2.3.6)
- Variant detection and calling using GATK (v3.5)
- Differential alternative splicing detection using jSplice (v1.0.1)
-
Additional tools:
- SAMtools (v1.2)
- gtfToGenePred-UCSC
At the end of this section the used can find a complete guide to install aRNApipe dependencies
3) STAR options: The administrator can provide default argument settings for STAR that the users will be able to select using the star_args argument in the project configuration file. Three options are already provided ('default' STAR arguments , 'encode' recommended arguments for long RNA-seq and 'fusion' recommended arguments for gene fusion detection):
STAR options: The keys of this dict are used in the project config files to use the referenced STAR arguments
star_options = {"default": "",
"encode": "--outFilterType BySJout --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000"}
Environment variables: Some of the programs used by aRNApipe may require specific libraries. aRNApipe is able to dynamically set the required environment variables when running a dataset without modifying the current user environment:
ENVIRONMENT VARIABLES TO CHANGE (ONLY FOR THE PIPELINE EXECUTION)
environment = {"JAVA_HOME": "/usr/java/jdk1.8.0_60/",
"PYTHONPATH": "/gpfs/gpfs1/software/HTSeq-0.5.3/lib/python",
"PATH": "/gpfs/gpfs1/software/Python-2.7.2/bin",
"LD_LIBRARY_PATH":"/gpfs/gpfs1/software/gcc-4.8.2/usr/lib64",
"PERL5LIB" :"/gpfs/gpfs1/software/perl-modules/lib/perl5/5.10.1:/gpfs/gpfs1/software/perl-modules/lib/perl5/5.10.1/lib64/perl5"}
4) GATK: See the GATK subsection in the Reference Builder section of the user guide:
FILES REQUIRED BY GATK
gatk_multithread = {"RTC": 4, "BR": 4, "PR": 4}
annots_gatk = {"g1k_v37": ["dbsnp_138.b37.vcf", ["1000G_phase1.indels.b37.vcf", "Mills_and_1000G_gold_standard.indels.b37.vcf"]]}
### Migrating aRNApipe to a workload manager other than LSF aRNApipe has been designed to provide an easy migration of the whole implementation to cluster environments based on workload managers other than LSF. All the workload manager specific functions have been concentrated in a single library. When migrating to other workload managers, the administrator must set the parameter mode to *OTHER* in the main configuration file *./lib/config.py*. The library file *./lib/sys_OTHER.py* contains five functions that must be edited to use the specific commands of the workload manager being used.
get_pid(path, secs)
This function returns the process ID of a job according to the output of the command used for submitting a job. When submitting a job aRNApipe writes the command output to a file. The path to this file is provided as the first argument of the function (path). The second argument sets the time in seconds the function will wait until parsing the output file.
When submitting a job in LSF-based systems using the command bsub the command writes in the standard shell the following output:
Job <435124> is submitted to default queue <normal>.
aRNApipe writes this output to a temporal file and parses it using the function get_pid:
def get_pid(path, secs):
time.sleep(secs)
try:
f = open(path, 'r')
uds = f.readline().rstrip().split(" ")[1].replace(">", "").replace("<", "")
f.close()
os.system("rm " + path)
return uds
except:
# error opening or parsing the job submission ouptut file
return "NA"
job_kill(uds)
This function kills a job given its job ID.
def job_kill(uds):
os.system("bkill " + uds + " &>/dev/null")
job_status(path)
This function returns the status of a job given the path to the log file generated by the workload manager. This log file provides information of the job execution and can be parsed to know if the job has successfully terminated, has exited or is still running. The function returns 1 if the job has terminated successfully, 0 if the job is still running or -1 if the job has exited with errors.
def job_status(path):
if os.path.exists(path): # if the log output file exists the job has terminated (successfully or with errors)
status = ""
f = open(path, 'r')
for i in f:
i = i.strip("\n")
if i.startswith("Subject:") & i.endswith("Exited"): # The job has exited with error
status = "error"
elif i.startswith("Subject:") and i.endswith("Done"): # The job has terminated successfully
status = "ok"
else:
if (status == "error") & i.startswith("TERM_REQUEUE_OWNER"): # When TERM_REQUEUE_OWNER appears, the job has been requeued, what means is still running
status = "requeing"
f.close()
if status == "ok":
return 1
elif status == "error":
return -1
else:
return 0
else: # the log file does not exist --> Job still running
return 0
submit_job(wt, n, q, output_log, jid, path2script, script, pt, bsub_suffix)
This function submits a job to the cluster queue system. The arguments are described below:
- wt: Wall-time for the job
- n: Number of CPUs that will be used
- q: Queue for the job
- output_log: Path to the output log of the job (it will be used by other functions such as job_status())
- jid: Job identifier
- path2script: Text string
- script: If 1, the job corresponds to a script and its path is given by path2script. If 0, the job is a command given by path2script.
- pt: Path to the project folder (is used to generate the output command file that will be parsed to obtain the job's PID)
- bsub_suffix: Additional arguments that will be provided to the job submission command (i.e. bsub) and generated by the function get_bsub_arg().
Implementation for LSF-based systems (./lib/sys_LSF.py):
def submit_job(wt, n, q, output_log, jid, path2script, script, pt, bsub_suffix):
r = str(random.randint(0, 10000)) # random number for naming the ouptut command file that will be subsequently parsed to obtain the job PID
opts = "-R select[type==any] " + bsub_suffix # arguments for bsub
if script == 1: # script file
command = 'bsub '+opts+' -W ' + wt + " -n " + n + " -q " + q + " -o " + output_log + " -J " + jid + " < " + path2script
else: # command
command = 'bsub '+opts+' -W ' + wt + " -n " + n + " -q " + q + " -o " + output_log + " -J " + jid + " '" + path2script + "'"
os.system(command + " > " + pt + "/temp/temp" + jid + "_" + r + ".txt")
uds = get_pid(pt + "/temp/temp" + jid + "_" + r + ".txt", 3)
return uds
get_bsub_arg(pars, L)
This function parses the configuration file module configuration "a/b/c" to apply the system-specific arguments. Returns the number of jobs (nchild), the number of CPUs per job (nproc) and a string with the required arguments for requesting a certain memory and spanning a single node.
def get_bsub_arg(pars, L):
pars = pars.split("/")
bsub_suffix = list()
nproc = int(pars[0])
if pars[1] != "NA": # memory (Gb)
mem = str(int(pars[1]) * 1024)
bsub_suffix.append("-R rusage[mem="+mem+"]")
if pars[2] != "NA":
nchild = int(pars[2])
bsub_suffix.append("-R span[hosts=1]")
else:
nchild = nproc
nproc = 1
bsub_suffix = " ".join(bsub_suffix)
if nchild > L:
nchild = L
return nproc, nchild, bsub_suffix
### Complete guide to install aRNApipe dependencies
FastQC
Execute the following commands to install FastQC in the working directory:
wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip
unzip fastqc_v0.11.5.zip
rm fastqc_v0.11.5.zip
Set path_fastqc in the aRNApipe configuration file (config.py):
path_fastqc = '[WORKING_DIRECTORY]/FastQC/fastq'
Kallisto
Execute the following commands to install Kallisto in the working directory:
wget https://github.com/pachterlab/kallisto/releases/download/v0.42.4/kallisto_linux-v0.42.4.tar.gz
tar -xvf kallisto_linux-v0.42.4.tar.gz
rm kallisto_linux-v0.42.4.tar.gz
Set path_kallisto in the aRNApipe configuration file (config.py):
path_kallisto = '[WORKING_DIRECTORY]/kallisto_linux-v0.42.4/kallisto'
Picard
Execute the following commands to install Picard in the working directory:
wget http://downloads.sourceforge.net/project/picard/picard-tools/1.88/picard-tools-1.88.zip
unzip picard-tools-1.88.zip
rm picard-tools-1.88.zip
Set path_picard in the aRNApipe configuration file (config.py):
path_picard = '[WORKING_DIRECTORY]/picard-tools-1.88'
samtools
Execute the following commands to install samtools in the working directory:
wget http://downloads.sourceforge.net/project/samtools/samtools/1.3.1/samtools-1.3.1.tar.bz2
tar -xvf samtools-1.3.1.tar.bz2
rm samtools-1.3.1.tar.bz2
cd samtools-1.3.1
./configure --enable-plugins --enable-libcurl --with-plugin-path=$PWD/htslib-1.3.1
make all plugins-htslib
Set path_samtools in the aRNApipe configuration file (config.py):
path_samtools = '[WORKING_DIRECTORY]/samtools-1.3.1/samtools'
VarScan
Execute the following commands to install VarScan in the working directory:
mkdir varscan
cd varscan
wget http://downloads.sourceforge.net/project/varscan/VarScan.v2.3.6.jar
Set path_varscan in the aRNApipe configuration file (config.py):
path_varscan = '[WORKING_DIRECTORY]/varscan/VarScan.v2.3.6.jar'
gtfToGenePred
Execute the following commands to install gtfToGenePred in the working directory:
mkdir gtfToGenePred
cd gtfToGenePred
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred
Set path_gtf2gp in the aRNApipe configuration file (config.py):
path_gtf2gp = '[WORKING_DIRECTORY]/gtfToGenePred/gtfToGenePred'
TrimGalore and cutadapt
Execute the following commands to install TrimGalore in the working directory:
wget http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/trim_galore_v0.4.1.zip
unzip trim_galore_v0.4.1.zip
rm trim_galore_v0.4.1.zip
Set path_trimgalore in the aRNApipe configuration file (config.py):
path_trimgalore = '[WORKING_DIRECTORY]/trim_galore_zip/trim_galore'
TrimGalore requires cutadapt. In order to install cutadapt execute the following commands in the working directory:
wget https://pypi.python.org/packages/62/bc/77da8a0f0c162831fdccb89306e65cbe14bab7eb72c150afb8e197fa262f/cutadapt-1.8.1.tar.gz
tar -xvf cutadapt-1.8.1.tar.gz
cd cutadapt-1.8.1
python setup.py build
python setup.py install
cutadapt binary is installed in the Python bin folder of the system. Find the path and set path_cutadapt in the aRNApipe configuration file (config.py):
Example: path_cutadapt = '[PYTHON_INSTALL_DIR]/python2.7/bin/cutadapt'
jSplice
Execute the following commands to install jsplice in the working directory:
mkdir jSplice-1.0.1
cd jSplice-1.0.1
wget https://www.ethz.ch/content/dam/ethz/special-interest/biol/mhs-biol/molecular-health-sciences-platform-dam/documents/research/krek/jSplice-1.0.1.zip
unzip jSplice-1.0.1.zip
rm jSplice-1.0.1.zip
Set path_jsplice in the aRNApipe configuration file (config.py):
path_jsplice = '[WORKING_DIRECTORY]/jSplice-1.0.1/'
GATK
GATK is only available for registered users. Go to https://software.broadinstitute.org/gatk/download/ and sign in. Once signed in go to https://software.broadinstitute.org/gatk/download/auth?package=GATK-archive&version=3.5-0-g36282e4 and download GATK. Execute the following command to extract the GATK jar file:
tar -xvf GenomeAnalysisTK-3.5-0-g36282e4.tar.bz2
Set path_gatk in the aRNApipe configuration file (config.py):
path_gatk = '[WORKING_DIRECTORY]/GenomeAnalysisTK.jar'
HTseq
Execute the following commands to install HTseq in the working directory:
wget https://pypi.python.org/packages/4d/3a/c7e0bf6b9061c85c6e5163ea21a01e95ed76a8f2d79d22e16cf78a74ca8f/HTSeq-0.5.3.tar.gz
tar -xvf HTSeq-0.5.3.tar.gz
cd HTSeq-0.5.3
python setup.py build
python setup.py install
Set path_htseq in the aRNApipe configuration file (config.py):
path_htseq = '[WORKING_DIRECTORY]/HTSeq-0.5.3/build/scripts-2.7/htseq-count'
STAR
Execute the following commands to install STAR in the working directory:
wget https://github.com/alexdobin/STAR/archive/2.5.2b.zip
unzip 2.5.2b.zip
The unzipped directory already contains precompiled binaries of STAR. Set path_star in the aRNApipe configuration file (config.py):
path_star = '[WORKING_DIRECTORY]/HTSeq-0.5.3/STAR-2.5.2b/bin/Linux_x86_64_static/STAR'
STAR-Fusion
Execute the following commands to install STAR-Fusion in the working directory:
wget https://github.com/STAR-Fusion/STAR-Fusion/releases/download/v0.8.0/STAR-Fusion_v0.8.FULL.tar.gz
tar -xvf STAR-Fusion_v0.8.FULL.tar.gz
Set path_starfusion in the aRNApipe configuration file (config.py):
path_starfusion = '[WORKING_DIRECTORY]/STAR-Fusion_v0.8/STAR-Fusion'
Check STAR-Fusion dependencies in https://github.com/STAR-Fusion/STAR-Fusion/wiki.