wiki:projects/DOTS-Finder/Documentation

DOTS-Finder User Manual

DOTS-Finder is a free program to detect driver genes in tumoral genomes. It's licensed under GNU GPLv3+. See COPYING.txt file for additional information.

For a full explanation of DOTS-Finder usage and analysis, please download the User Manual and Input-Output Formats


Installation

DOTS-Finder works on UNIX base systems. We tested the installation on Linux Debian 3.2.46-1+deb7u1, Scientific Linux 6.5 and Darwin Kernel Version 11.4.2 and 13.0.0. For large files it can be extremely memory consuming, especially if you want to parallelize the R calculations. We strongly suggest to run DOTS-Finder on a server cluster and set the number of cores according to available memory.

DOTS-Finder requires the availability of a python virtual environment. It's very easy to create, via virtualenv. Nevertheless, you don't need to install it, DOTS-Finder does it for you!

Before the installation of DOTS-Finder you need python (tested on version 2.7) and R (any version >2) with multicore library already at your disposal. Every other packages or libraries will be installed by DOTS-Finder. For MacOS users, Xcode Command Line Developer Tools is also required.

1) Uncompress DOTSF.tgz

2) Create a virtualenv in <dotsfinder_python_home> folder of your choice with python2.7 and activate it:

/usr/bin/python virtualenv.py -p python2.7 <dotsfinder_python_home>

. <dotsfinder_python_home>/bin/activate

3) Install DOTS-Finder from inside uncompressed folder by typing:

pip install .

DOTS-Finder will be installed inside your python virtualenv along with all the libraries ( numpy, pybedtools, cython) and third part tools (bedtools, liftOver)

4) Every time you want to use DOTS-Finder, remember to activate the virtualenv (if you use it from command line). If you want to use it inside a shell script, you have to change the PATH variable. First, find the path (with virtualenv activated):

echo $PATH

Then, inside your shell script, just change the PATH variable before launching dots-finder

PATH=new-path
dots_finder -i input -o output

Another easy option is to change the shabang interpreter from:

#!/usr/bin/env python

into:

#!<dotsfinder_python_home>/bin/python

Simple Example invocation

Inside sample-files folder you can find two example files. PROVA.marf (500 lines of a maf file in marf format) and LAML_build36.marf (acute myeloid leukemia maf file in marf format under hg18 coordinates).

dots_finder -i DOTSF/sample-files/PROVA.marf -o <output_folder> -n pfx
df_liftover -i DOTSF/sample-files/LAML_build36.marf -o <output_folder> -n pfx

For an explanation on available commands

dots_finder --help


Usage

Requirements to run the analysis

  • MAF file
  • MARF file (see User Manual or Input-Output Formats for specifications)

Scripts description

  • df_liftover (optional if the files are already in Build 37)
  • dots_finder


Supported input formats

DOTS-Finder supports various input formats:

  • TCGA standard MAF 2.3, MAF 2.4
  • MARF format. Custom, 13 columns MAF subset, described in Input-Output Formats.
  • CSV files as produced by annovar can be easily reverted to MARF (see Input-Output Formats)


Easy Analysis Reading

DOTS-Finder will create 5 tab delimited txt files in the output folder. The easiest way to have a first glance at the results is to look at Pfx_OncoGene_Driver.txt and Pfx_TSG_Driver.txt. In the last column you will find "Global_P_Value", an FDR corrected summary of all the procedure of DOTS-Finder. We consider driver, the genes with a Global_P_Value <= 0.1.

For the rest of the output comprehension, please take a look at Input-Output Formats guide.


Content of DOTS-Finder

DOTS-Finder ships a number of useful resources, needed for its work:

final_ref_sort.bed

A bed file with chromosome, start, end and HUGO Gene Name based on RefSeq database

final_knowngene_sort_converted.bed

A bed file with chromosome, start, end, transcript name in UCSC nomenclature and translation in gene name

codon_dictionary.pkl

Python dictionary that shows the expected number of Non synonymous mutations over Total mutation caused by a single nucleotide variation (SNV: A>C , G>T etc.) for each gene. This table derives from the specific codon structure of each gene from NCBI Genbank via the Kazusa website

domains_dictionary.pkl

Python dictionary that reports the superfamily domain structure taken from Conserved Domains Database and the Uniprot number of amino acids for each gene.

length_dictionary.pkl

Python dictionary that reports the length of the gene in bp as the minimum set of exons that comprises all the known transcripts in refGene

hugo_prot_prod_dictionary.pkl

Python dictionary that reports the gene name of the genes that lead to protein products. DOTS-Finder automatically discard all the genes that don't have this property.

hugo_prot_prod_dictionary_oldsymbol.pkl

This python dictionary reports all the possible aliases of the official names of the genes with a protein product.

MAdb.Rdata and other Rdata files

Various databases in a easy reading format for R. All these resources are based on MutationAssessor


Output Format

For a full explanation of output tables created by DOTS-Finder, see Input-Output Formats


Memory usage and computing time reference

The table below shows how the computing time and total memory usage are consumed by DOTS-Finder with a 4 cores analysis. They depend largely by the file size, but also by the number of genes that are mutated in that particular tumor (or in other words, by the tumor specific mutation rate). Use this information as a reference to decide the number of cores for your machine possibilities. If the program crashes or if it's too slow, use less or more cores respectively.

Tumor File Size Patients Genes MutatedNumber of MutationMutation per PatientMutation per GeneCoresMax Vmem (Gb)System Time (hours)
brca1.5M771139874194754.405966282.998999071412.7511:39:38
cesc298K3956869138234.30769231.60710517146.8891:13:16
coadread2.6M2241572682216367.03571435.228030014414.97817:46:12
gbm738K29195342133273.305841922.237465911419.1073:02:09
kirc976K417119072752466.004796162.311581423416.8246:38:19
kirp 271K1005229734273.421.40409256146.4511:09:14
laml93K1961675231111.790816331.379701493412.5650:10:27
lgg737K1701089923598138.81176472.165152766416.2684:13:36
luad3.0M2301613594481410.78695655.855655407418.50611:28:46
ov667K31694501857658.784810131.96571428649.0083:42:54
paad217K3435555895173.38235291.65822784844.1630:37:20
prad176K833594476657.421686751.32609905444.1730:37:59
skcm5.5M25317395181175716.106719410.41534924418.75820:05:15
stad2.7M1511655286461572.5894045.223598357418.68111:37:03
thca278K3234887727922.535603721.48946183844.4720:59:11

We typically encounter errors during R calculations like "Unable to fork". This is due to lack of memory in multicore process. Consider to reduce the number of cores if the program crashes.


Last modified 3 years ago Last modified on Dec 23, 2014 12:11:30 PM