Context Navigation

DOTS-Finder User Manual

DOTS-Finder is a free program to detect driver genes in tumoral genomes. It's licensed under GNU GPLv3+. See COPYING.txt file for additional information.

For a full explanation of DOTS-Finder usage and analysis, please download the User Manual and Input-Output Formats

Installation

DOTS-Finder works on UNIX base systems. We tested the installation on Linux Debian 3.2.46-1+deb7u1, Scientific Linux 6.5 and Darwin Kernel Version 11.4.2 and 13.0.0. For large files it can be extremely memory consuming, especially if you want to parallelize the R calculations. We strongly suggest to run DOTS-Finder on a server cluster and set the number of cores according to available memory.

DOTS-Finder requires the availability of a python virtual environment. It's very easy to create, via virtualenv. Nevertheless, you don't need to install it, DOTS-Finder does it for you!

Before the installation of DOTS-Finder you need python (tested on version 2.7) and R (any version >2) with multicore library already at your disposal. Every other packages or libraries will be installed by DOTS-Finder. For MacOS users, Xcode Command Line Developer Tools is also required.

1) Uncompress DOTSF.tgz

2) Create a virtualenv in <dotsfinder_python_home> folder of your choice with python2.7 and activate it:

/usr/bin/python virtualenv.py -p python2.7 <dotsfinder_python_home>

. <dotsfinder_python_home>/bin/activate

3) Install DOTS-Finder from inside uncompressed folder by typing:

pip install .

DOTS-Finder will be installed inside your python virtualenv along with all the libraries ( numpy, pybedtools, cython) and third part tools (bedtools, liftOver)

4) Every time you want to use DOTS-Finder, remember to activate the virtualenv (if you use it from command line). If you want to use it inside a shell script, you have to change the PATH variable. First, find the path (with virtualenv activated):

echo $PATH

Then, inside your shell script, just change the PATH variable before launching dots-finder

PATH=new-path
dots_finder -i input -o output

Another easy option is to change the shabang interpreter from:

#!/usr/bin/env python

into:

#!<dotsfinder_python_home>/bin/python

Simple Example invocation

Inside sample-files folder you can find two example files. PROVA.marf (500 lines of a maf file in marf format) and LAML_build36.marf (acute myeloid leukemia maf file in marf format under hg18 coordinates).

dots_finder -i DOTSF/sample-files/PROVA.marf -o <output_folder> -n pfx

df_liftover -i DOTSF/sample-files/LAML_build36.marf -o <output_folder> -n pfx

For an explanation on available commands

dots_finder --help

Usage

Requirements to run the analysis

MAF file
MARF file (see User Manual or Input-Output Formats for specifications)

Scripts description

df_liftover (optional if the files are already in Build 37)
dots_finder

Supported input formats

DOTS-Finder supports various input formats:

TCGA standard MAF 2.3, MAF 2.4
MARF format. Custom, 13 columns MAF subset, described in Input-Output Formats.
CSV files as produced by annovar can be easily reverted to MARF (see Input-Output Formats)

Easy Analysis Reading

DOTS-Finder will create 5 tab delimited txt files in the output folder. The easiest way to have a first glance at the results is to look at Pfx_OncoGene_Driver.txt and Pfx_TSG_Driver.txt. In the last column you will find "Global_P_Value", an FDR corrected summary of all the procedure of DOTS-Finder. We consider driver, the genes with a Global_P_Value <= 0.1.

For the rest of the output comprehension, please take a look at Input-Output Formats guide.

Content of DOTS-Finder

DOTS-Finder ships a number of useful resources, needed for its work:

final_ref_sort.bed

A bed file with chromosome, start, end and HUGO Gene Name based on RefSeq database

final_knowngene_sort_converted.bed

A bed file with chromosome, start, end, transcript name in UCSC nomenclature and translation in gene name

codon_dictionary.pkl

Python dictionary that shows the expected number of Non synonymous mutations over Total mutation caused by a single nucleotide variation (SNV: A>C , G>T etc.) for each gene. This table derives from the specific codon structure of each gene from NCBI Genbank via the Kazusa website

domains_dictionary.pkl

Python dictionary that reports the superfamily domain structure taken from Conserved Domains Database and the Uniprot number of amino acids for each gene.

length_dictionary.pkl

Python dictionary that reports the length of the gene in bp as the minimum set of exons that comprises all the known transcripts in refGene

hugo_prot_prod_dictionary.pkl

Python dictionary that reports the gene name of the genes that lead to protein products. DOTS-Finder automatically discard all the genes that don't have this property.

hugo_prot_prod_dictionary_oldsymbol.pkl

This python dictionary reports all the possible aliases of the official names of the genes with a protein product.

MAdb.Rdata and other Rdata files

Various databases in a easy reading format for R. All these resources are based on MutationAssessor

Output Format

For a full explanation of output tables created by DOTS-Finder, see Input-Output Formats

Memory usage and computing time reference

The table below shows how the computing time and total memory usage are consumed by DOTS-Finder with a 4 cores analysis. They depend largely by the file size, but also by the number of genes that are mutated in that particular tumor (or in other words, by the tumor specific mutation rate). Use this information as a reference to decide the number of cores for your machine possibilities. If the program crashes or if it's too slow, use less or more cores respectively.

Tumor	File Size	Patients	Genes Mutated	Number of Mutation	Mutation per Patient	Mutation per Gene	Cores	Max Vmem (Gb)	System Time (hours)
brca	1.5M	771	13987	41947	54.40596628	2.998999071	4	12.75	11:39:38
cesc	298K	39	5686	9138	234.3076923	1.607105171	4	6.889	1:13:16
coadread	2.6M	224	15726	82216	367.0357143	5.228030014	4	14.978	17:46:12
gbm	738K	291	9534	21332	73.30584192	2.237465911	4	19.107	3:02:09
kirc	976K	417	11907	27524	66.00479616	2.311581423	4	16.824	6:38:19
kirp	271K	100	5229	7342	73.42	1.404092561	4	6.451	1:09:14
laml	93K	196	1675	2311	11.79081633	1.379701493	4	12.565	0:10:27
lgg	737K	170	10899	23598	138.8117647	2.165152766	4	16.268	4:13:36
luad	3.0M	230	16135	94481	410.7869565	5.855655407	4	18.506	11:28:46
ov	667K	316	9450	18576	58.78481013	1.965714286	4	9.008	3:42:54
paad	217K	34	3555	5895	173.3823529	1.658227848	4	4.163	0:37:20
prad	176K	83	3594	4766	57.42168675	1.326099054	4	4.173	0:37:59
skcm	5.5M	253	17395	181175	716.1067194	10.41534924	4	18.758	20:05:15
stad	2.7M	151	16552	86461	572.589404	5.223598357	4	18.681	11:37:03
thca	278K	323	4887	7279	22.53560372	1.489461838	4	4.472	0:59:11

We typically encounter errors during R calculations like "Unable to fork". This is due to lack of memory in multicore process. Consider to reduce the number of cores if the program crashes.

Last modified 3 years ago Last modified on Dec 23, 2014 12:11:30 PM

Download in other formats:

Plain Text