DOTS-Finder User Manual
DOTS-Finder is a free program to detect driver genes in tumoral genomes. It's licensed under GNU GPLv3+. See COPYING.txt file for additional information.
For a full explanation of DOTS-Finder usage and analysis, please download the User Manual and Input-Output Formats
Installation
DOTS-Finder works on UNIX base systems. We tested the installation on Linux Debian 3.2.46-1+deb7u1, Scientific Linux 6.5 and Darwin Kernel Version 11.4.2 and 13.0.0. For large files it can be extremely memory consuming, especially if you want to parallelize the R calculations. We strongly suggest to run DOTS-Finder on a server cluster and set the number of cores according to available memory.
DOTS-Finder requires the availability of a python virtual environment. It's very easy to create, via virtualenv. Nevertheless, you don't need to install it, DOTS-Finder does it for you!
Before the installation of DOTS-Finder you need python (tested on version 2.7) and R (any version >2) with multicore library already at your disposal. Every other packages or libraries will be installed by DOTS-Finder. For MacOS users, Xcode Command Line Developer Tools is also required.
1) Uncompress DOTSF.tgz
2) Create a virtualenv in <dotsfinder_python_home> folder of your choice with python2.7 and activate it:
/usr/bin/python virtualenv.py -p python2.7 <dotsfinder_python_home> . <dotsfinder_python_home>/bin/activate
3) Install DOTS-Finder from inside uncompressed folder by typing:
pip install .
DOTS-Finder will be installed inside your python virtualenv along with all the libraries ( numpy, pybedtools, cython) and third part tools (bedtools, liftOver)
4) Every time you want to use DOTS-Finder, remember to activate the virtualenv (if you use it from command line). If you want to use it inside a shell script, you have to change the PATH variable. First, find the path (with virtualenv activated):
echo $PATH
Then, inside your shell script, just change the PATH variable before launching dots-finder
PATH=new-path dots_finder -i input -o output
Another easy option is to change the shabang interpreter from:
#!/usr/bin/env python
into:
#!<dotsfinder_python_home>/bin/python
Simple Example invocation
Inside sample-files folder you can find two example files. PROVA.marf (500 lines of a maf file in marf format) and LAML_build36.marf (acute myeloid leukemia maf file in marf format under hg18 coordinates).
dots_finder -i DOTSF/sample-files/PROVA.marf -o <output_folder> -n pfx
df_liftover -i DOTSF/sample-files/LAML_build36.marf -o <output_folder> -n pfx
For an explanation on available commands
dots_finder --help
Usage
Requirements to run the analysis
- MAF file
- MARF file (see User Manual or Input-Output Formats for specifications)
Scripts description
- df_liftover (optional if the files are already in Build 37)
- dots_finder
Supported input formats
DOTS-Finder supports various input formats:
- TCGA standard MAF 2.3, MAF 2.4
- MARF format. Custom, 13 columns MAF subset, described in Input-Output Formats.
- CSV files as produced by annovar can be easily reverted to MARF (see Input-Output Formats)
Easy Analysis Reading
DOTS-Finder will create 5 tab delimited txt files in the output folder. The easiest way to have a first glance at the results is to look at Pfx_OncoGene_Driver.txt and Pfx_TSG_Driver.txt. In the last column you will find "Global_P_Value", an FDR corrected summary of all the procedure of DOTS-Finder. We consider driver, the genes with a Global_P_Value <= 0.1.
For the rest of the output comprehension, please take a look at Input-Output Formats guide.
Content of DOTS-Finder
DOTS-Finder ships a number of useful resources, needed for its work:
final_ref_sort.bed
A bed file with chromosome, start, end and HUGO Gene Name based on RefSeq database
final_knowngene_sort_converted.bed
A bed file with chromosome, start, end, transcript name in UCSC nomenclature and translation in gene name
codon_dictionary.pkl
Python dictionary that shows the expected number of Non synonymous mutations over Total mutation caused by a single nucleotide variation (SNV: A>C , G>T etc.) for each gene. This table derives from the specific codon structure of each gene from NCBI Genbank via the Kazusa website
domains_dictionary.pkl
Python dictionary that reports the superfamily domain structure taken from Conserved Domains Database and the Uniprot number of amino acids for each gene.
length_dictionary.pkl
Python dictionary that reports the length of the gene in bp as the minimum set of exons that comprises all the known transcripts in refGene
hugo_prot_prod_dictionary.pkl
Python dictionary that reports the gene name of the genes that lead to protein products. DOTS-Finder automatically discard all the genes that don't have this property.
hugo_prot_prod_dictionary_oldsymbol.pkl
This python dictionary reports all the possible aliases of the official names of the genes with a protein product.
MAdb.Rdata and other Rdata files
Various databases in a easy reading format for R. All these resources are based on MutationAssessor
Output Format
For a full explanation of output tables created by DOTS-Finder, see Input-Output Formats
Memory usage and computing time reference
The table below shows how the computing time and total memory usage are consumed by DOTS-Finder with a 4 cores analysis. They depend largely by the file size, but also by the number of genes that are mutated in that particular tumor (or in other words, by the tumor specific mutation rate). Use this information as a reference to decide the number of cores for your machine possibilities. If the program crashes or if it's too slow, use less or more cores respectively.
Tumor | File Size | Patients | Genes Mutated | Number of Mutation | Mutation per Patient | Mutation per Gene | Cores | Max Vmem (Gb) | System Time (hours) |
---|---|---|---|---|---|---|---|---|---|
brca | 1.5M | 771 | 13987 | 41947 | 54.40596628 | 2.998999071 | 4 | 12.75 | 11:39:38 |
cesc | 298K | 39 | 5686 | 9138 | 234.3076923 | 1.607105171 | 4 | 6.889 | 1:13:16 |
coadread | 2.6M | 224 | 15726 | 82216 | 367.0357143 | 5.228030014 | 4 | 14.978 | 17:46:12 |
gbm | 738K | 291 | 9534 | 21332 | 73.30584192 | 2.237465911 | 4 | 19.107 | 3:02:09 |
kirc | 976K | 417 | 11907 | 27524 | 66.00479616 | 2.311581423 | 4 | 16.824 | 6:38:19 |
kirp | 271K | 100 | 5229 | 7342 | 73.42 | 1.404092561 | 4 | 6.451 | 1:09:14 |
laml | 93K | 196 | 1675 | 2311 | 11.79081633 | 1.379701493 | 4 | 12.565 | 0:10:27 |
lgg | 737K | 170 | 10899 | 23598 | 138.8117647 | 2.165152766 | 4 | 16.268 | 4:13:36 |
luad | 3.0M | 230 | 16135 | 94481 | 410.7869565 | 5.855655407 | 4 | 18.506 | 11:28:46 |
ov | 667K | 316 | 9450 | 18576 | 58.78481013 | 1.965714286 | 4 | 9.008 | 3:42:54 |
paad | 217K | 34 | 3555 | 5895 | 173.3823529 | 1.658227848 | 4 | 4.163 | 0:37:20 |
prad | 176K | 83 | 3594 | 4766 | 57.42168675 | 1.326099054 | 4 | 4.173 | 0:37:59 |
skcm | 5.5M | 253 | 17395 | 181175 | 716.1067194 | 10.41534924 | 4 | 18.758 | 20:05:15 |
stad | 2.7M | 151 | 16552 | 86461 | 572.589404 | 5.223598357 | 4 | 18.681 | 11:37:03 |
thca | 278K | 323 | 4887 | 7279 | 22.53560372 | 1.489461838 | 4 | 4.472 | 0:59:11 |
We typically encounter errors during R calculations like "Unable to fork". This is due to lack of memory in multicore process. Consider to reduce the number of cores if the program crashes.