[[TOC(heading=User Manual)]]
= DOTS-Finder User Manual

DOTS-Finder is a free program to detect driver genes in tumoral genomes. It's licensed under GNU GPLv3+. See COPYING.txt file for additional information.

For a full explanation of DOTS-Finder usage and analysis, please download the '''[/raw-attachment/wiki/projects/DOTS-Finder/Documentation/User%20Manual%20-%20DOTS-Finder.pdf User Manual]''' and '''[/raw-attachment/wiki/projects/DOTS-Finder/Documentation/Input-Output%20Formats.pdf Input-Output Formats]'''

\\
== Installation

DOTS-Finder works on UNIX base systems. We tested the installation on Linux Debian 3.2.46-1+deb7u1, Scientific Linux 6.5 and Darwin Kernel Version 11.4.2 and 13.0.0. For large files it can be extremely memory consuming, especially if you want to parallelize the R calculations. We strongly suggest to run DOTS-Finder on a server cluster and set the number of cores according to available memory.

DOTS-Finder requires the availability of a python virtual environment. It's very easy to create, via [http://www.virtualenv.org/en/latest/ virtualenv]. Nevertheless, you don't need to install it, DOTS-Finder does it for you!

Before the installation of DOTS-Finder you need '''python''' (tested on version 2.7) and '''R''' (any version >2) with '''multicore library''' already at your disposal. Every other packages or libraries will be installed by DOTS-Finder.
For MacOS users, Xcode Command Line Developer Tools is also required.

1) Uncompress DOTSF.tgz

2) Create a virtualenv in <dotsfinder_python_home> folder of your choice with python2.7 and activate it:
{{{
#!sh
/usr/bin/python virtualenv.py -p python2.7 <dotsfinder_python_home>

. <dotsfinder_python_home>/bin/activate
}}}

3) Install DOTS-Finder from inside uncompressed folder by typing:
{{{
#!sh
pip install .
}}}
DOTS-Finder will be installed inside your python virtualenv along with all the libraries ( [http://www.numpy.org/ numpy], [http://pythonhosted.org/pybedtools/ pybedtools], [http://cython.org/ cython]) and third part tools ([http://bedtools.readthedocs.org/en/latest/ bedtools], [http://genome.sph.umich.edu/wiki/LiftOver liftOver])

4) Every time you want to use DOTS-Finder, remember to activate the virtualenv (if you use it from command line). If you want to use it inside a shell script, you have to change the PATH variable. First, find the path (with virtualenv activated):
{{{
#!sh
echo $PATH
}}}
Then, inside your shell script, just change the PATH variable before launching dots-finder
{{{
#!sh
PATH=new-path
dots_finder -i input -o output
}}}

Another easy option is to change the shabang interpreter from:
{{{
#!/usr/bin/env python
}}}
into:
{{{
#!<dotsfinder_python_home>/bin/python
}}}

=== Simple Example invocation
Inside sample-files folder you can find two example files. PROVA.marf (500 lines of a maf file in marf format) and LAML_build36.marf (acute myeloid leukemia maf file in marf format under hg18 coordinates).

{{{
#!sh
dots_finder -i DOTSF/sample-files/PROVA.marf -o <output_folder> -n pfx
}}}

{{{
#!sh
df_liftover -i DOTSF/sample-files/LAML_build36.marf -o <output_folder> -n pfx
}}}

For an explanation on available commands

{{{
#!sh
dots_finder --help
}}}

\\
== Usage

=== Requirements to run the analysis

 - MAF file
 - [wiki:./CustomMafSubset MARF file] (see User Manual or Input-Output Formats for specifications)

=== Scripts description

 - df_liftover (optional if the files are already in Build 37)
 - dots_finder

\\
== Supported input formats

DOTS-Finder supports various input formats:

 * TCGA standard [https://wiki.nci.nih.gov/x/xYGDBw MAF 2.3], [https://wiki.nci.nih.gov/x/eJaPAQ MAF 2.4]
 * [wiki:./CustomMafSubset MARF format]. Custom, 13 columns MAF subset, described in Input-Output Formats.
 * CSV files as produced by [http://www.openbioinformatics.org/annovar/ Annovar] can be easily reverted to MARF (see Input-Output Formats)

\\
== Easy Analysis Reading

DOTS-Finder will create 5 tab delimited txt files in the output folder. The easiest way to have a first glance at the results is to look at Pfx_OncoGene_Driver.txt and Pfx_TSG_Driver.txt. In the last column you will find "Global_P_Value", an FDR corrected summary of all the procedure of DOTS-Finder. We consider driver, the genes with a Global_P_Value <= 0.1.

For the rest of the output comprehension, please take a look at Input-Output Formats guide.

\\
== Content of DOTS-Finder

DOTS-Finder ships a number of useful resources, needed for its work:

'''final_ref_sort.bed'''

A bed file with chromosome, start, end and [http://www.genenames.org/ HUGO Gene Name] based on [http://www.ncbi.nlm.nih.gov/refseq/ RefSeq] database (find it [source:main/projects/DOTS-Finder/trunk/files/final_ref_sort.bed here]) 

'''final_knowngene_sort_converted.bed'''

A bed file with chromosome, start, end, transcript name in [http://genome.ucsc.edu/ UCSC nomenclature] and translation in gene name (find it [source:main/projects/DOTS-Finder/trunk/files/final_knowngene_sort_converted.bed here])  

'''codon_dictionary.pkl'''

Python dictionary that shows the expected number of Non synonymous mutations over Total mutation caused by a single nucleotide variation (SNV: A>C , G>T etc.) for each gene. This table derives from the specific codon structure of each gene from [http://www.ncbi.nlm.nih.gov/genbank/ NCBI Genbank] via the [http://www.kazusa.or.jp/codon/ Kazusa website]

'''domains_dictionary.pkl'''

Python dictionary that reports the superfamily domain structure taken from [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml Conserved Domains Database] and the [http://www.uniprot.org/ Uniprot] number of amino acids for each gene.

'''length_dictionary.pkl'''

Python dictionary that reports the length of the gene in bp as the minimum set of exons that comprises all the known transcripts in refGene

'''hugo_prot_prod_dictionary.pkl'''

Python dictionary that reports the gene name of the genes that lead to protein products. DOTS-Finder automatically discard all the genes that don't have this property.

'''hugo_prot_prod_dictionary_oldsymbol.pkl'''

This python dictionary reports all the possible aliases of the official names of the genes with a protein product.

'''MAdb.Rdata and other Rdata files'''

Various databases in a easy reading format for R. All these resources are based on [http://mutationassessor.org MutationAssessor]

\\
== Output Format

For a full explanation of output tables created by DOTS-Finder, see [/raw-attachment/wiki/projects/DOTS-Finder/Documentation/Input-Output%20Formats.pdf Input-Output Formats]

\\
== Memory usage and computing time reference

The table below shows how the computing time and total memory usage are consumed by DOTS-Finder with a 4 cores analysis. They depend largely by the file size, but also by the number of genes that are mutated in that particular tumor (or in other words, by the tumor specific mutation rate).
Use this information as a reference to decide the number of cores for your machine possibilities. If the program crashes or if it's too slow, use less or more cores respectively.

||= '''Tumor''' =||= '''File Size''' =||= '''Patients''' =||='''Genes Mutated'''=||='''Number of Mutation'''=||='''Mutation per Patient'''=||='''Mutation per Gene'''=||='''Cores'''=||='''Max Vmem (Gb)'''=||='''System Time (hours)'''=||
||=''brca''=||1.5M||771||13987||41947||54.40596628||2.998999071||4||12.75||11:39:38||
||=''cesc''=||298K||39||5686||9138||234.3076923||1.607105171||4||6.889||1:13:16||
||=''coadread''=||2.6M||224||15726||82216||367.0357143||5.228030014||4||14.978||17:46:12||
||=''gbm''=||738K||291||9534||21332||73.30584192||2.237465911||4||19.107||3:02:09||
||=''kirc''=||976K||417||11907||27524||66.00479616||2.311581423||4||16.824||6:38:19||
||= ''kirp'' =||271K||100||5229||7342||73.42||1.404092561||4||6.451||1:09:14||
||=''laml''=||93K||196||1675||2311||11.79081633||1.379701493||4||12.565||0:10:27||
||=''lgg''=||737K||170||10899||23598||138.8117647||2.165152766||4||16.268||4:13:36||
||=''luad''=||3.0M||230||16135||94481||410.7869565||5.855655407||4||18.506||11:28:46||
||=''ov''=||667K||316||9450||18576||58.78481013||1.965714286||4||9.008||3:42:54||
||=''paad''=||217K||34||3555||5895||173.3823529||1.658227848||4||4.163||0:37:20||
||=''prad''=||176K||83||3594||4766||57.42168675||1.326099054||4||4.173||0:37:59||
||=''skcm''=||5.5M||253||17395||181175||716.1067194||10.41534924||4||18.758||20:05:15||
||=''stad''=||2.7M||151||16552||86461||572.589404||5.223598357||4||18.681||11:37:03||
||=''thca''=||278K||323||4887||7279||22.53560372||1.489461838||4||4.472||0:59:11||

We typically encounter errors during R calculations like "Unable to fork". This is due to lack of memory in multicore process. Consider to reduce the number of cores if the program crashes.

\\
== User Support

Please ask a question to giorgio.melloni[at]iit.it if needed.