Genomes to Natural Products Network (GNPN):
Stanford Genome Technology Center


 

Natural Product Gene Cluster Genome Query Tool

 

Description

Computer aided identification of natural product gene clusters is made possible with the increasing availability of full genome sequences. Here we present a software package NPGCquery for locating genomics loci based on expert defined gene function/protein domain criteria. The package contains two major components.

a) A genome query component that takes user defined query and returns the list of candidate gene clusters;

b) An annotation component that colors the genes in the clusters based on the gene functions.

 

Combining these, the package allows experts to quickly located candidate natural product gene clusters of interest.

 

 

Tutorial

 

Prerequisites

The package is written in R and depends on a few external packages to run successfully. Users should follow the error messages from R and install the packages one by one when the package is used for the first time.

 

Input files

Two input files are required, one is the genome annotation file in GFF format, and another is the protein annotation tsv file from InterProScan. Details on GFF file format are available here, and the InterProScan software is available here.

 

See Trichoderma_virens.ASM17099v1.23.gff3 for an example of GFF input file. See Trichoderma_virens.ASM17099v1.23.pep.all.fa.tsv for an example of InterProScan result file.

 

Example Code

# Load the package

setwd('NPGC_Query')

source('genome.query.R') # contain the query NPGC.query code and required functions,

                         # you need to installed required packages if you get an error message

 

# step 1: prepare the genome annotation gff file and the iprscan protein annotation file

# step 2: identify candidate genome cluster loci in the genome that satisfy user defined queries

gene.ranges = NPGC.query(gff.file = 'Trichoderma_virens.ASM17099v1.23.gff3', # gff file for existing genome annotation

                         iprscan.tab.file = 'Trichoderma_virens.ASM17099v1.23.pep.all.fa.tsv', # protein function

                         # annotaiton file produced by iprscan

                         query = list(func = list('oxidoreductase|P450|oxidase|dehydrogenase|oxygenase|reductase', 'O-methyltransferase'),

                                    # the types of gene functions you like the genomic loci to have or not have

                                    # alternative functions, i.e. function A OR function B, are listed together with

                                    #| as separator

                                    freq = list(2:15,                                                            1:15)),

                                    # frequency that the specified function should appear

                                    # 0 - should not appear, 1 or higher - appear the specified number of times,

                                    # a frequency range is specified by m:n, e.g. 1:5 means 1 to 5 times

                         window.size = 15, # number of neighboring genes to examine

                         out.file = 'Tvirens_relaxed.xlsx', # output file name

                         window.extend = 5, # extra genes to include or both sides of the identify the loci

                         gene.definition = 'transcript',  # what defines a gene in the gff file?

                         proteinID = 'ID') # what ID in gff file is used as protein ID for proteins, which appear

                         # in the iprscan result file

 

# step 3: color code the gene functions in the resulting excel files based gene functions.

xlsx.color.NPGC('Tvirens_relaxed.xlsx')

# read the output file "readme_deepAnno.txt" for color code instructions

 

Output files

Three output files will be produced.

1. An excel file list the identified candidate gene clusters and surrounding genes near the loci. See Tvirens_relaxed.xlsx for an example. The file contains the following columns.

 

chr

Chromosome ID

gene

Number of nucleotides changed

protein.ID

Protein IDs

Existing.Anno

Existing annotations of the gene extracted from GFF input file

[Variable number of columns]

Each column gives the occurrence of one of the functions listed in the query

[Variable number of columns]

domains

Protein annotation extract from InterProScan input file

clusterID

ID assigned to the gene clusters. Nearby genes are included in the table but not assigned with a clusterID.

 

2. Another excel file with the same content as output 1, but now genes are highlighted with colors indicating secondary metabolism related functions. See colored_Tvirens_relaxed.xlsx for an example.

Description: colored_Tvirens_relaxed.xlsx.pdf

 

3. A text file with details on the color code used for the visual annotation of the gene clusters. See readme_deepAnno.txt for an example.

 

 

Download

Tutorial code:              examples_genome_query.R

Tutorial input:                        Batch2.5G9.txt

Tutorial output:          readme_deepAnno.txt, Tvirens_relaxed.xlsx, and  colored_Tvirens_relaxed.xlsx

Package source code:   genome.query.R

Download as ZIP archive containing all files: NPGC_Query.zip

 

 

Contact

 


 



Inquiries can be addressed to Maureen Hillenmeyer (maureenh at stanford.edu) and Angela Chu (amchu at stanford.edu)
Stanford Genome Technology Center