Genomes to Natural Products Network (GNPN):
Stanford Genome Technology Center


Natural Product Gene Cluster Genome Query Tool



Computer aided identification of natural product gene clusters is made possible with the increasing availability of full genome sequences. Here we present a software package NPGCquery for locating genomics loci based on expert defined gene function/protein domain criteria. The package contains two major components.

a) A genome query component that takes user defined query and returns the list of candidate gene clusters;

b) An annotation component that colors the genes in the clusters based on the gene functions.


Combining these, the package allows experts to quickly located candidate natural product gene clusters of interest.






The package is written in R and depends on a few external packages to run successfully. Users should follow the error messages from R and install the packages one by one when the package is used for the first time.


Input files

Two input files are required, one is the genome annotation file in GFF format, and another is the protein annotation tsv file from InterProScan. Details on GFF file format are available here, and the InterProScan software is available here.


See Trichoderma_virens.ASM17099v1.23.gff3 for an example of GFF input file. See Trichoderma_virens.ASM17099v1.23.pep.all.fa.tsv for an example of InterProScan result file.


Example Code

# Load the package


source('genome.query.R') # contain the query NPGC.query code and required functions,

                         # you need to installed required packages if you get an error message


# step 1: prepare the genome annotation gff file and the iprscan protein annotation file

# step 2: identify candidate genome cluster loci in the genome that satisfy user defined queries

gene.ranges = NPGC.query(gff.file = 'Trichoderma_virens.ASM17099v1.23.gff3', # gff file for existing genome annotation

                = 'Trichoderma_virens.ASM17099v1.23.pep.all.fa.tsv', # protein function

                         # annotaiton file produced by iprscan

                         query = list(func = list('oxidoreductase|P450|oxidase|dehydrogenase|oxygenase|reductase', 'O-methyltransferase'),

                                    # the types of gene functions you like the genomic loci to have or not have

                                    # alternative functions, i.e. function A OR function B, are listed together with

                                    #| as separator

                                    freq = list(2:15,                                                            1:15)),

                                    # frequency that the specified function should appear

                                    # 0 - should not appear, 1 or higher - appear the specified number of times,

                                    # a frequency range is specified by m:n, e.g. 1:5 means 1 to 5 times

                         window.size = 15, # number of neighboring genes to examine

                         out.file = 'Tvirens_relaxed.xlsx', # output file name

                         window.extend = 5, # extra genes to include or both sides of the identify the loci

                         gene.definition = 'transcript',  # what defines a gene in the gff file?

                         proteinID = 'ID') # what ID in gff file is used as protein ID for proteins, which appear

                         # in the iprscan result file


# step 3: color code the gene functions in the resulting excel files based gene functions.


# read the output file "readme_deepAnno.txt" for color code instructions


Output files

Three output files will be produced.

1. An excel file list the identified candidate gene clusters and surrounding genes near the loci. See Tvirens_relaxed.xlsx for an example. The file contains the following columns.



Chromosome ID


Number of nucleotides changed


Protein IDs


Existing annotations of the gene extracted from GFF input file

[Variable number of columns]

Each column gives the occurrence of one of the functions listed in the query

[Variable number of columns]


Protein annotation extract from InterProScan input file


ID assigned to the gene clusters. Nearby genes are included in the table but not assigned with a clusterID.


2. Another excel file with the same content as output 1, but now genes are highlighted with colors indicating secondary metabolism related functions. See colored_Tvirens_relaxed.xlsx for an example.

Description: colored_Tvirens_relaxed.xlsx.pdf


3. A text file with details on the color code used for the visual annotation of the gene clusters. See readme_deepAnno.txt for an example.




Tutorial code:              examples_genome_query.R

Tutorial input:                        Batch2.5G9.txt

Tutorial output:          readme_deepAnno.txt, Tvirens_relaxed.xlsx, and  colored_Tvirens_relaxed.xlsx

Package source code:   genome.query.R

Download as ZIP archive containing all files:






Inquiries can be addressed to Maureen Hillenmeyer (maureenh at and Angela Chu (amchu at
Stanford Genome Technology Center