Genomes to Natural Products Network (GNPN):
Stanford Genome Technology Center


CDS Codon Optimization and Quality Control

 

 

Description

In a synthetic biology pipeline, we synthesize all CDS sequences, assemble them with promoters and terminators, and then express them in heterologous host – yeast. Once we choose a set of proteins of interest and the CDS sequences, there are still a few things we need do in-silico to ensure successful DNA synthesis and expression of the protein in heterologous host. These typically include the following.

a) Codon optimization: replacing the rarest codons by common ones;

b) Removing undesired restriction sites for smooth downstream cloning;

c) Remove polymers and repeat sequences for successful DNA synthesis;

d) Adjust the AT or CG content for successful DNA synthesis.

 

The CDS-QC software package provides these capabilities in one setting.

 

 

Tutorial

 

Prerequisites

The package is written in R and depends on a few external packages to run successfully. Users should follow the error messages from R and install the packages one by one when the package is used for the first time.

 

Input files

Input CDS sequences should be provided in the following tab delimited format.

CDS_ID1        CDS_sequence1

CDS_ID2        CDS_sequence2

CDS_ID3        CDS_sequence3

 

Example input file Batch2.5G9.txt is included in the CDS_QC folder.

 

Example Code

# Load the package

setwd('CDS_QC')

source('codon.optimizer.R') # contain the query NPGC.query code and required functions,

                            # you need to installed required packages if you get an error message

 

 

# example 1: optimize the rarest 6 types of codons AND removing restriction sites of BsaI and AarI,

# while maintaining protein sequences.

codon.optimizer(CDS.list.file='Batch2.5G9.txt', # input file containing a list of CDS sequences

                N.codon.types.to.change = 6, # the types of rarest codons to optimize

                genetic.table=1, # genetic table for translation

                host.species='4932',  #  taxonomy ID for the yeast host is provided

                left.extra='GATCAGCGGCCGC', # adding protector sequence to the 5' of the CDS sequences

                right.extra='CCCGGGAACAC', # adding protector sequence to the 3' of the CDS sequences

                restriction.sites = c(BsaI='GGTCTC', BsaI.rc='GAGACC', # removing the restriction sites

                                      # of BsaI by providing the target and reverse complementary

                                      # target sequences

                                      AarI='CACCTGC', AarI.rc = 'GCAGGTG')) # removing the restriction

                                      # sites of BsaI by providing the target and reverse complementary

                                      # target sequences

                                      

 

# example 2: in addition to the task in exmple 1, also remove simple polyA/polyC/polyG/polyT sequences

# and remove repeat sequence with unit CCAGAG while maintaining the protein sequence.

# Note that the repeats can be detected by repeat masker

codon.optimizer(CDS.list.file='Batch2.5G9.txt',

                N.codon.types.to.change = 6,

                genetic.table=1,

                host.species='4932',

                left.extra='GATCAGCGGCCGC',

                right.extra='CCCGGGAACAC',

                restriction.sites = c(BsaI='GGTCTC', BsaI.rc='GAGACC',

                                      AarI='CACCTGC', AarI.rc = 'GCAGGTG',

                                      polyA8='AAAAAAAA', polyC8 = 'CCCCCCCC', # removing simple polymers by

                                      # providing polymer sequences, any polymers with this length or

                                      # above are removed

                                      polyG5='GGGGG', polyT8 = 'TTTTTTTT'), # removing simple polymers by                                       

                                      # providing polymer sequences, any polymers with this length or above

                                      # are removed

                repeats = c(rep.CCAGAG='CCAGAG'), # removing the repeat sequences by providing the

                                                  # repeating unit

                tag = 'BsaIAarIPolyRepRemoved') # give a unique label to this optimization setting

 

 

# example 3: in addition to example 1, also increase the CG content of the CDS sequences by removing AT only sequences.

codon.optimizer(CDS.list.file='Batch2.5G9.txt',

                N.codon.types.to.change = 6,

                genetic.table=1,

                host.species='4932',

                left.extra='GATCAGCGGCCGC',

                right.extra='CCCGGGAACAC',

                restriction.sites = c(BsaI='GGTCTC', BsaI.rc='GAGACC',

                                      AarI='CACCTGC', AarI.rc = 'GCAGGTG',

                                      ployAT4='WWWW'), # W is the the degernate symbol for A and T,

                                                       # here we remove all 4-mers of A/Ts

                tag = 'BsaIAarI_CGplus') # give a unique label to this optimization setting

 

 

Output files

Two output files will be produced.

1. A text output files will provide the modified sequences in the same format as the input file. See new_Batch2.5G9.txt for an example.

2. An excel output file will be produced with the following columns to summarize the codon optimizations and QC for each of the sequences.

Name

CDS name in the input file

CDS

CDS sequence in the input file

newCDS

Optimized and QCed CDS sequence

Nchanged

Number of nucleotides changed

Nchanged%

Percentage of nucleotides changed

CG%_old

CG content of the input sequence

CG%_new

CG content of the optimized sequence

CAI_old

Codon adaptation index of the input CDS sequence

CAI_new

Codon adaptation index of modified CDS sequence

sites removed

Summary of number of sites removed for each input sequence

CDSwExtra

Optimized and QCed sequence + protector sequences on 5’ and 3’

Length

Length of resulting sequence

 

See optimized_N6_gt1_h4932_Batch2.xls for an example.

 

 

Download

Example code:                         examples_codon_optimze.R

Example input:                        Batch2.5G9.txt

Example output:         new_Batch2.5G9.txt, optimized_N6_gt1_h4932_Batch2.xls, optimizedBsaIAarIPolyRepRemoved_N6_gt1_h4932_Batch2.xls, and

optimizedBsaIAarI_CGplus_N6_gt1_h4932_Batch2.xls

Package source code:   codon.optimizer.R

Download as ZIP archive containing all files: CDS_QC.zip

 

 

Contact

 

 

 



 



Inquiries can be addressed to Maureen Hillenmeyer (maureenh at stanford.edu) and Angela Chu (amchu at stanford.edu)
Stanford Genome Technology Center