Gene Specification Instructions

From BioE80 Boot
Revision as of 12:29, 3 June 2017 by Acjs (talk | contribs) (Revert mistake)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


Creating a high-quality specification sheet is critical to both documenting your design, and to its integration with other components as we build our organism.

To create a new gene specification sheet, enter your gene name in the box below and click 'Create page'. Name your gene with the JCVI (MMSYN1) or EcoCyc accession (EGxxxxx) ID. If you have the same gene as someone else (check the allocation spreadsheet to see if anyone else has your ID), then append your SUID. For example, your gene name will look like MMSYN1_xxxx_jstanford.

Part 1: Author Information

Author Name Your name! Create a link in the form above, and then create a page at that address with your name, SUID, and any other information you want to share.

Part 1: Basic Information


The identifier given to the gene in either the JCVI-Syn3.0 database (JCVI genes) or in EcoCyc (E. coli genes). MMSYN1_xxxx or EGxxxxx, respectively. Use the JCVI ID for your JCVI genes, and the EcoCyc ID for the equivalent E. coli genes that you find (see below).


The gene name, preferably the standard name of the equivalent gene in the E. coli genome. For many genes, this will be annotated in the JCVI database under 'Current annotation'. If it isn't, you will need to do some more searching. You can search UniProt by BLAST (click the 'BLAST' link in the top left of the website at UniProt): this will hopefully identify the protein, which can can use to get back to the name of the gene.


The source organism of the gene (JCVI-Syn3.0 or E. coli). This is just the organism the gene came from. If you're working with your JCVI genes, it should be JCVI. If you're working with the E. coli functional analogues, it should be E. coli.

UniProt ID

A link to the UniProt protein database entry for the relevant gene in either Mycoplasma mycoides subsp capri (the source of the JCVI genes) or E. coli. For example, P0A7Z4. Search UniProt and select the best/closest result. You will notice that some results are linked to 'reviewed' data, while others may not be. Where a reviewed database entry exists, prefer it.

You can search UniProt by a number of means. You can search UniParc (a sequence database) with the JCVI gene ID, and follow the result links to find the UniProt protein from the original M. mycoides subsp. capri organism. You should see an 'entry' in the search results linking the JCVI gene to M. mycoides, with a UniProtKB link on the right hand side. You can also search directly in UniProtKB using the gene name identified above, along with M. mycoides subsp. capri. A third option is to BLAST the amino acid sequence of your gene (from the database, see below) against UniProt: use the BLAST link in the top left.

If you can't find an exact match against M. mycoides, select the best/closest match (for example, against another Mycoplasma.


A description of the gene's function, its role in the cell, and its partners in the cell. Max (!) 500 words with inline references. Keep it short and sweet.

The best way to approach this is to start from some of the details you've uncovered and work your way out. Once you have an E. coli gene name, EcoCyc usually has good summaries on gene function. You can also google your gene and protein names in order to find out more about it. Elaborate on the function: if you have a 'cyclase', what does that mean? Overall, we're focused on function (what does the gene do, or how does it help something happen), rather than molecular details.


You can find the amino acid sequence for your genes in the database at SynCells.


Length: use Benchling or a text editor to get a count of the number of nucleotides in the sequence.

In order to find a DNA sequence you will need to codon optimize your amino acid sequence: select a codon for each amino acid, using the frequency of codon usage in a native organism to pick which codon to use. In general, this produces a higher level of expression (more protein), because even though multiple codons can code for the same amino acid, the host organism (like E. coli) might strongly prefer one and so be more efficient at translating it. You don't have to do this by hand: tools exist to do it for you (recall problem set 1).

We codon optimize because different organisms use slightly different codon sets. Even though there are a number of different codons for each amino acid, some organisms prefer one codon over the others, and as a result can more efficiently translate that codon. At the molecular level, this is (generally) because they have more tRNA (the molecule which recognizes a codon and allows it to be translated) for one codon than another. Codon optimization looks at an amino acid sequence, and selects codons based on the native usage of the host organism. For example, E. coli encodes leucine (Leu or L) as CTT ~47% of the time, so a codon optimizer will make sure the sequence encodes approximately half of the leucine amino acids as CTT.

We want to codon optimize both the JCVI-Syn3.0 genes and the E. coli genes for E. coli, because we will be actually producing the proteins in an E. coli-based cell free mixture. We are producing proteins from different organisms, but we want the DNA to be optimized to work well with the E. coli transcription and translation machinery.

You can perform codon optimization using the IDT Codon Optimization Tool from PS 1.

Amino Acids

Length: use Benchling (import protein) or a text editor character count.

The amino acid sequence for your gene is available in the JCVI database (the Excel spreadsheet). Check out column F: on some computers that column is really narrow and makes it hard to see that there is a full sequence there. Note that we also have a single-letter code for protein sequence: each character is an amino acid. See here for the code.

Part 1: Function and Homologs

  • Functional Category: A link to the functional category into which this gene falls. For example, Transcription. This category may already be annotated against the gene in the JCVI database, otherwise select it from the List of Functional Categories.
  • Product: The specific protein product of the gene. For example, RNA polymerase subunit alpha.
  • Module: The 'functional module' into which this gene falls. For example, RNA polymerase (the alpha subunit coordinates with the other subunits to form the polymerase).
  • Closest homologous proteins: The top (max three) homologous proteins to this protein, as identified by BLAST searches. For E. coli, this should be the top results excluding the exact E. coli protein that you searched (probably one or more of the top results).
    • Name, Max score/Query Cover/E-Value/Ident, [link Accession]
    • Name, Max score/Query Cover/E-Value/Ident, [link Accession]
    • Name, Max score/Query Cover/E-Value/Ident, [link Accession]
  • Equivalent E. coli / JCVI functional protein: A link to the identified E. coli functional analog of the protein encoded by the gene, or a link back to the JCVI gene matching this E. coli version. For example, EG10883.

Background: What is a Functional Module?

A functional module is the lowest-level functioning unit above the level of a gene. This could take several forms, depending on which gene you have. For example, if the gene encodes a sub-unit of an enzyme (as in the polymerase example above), the functional module is the enzyme: the sub-unit does nothing by itself, but the enzyme does. Alternatively, your gene may be a member of a metabolic pathway: helping to build a molecule that the cell needs. In this case, the pathway is the functional module: for example, Histidine biosynthesis. If the gene encodes a protein which performs a standalone task, the functional module can be a higher-level grouping of genes/proteins which do related things. For instance, toxin transport, DNA repair, etc. Choosing an appropriate functional module will require judgement based on your investigation of the gene's function.

Choosing an appropriate functional module will require judgement based on your investigation of the gene's function above. Feel free to ask for help or co-ordinate with others who have similar genes. Some other resources which may aid your investigation are:

  • The 'Gene Ontology' classifiers associated with the protein: you can find these under 'GO-Molecular function' and 'GO-Biological process' on UniProt, or on the E. coli functional analog's EcoCyc page (under GO Terms).
  • The E. coli functional analogue's "Component of" data (for multi-subunit enzymes) on EcoCyc.
  • Resources you find while writing the description of the gene.

Background: Finding Equivalent E.coli Genes

There are a number of different approaches to finding the equivalent E. coli genes. For some genes, there may be a direct equivalent in E. coli. For example, the RNA polymerase is very similar (we say conserved) between M. mycoides and E. coli, so there is likely a matching gene for the gene encoding an RNA polymerase subunit in JCVI-Syn3.0. For other genes, there may not be a direct equivalent, but instead there may be a functional equivalent. For example, JCVI-Syn3.0 might gain access to a molecule it needs by scavenging it from the outside environment (acquiring a molecule through a transporter), while E. coli constructs the molecule using biosynthesis. In this case, find the gene(s) that encode the equivalent function.

As you're investigating the equivalent genes, you can try different resources:

  • Homology searches (particularly BLASTp), as above
  • Functional searches on EcoCyc
  • Other databases: InterPro (a protein analysis tool), UniProt, KEGG (a pathway and function database)

Try to identify your gene through several avenues: convergent answers will indicate you've found the best equivalent. Note that many of the databases above will also be useful for identifying your unknown genes.

Background: Unknown Genes

Your unknown gene has been assigned from one of two categories in the JCVI database: 'Unknown' or 'Generic' (see column AJ). We have the least information about the 'Unknown' genes: they are genes for which the JCVI investigators could not find even a potential activity. We have slightly more information about genes categorized 'Generic'. Most generic genes encode an identifiable protein, but the biological role or function of that protein is unclear. For example, a generic gene might encode a kinase, which transfers a phosphate group from high-energy molecules (such as ATP) to a different substrate. While we can determine that the protein encoded by the generic gene performs this role, we don't know which specific high-energy molecule or substrate, or why that reaction is essential to cellular life.

Some of the unknown genes do have associated functional categories and annotations: these may provide a starting point, but don't be constrained by them.

Part 2: Expression

The expression specification you create will be used not just to document what a gene should do, but to drive the testing of your gene. For example, specifying a high level will result in us testing the gene at a high level when your gene is tested with other genes.

Do not spend more than 10 mins on this per gene. In many cases, there will not be a clear answer, or it may be very hard to find the definite answer.

Expression Level

At which level should the gene be expressed in the cell? Recall problem set 4: the expression level effectively determines how much protein we want to produce. For certain proteins, we want a lot, for others, only a few is enough. As we have discussed, our cell-free test system is significantly different to the JCVI-Syn3.0 cell. We do not expect to be able to pick exact levels. Instead, we will coarsely estimate what level each gene should be expressed at, grouping them into one of three bins: low, medium, high.

Finding the expression level

How do we determine at what level a gene should be expressed? We can look to measurements made from living cells, simulations of how cells function, as well as our understanding of a gene and the protein it encodes. We have curated several resources which you can use to investigate expression:

  • Mycoplasma genitalium simulation results: although accurate measurements of protein levels in Mycoplasma mycoides are difficult to come by, a whole cell model of a closely related organism, Mycoplasma genitalium, was created by a group at Stanford (see WholeCellViz and the Covert Lab). This model simulates almost all of what is going on inside a single cell, including transcription, translation, and metabolism. We can read out the data on protein levels, and use them to inform the expression level of our genes.
    • The data is at File:MgenitaliumSimProteinCounts.xlsx.
    • Look for your gene or a close relative in the M. genitalium dataset and decide on an expression level.
    • This is the best resource for determining the expression level of your JCVI-Syn3.0 genes.
  • Escherichia coli proteome dataset: a "proteome" is a full measurement of all of the proteins in cell, or a population of cells. In 2015 Schmidt et al reported a "quantitative proteome" of E. coli, which measures the levels of all expressed proteins under different growth conditions (see Schmidt 2015). We have selected the data for growth under standard conditions.
    • The data is at File:EcoliProteomicExpressionData.xlsx.
    • Look for your gene or a close relative in the E. coli dataset and decide on an expression level.
    • This is the best resource for determining the expression level of your E. coli functional analogues.
  • Literature research: if neither of the above resources provide a level for your gene, use your judgement and the research you performed in Part I to select a level based on the function of the gene.
    • What does the protein interact with, and at what level are those proteins expressed? How much work does it have to do?
    • Use Google to search for your gene and 'expression level' and see what results you find.

We are only seeking a best-attempt at the correct expression level between low, medium, high. Do not worry about getting an exact result, but do document any uncertainty you have below.

Unknown genes

As it is unlikely you will find your genes in either supplied dataset, make a best-guess based on your investigation into the gene. If all else fails, select high. Document this choice.


In one or two sentences, explain why your gene might need to be expressed at this level. Consider the function of your gene, and the information on what it does and its context that you discovered while writing your gene description. Consider how it fits in with other components of its functional module, and other parts of the cell (based on your answers in Gene Context, below).


Which information did you use to decide on the expression level? How sure or unsure are you about your choice? Keep this brief (one or two sentences), but provide links to any resources you used other than the M. genitalium and E. coli datasets we provided.

Expression Time

At which time should the gene be expressed in the lifecycle of our organism? As above, contextualize the gene in terms of what it does, what has been reported, and what you learned in part I. For example, we might need central dogma components working early, but cell division components later on.

Select your expression time from one of:

  • right at beginning
  • early
  • late;
  • unknown/impossible to tell


As above, in one or two sentences, explain why your gene might need to be expressed at this time. Consider the factors you looked at for expression level, your gene's function, and its dependencies as documented below.


Document why you made your decision and link to resources you used (if any). Keep it brief, one or two sentences.

Part 2: Gene Context

You have already specified the Functional Category (e.g. Transcription) and the Functional module (e.g. RNA polymerase). Let's zoom out, and put your gene into the cellular context. The gene context information you provide will allow us to build up a dependency network of which genes depend on what others, and we can use this network to drive our testing.

Other Components

Based on the functional module you decided on in part one, identify one other gene/component that is required for that functional module to actually work. For example, RNA polymerase has multiple subunits, so if your protein is RNA polymerase subunit alpha, you could specify RNA polymerase subunit beta.

Find the page for the gene encoding the other component on this wiki, and provide a link to it using the Wiki link syntax. This will look like MMSYN1_xxxx or EGxxxxx for JCVI or E. coli gene respectively. There are two easy ways to find the component page:

  • Search the wiki for the gene name (in our example, rpoB), using the search box in the top right corner.
  • Find the gene in the JCVI Database, and link to the ID.

Be sure to link to the gene for the appropriate organism (ie, JCVI for JCVI, E. coli for E. coli).


Other components on which the functional module depends, but that are not part of the module itself. For example, the ribosome module might require that the RNA polymerase module is running before it is useful. Specify one possible dependency, if applicable. For example, Ribosome is of little use without amino acids to link, so Histidine biosynthesis would be a sensible choice.

  • The resources you used in writing your description of the gene may be helpful in identifying dependencies:

Provide links to any references you used in determining the dependency you identified.


What process does the functional module take part in? Identify one process and provide the inputs and outputs to this process. Whether your functional modules are metabolic, related to central dogma (transcription and translation), or participate in some other cell function will drive what these inputs and outputs are. Use your judgement to pick the appropriate inputs and outputs. Consider the resources you used when annotating your gene in part I in order to find out what process the gene takes part in. Both EcoCyc and UniProt are likely to have useful information.

For example, hexokinase (a component of central metabolism) processes glucose into glucose-6-phospate, using one ATP (an energy molecule). You could list the process as:

  • Inputs: Glucose, ATP
  • Outputs: Glucose-6P, ADP

As another example, we've learned that rpoA is the alpha-subunit of the RNA polymerase. The functional polymerase module has a number of processes, one being extension of a messenger RNA. We could document this process as follows:

  • Inputs: mRNA(n) + nucleotide
  • Outputs: mRNA(n+1) + PPi

Provide a link to any resources or references you use in identifying the process.

Linking to the category

Finally, to finish your assignment, link up your gene with its Functional Category (e.g. Transcription) in the wiki.

To do that, click on the appropriate category on the sidebar at the left of the page, create/edit the page that comes up, and add a link to your gene and write a sentence on what it does. For example:

* [[MMSYN1_0669|rplW]] rplW encodes the 50S ribosomal protein L23, which provides a docking site for trigger factor at the ribosome polypeptide exit tunnel.


We have an example specification sheet available for the rplW JCVI gene, MMSYN1_0669.


We will do this - not part of your assignment.

  • Synthesis Score: The synthesis score of your construct: 1, 2,3
  • Predicted Translation Rate: Prediction of construct translation rate from the RBS calculator
  • Design Notes and Details: For example, had to use a rare codon to fix folding energy;
  • GenBank File: A link to the GenBank file. file