Practical 1

Introduction

You have been assigned a specific gene from a specific organism from this list of genes. Your task is to find out about the organism, gene, the enzyme it encodes and the reaction that the enzyme catalyses.

The following instructions are not very specific about where on the relevant web pages you will find the items you need. This is because the page designs keep changing and it is difficult to keep instructions up to date. You'll just have to hunt! Also, because of the cross-linking between data items, there's more than one route to the same item, so it's quite easy to go round in circles. Finally, the types and amount of information available are different for different genes, so some of the web pages can appear different depending on what is listed. This does mean that the ways to do things given below are just suggestions; you may find the information following different routes.

Exercise 1: basic information about your gene and organism

Here you will be conducting database searches to find links to your gene and its protein product. The gene designation I have given you is likely to consist of an organism designation (three or four letters) followed by a number (usually a sequence number for the coding sequence in the genome annotation). However, within a particular database, the associated data is likely to be linked to unique and specific identifiers or accession numbers. Look out for (and note for future reference):

Gene id
Protein id

Sometimes one database will include cross-references to the corresponding ids in other databases. One exception to these 'local' designations concerns enzymes: if your gene encodes an enzyme, the enzyme activity is represented by an internationally-agreed designation in the format ECn.n.n.n which is common across all the databases.

Databases to consult

The National Centre for Biotechnology Information (http://www.ncbi.nlm.nih.gov/). This resource in the USA includes a number of well-known databases including PubMed and GenBank. The global Query page is the point from which you can find all database references to your gene. However, two sections of the NCBI site to be consulted today from the query page (or the links below) are:
Gene where you can enter your gene id.
Blast where the starting point for the sequence comparison searches can be found.

You can also consult:

KEGG, the Kyoto Encyclopaedia of Genes and Genomes in Japan, which links genome annotations to functional information, in particular enzymes and metabolism.
ExPaSy, in Switzerland, which focuses on proteins and enzymes and tools to examine them. In particular:
UniProtKB - containing SwissProt (a manually-curated database of information about known proteins) and TrEMBL (automatic annotations from the EBI - not curated and therefore more speculative);
Proteomics tools - for a range of calculations of protein properties;
Enzyme - one of the copies of the enzyme catalogue (to be explored later).
EBI, the European Bioinformatics Institute, based in England, which has mirrors of some of the other databases, but also has a number of its own databases related to functional interpretation of genomic information.

A more extensive list of links is available here

Task 1: Identify your gene

Go to NCBI Query and enter your gene id. The table on the page should refresh and indicate where there are relevant links. Look for:

Genome. This links to the organism genome.
- What can you find out about your organism and its classification?
- Re-enter your gene in the search box to see its chromosomal position and its neighbours. Some of the information on the map is cryptic; you have to click on the gene markers for it to pop up.
- Do nearby genes have related functions?
Gene. This will give the gene_id for the NCBI databases. Note it down for future reference. At this point, you should be able to find the nucleotide sequence of your gene. It's useful to make a copy of this for pasting into other applications such as similarity searches. For this, the FASTA format (the one letter base sequence uninterrupted by numbering etc) is most widely used, so look for a link that leads you to this.
- What is the functional annotation of your gene? If an enzyme, is there an EC number?
Protein. Note down the protein_id of the protein product (generally the translated nucleotide sequence). Take a copy of the protein sequence; again, the FASTA format (uninterrupted one-letter amino acid codes) is the most useful. If there is information about the protein product, you may also find it on UniProtKB. There are two ways of finding the corresponding protein: inserting the FASTA format protein sequence in the search box on UniProtKB, or trying to get a match between gene or protein ids you already have via the ID mapping search.
Go to Kegg and search using your gene name and/or enzyme EC number to expand on the information above, especially the metabolic context of an enzyme.

Task 2: Similarity searching (BLAST)

Go to NCBI Blast.

Select nucleotide blast and enter the gene_id or the FASTA nucleotide sequence in the search box. Also select the non-redundant (nr) nucleotide collection to search against, but do not enter your organism name at this point – leave the organism box blank. Start the search and wait for the (very large) results page to appear. (The default number of matches reported is 100; this can be increased if you want to see matches to more distantly-related sequences.) The summary of the top matches is shown in a diagram near the top of the results. Information appears when you roll the mouse over the diagram, and clicking on it jumps you to the related information. Note that the 'score' relates to the quality of the match: the higher the score, the better it is. The 'E value' indicates the probability of finding an equivalent quality match with a random sequence of the same length searched against a database of the size you have searched; the lower the value, the better the match. An E value of 0.1 or greater indicates the match could well have occurred merely by chance.
- What do you notice about the similar sequences reported? Are they for genes of the same function, or do they differ?
- Are the best matches with genes in closely-related organisms? (There is an option to see the matches arranged as a distance tree; this is like a phylogeny, but not exactly the same as the distances are all relative to the reference sequence rather than all v. all.)
Repeat the nucleotide blast, but this time confine the search to the genome of your organism.
- Do similar sequences to your gene occur elsewhere within the genome of the organism? If so, are these to genes of the same or related function?
Repeat the above exercises, but this time comparing protein sequences. Similarity can be pursued over greater distances with the protein sequences because some nucleotide substitutions are synonymous at the protein level, and the similarity scoring takes account of whether or not substitutions are for closely related amino acids. The protein blast can be performed using Blastp, entering the FASTA protein sequence or protein_id as query, or using Blastx, entering the gene sequence/id, but then make sure you use the appropriate genetic code for your organism. With both options, search against the non-redundant protein database.
- Have you found more distant, good quality matches (assessed by E values) using a protein search? (You might need to increase the number of reported matches to tell.)

Now go to exercise Exercise 2.