Please Note

This data has been generated for Mary Ann Moran, and if you are not in her group please do not use this data. This site is solely for the exchange of data between our groups.

Please read our views on sharing data

Download the reformatted sequences

There are fasta files and quality files for each of the sequences.

Links can be removed at anytime

Data Analysis on the four samples

Files

454 File NameNumber
454Reads_SID408.fna63
454Reads_SID409.fna64
454Reads_SID410.fna65
454Reads_SID411.fna66

These are the samples that we have worked on. For ease of manipulation, they are renamed as numbers. The numbers are shown in the table.

The numbers are basically arbitrary. We started at 1 a while ago and now we are upto 63. These are mostly 454 sequences, but a few are other metagenomes. It just makes it a lot easier for data manipulation and handling to have a number instead of a name.

The reformatted sequences can be downloaded from the links on the right

Initial Sample Preparation

As part of our initial preparation we renumber the sequences. This involves two steps, first removing all the exact duplicate sequences, and second just numbering the sequences starting at an essentially arbitrary integer and incrementing the numbers.

Exact duplicates

All the 454 reads have exact duplicate sequences. These are sequences that are exactly the same sequence and exactly the same length, starting and finishing at the same point. These sequences are removed before the analysis begins because they are an artefact of the sequencing production and not real new sequence. This will therefore screw with the statistics and other analysis.

File Name Number Number of Sequences
from 454
Number of duplicates Number of sequences
after removing duplicates
454Reads_SID408.fna6366,53411,68654,848
454Reads_SID409.fna6458,876 8,56250,313
454Reads_SID410.fna6516,444 3,99812,446
454Reads_SID411.fna6638,642 4,86933,773

The number under "Number of duplicates" is the number of dropped sequences. One of the sequences is kept (generally the first one that is found in the file

For the 454Reads_SID408.fna file, the number of duplicates are shown in the table below. Hence there were 4,412 different sequences that appeared two times in the library. Only one of these was kept. Similar data is available for the other files.

Number of identical sequencesNumber of duplicates
24412
31114
4454
5225
6141
771
849
927
1021
1115
1212
1311
148
156
162
174
182
192
201
222
231
571
TOTAL11686

Duplicate sequences are not given new numbers (because we don't keep them), but this is a list of the exact dups that you can find in the original fasta files

Renumbering

We just give sequences a number, and once upon a time we started at one. To avoid confusion, we just continuted with these numbers for your sequences, though I can regenerate the numbers starting at one again for you. The main advantage of numbers is a purely geeky thing in that it is faster to index and access them in databases than it is letters, so downstream data handling becomes quicker. When you have lots of sequences like this, names basically become meaningless anyway.

For each of the libraries that we worked with, these are the first and last numbers of the sequences. Other sequences are incremental between these.

File Name Number First sequence Last sequence
454Reads_SID408.fna6365808396635686
454Reads_SID409.fna6466356876685999
454Reads_SID410.fna6566860006698445
454Reads_SID411.fna6666984466732218

As you can see, not very imaginitive!

Results

At last!

These are the preliminary results that we are generating from this data. As the results flow in, I will include them here

16S Analysis

The sequences were compared to the 16S database (version 9) using BLASTN. In the table below we have the summarized results that has two tables, one showing the first hit for each sequence, and the second showing the counts of each hit. We also have an interface that allows you to choose from the top 10 best hits for each sequence, and finally we have the raw results.

We usually find a need for some manual input in deciding the 16S results. These pages allow you to look through the best 10 hits to each sequence, and provide an interface to choose the best hits. These will be saved to a local file that should be the second link. Note that the second link won't be active until you have been through the files the first time. This will also give you a nice summary of the number of hits

Manually check resultsParsed resultsAll results
636363
646464
656565
666666

Phage Analysis

We routinely use BLASTX and BLASTN to compare the sequences to the phage database. We then generate a summary best on the hits. From these data we can generate summaries for the BLAST hits that we see:

FileBLASTNBLASTX
63countscounts
64countscounts
65countscounts
66countscounts

We also have the raw data from the BLAST analysis available here:

FileBLASTNBLASTX
63summarized   (raw)summarized   (raw)
64summarized   (raw)summarized   (raw)
65summarized   (raw)summarized   (raw)
66summarized   (raw)summarized   (raw)

NR Blast

This is the data from the BLAST against the SEED nr.

There are many types of non-redundant DNA and protein database. Most people are familiar with the ones generated by GenBank called nt for nucleotide and nr for protein. However, many other places also make non-redundant databases. All it means is that you grab sequences from different places and remove the ones that occur more than once (the redundancy).

For all analysis I use the SEED non-redundant database. This is a non-redundant database comprised of sequences from GenBank, KEGG, UniProt, the sequencing centers (TIGR, Sanger, JGI, Broad, etc etc etc), and other sources and databases where ever we find them. The SEED is an integrated annotation platform that has more genomes in a single site than any other integration. It has more proteins from more places, because we have a team of people whose job is to hunt down new sequences.

What happens is we grab data from all over the place. Typically the source is given a short name and then a vertical bar and then the id, so fig|83333.1.peg.123 comes from fig, but the identical protein is also in databases with these names:

GeneID is from NCBI, as is NP_ and gi. b0123 is the E. coli b number, eric is from the ERIC database https://asap.ahabs.wisc.edu, kegg is from KEGG, sp is swissprot, tr is trembl and uni is uniprot. Not all proteins are present at all sources and there are more sources than those listed here. Rather than having each of these in our non-redundant file many times, we just make one entry and give it a unique ID. A separate file, called peg.synonyms, keeps track of those matches. In addition to exactly identical proteins we also allow variation at the 5' end (start) of the proteins since different sources use different gene calling/trimming algorithms. Generally the 3' end is conserved because the stop site is easy to recognize, but the start site can be different each time. Therefore, we allow up to 30% variation in length (as long as the sequence is identical) and we do not consider the first amino acid (since some sites convert GTG to Met and some do not). Comparing a 454 library against this database is not a trivial task, and usually takes several days. See http://phage.sdsu.edu/~rob/Pyrosequencing/test_data/ for some test data sets that I developed and compared on a couple of different machines.

Finally, we use this BLASTX as the basis of our "orf" calling. This is a rough-and-ready technique that seems to work well for the 454 sequences, but should not be relied on!. Hence we have different sets of sims:

These are the blast results:

    Raw Results, unexpanded uncalled.

  1. 63
  2. 64
  3. 65
  4. 66

    Expanded, orf called sims suitable for the SEED

  1. 4444444.23
  2. 4444444.24
  3. 4444444.25
  4. 4444444.26

    Peg synonyms file. You will need this to expand the raw results

  1. peg.synonyms