biotech report

profileA11113
Introduction4.docx

Gene Report

Students Name

Institution

Course

Date

Introduction

There are two main methods for automatic gene prediction: ab initio methods and comparative methods. Ab initio methods use the DNA sequence as the only input and are referred to as intrinsic methods. There are several features that can be identified in a genomic sequence and used to identify genes computationally. Such features are related either to the signals that regulate the biological mechanisms of gene expression (signal sensors), or to biases in sequence composition in DNA regions that are translated into proteins (content sensors). Signal sensors are typically splice-sites (donor: GTRAGT, acceptor: YAG, branch-site: CTRAY), the start of translation (codon ATG), and the end of translation (codons TGA, TAA, and TAG). The content sensor most commonly used is bias in codon usage: regions of DNA coding for a protein use some codons more frequently than others. Both signal sensors and content sensors must be trained, i.e., we must start from a set of observations (such as known genes) from which we build a sensor model. Predicting a gene therefore involves looking for new features in the genomic sequence that resemble our model. The resemblance can be established in terms of probabilities.

Comparative methods are called extrinsic methods. They include two strategies: those that use homology with sequences from other genes, also called homology-based, and those that make comparisons with genomic sequence from other genomes, also called comparative-genomics-based. Homology-based methods predict a gene from the alignment of a protein sequence, or an RNA sequence in the form of a full-length mRNA, cDNA or EST (expressed sequence tag), with the genome sequence that we want to annotate. The known sequence (also called evidence) guides the prediction. There are several ways of applying homology-based methods. The simplest is to accept the alignment of the known sequence to the genome as the gene prediction. More advanced methods use the known sequence as a guide and try to complete the evidence to yield a complete gene structure. The efficacy of the latter method depends on the number of known gene sequences; hence it is limited by the completeness of biological databases. Comparative-genomics-based methods hypothesise that any sequences conserved between two relatively closely-related genomes are functional and likely to code for a gene.

Methods

Nucleotide Blast is used in the mapping of the unknown genes

Results

Description

Scientific Name

Max Score

Total Score

Query Cover

E value

Per. Ident

Acc. Len

Accession

Talaromyces marneffei isolate 11CN-20-091 chromosome 8, complete sequence

Talaromyces marneffei

36828

36828

100%

0.0

99.91%

1953797

CP045660.1

Talaromyces marneffei strain TM4 chromosome 8, complete sequence

Talaromyces marneffei

36747

36747

100%

0.0

99.84%

1967754

CP015875.1

Talaromyces marneffei ATCC 18224 DUF962 domain protein (PMAA_009450), mRNA

Talaromyces marneffei ATCC 18224

1953

2112

5%

0.0

100.00%

1142

XM_002153011.1

Talaromyces marneffei ATCC 18224 DUF962 domain protein (PMAA_009450), mRNA

Talaromyces marneffei ATCC 18224

1953

2837

7%

0.0

100.00%

1557

XM_002153010.1

Talaromyces marneffei ATCC 18224 DUF962 domain protein (PMAA_009450), mRNA

Talaromyces marneffei ATCC 18224

1953

2724

7%

0.0

100.00%

1495

XM_002153008.1

Talaromyces marneffei ATCC 18224 conserved hypothetical protein (PMAA_009390), partial mRNA

Talaromyces marneffei ATCC 18224

1938

3151

8%

0.0

100.00%

1725

XM_002153002.1

Talaromyces marneffei ATCC 18224 short-chain dehydrogenase, putative (PMAA_009410), partial mRNA

Talaromyces marneffei ATCC 18224

1592

1843

4%

0.0

100.00%

996

XM_002153004.1

Talaromyces marneffei ATCC 18224 cytochrome P450, putative (PMAA_009420), partial mRNA

Talaromyces marneffei ATCC 18224

1413

3428

9%

0.0

100.00%

1851

XM_002153005.1

Talaromyces marneffei ATCC 18224 DUF962 domain protein (PMAA_009450), mRNA

Talaromyces marneffei ATCC 18224

1382

2634

7%

0.0

100.00%

1441

XM_002153009.1

Talaromyces marneffei ATCC 18224 benzoate 4-monooxygenase cytochrome P450, putative (PMAA_009430), partial mRNA

Talaromyces marneffei ATCC 18224

1177

2368

6%

0.0

99.84%

1278

XM_002153006.1

Discussion

Both these models, but for extra or absent inputs, have the same architecture and are gradient conditioned. We have conducted a limited number of initial tests in the sets randomly split in half and half-test to assess a good architectural scale, whereas the cross-validations perform only once. The variation in the amount of network parameters during the initial experiments from roughly 5,000 to 10,000 contributed only to very minor increases in prediction accuracy (at most 0.5 percent). We position an additional ± 5 residues around the SCOP boundary definitions during preparation. However, the initial SCOP concept is used for checking. Since the issue is highly unbalanced the optimum threshold (that maximizes the boundary value below) is usually less than 0.5 for the determination of boundaries. Therefore, we evaluate the optimal threshold on the training pliers and use this threshold on the evaluation pliers.

BLAST is an acronym for the Basic Local Alignment Search tool which refers to a collection of programs for the generation of alignments between a nucleotide or protein sequence known in the database as a "ask," and nucleotides or protein sequences known as "topic" sequences. The initial BLAST software used a "quest" sequence of protein to check a database for a protein sequence. A sequence variant and a nuclear sequence database are shortly supplemented by a nucleotide query version. The inclusion of an intermediary layer where nucleotide sequences are converted to a given genetic code into their corresponding protein sequences allows for cross-comparisons between nucleotide and protein sequences. Specialized BLAST variants allow quick search with nucleotide databases of very broad query sequences or alignments between a single pair of sequences. BLAST is both a standalone and a network version of the National Biotechnology Information Centre (www.ncbi.nlm.nih.gov). The web edition offers the quest for the complete genomes of Homo sapiens and several model species, including mouse, rattles, fruit flies and thalian Arabidopsis, which enable the maximum genomic background for BLAST alignments.

BLASTn (Nucleotide BLAST): compares one or more nucleotide query sequences to a nucleotide sequence or a nuclear sequence database. This is useful for determining the evolutionary relationships between various species (see Comparing two or more sequences below).

BLASTx: compares a nucleotide query sequence that is converted into six (six protein sequences) read frames to a protein sequence database. BLASTx (Traduced Nucleotide Sequence) Since blastx converts the query sequence into all 6 reading frames and offers a composite meaning statistic for hits on various frames, it is particularly helpful when the query sequence reading frame is unknown or it includes errors, which can result in frame shifts or other coding errors. Therefore, blastx is always the first study of a freshly determined series of nucleotides.

tBLASTn (the protein sequence checked for translated nucleotide sequences): compares a protein sequence question to a six-frame nucleotide sequences database translation. In unannotated nucleotide sequences, such as expressed gene tags and draft genome (HTG) information, found on the est and htgs of the BLAST data bases, Tblastn is useful in the finding of homologous protein coding areas. ESTs are small cDNA sequences that are single-read. They provide the largest pool of sequence data for several species and include transcripts of many unspecified genes. Since ESTs do not have annotated code sequences, BLAST protein databases do not have the related protein translations. Therefore, a tblastn scan is the only way to search for certain possible protein coding areas. Another great source of undetected coding region are the HTG sequences, draft sequences from multiple genome projects or massive genomic clones.

BLASTp (BLAST protein): contrasts one or more sequences of protein queries with a protein sequence or a protein sequence index. This is helpful if a protein is to be identified