Dna sequences, Computer Engineering

Assignment Help:

The dataset provided in this assignment contains a collection of real DNA sequences. The number of true binding sites is quite limited and that makes the problem challenging. In machine learning community, this is termed as imbalanced datasets. Some techniques dealing with imbalanced data classification, such as sampling or filtering, can be applied for the biological data. It is a good idea to find some relevant publications to see in which way you can build effective classifiers for motif recognition.

The whole dataset should be partitioned into a training dataset used to build the learner models, and a testing dataset used to evaluate generalization capability of the classification systems. System performance will be evaluated by looking at the recall, precision, F-measure and recognition rate for both the training dataset and the test dataset.

It is very important to notice that unlike traditional way for evaluating classifier's performance, here a kmer is classified as a motif instance if its location has at least 50% overlap with a true binding site in the DNA sequences. For example, consider two true binding sites ACACGGGA and ACACGGGA in the following DNA sequence.

ccttacacaaACACGGGAgaattaatACACGGGAtcagatcaataaa (1)

Suppose that the 8mers acaaACAC and ACGGGAtc are classified as binding sites by a learner model. Then, we will count them as correct prediction because they have 50% and 75% overlaps with the true binding sites in sequence (1), respectively. Conversely, if classifiers classify them as non-binding sites, then we will count them as incorrect prediction because they have at least 50% overlaps with the true binding sites. Take another 8mer, GAgaatta, in (1). If it is classified by a learner model as a binding site, then it will be counted as a misclassified one because it has only 25% overlap with the true binding site ACACGGGA


Related Discussions:- Dna sequences

Biscuit factory circuit, Biscuits move on a conveyer through an oven where ...

Biscuits move on a conveyer through an oven where they are cooked.   Two sensors, H and C, measure the temperature of the oven, H = 1 if the oven is too hot and C = 1 if the ove

Explain about mmx architecture, Explain about MMX architecture MMX arc...

Explain about MMX architecture MMX architecture introduces new packed data types. Data types are eight packed, consecutive 8-bit bytes; four packed, consecutive 16-bit words;

Hardware interrupts - computer architecture, Hardware interrupts: Har...

Hardware interrupts: Hardware interrupts -from I/O devices, processor, memory Software interrupts-produced by a program. Direct Memory Access (DMA)  Interrupt or Poll

.rapid technology, Choose one area of rapid technological change in IT or C...

Choose one area of rapid technological change in IT or Computer Science and research and report on recent developments and the outlook for the future in the area that you have chos

Polishing game, Byteland county is very famous for luminous jewels. Luminou...

Byteland county is very famous for luminous jewels. Luminous jewels are used in making beautiful necklaces. A necklace consists of various luminous jewels of particular colour. Nec

Characters in vi editor, What is the command used to replace many character...

What is the command used to replace many characters in Vi Editor? Ans) For replace most of the character in vi editor press esc key and then press R for change many character.

What is shadow ram, Shadow RAM is a copy of Basic Input/Output Operating Sy...

Shadow RAM is a copy of Basic Input/Output Operating System (BIOS) routines from read-only memory (ROM) into a particular area of random access memory (RAM) so that they can be acc

Propositional model, Propositional model: Hence a propositional model ...

Propositional model: Hence a propositional model was simply an assignments of truth values to propositions. In distinguish, a first-order model is a pair (Δ, Θ) where

Cg transformations, magnify a triangle a(0,0), b(1,1), c(5,2) twice its siz...

magnify a triangle a(0,0), b(1,1), c(5,2) twice its size hile keeping c as fix

What is gustafsons law, Q. What is Gustafsons Law? Amdahl's law is app...

Q. What is Gustafsons Law? Amdahl's law is appropriate for applications where response time is significant. On the other hand there are numerous applications that necessitate

Write Your Message!

Captcha
Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd