Dna sequences, Computer Engineering

Assignment Help:

The dataset provided in this assignment contains a collection of real DNA sequences. The number of true binding sites is quite limited and that makes the problem challenging. In machine learning community, this is termed as imbalanced datasets. Some techniques dealing with imbalanced data classification, such as sampling or filtering, can be applied for the biological data. It is a good idea to find some relevant publications to see in which way you can build effective classifiers for motif recognition.

The whole dataset should be partitioned into a training dataset used to build the learner models, and a testing dataset used to evaluate generalization capability of the classification systems. System performance will be evaluated by looking at the recall, precision, F-measure and recognition rate for both the training dataset and the test dataset.

It is very important to notice that unlike traditional way for evaluating classifier's performance, here a kmer is classified as a motif instance if its location has at least 50% overlap with a true binding site in the DNA sequences. For example, consider two true binding sites ACACGGGA and ACACGGGA in the following DNA sequence.

ccttacacaaACACGGGAgaattaatACACGGGAtcagatcaataaa (1)

Suppose that the 8mers acaaACAC and ACGGGAtc are classified as binding sites by a learner model. Then, we will count them as correct prediction because they have 50% and 75% overlaps with the true binding sites in sequence (1), respectively. Conversely, if classifiers classify them as non-binding sites, then we will count them as incorrect prediction because they have at least 50% overlaps with the true binding sites. Take another 8mer, GAgaatta, in (1). If it is classified by a learner model as a binding site, then it will be counted as a misclassified one because it has only 25% overlap with the true binding site ACACGGGA


Related Discussions:- Dna sequences

Find farthest distance from exchange where subscriber locate, An exchange u...

An exchange uses a -40 V battery to drive subscriber lines. A resistance of 250 ohms is placed in series with the battery to protect it from short circuits. The subscribers are req

Explain the criteria to classify data structures, Explain the criteria to c...

Explain the criteria to classify data structures used for language processors? The data structures utilized in language processing can be classified upon the basis of the subse

Mips assembly language equivalents , MIPS' native assembly code only has tw...

MIPS' native assembly code only has two branch instructions, beq and bne, and only one comparison instruction, slt. Using just these three instructions (along with the ori instruct

What do you mean by internet, Q. What do you mean by Internet? Ans: In...

Q. What do you mean by Internet? Ans: Internet is a network of networks or collection of networks. Several networks such as WAN and LAN connected through suitable hardware an

Create the website home page, Now when a site structure is set up for stori...

Now when a site structure is set up for storing pages and assets for Compass site, you'll create the first page-a home page for the site. As you build this page, you will add text,

How can we design radio button, Q. How can we design Radio Button? Radi...

Q. How can we design Radio Button? Radio buttons are used when only one out of group of options is to be chosen. In the illustration code we have put a line break after every b

Array user interface, You were offered bonus marks for separating the user ...

You were offered bonus marks for separating the user interface code from the main logic of your program. This design choice makes it very easy to replace the user interface without

Gnome, Explain briefly about GNOME desktop

Explain briefly about GNOME desktop

Stack overflow causes, Stack overflow causes   (A) Hardware interrupt. ...

Stack overflow causes   (A) Hardware interrupt.  (B) External interrupt.  (C) Internal interrupt.   (D) Software interrupt. Stack overflow occurs whereas execution

Calculate traffic lost in a particular exchange, In a particular exchange d...

In a particular exchange during busy hour 1200 calls were offered to a group of trunks, during this time 6 calls were lost. The average call duration being 3 minutes Calculate Tr

Write Your Message!

Captcha
Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd