How many sequence fragments in dataset

Assignment Help Other Subject
Reference no: EM133083310

INF 503 Large-Scale Data Structures And Organization - Northern Arizona University

For this homework, you will need to use the High Throughput Sequence reads dataset

• This file contains approximately 36 million ‘reads' (genomic sequence fragments of equal length) from multiple datasets (14 in total)
• Each read is exactly 50 nucleotides (characters) long
• The read set is in FASTA format (see insert)
o The headers are unique and consist of the read ID number (e.g. R1) and a series of ‘copy number' values for
the number of times this read is present in sample 1, 2, ... (separated by underscore "_")
o The genomic sequences consist of the following alphabet {A, C, G, T, N}

Problem #1 Arrays and Classes
Create a class called FASTA_readset. The purpose of the class will be to contain a single FASTA read dataset (so you'll need 14 instances of this object) and all of the functions needed to operate on this set. Use an array data-structure to store the genomic sequence of the given read dataset. Use character arrays (char[ ] ) to store the sequence, rather than ‘string' object (you should have an array-of-arrays object to store a single dataset). At minimum, the class must contain (15pts):
• A default constructor (zeroes everything out)
• At least one custom constructor (parses the combined file and fills in the actual data)
• A function to alphabetically sort the sequence fragments within the FASTA_readset
• A function to implement a binary search within the fragments of the FASTA_readset
• A single function to compute the statistics for the Readset (see below)
• A destructor
• Comments describing major code blocks and control structures

A. Read in the combined dataset and initialize all 14 instances of the FASTA_readset object. Hint: You may want to retain the copy count of each fragment as a separate array.

• How many unique sequence fragments are in each of the 14 datasets?
• How many total sequence fragments are in each dataset (i.e. when you consider copy numbers)?

B. Without alphabetically sorting any of the data in the FASTA_readset object compare the contents of datasets 1 and 2 (i.e. use the fragments in dataset 1 as queries to search in dataset 2). Make sure you continue to consider copy count in your answer.
• What is the ‘big O' notation of your search (linear / quadratic / cubic / etc)?
• How long does it take (in seconds) to search for all fragments of dataset 1 within dataset 2? Please note that depending on the efficiency of your algorithm, this step may take a long time. First estimate the total time using 1,000, 10,000, and 100,000 queries - if total time estimate is greater than 24 CPU hours, provide estimate rather than exact number.
• How many sequence fragments in dataset 1 are also in dataset 2? (estimate if needed)

C. Alphabetically sort the sequence fragments in each of the FASTA_readset objects and implement a binary search function to compare the contents of datasets 1 and 2 (i.e. use the fragments in dataset 1 as queries to search in dataset 2).
• What is the ‘big O' notation of your search (linear / quadratic / cubic / etc)?
• How long (in seconds) does it take to search for 1000 queries? How about 10,000 or 100,000? Does the time increase make sense? Explain the differences (if any) when compared to search times obtained as part of 1B.
• How many sequence fragments in dataset 1 are also in dataset 2?

Attachment:- Data Structures.rar

Reference no: EM133083310

Questions Cloud

What is fixed cost in particular example : Assume the following short-run cost function =100+10 +5 2.C=100+10Q+5Q2.
Current economic position in the country philippines : Which of the development theories best explains the current economic position in the country Philippines? What made you settle on that hypothesis? Kindly expoun
Developing country that manufactures products : Select a developed country that has implemented a tariff, and a developing country that manufactures products that are impacted by that same tariff.
Application of the selected knowledge and skill : ACS Core Body of Knowledge - your achievement of the selected learning outcomes and application of the selected knowledge and skill
How many sequence fragments in dataset : How many unique sequence fragments are in each of the 14 datasets and How many total sequence fragments are in each dataset (i.e. when you consider copy numbers
Roadmap for Renewable Energy Use In The UK : Roadmap for Renewable Energy Use In The UK - Doing Stake Holder and Pestle Analysis for Renewable Energy Industry in UK
Correctly identifying the strengths of the handover : Correctly identifying strategies that would have worked more effectively at specific time codes
Cash-generating units, corporate assets, goodwill assignment : Cash-generating units, corporate assets, goodwill Assignment - Prepare the journal entry(ies) for Camelot Ltd to record any impairment loss at 31 July 2016
Prepare the journal entries accounting for impairment loss : Prepare the journal entry(ies) accounting for the impairment loss at 30 June 2015 and the reversal of the impairment loss at 30 June 2016

Reviews

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd