Reference no: EM133334515
For this homework, you will need to use the most recent human genome assembly located on Monsoon:
/common/contrib/classroom/inf503/genomes/human.txt
• This file contains multiple scaffolds that comprise the human genome
• The genome is in FASTA format (see insert)
o The headers are unique and always begin with the ">" character. These can be discarded for this homework.
![812_Human genome.jpg](https://secure.expertsmind.com/CMSImages/812_Human genome.jpg)
Each line of genome file is exactly 80 characters long (plus carriage return character)
o The genomic sequences consist of the following alphabet {A, C, G, T, N}
Problem 1: Monsoon account creation and workshop
• Navigate to NAU's High Performance Computing Cluster (Monsoon) account creation
• Complete the Self-Paced Workshop
• Obtain and submit the validation codes to self-validate your account
• Take a screenshot of the successful ‘confirm user' command (see example below) and submit it as part of your writeup to complete problem #1 of the assignment.
Problem #2: basic text processing
Write code to read, store, and analyze the latest human genome assembly (found at:
/common/contrib/classroom/inf503/genomes/human.txt ). At minimum, your code must contain:
• A character array to store the entire human genome in a single data structure
• A separate function to read the human genome file
• A function to compute the number of A, C, G, or T characters in the human genome
• Comments describing major code blocks and control structures
A. Read in and store the human genome. There will be multiple scaffolds (each with a separate header denoted by ">"). Concatenate the entire genome (discard headers) into a single character array data structure. Collect the following statistics (see below) as you are reading the file. Hint: you can keep running totals or store scaffold sizes / names in a separate sets of arrays
• How many scaffolds were there?
• What was the longest and shortest scaffold? Provide names of scaffolds and lengths.
• What was the average scaffold length?
B. Write a function to assess the content of the human genome - count the total number of a given character (A, C, G, or T) in the whole genome.
• What is the ‘big O' notation of your search (linear / quadratic / cubic / etc)?
• How long does it take (in seconds) to execute this function? Hint: You will need to use system time within your code to get accurate time estimates.
• What was the GC content of the human genome (percent of C's and G's in the genome)?
![1876_human genome1.jpg](https://secure.expertsmind.com/CMSImages/1876_human genome1.jpg)