Implement a simple utility for drawing dotplots

Assignment Help Biology
Reference no: EM131262808

Coding for biologists:

SUBMISSION INSTRUCTIONS

You should submit a single zipped file containing the entire work directory for the assignment.

This should include: all FASTA files, all of your code, and an iPython notebook with the details of your work. All code should be either included in the notebook, or written in separate files that are either imported or run from the notebook via the %run iPython magic,

see - https://ipython.org/ipython-doc/rel-0.10.2/html/interactive/tutorial.html

All output and all comments should appear in the notebook. It should be possible to run the entire notebook by running the cells sequentially from the beginning to the end (check that this works by restarting the kernel and working through the notebook from the top). Code and graphic output not linked to (directly or indirectly) from the notebook will not be marked. Only the notebook,Python code, text files required by the software and graphic output produced by the softwarewill be marked.Comment your code thoroughly and format it properly.

MARKING CRITERIA

Your work will be marked based on:
- completeness and correctness: 60%
- quality of the algorithmic solutions (including appropriate use of data and control structures, use of functions, etc.): 30%
- coding style (comments, variable names, readability of code): 10%

Outline

For this assignment, you will implement a simple utility for drawing dotplots comparing two proteins. You can refer to the dotter program and the lecture notes for the Computational Genomics module for inspiration. The assignment is presented as a sequence of stages.

Attempt all questions in the "Requiredfunctionality" part before implementing any features marked as "Optional functionality". You can implement any subset you like of the optional functionality. Check that your program runs correctly in the terminal, then use the iPython %run magic to run it from within a notebook. Include sample output for each functionality you implement and any other relevantinformation in the notebook.

Indicate clearly near the top of the notebook which of the questions you have attempted.

Required functionality

a) Write a dotplot program that reads two proteins from FASTA files specified on the command line (see sys.argv in the Python documentation). The program should output a simple dotplot to the terminal. The dotplot should involve only the first 70 residues of the sequence displayed horizontally and the first 20 residues of the sequence displayed vertically, so as to fit in the standard terminal screen. The first row and the first column should display the two sequences. In the dotplot proper, an asterisk (*) should mark locations corresponding to matching entries, while the rest should be left empty. A sample output (limited here for convenience to 10 residues from one sequence and 5 from the other) should look like:

TSLWWAPQQR
A *
K
Q **
P *
R *
Include a sample output in your notebook.

b) Code a simple help message to be displayed when the program is invoked with wrong or insufficient arguments or with the string help on the command line. Run your program from within the notebook to display the help message. To allow for easy modification and translation, the help message should be stored in a separate text file and loaded and displayed upon request.

c) Program a simple menu system of the type found in clustalw that allows the user to specify the names of the input files, obtain help, and quit the program. The menu should be displayed if the program is invoked without command-line arguments, or in any case after a dotplot is produced. You should wait for the user to press the enter key before reverting to the menu, to avoid wiping out the dotplotimmediately when running in a terminal. For clarity, print the following line just below the dotplot: Hit <enter> to return to menu:
Include a screenshot of the menu in the notebook.

d) Implement panning through the sequences to visualise the rest of the dotplot. When a dotplot is displayed, the user should have a choice to press one of five keys to "page" forwards or backwards through either sequence, or return to the main menu. Following this a different portion of the dotplot should be displayed, or the user should be returned to the main menu. For example, a text line printed just below the dotplot should read:

Enter [r]ight, [l]eft, [u]p, [d]own or [m]enu:

The system should be able to handle sequences with a number of residues that isn't a multiple of 20 or 70. Demonstrate this feature in the notebook.

Optional functionality

e) Use a scoring matrix instead than a simple identity check to score corresponding amino acids. Only plot a (*) if the score is above a threshold. The scoring matrix should be stored in a separate file that is loaded as required. The user should be able to select the threshold with a command line option and through the menu; for example mydotplot -t0.3 proteinA.fastaproteinB.fastashould select a threshold value of 0.3. Include sample output in the notebook and comment on the difference with respect to the simpler scoring scheme, if any (you can return to identity matching by choosing the identity matrix as your scoring scheme).

f) Implement filtering with a window of length w.

If you are not implementing (e): only draw a (*) at position (i,j) on the dotplot if the number of matching residues in corresponding positions within windows of length w centred at positions i (respectively j) onthe two sequences is above a threshold t. So for instance if w=5 and t=3 a (*) should appear at any givenposition only if at least 3 corresponding residues within windows of length 5 match (both in the sense that they are the same residue, and that they are in the same position within the window; so for example if the two filtering windows contain "APKTR" and "AKQWR" then A and R count as a matches but K does not).

If you are implementing (e): For each position (i,j) in the two sequences, pairs of amino acids in corresponding positions in the filtering windows should be scored using the scoring matrix. These scores should be averaged and compared against the threshold. A (*) should then be printed only if the resulting average score is above the threshold.

In either case you should implement a command line option -f to allow the user to request the use of the filter and specify the length of the window, and an option -t for threshold selection. For instance mydotplot -f5 -t2.0 proteinA.fastaproteinB.fastashould produce a dotplot of protein Avs protein B, filtered with a window of length 5 and a threshold of 2.0. The same functionality should also be accessible through the menu. Invoke your dotplot program on two sample sequences, without and with filtering, include the output in the notebook and comment on the differences.

g) Give the user the option to display the dotplot for the entire sequences using a graphic library. I suggest the imshow function from the matplotlib library, but other equivalent choices are also fine (if this library is not present on your system, use the software installer to install python-matplotlib). https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.imshow

Note that you will not be able to display the sequences with imshow, only the dots will be displayed as an image.

For this to work, you will need to create a two-dimensional array of the appropriate size and set each single entry to 0.0 (black) or 1.0 (white) to differentiate between dots and background. You can pass the keyword argument cmap='gray' to imshow to select a grey scale colormap. If you have implemented point (e) and/or (f), you may want to display the matching score itself as a grey level, instead of creating a black-and-white (two-level) dotplot. It is still useful to set a threshold below which the point is set to white. According to your scoring scheme, you may need to rescale/normalize thethresholded scores for display with imshow (read the function description and the example carefully). Graphic output should be selectable from the command line (via option -g) and from the menu of your program. Include sample output in the notebook (you can get matplotlib images to display directly in the notebook by running the magic %matplotlib inline).

SUBMISSION CHECKLIST:
- Notebook contains links to all relevant code and all output required
- Notebook runs in a sequence from the first cell to the last with a fresh kernel
- Notebook and software include name of author and/or student number
- No Microsoft Word or other files other than Python code, text and a notebook file, and images generated by the code (with links in the notebook)
- All relevant files are included in the submission as a single .zip file

Reference no: EM131262808

Questions Cloud

Differences between ifrs and gaap : In 2009, the FASB completed a five-year effort to distill the existing GAAP literature into a single database known as: Financial statements follow: Differences between IFRS and GAAP include all of the following EXCEPT:
An effort to buy products made in america : Has Alex Rodriguez demonstrated that he is worth U.S.$30 million a year? Does his ethical behavior on and off the field have anything to do with this?
Career planning and fitness programs : Refer to Scenario 1.1. The career planning and fitness programs provided to A-OK employees help fulfill which fundamental goal of human resource management?
Diluted earnings per share for the year ended : On December 31, 2015, Berclair Inc. had 600 million shares of common stock and 16 million shares of 9%, $100 par value cumulative preferred stock issued and outstanding. On March 1, 2016, Berclair purchased 30 million shares of its common stock as tr..
Implement a simple utility for drawing dotplots : Implement a simple utility for drawing dotplots comparing two proteins. You can refer to the dotter program and the lecture notes for the Computational Genomics module for inspiration. The assignment is presented as a sequence of stages.
What management procedures could the ioc implement : Are the Olympics a domestic, an international, or a multinational sport organization? - What management procedures could the IOC implement before the 2016 Games to prevent any scandals?
What is the architects role to mitigate these issues : Cost overruns and schedule delays are the two most common causes of legal disputes for construction projects. What is the architect's role to mitigate these issues?
Different segments of the population : 1. Search for two advertising Ads that are directed towards two different segments of the population. 2. Your Ad segment can be any of the following segments(must pick one for each Ad):
Create a bcg matrix for jetblue airways : Create a BCG Matrix for JetBlue Airways. Recommend speci?c strategies and long-term objectives. Show how much your recommendations will cost. Clearly itemize these costs for each projected year. Compare your recommendations to actual strategies pl..

Reviews

Write a Review

Biology Questions & Answers

  Immune response to infectious disease

It is a very curcial concept to understand how the immune response is mounted against viruses, bacteria, protozoans and helminthes. For an effective immune response, both innate and adaptive immunity should work together.

  A review on advanced glycated end products (ages)

This Project report elaborates a critical review of important elements attached to Advanced Glycated End Products (AGEs). It is very crucial to understand the process called Millard reaction.

  Plastic as a soil stabilizer

Soil stabilization is the permanent physical and chemical alteration of soils to enhance their physical properties. Stabilization can increase the shear strength of a soil and control the shrink-swell properties.

  Principles of microbiology

This assignment has three parts which contains questions related to Microbiology. It contains basic principles of microscopy, staining techniques in microbiology and microbial growth in the food industry.

  List the biologic functions

Lipid metabolites are often seen as key elements in cellular signaling. Is this unique? Please provide several examples of the function of lipids as key elements in signal arrays and list the biologic functions these signals affect?

  Biologic function relationships

Please describe how one might search for chemical structure, biologic function relationships, involving small molecular weight lipophylic compounds. Provide one example.

  Case study on patient in the haematology laboratory

Write a case study which detailing a scenario of a patient being investigated in the Haematology laboratory.

  Use of pcr and genetic approaches in biotechnology

The use of PCR and genetic approaches in biotechnology

  Describe the role of this enzyme in honey

Glucose oxidase is an enzyme that can be used for measurements of glucose levels by combining this reaction with an oxygen probe.

  Genetic problems

What phenotypic ratio would you get if you crossed a white mouse and a heterozygous brown mouse?

  Prepare an essay on nosocomial infection

Prepare an essay on nosocomial infection.

  Monitoring and recording the blood pressure

To increase the awareness of monitoring and recording the blood pressure of patients and practice measuring blood pressure in a safe environment.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd