Gene Expression and DNA Methylation Assignment

Assignment Help Python Programming

Reference no: EM132419545

Gene Expression and DNA Methylation

Exercise 8.1: Data Preprocessing

The supplement contains a gene expression and a DNA methylation data set of 100 genes from 19 samples. The samples HSC, MPP1, MPP2, CLP, CMP, GMP, MEP, CD4, CD8, B cell, Eryth, Granu and Mono are from blood cells, whereas the samples TBSC, ABSC, MTAC, CLDC, EPro and EDif are from skin tissues.

The values in the data sets are already normalised, but still contain entries with empty or unknown values that need to be removed, as well as multiple entries for one gene that need to be merged before you can work with the data.

(a) Data matrix: The supplement contains the data_matrix.py-file with the outline of a DataMatrix -class in which you should implement the following functions:

(1) read data(): Read tab-separated tables where the first line gives the column names and the first column gives the row names. Remove rows without name or that contain empty or non-numerical values. If there are several rows with the same name, merge them into a single row by taking the mean at each position of those rows.

(2) get columns() and get rows(): Return a dictionary with the row/column names as keys and corresponding observation lists as values. You need this for later exercises.

(3) not normally distributed(alpha, rows): Many statistical analysis methods make assump- tions about the underlying distribution. The Shapiro-Wilk test is used to test the null-hypothesis that a list of observations comes from a normal distribution.

Use the Shapiro-Wilk test to compute and return the names and p-values of rows with p < alpha. The parameter rows takes a boolean value that specifies whether this should be done for rows or columns. You can use the Shapiro-Wilk test from the scipy module.

(4) to tsv(file path): Write the processed matrix into a tab-separated file with the same format as the input matrices. The columns should be in the same order as in the input file and the rows should be in lexicographical order of their name.

(b) Process expression and methylation data: In the function exercise 1() in main.py, use your DataMatrix -class to read in the expression and methylation tables given in the supplement and write the processed matrices into files. Submit each matrix file with the following names:

• lastname1_lastname2_expression.tsv
• lastname1_lastname2_methylation.tsv
For each input file, report the number of genes and samples whose data does not follow a normal distribution with α = 0.05.

Exercise 8.2: Correlation Measures

Gene expression or DNA methylation are often investigated using various correlation coefficients. In the last exercise you already used the Pearson correlation but there are also other correlation methods which all have their pros and cons. Implement the following functions in correlation.py, the Pearson correlation implementation is already given:

(a) rank(X): The Spearman and Kendall correlation coefficients consider the ranking (sort order) of values. To compute the ranking of a value list X:

(1) Compute a sorted version Xs of X, in descending order.
(2) Create a new list Xr that contains the index of each value of X in Xs. (3)Return Xr.
If X contains a value v multiple times, then all occurrences of v are assigned the mean of their ranks. For example, for a list X = [6, 6, 4, 2, 10] the ranking is Xr = [1.5, 1.5, 3, 4, 0].

(b) Spearman correlation(X, Y): The Spearman correlation coefficient calculates a non-parametric correlation by computing the Pearson correlation coefficient on the ranking of two observa- tion lists X and Y of length n. Return the Spearman correlation coefficient for the input lists.

(c) Kendall correlation(X, Y): The Kendall correlation coefficient τB computes a non-parametric correlation by computing the concordant and discordant pairs in the ranking of two obser- vation lists X and Y of length n, while considering tied pairs.

(1) Compute the rankings Xr and Yr of the input lists.
(2) Pair the rankings as follows: ( Xr,1, Yr,1), (Xr,2, Yr,1), ..., (Xr,n, Yr,n).
(3) Compute the number of concordant pairs nc and discordant pairs nd by going through all (unique) combination of pairs (Xr,i, Yr,i) and (Xr,j, Yr,j) with i = j. A pair is concordant if
• Xr,i < Xr,j and Yr,i < Yr,j or
• Xr,i > Xr,j and Yr,i > Yr,j . A pair is discordant if
• Xr,i < Xr,j and Yr,i > Yr,j or
• Xr,i > Xr,j and Yr,i < Yr,j .
Also compute the number of tied pairs nX with Xr,i = Xr,j and the number of tied pairs nY with Yr,i = Yr,j. A pair with Xr,i = Xr,j and Yr,i = Yr,j does not count towards nX and nY .
(4) Compute the Kendall correlation coefficient as

τB = n_c - n_d /(√(nc+nd+nx)(nc+ nd+ nY))

Return the Kendall correlation coefficient for the input lists.

Exercise 8.3: Gene Co-Expression Networks

Co-expression of genes is a possible indicator that those genes are part of the same process or pathway, functionally related, or regulated by the same transcriptional programs.

(a) Network construction: correlation.py contains the already implemented CorrMatrix - class, and network.py contains the outline of the CorrNetwork -class. In the latter, imple- ment the following functions:

(1) init(correlation matrix, threshold): Use the CorrMatrix to add undirected edges be- tween nodes with absolute correlation ≥ threshold.
(2) to sif(file path): The simple interaction format (SIF) is a basic, tab-separated format without header that can be read by many network visualisation tools.
• Column 0: label of the source node
• Column 1: interaction type
• Columns 2+: label of target node(s)
The interaction type should be the correlation value rounded to two decimal places. The file should include interactions only once, meaning that if you already included "n1 0.75 n2", do not include "n2 0.75 n1".

(b) Network visualisation: In the function exercise 3() in main.py, use your implementation to construct gene co-expression networks for the expression and methylation data tables with the Pearson, Spearman and Kendall correlation coefficient with threshold = 0.75. This should give you a total of 6 SIF files that you should submit with the following names:

• lastname1_lastname2_expression_network_pearson.sif
• lastname1_lastname2_expression_network_spearman.sif
• lastname1_lastname2_expression_network_kendall.sif
• lastname1_lastname2_methylation_network_pearson.sif
• lastname1_lastname2_methylation_network_spearman.sif
• lastname1_lastname2_methylation_network_kendall.sif
Visualise each network file with the open source network visualisation softwareCytoscape. (You do not have to submit images of the networks, but it will help with the discussion.)
(c) Discussion: Briefly comment on the similarities and difference between the networks. Explain and discuss your results.

Exercise 8.4: Hierarchical Clustering

In the previous task you investigated the correlation between genes. In this task you explore the correlation between samples and use hierarchical clustering to examine if gene expression and DNA methylation can be used to correctly distinguish between tissue types.
Hierarchical clustering methods use a distance metric d(a, b) that measures how similar two in- dividual observations a and b are, and a linkage criterion to determine the similarity of sets of observations and which clusters to combine next.
In this task you are going to implement a bottom-up (also called agglomerative) approach that uses the average linkage criterion and correlation as the metric:

(a) Implementation: Implement the following functions in the CorrelationClustering -class in cluster.py:

(1) average linkage(cluster A, cluster B): Given two clusters A and B with m and n obser- vations, respectively, return the average linkage

l(A, B) = 1/m.n Σ^m_i=1 Σⁿ_j=1 |d(A_i, B_j)|.

(2) cluster(): The bottom-up approach starts with each observations as a single cluster and performs the following steps until there is only one cluster containing all observations:
i. Compute the linkage criterion for all pairs of clusters. ii.Select the two clusters with the highest linkage criterion. iii.Add the two clusters and their linkage value to the trace. iv.Merge the two clusters.

(3) trace to tsv(file path): Write the clustering trace into a tab-separated file, where each line represents a clustering step. Columns 0 and 1 should each contain the names in a cluster, separated by a comma. Column 2 should contain the linkage value.
For example: "HSC, MPP1 MMP2, CLP, CMP 0.93"

You can use the Cluster -class in cluster.py to represent clusters.

(b) Application: In the function exercise 4() in main.py, use your implementation to hierar- chically cluster the expression and methylation data tables with the Pearson, Spearman and Kendall correlation coefficient. This should give you a total of 6 TSV files that you should submit with the following names:

• lastname1_lastname2_expression_cluster_pearson.tsv
• lastname1_lastname2_expression_cluster_spearman.tsv
• lastname1_lastname2_expression_cluster_kendall.tsv
• lastname1_lastname2_methylation_cluster_pearson.tsv
• lastname1_lastname2_methylation_cluster_spearman.tsv
• lastname1_lastname2_methylation_cluster_kendall.tsv

(c) Discussion: Can hierarchical clustering be used to differentiate between blood cells and skin tissues? Are there differences between the correlation coefficients or data type? Why?

Reference no: EM132419545

Questions Cloud

How does the group you have selected train bomber : For this written assignment: Provide information on a terrorist group which uses this method. How does the group you have selected, train and reward the bomber?

Develop a plan for addressing mandated clients : Write a 1,050- to 1,400-word paper addressing the following: Develop a plan for addressing mandated clients who are resistant to change.

Describe in writing the interagency collaboration process : Describe in writing the interagency collaboration process, network, relationships, and/or procedures that all responding partners should participate.

Macro environments assignment : Macro environments Assignment help and solutions:-What do you think are the three major trend in your select microenvironment that have the potential to impact.

Gene Expression and DNA Methylation Assignment : Gene Expression and DNA Methylation Assignment Help and Solution - Briefly comment on the similarities and difference between the networks.

What is meant by marketing research? : What are the stages in the marketing research process? Explain briefly. How would you use marketing research to make recommendations on marketing the soap?

Response needed three organizational markets are producers : Response needed three organizational market are producer Assignment help and solutions:-understanding of the weekly content as supported by a scholarly resource

Determine desired market state for each organizational unit : Determine the desired market state for each organizational unit contained within the business, and the appropriate weighting of each pay element

Draw the flow net on graph paper : Draw the flow net on graph paper and Calculate the flow rate under the weir - Calculate the pore water pressure at point

User Account

All Pages