Define binary logistic regression with a lasso penalty

Assignment Help Applied Statistics
Reference no: EM133561601

Statistical Learning

Assignment - High-Dimensional Analysis

A person can be in the process of developing breast cancer, but not show clear signs of this, even after mammography (production of x-ray images of the breast). Sharma et al. (2005) and Aaroe et al. (2010) wished to determine whether gene expression profiles from peripheral blood cells (a blood sample) could be used to predict whether or not a person has breast cancer. They and other researchers were also interested in the types of changes in gene expression that occur during the development of breast cancer.

In both studies, blood was drawn from a set of women who had a suspect initial mammogram, but not yet had a diagnosis of whether the abnormality observed was benign (currently harmless) or malignant (cancerous). Aaroe et al. followed on from the work by Sharma et al. with a larger number of patients and a much larger set of genes. Both datasets were made public, with the Aaroe dataset now available.

Each patient's mammography results were assessed further by clinicians and a diagnosis made. The patient condition labels are stored in Aaroelabels.csv : normal or cancer. The batch-normalised, logged and otherwise processed gene expression data is stored in Aaroe.csv. This processed dataset contains gene expression data derived from blood samples from 121 women, processed with microarrays to record values for 11217 probes, most of which represent individual genes.

Your tasks with the dataset are focused on classification of a sample as coming from a patient with breast cancer or without, and identification of genes of potential interest.

You should select one classifier for the task of classification, which you have not used in previous assignments. Probability-based classifiers discussed in this course include linear, quadratic, mixture and kernel density discriminant analysis. Non-probability-based classifiers discussed include k nearest neighbours, neural networks, support vector machines and classification trees. All of these are implemented via various packages available in R. If you wish to use a different method, please check with the lecturer. In addition, you will make use of lasso-penalised logistic regression. Note that you cannot choose another form of logistic regression as your other classifier.

The number of observations is less than the number of variables, and so some form of dimensionality reduction is needed for most forms of probability-based classifier and can be used if desired with the non-probability-based classifiers.

Here we consider analysis of this data to

(i) develop a model which is capable of accurately predicting the class (cancer or normal) of new observations based on a blood sample, without the need for a mammogram or its examination by clinicians

(ii) determine which genes are expressed differently between the two groups, individually, or as part of a combination.

Discriminant analysis/supervised classification can be applied to solve (i), and in combination with feature (predictor) selection, can be used to provide a limited solution to (ii) also. Other methods such as single-variable analysis can also be applied to attempt to answer (ii). You should use R (recommended) or Python for the assignment.

Task:

Question 1. Following this, perform principal component analysis of the gene expression dataset and report and comment on the results. Detailed results should be submitted via a separate file, including what each principal component direction is composed of in terms of the (transformed) original explanatory variables, with some explanation in the main report about what is in the file. Give a plot or plots which shows the individual and cumulative proportions of variance explained by each component. Also produce and include another plot about the principal components which you think would be of interest to clinicians and scientists such as Aaroe et. al, along with some explanation and discussion. The R package FactoMineR is a good option for PCA.

Question 2. Perform single variable analysis of the dataset, looking for a relationship with the response variable (the class). Use the Benjamini-Hochberg (1995) or Benjamini- Yekutieli (2001) approach to control the false discovery rate to be at most 0.1. Explain the assumptions of this approach and whether or not these are likely to be met by this dataset, along with possible consequences of any violations. Also explain how the method works mathematically, but leave out why (i.e. give something equivalent to pseudocode). Report which genes are then declared significant along with the resulting threshold in the original p-values. Also give a plot of gene order by p-value versus unadjusted p-value (or the log of these), along with a line indicating the FDR control.

Within the stats package is the function p.adjust, which offers this method. More advanced implementations include the fdrame package in Bioconductor.

Question 3. Define binary logistic regression with a lasso penalty mathematically, including the function to be optimised and briefly introduce a method than can be used to optimise it. Note that this might require a little research.

Question 4. Explain the potential benefits and drawbacks of using PCA to reduce the dimensionality of the data before attempting to fit a classifier. Explain why you have chosen to reduce the dimensionality or not to do so for this purpose.

Question 5. Apply each classification method (your choice and lasso logistic regression) to the dataset using R or Python, report the results and interpret them.

For lasso logistic regression in R, I suggest you use the glmnet package, available in CRAN, and make use of the function cv.glmnet and the family="binomial" option. If you are interested, there is a recording of Trevor Hastie giving a tutorial on the lasso and glmnet. There are other options in Python including in scikit-learn.

Results should include the following:

a) characterisation of each class: parameter estimates or a reasonable alternative.

b) cross-validation (CV)-based estimates of the overall and class-specific error rates: obtained by training the classifier on a large fraction of the whole dataset and then applying it to the remaining data and checking error rates. You may use K-fold cv with K ≥ 5 or leave-one-out cross-validation to estimate performance. Additionally report the overall apparent error rates (when trained on all the data and applied back to it).

c) For lasso logistic regression, you will need to use cross-validation to estimate of the optimal value of λ. Explain how you plan to search over possible values. Then produce and explain a graph of your cost function versus λ. You should also produce a list ordered by importance of the genes included as predictor variables in the optimal classifier, along with their estimated coefficients.

For your other classifier, also determine an ordered list of the most important genes, stopping at 50, or earlier if justified. For each classifier, comment on any differences between the apparent and CV-derived overall error rates.

6. Compare the results from all approaches to analysis of the dataset (PCA, single- variable analysis and the two classifiers). Explain what each approach seems to offer, including consideration of these results as an example. In particular, if you had to suggest 10 genes for the biologists to study further for possible links to this form of cancer, which ones would you prioritise, and what makes you think they are worth studying further?
Notes:

(i) R commands you might find useful:

objects() - gives the current list of objects in memory. attributes(x) - gives the set of attributes of an object x.
(ii) Please put all your code in a separate text file or files and submit these separately via a single text file or a zip file. You should not give any code in your main report and should not include any raw output - i.e. just include figures (each with a title, axis labels and caption below) and put any relevant numerical output in a table or within the text.

Attachment:- High-Dimensional Analysis.rar

Reference no: EM133561601

Questions Cloud

Describe pathophysiological changes, abnormal findings : Describe pathophysiological changes, abnormal findings, and symptoms of the chosen health dysfunction. How does it affect the patient's functions?
Discuss potential motivations for mr. jacobs learning : Use one of the three theoretical models discussed in the reading to discuss potential motivations for Mr. Jacobs learning and behavior changes associated with
What will be the value of the investment : If a physician deposits $44,000 today into a mutual fund that is expected to grow at an annual rate of 8%, what will be the value of this investment.
Implementing your program in rural setting : What reasons would you have to conduct an evaluability assessment before implementing your program in a rural setting?
Define binary logistic regression with a lasso penalty : STAT3006 Statistical Learning, University of Queensland - Define binary logistic regression with a lasso penalty mathematically, including the function
Discuss your evaluation strategy for this infant with : Discuss your evaluation strategy for this infant with rationale and differential diagnosis. AND Respond to at least one colleague supporting or respectfully
Why ruby needs to be seen by the cnc : why Ruby needs to be seen by the CNC. It is important that the topic you choose complements the other team members' topics and that you are able to link this
What should be coded first - chronic pain or intervertebral : Julius is being seen in a Pain Clinic for chronic pain due to an intervertebral disc herniation. He receives a steroid injection to treat his pain and is sent
Analyze run chart as it relates to patient satisfaction : Analyze the run chart as it relates to patient satisfaction with pain management. Consider the practice problem on fall and either revised or affirmed

Reviews

Write a Review

Applied Statistics Questions & Answers

  Describe one program evaluation model

Describe one program evaluation model and one curriculum evaluation model and explain the significance of the similarities and differences between the models.

  Traditionally been resistant to vaccination efforts

6. A particular district of your nation had a higher incidence rate for polio last year compared to the rest of the country. That region has been traditionally been resistant to vaccination efforts. You have commissioned a case-control study for the ..

  State the three basic elements of spc

Outline two techniques which can be used to decide where to put SPC effort most effectively and state 3 ways in which the application of SPC can improve quality standards.

  Compute the cumulative frequencies for each group of workers

MPH 512: Biostatistics Examination. Compute the cumulative frequencies for each group of workers in 1987. Fill in the last column

  Calculate the risk of getting a headache when taking drug

FBUS105 Assignment Questions-A study is done to compare side effects for those taking a drug versus those taking a placebo. Calculate risk of getting a headache

  Illustrate the law of large numbers

Does the coin flipping process you just completed illustrate the Law of Large Numbers? Why or why not?

  Is hedging typically used in active portfolio management

1. Is hedging typically used in active portfolio management? 2. Do bond mutual fund managers tend to have more success in outperforming their relevant benchmarks than stock mutual fund managers do with their relevant benchmarks?

  Need Grades and Cereal code in both R and SAS

Attached are the 3 grade documents and also need the same thing done for 2 additional files. Need Grades and Cereal code in both R and SAS

  Why is a break-even analysis important tool for business

BUMAN201A-Business Maths and Statistics-TAFE NSW-Australia-Explain the difference between straight line depreciation and diminishing value depreciation.

  Find the probability that a randomly selected

Find the probability that a randomly selected employee earns less than $7.53

  Describe the variable lodgement method for dataset

BUS708 Statistics and Data Analysis - Statistical Modelling Assignment - collect and analyse data to answer a specific business problem. It will also test your

  Explain difference between classification and prediction

MITS6002 Business Analytics Assignment, Victorian Institute of Technology, Australia. Explain difference between classification and prediction

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd