Outliers - reasons for screening data, Advanced Statistics

Assignment Help:

Outliers - Reasons for Screening Data

Outliers are due to data entry errors, subject is not a member of the population that the sample is trying to represent, or the subject is really different. Statistical tests are quite sensitive to outliers so this problem should be addressed.

Univariate outliers are easy to detect (z-scores, box plots, histograms, etc.) standard scores larger than +/-3 are outliers (consider 4 is n>100 or 2.5 if n<10)

Multivariate outliers are difficult to detect. Mahalanobis distance is one powerful technique to use in this case (discussed later). This is evaluated as a chi-square statistic with degrees of freedom equal to number of variables in the analysis. A chi-sqaure statistic value that is significant beyond p<0.001 level determines outliers.

In most cases, it is ok to drop the value from the sample. One can also take steps to reduce the relative influence of outliers if the researcher decides to include the values in the analysis.


Related Discussions:- Outliers - reasons for screening data

Case Study: Test Market, You and your team have been hired as strategic con...

You and your team have been hired as strategic consultants by the hugely successful retailer known as “Cutie Pie”. The company sells many products, although one product in particul

Compound symmetry, Compound symmetry : The property possessed by the varian...

Compound symmetry : The property possessed by the variance-covariance matrix of the set of multivariate data when its chief diagonal elements are equal to each other, and in additi

Frequency distribution, The division of a sample of observations into sever...

The division of a sample of observations into several classes, together with the number of observations in each of them.  It acts as a useful summary of the main features of the da

Hypergeometric distribution, Hypergeometric distribution is t he probabili...

Hypergeometric distribution is t he probability distribution related with the sampling without replacement from the population of finite size. If the population comprises of r ele

Cure models, Models for the analysis of the survival times, or the time to ...

Models for the analysis of the survival times, or the time to event, data in which it is expected that a fraction of the subjects will not experience the event of interest. In a cl

Randomized consent design, Randomized consent design is the design at firs...

Randomized consent design is the design at first introduced to overcome some of the perceived ethical problems facing clinicians entering patients in the clinical trials including

Graph theory, Why Graph theory? It is the branch of mathematics concerned w...

Why Graph theory? It is the branch of mathematics concerned with the properties of sets of points (vertices or nodes) some of which are connected by the lines known as the edges. A

Clinical trials, Clinical trials : Medical experiments designed to assess w...

Clinical trials : Medical experiments designed to assess which of two or more treatments is much more effective. It is based on one of the oldest philosophy of the scienti?c resear

Clustering, hello I have a dataset including both categorical & numerical v...

hello I have a dataset including both categorical & numerical variable for market segmentation.how can i cluster them via k-means in matlab? thank you

Write Your Message!

Captcha
Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd