stats homework using R
Homework- lab 5:
#The central limit theorem (CLT) states that as the sample size gets sufficiently large, the distribution of the sample means will be normally distributed.
#In addition, the CLT has been used to justify the fact that for many of our statistics we rely upon computing the mean (not median or trimmed mean) of our samples
#There are a few problems with the CLT.
#1) How large of a sample is needed
#2) It seems that our experiments with the contaminated normal may contradict this.
#In this homework assignment you will investigate the CLT further.
#PART 1 – The Central Limit Theorem under Normality.
#1.1) Simulate a standard normal population of 1 million people called pop1
#1.2) Draw 5000 samples of size 20 and put these in sam20. Draw 5000 samples of size 50 and put these in sam50 .
#1.3) Create variables called sam20means and sam50means that contains the means of the samples . Use a density plot to show the sampling distribution of the means for sam20means and sam50means together
#1.4) Compare the Standard Error (SE) of the sampling distributions. Which sample size creates better estimates of the population mean (ie. has the lowest SE)?
#PART 2 – The Central Limit Theorem under Non-Normality
#2.1) Simulate a contaminated normal population using cnorm() of 1 million people called pop2 where 30% (epsilon=0.3) of the data have an SD of 30 (k=30) .
#2.2) Draw 5000 samples of size 30 and put these in sam30. Draw 5000 samples of size 100 and put these in sam100.
#2.3) Create variables called sam30means, sam30tmeans, sam100means, sam100tmeans that represent the means AND trimmed means for the samples.
#2.4) Use a density plot to show the sampling distribution of the means and trimmed means for these variables.
#2.5) Compare the Standard Error (SE) of the sampling distributions.
#2.6) Which would be better here: a larger sample size using the mean as the location estimator OR a smaller sample using the trimmed mean?
#2.7) Which location estimator performs the best, regardless of sample size?
————————————————————————————————————————————————-
Lab 5 lecture notes:
#Lab 5-Contents
# 1. Sampling Distribution of the Mean,
# Median, and Trimmed Mean under Normality
# 2. Sampling Distribution of the Mean,
# Median, and Trimmed Mean under Non-Normality
# 3. The Central Limit Theorem
# Last week we saw that when we had a Normal or Uniform population,
# that the means of random samples taken from that population
#were normally distributed.
#Today we are going to investigate the distributions of the mean,
#median, and trimmed mean from samples coming from Normal
# and non-normal populations.
#———————————————————————————
# 1. Sampling Distribution of the Mean, Median,
# and Trimmed Mean under Normality
#———————————————————————————
#Let’s start by generating a standard normal distribution (mean=0, SD=1) for 1 million subjects
pop1 = rnorm(1000000, mean=0, sd=1)
#We will use this as our population from a normal distribution
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
#EXERCISE 1-1:
#A) Find the mean, median, trimmed mean (using tmean() ), and sd of pop1
#B) Draw a density plot of pop1
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
#A)
mean(pop1); median(pop1);
tmean(pop1); sd(pop1)
#B)
plot(density(pop1))
#Like we did last week, we are going to want to take random samples
# from our population and then compute a measure of central tendency
#(eg. mean, median, trimmed mean) for each sample and examine
#the distribution of this measure.
#We are going to take 5000 samples of 20 subjects
sam1 = matrix(, ncol=5000, nrow=20)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# EXERCISE 1-2: Use a loop to draw 5000 samples of size 20 from pop1
# an place the samples in sam1
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
for (ii in 1:5000) {
sam1[ , ii] = sample(pop1, 20, replace=TRUE)
}
# Now that we have our datafile containing all 5000 samples (ie. sam1)
# we can begin to create variables for each of our location measures
#I’ll start us off with the mean
sam1means = apply(sam1, 2, mean) # number 2 = work in the columns rather than rows
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# EXERCISE 1-3: Use the apply function to generate
# the variables sam1meds (medians) and sam1tmeans (trimmed mean)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
sam1meds = apply(sam1, 2, median)
sam1tmeans = apply(sam1, 2, tmean)
#Let’s look at the distributions of each of these location estimators
plot(density(sam1means))
lines(density(sam1meds), col=”red”)
lines(density(sam1tmeans), col=”blue”)
abline(v = mean(pop1), lty=2) #Add in a line for the pop1 mean
#??????????????????????????????????????????????????????????????#
#Thought Question 1: Which location estimator performs the best
#for data coming from a normal population? Why?
#??????????????????????????????????????????????????????????????#
# One of the ways we can determine which location estimator
# performs the best is by looking at the standard deviation
# of the estimator accross all the samples.
# The estimator with the lowest SD will have the least amount
# of variability accross the samples.
# A more common name for the standard deviation of the location
# estimator is called the Standard Error or SE
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
# EXERCISE 1-4: Find the Standard Error of the sample means,
# medians, and trimmed means. Based upon the SE, which
# location estimator is the best for samples coming from
# a normal population?
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
sd(sam1means); sd(sam1meds); sd(sam1tmeans)
#The mean performs the best.
# In real life, we generally cannot go out an collect multiple samples
# from a population, so we compute the Standard Error using a formula:
# SE = sd(sample) / sqrt(sample N)
#———————————————————————————
# 2. Sampling Distribution of the Mean, Median,
# and Trimmed Mean under Non-Normality
#———————————————————————————
# Normal distributions generally have very few outliers,
# however when outliers begin to occur more frequently so of the
# basic assumptions about normal distributions are no longer true
# (as we are about to see).
# One distribution that is like a normal distribution,
# but with more outliers is called a mixed or contaminated
# normal distribution and it is a result of two populations mixing together.
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
#EXAMPLE 1: “a” will be a mix of TWO populations 1: with SD=1 and 2: with SD=2
a=c(rnorm(5000, 0, 1), rnorm(5000, 0, 2))
#Let’s compare this to b, which is from ONE population but with the same parameters of a
b=rnorm(10000, mean(a), sd(a))
plot(density(a))
lines(density(b), col=”red”)
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
#??????????????????????????????????????????????????????????????#
#Thought Question 2: How are a and b from Example 1 different?
#??????????????????????????????????????????????????????????????#
#Thankfully, rather than having to create contaminated normal distributions the hard way, we can just use
#a function provided to us by Dr. Wilcox called cnorm()
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#
#Contaminated/Mix Normal Distribution: cnorm(n, epsilon=0.1, k=10)
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^#
#Let’s look at the options for the contaminated normal distribution:
#cnorm() combines two normal distributions:
#1) A standard normal (mean=0, sd=1) for 1-epsilon % of the data
#2) A normal of mean=0 and sd=k for epsilon % of the data
#If we were trying to re-create the variable a we made in example 1 we would have to do:
z=cnorm(10000, epsilon=0.5, k=2)
plot(density(a))
lines(density(z), col=”blue”)
#Which looks very very similar to a!
#Let’s create a second population called pop2 from a contaminated normal distribution
pop2 = cnorm(1000000, epsilon=0.1, k=10)
#The mean, sd, and plot of which are:
mean(pop2); sd(pop2); plot(density(pop2))
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
#EXERCISE 2:
#A) Create an empty matrix called sam2 to contain 5000 samples
#of 20 observations each
#B) Populate sam2 with 5000 random samples of size 20 from pop2
#C) Compute the mean (sam2means), median (sam2meds),
#and trimmed mean (sam2tmeans) for each sample
#D) Create an overlaid density plot of each sample WITH the pop2
#mean as a verticle line
#E) Find the SE of each location estimator
#F) Based upon the SE, which location estimator is the best
# for samples coming from a contaminated normal distribution
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#
#A)
#B)
#C)
#D)
#E)
#F)
#———————————————————————————
# 3. The Central Limit Theorem
#———————————————————————————
#We’ve discovered a few things today:
#1) When a population comes from a normal distribution,
# then mean will be the best location estimator of the samples
#2) When a population comes from a mixed/contaminated normal distribution,
# the trimmed mean is the best location estimator
# These observations are related to the Central Limit Theorem (CLT)
# that is discussed in Section 5.3 of the book (page 85)
# The CLT states that as the sample size gets sufficiently large,
# the distribution of the sample means will be normally distributed.
# We saw a demonstration of this last week when we looked at the means
# from the unifom distribution.
# The CLT has been used to justify the fact that for many of our statistics
# we rely upon computing the mean (not median or trimmed mean) of our samples
#There are a few problems with the CLT.
#1) how large of a sample do we need?
#2) It seems that our experiements with the contaminated normal may contradict this.
#In the homework you will investigate this further
Looking for a solution written from scratch with No plagiarism and No AI?
WHY CHOOSE US?
We deliver quality original papers |
Our experts write quality original papers using academic databases.We dont use AI in our work. We refund your money if AI is detected |
Free revisions |
We offer our clients multiple free revisions just to ensure you get what you want. |
Discounted prices |
All our prices are discounted which makes it affordable to you. Use code FIRST15 to get your discount |
100% originality |
We deliver papers that are written from scratch to deliver 100% originality. Our papers are free from plagiarism and NO similarity.We have ZERO TOLERANCE TO USE OF AI |
On-time delivery |
We will deliver your paper on time even on short notice or short deadline, overnight essay or even an urgent essay |