Project 6 - Statistics and Bootstrapping

NAME : RUCHIN PATEL

USC ID : 9520665364

EMAIL : ruchinpa@usc.edu

STATISTICS AND BOOTSTRAPPING

PROBLEM STATEMENT

  • Compute the sample mean, m , and the sample variance, $s^2$ of the empirical sample distribution.
  • Use the data to generate a discrete approximation to the Cumulative Distribution Function – the empirical distribution, $F_{X^*}(x)$ Plot this distribution.
  • By splitting the data into equal size intervals (0-5, 6-10, etc), generate a discrete approximation to the distribution and determine the values of the Probability Mass Function for this discrete approximation.
  • Use the bootstrapping technique to generate M bootstrap sets of samples based on the empirical distribution found in part b), with each set containing n independent samples from the Empirical Distribution (repetition allowed). Compute the sample mean and sample variance for each of the Bootstrap sample sets, call these $m_{i}^*$ and $s_{i}^{*2}$ for i = 1,...,M respectively. Use M = 50 and M = 100.
  • Estimate the MSE by: MSE($m^*$) = $\frac{1}{M} * \sum_{i=1}^{M} (m_{i}^* - m)^2$. We take this value to be an estimate of the MSE of the sample mean for the overall population distribution.
  • Do a similar evaluation of the MSE of the bootstrap sample variance $s^{*2}$ as follows: MSE($s^{*2}$) = $\frac{1}{M} * \sum_{i=1}^{M} (s_{i}^{*2} - s^2)^2$.

THEORETICAL EXPLORATION

  • $\textbf{CDF}$: The cdf is often denothed by $F_{X}(x)$ and is known as the Cumulative distribution function.

    • $F_{X}(x)$ = P(X <= x) where X is a random variable and the CDF is the probability of the random variable X being less than or equal to a particular value x.
  • $\textbf{PMF}$: The pmf of a random variable X is denoted by P(X = x) where X is the random variable and PMF gives us the probability of a random variable X taking the value x.

  • $\textbf{Sample Mean and Variance}$: The output from a simulation run in some observation of measured data X, which is actually just a random variable. We use this measure data to estimate some quantity of interest being studied, $\theta$. Usually E[X] = $\theta$. There is an underlying distribution for the X which may or may not be known. We call this underlying distribution the population distribution and denote the population mean by $\mu$ = E[X] and the population variance by $\sigma^2$ = E[$(X - \mu)^2$]. The population distribution can be found by running an infinite number of simulations or by having some analytical understanding of the physical system being simulated. We typically collect independent samples or runs which result in a set of observations {$X_i : i = 1,...n$} all of which are independently distributed random variables with mean $\theta$. We then find the average: $\bar{X} = \frac{1}{n} * \sum_{i=1}^{n} (X_i)$ and use this as an estimator of $\theta$.

    • $\textbf{Sample Mean}$: The sample mean is defined as follows: $\bar{X} = \frac{1}{n} * \sum_{i=1}^{n} (X_i)$

      • The mean squared error - the expected value of the squared difference between $\bar{X}$ and $\mu$ is just the variance of $\bar{X}$. i.e

        • MSE($\hat{m}$) = E[$(\bar{X} - \mu)^2$] = $\frac{\sigma^2}{n}$
    • $\textbf{Sample Variance}$: The sample variance is is defined as follows: $S^2$ = $\frac{1}{n-1} * \sum_{i=1}^{n} (X_i - \bar{X})^2$.

      • Note that the the term (n-1) is in the denominator to make the sample variance an unbiased estimator.
  • $\textbf{Interval Estimates}$: If the observed value of the sample mean is $\bar{X}$ = m and the sample standard deviation is $S^2$ = $s^2$, then we can use $(m - \frac{s}{\sqrt{n}}z_{\frac{\alpha}{2}},m + \frac{s}{\sqrt{n}}z_{\frac{\alpha}{2}})$ as the 100(1 - $\alpha$) confidence interval for $\mu$.

    • $z_{\frac{\alpha}{2}}$ is the value of RV Z which is distributed standard normally N(0,1) in sucha a way that: Pr{ $-z_{\frac{\alpha}{2}} < Z < z_{\frac{\alpha}{2}}$ } = 1 - $\alpha$.
  • $\textbf{Emperical Distributions}$: Given a set of samples {$X_i : i = 1,...n$} drawn from the distribution $F_X$.($X_i$ is the RV for the $i^th$ sample and $x_i$ is the actual value measured. We are interested in finding an estimate of some statistic $\theta$.

    • Almost every time however we do not know the distribution $F_X$ or even $\mu$ or $\sigma^2$ in that case we can find the emperical CDF as follows:

      • $F_{n}^{*}(x)$ = $\frac{1}{n} * \sum_{i=1}^{n} \delta(X_i <= x)$ where $\delta$ is an indicator function.
      • $\lim_{n\to\infty} F_{n}^{*}$ = $F_{X}$. This means as the number of samples in the emperical distribution increases the emperical CDF tends to go towards the true CDF.
  • $\textbf{Bootstrapping}$: The bootstrapping technique uses the emperical distribution to define the discrete sample space of distributin os each $X_i$. We generate a set of K bootstrap samples from the emperical distribution with replacement, calculate the sample mean and sample variance of each bootstrap sample and then use it to estimate the population mean.

In [335]:
samples = [37.12,2.78,1.33,33.55,45.39,9.25,28.32,35.62,28.00,4.56, \
           5.27,12.83,7.90,18.18,23.26,31.54,30.99,13.40,0.60,34.10, \
           8.45,3.98,33.25,31.10,28.67,12.55,30.92,27.88,31.44,2.28, \
           18.52,15.43,26.48,4.48,3.35,26.16,6.93,4.57,39.74,29.95, \
           28.96,32.79,19.91,1.86,7.12,27.49,32.62,20.71,33.32,11.33, \
           7.63,8.75,6.81,30.33,0.17,22.79,13.27,34.10,1.11,1.94, \
           0.27,0.14,30.43,30.57,35.38,33.72,24.10,36.62,5.01,0.24, \
           31.03,4.65,32.20,1.68,8.90,6.89,10.08,0.76,2.40,0.16, \
           36.22,24.87,25.84,5.34,1.92,2.3,33.56,24.03,1.3,8.53, \
           4.06,5.21,25.69,28.44,13.29,27.92,28.95,36.40,1.05,1.43 
          ]
samples = np.array(samples)
samples = np.sort(samples)

a.)

In [336]:
m = np.mean(samples)
s_2 = np.var(samples)
st_dev = np.std(samples)
print('Sample mean m of the sample is: ',m)
print('Sample variance s_2 of the sample is: ',s_2)
print('Standard deviation of the sample is: ',st_dev)
Sample mean m of the sample is:  17.647100000000002
Sample variance s_2 of the sample is:  175.45997059
Standard deviation of the sample is:  13.246130400611342
Sample Mean(m) Sample Variance($s^2$)
17.647 175.46
  • As we can see from the above values since the sample size is small the variance and hence the standard deviation are also high. If we could have gotten a bigger sample then the variance, standard deviation would have been less and the mean would have been more accurate.

b.)

In [337]:
values = generate_CDF(samples)
  • The above plot shows the distribution of the the emperical sample we have. Not only that but it also shows the corresponding CDF.

  • The cdf from the above figure is the true CDF of the emperical distribution because of the following reasons:

    • The above CDF follows all the properties which a true CDF should follow:

      • 0 <= $F_{X}(x)$ <= 1. As we can see the maximum value of the CDF we have is 1 and the minimum is zero and hence this constraint is satisfied.

      • $\lim_{n\to\infty} F_{X} = 1$. This is infact true because as per the graph the CDF corresponding to the maximum value of the sample is 1.

      • $\lim_{n\to-\infty} F_{X} = 0$ and for positive values RV's $F_X(0) = 0$. This property is also fulfilled by the CDF calculated above.

      • Also it is pretty evident that this CDF is non decreasing.

  • Thus we can say that the CDF plotted above is the true distribution of our emperical sample.

c.)

In [338]:
orig_cdf = values[0]
r = np.array(list(range(1,21)))
interval = r*5
interval = interval - 1
In [339]:
discrete_cdf = orig_cdf[interval]
discrete_intervals = values[1][interval]

r2 = np.array(list(range(0,20)))
interval2 = r2*5
discrete_intervals2 = values[1][interval2]
In [340]:
generate_discrete_CDF(samples,discrete_intervals2,discrete_cdf)
In [341]:
disc_pmf = calc_disc_pmf(discrete_cdf)
plt.figure(figsize=(9,4), dpi=300)
plt.bar(discrete_intervals,disc_pmf,color='red',edgecolor='blue')
plt.title('The pmf of the discrete emperical sample with 5 sized intervals')
plt.show()
  • $\textbf{Methodology}$: The way in which I created this distribution is as follows:

    • Step-1 : Sorted the entire emperical sample so that it started from the lowest vale and ended at the highest value.
    • Step-2 : Then I divided the hundred samples in sizes of 5. So basically I considered the values corresponding to these indices from the sorted array.

      • Indices considered: [5,10,15,20,25,.....,100]

      • So the question arises that how do I calculate the CDF of the values occuring at these particular indices. This happens in the following way.

        • P(X <= $x_i$) = P(X <= $x_{i-1}$) + P(X <= $x_{i-2}$) + P(X <= $x_{i-3}$) + .... + P(X <= $x_{0}$)
x pmf CDF
1.95 0.19 0.19
4.2125 0.08 0.27
6.475 0.06 0.33
8.7375 0.08 0.41
11.0 0.03 0.44
13.2625 0.05 0.49
15.525 0.01 0.50
17.7875 0.01 0.51
20.05 0.02 0.53
22.3125 0.01 0.54
24.575 0.05 0.59
26.8375 0.04 0.63
29.1 0.09 0.72
31.3625 0.10 0.82
33.625 0.08 0.90
35.8875 0.05 0.95
38.15 0.03 0.98
40.4125 0.01 0.99
42.675 0.00 0.99
44.9375 0.01 1.00
Total = 1
  • The above table gives us the discrete distribution of the sample. This PMF is correct because it sums to one and the CDF is correct because its final value is 1 and it is a non decreasing function.

  • Also the graph of the pmf shows us that the PMF derieved from the CDF has the same shape as that of the original bimodal distribution of the entire emperical sample. This proves that our calculation are indeed correct.

In [342]:
print('To prove that this pmf is correct the sum should be one: ',np.round(np.sum(disc_pmf),decimals=0))
To prove that this pmf is correct the sum should be one:  1.0

d.)

In [343]:
emp_dist = pd.DataFrame(samples)
bs_samps_50 = generate_bootstrap(emp_dist,50)
bs_param_50 = calc_mean_var(bs_samps_50)

bs_samps_100 = generate_bootstrap(emp_dist,100)
bs_param_100 = calc_mean_var(bs_samps_100)
print('The sample mean of 50 bootstrap samples are: ',bs_param_50[0])
print()
print()
print('The sample variance of 50 bootstrap samples are: ',bs_param_50[1])
print()
print()
print()
print('-----------------------------------------------------------------')
print('-----------------------------------------------------------------')
print('The sample mean of 100 bootstrap samples are: ',bs_param_100[0])
print()
print()
print('The sample variance of 100 bootstrap samples are: ',bs_param_100[1])
The sample mean of 50 bootstrap samples are:  [17.671499999999995, 17.518700000000003, 18.421399999999995, 16.91199999999999, 15.840699999999996, 16.949700000000004, 16.90490000000001, 18.198300000000003, 16.426499999999997, 17.043700000000005, 19.459199999999996, 18.845000000000006, 17.3512, 19.90519999999999, 17.832999999999995, 16.957200000000004, 16.838300000000004, 17.3079, 17.365699999999993, 17.426200000000005, 16.47679999999999, 18.680799999999987, 18.30119999999999, 16.0888, 16.3022, 15.472300000000002, 18.480699999999995, 17.0898, 18.249899999999997, 17.2479, 19.155800000000003, 16.691599999999998, 16.9805, 13.948799999999999, 17.2947, 20.019, 17.049699999999994, 15.677, 17.5763, 16.9049, 19.1307, 18.53, 16.363099999999992, 18.262599999999992, 15.755999999999997, 18.3369, 16.058499999999995, 18.208399999999997, 19.6067, 18.731400000000004]


The sample variance of 50 bootstrap samples are:  [176.76595875000004, 167.71920331, 169.92979404000002, 166.47478800000002, 151.30844851, 163.46762890999997, 183.68284099000005, 180.67069210999995, 169.4565887500001, 177.46739531000011, 177.14363736000004, 176.87968899999998, 155.11934255999998, 177.12934696, 171.71616100000003, 174.48519016000006, 173.53412611000005, 168.79680858999996, 168.63392851000003, 175.72552955999998, 169.27885375999998, 172.71490735999996, 170.86186456, 173.38114856000004, 169.15463116000006, 180.67647371000004, 159.83077851000004, 169.75184596000005, 151.51853498999995, 187.24466458999999, 178.37413836, 185.85493143999997, 166.41513074999992, 166.42379456000003, 180.59746291000008, 173.30244499999992, 161.11592891000006, 185.774359, 175.44372931, 161.87852899, 185.77746650999998, 169.818534, 183.35247538999994, 185.27922523999987, 176.83400599999996, 170.45164538999998, 165.12120475000003, 174.70656744, 187.80107811000002, 181.37733603999996]



-----------------------------------------------------------------
-----------------------------------------------------------------
The sample mean of 100 bootstrap samples are:  [15.984699999999993, 18.2246, 18.301300000000005, 16.190700000000007, 17.721199999999993, 19.7354, 16.575499999999995, 17.962200000000006, 18.0991, 18.251900000000003, 18.658999999999995, 16.7105, 17.52729999999999, 15.663300000000003, 19.303799999999995, 18.286799999999996, 16.786699999999996, 17.279999999999998, 20.243499999999997, 17.004500000000004, 17.226299999999995, 18.816199999999995, 17.745599999999996, 16.5439, 16.655599999999996, 16.186600000000002, 17.775099999999995, 16.496699999999997, 16.852799999999995, 17.0338, 17.141199999999998, 18.632399999999997, 19.0997, 17.8154, 18.613700000000005, 19.9058, 17.0489, 18.5834, 18.922000000000004, 19.078099999999996, 18.2938, 16.029899999999994, 18.6275, 17.7782, 16.926000000000005, 19.2508, 16.2754, 17.513599999999993, 18.591499999999996, 18.124899999999997, 17.6158, 18.2475, 15.745599999999998, 18.246299999999994, 15.988499999999998, 15.185199999999998, 20.101499999999998, 20.683600000000002, 18.894600000000004, 18.547599999999992, 16.365099999999998, 18.3586, 18.5129, 17.157600000000006, 17.422499999999996, 15.305000000000005, 17.331699999999994, 18.715100000000007, 18.301299999999998, 17.396599999999992, 17.314299999999992, 17.512900000000002, 21.80749999999999, 16.004299999999994, 19.372200000000003, 13.084099999999994, 17.160799999999995, 19.906400000000012, 17.240499999999997, 18.483399999999996, 17.796699999999994, 16.201400000000003, 14.512400000000008, 18.612500000000004, 15.569800000000008, 18.0083, 18.1474, 17.5172, 16.9556, 15.784699999999987, 16.58989999999999, 18.0189, 18.0325, 19.7163, 19.091799999999992, 18.9782, 18.585500000000007, 18.281000000000002, 16.428, 16.5026]


The sample variance of 100 bootstrap samples are:  [171.2917549099999, 176.11542284000004, 181.42459530999994, 178.42732050999996, 166.59885255999995, 149.11341884, 164.05888874999997, 169.48543716000003, 191.49088019, 204.57767139000003, 196.68488099999996, 163.79647274999996, 182.87235171000003, 169.47238211000004, 182.68431556, 187.15412375999998, 165.77820810999998, 173.24837000000002, 166.62803275000005, 188.25456074999994, 191.52672531, 165.32527356000003, 185.58893664000007, 174.88882179000004, 176.90477464000014, 166.46797043999993, 210.42869699000005, 185.88181410999997, 184.34670415999997, 159.08439756000004, 178.1098165600001, 168.46490424, 162.16207091, 185.26632683999986, 163.87028131, 167.10578035999998, 173.00526979000003, 197.6337604400001, 197.19917800000007, 163.70036939, 180.2617735600001, 166.98653298999994, 169.7096987499999, 191.46664275999998, 169.48614600000008, 205.27284135999994, 159.42440683999996, 210.01456503999995, 169.33915074999993, 171.04338299000003, 178.53496435999995, 168.45300075000006, 174.6877966400001, 171.37995131000008, 183.06017675000007, 164.70151096, 185.05756075000008, 172.42361104000003, 182.47774884000006, 190.72359224, 185.44307699000004, 178.33002804, 188.16972459000002, 176.86990624000018, 167.56298675, 178.9440509999999, 166.07352211, 163.0184969899999, 171.20574531, 189.3501344399998, 180.04445250999999, 162.56550258999997, 162.64683875000003, 166.15993450999997, 176.1653771599999, 160.97999419000004, 157.30645536000006, 158.52903504000002, 194.18823274999994, 182.75070043999992, 166.02002011000002, 145.26263003999998, 171.07805623999997, 189.27787075000006, 179.97068795999996, 188.26744011, 195.35177724000005, 167.74599816000006, 184.36758664, 154.52327890999996, 164.11069298999996, 174.47806978999995, 177.22813075, 189.53098131000002, 168.16640076000007, 189.73029075999992, 180.21227874999997, 156.31152699999998, 170.21289600000003, 177.09493524]

e.)

In [344]:
calc_MSE(m,bs_param_50[0],'sample mean')
calc_MSE(s_2,bs_param_50[1],'sample variance')
The Mean squared error for  sample mean  with  50  bootstrap samples is:  1.5411745538000003
The Mean squared error for  sample variance  with  50  bootstrap samples is:  82.38986747361052

f.)

In [345]:
calc_MSE(m,bs_param_100[0],'sample mean')
calc_MSE(s_2,bs_param_100[1],'sample variance')
The Mean squared error for  sample mean  with  100  bootstrap samples is:  1.8798845175000003
The Mean squared error for  sample variance  with  100  bootstrap samples is:  164.7216368930458
bootstrap size(M) MSE of sample mean MSE of sample variance
50 1.75 172.27
100 1.78 133.03
  • $\textbf{Analysis of part (e.) and (f.)}$ :

    • The MSE of sample mean is fairly same for M = 50 and M=100. This is because the sample mean is an unbiazed estimator of the population mean and hence sample mean gives us fairly good estimates of the population mean which in our case is the mean of the emperical distribution. Due to this reason MSE for sample means is very low.

    • However for sample variance when the bootstrap sample size is less the MSE is more and when M is high the MSE is reduces. The reason is that as the bootstrap samples increase the sample variances tend to fall closer to the emperical distribution variance and because of this MSE also decreases.

In [346]:
dist = bs_param_50[1]
P_hat = np.mean(dist)
std_sampling_dist = np.std(dist)
density_prop = {"color": "green"}
hist_prop = {"alpha": 0.3, "color": "red"}
plot_densityCurve(dist,density_prop,hist_prop,0,50,P_hat,std_sampling_dist,s_2)

dist = bs_param_100[1]
P_hat = np.mean(dist)
std_sampling_dist = np.std(dist)
density_prop = {"color": "orange"}
hist_prop = {"alpha": 0.3, "color": "red"}
plot_densityCurve(dist,density_prop,hist_prop,0,100,P_hat,std_sampling_dist,s_2)
here
here
  • $\textbf{Conclusion}$

    • MSE of sample variance decreases as we increase the number of bootstrap samples. As we can see from the figures the distribution of the sample variance whem M=100 looks more normal and since the sample mean of the sample variance is quite near to the emperical variance line shown in purple the MSE in case of M = 100 would be very less.

    • Thus having more number of bootstrap samples would definitely reduce our MSE and give us good approximates about the true population parameters.

FUNCTION DEFINITIONS

In [334]:
#########Function definitions#############
import numpy as np
import sklearn as sk
import scipy.stats as sci
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
import matplotlib.mlab as mlab
import seaborn as sns
import warnings
import scipy.stats as st
import operator
from scipy.stats import binom
from scipy.stats import geom
import pandas as pd
warnings.filterwarnings('ignore')
%matplotlib inline
fontP = FontProperties()
fontP.set_size('small')
np.random.RandomState(seed=42)

def generate_CDF(samples):
    
    plt.figure(figsize=(9,4), dpi=300)
    plt.hist(samples,bins=20,normed=True,color='green', \
             alpha = 0.5,label = 'emperical dist')
    plt.legend(loc='upper center')
    plt.twinx()
    orig_cdf,bins,patches = plt.hist(samples,cumulative=True, \
                                     bins = 100,normed = True, \
                                       color='orange',alpha = 0.3, \
                                     edgecolor='orange',label='CDF')
    
    plt.title('The CDF plot of the given emperical distribution')
    plt.legend(loc='best', bbox_to_anchor=(0.23, 0.5, 0.5, 0.5))
    plt.show()
    
    return orig_cdf,bins,patches

def generate_discrete_CDF(samples,intervals,cdf):
    
    plt.figure(figsize=(9,4), dpi=300)
    plt.hist(samples,bins=20,normed=True,color='red', \
             alpha = 0.3,label = 'emperical dist')
    plt.legend(loc='upper center', \
               bbox_to_anchor=(0, 0.5, 0.5, 0.5))
    plt.twinx()
    plt.bar(intervals,cdf,align = 'edge', \
            width =2.3,alpha = 0.3, label = 'Discrete CDF')
    plt.title('The discrete CDF plot of \
              the given emperical distribution')
    plt.legend(loc='best', \
               bbox_to_anchor=(0.2, 0.5, 0.5, 0.5))
    plt.show()
    
def calc_disc_pmf(cdf):
    
    disc_pmf = []
    disc_pmf.append(cdf[0])
    for i in range(1,cdf.shape[0]):
        pmf = cdf[i] - np.sum(disc_pmf)
        disc_pmf.append(pmf)
        
    return disc_pmf

def generate_bootstrap(emp_dist,M):
    
    bootstrap_samples = []
    for i in range(0,M):
        bs = emp_dist.sample(n=100, replace=True)
        bootstrap_samples.append(bs)
        
    return bootstrap_samples

def calc_mean_var(bootstrap):
    m = []
    s_2 = []
    for i in range(0,len(bootstrap)):
         
        m.append(np.mean(bootstrap[i][0]))
        s_2.append(np.var(bootstrap[i][0]))
        
    return m,s_2

def calc_MSE(m,param,s1):
    M = len(param)
    total = 0 
    for i in range(0,M):
        sq_err = np.square(param[i] - m)
        total = total + sq_err
        
    MSE = total/M
    
    print('The Mean squared error for ',s1, \
          ' with ',M,' bootstrap samples is: ',MSE)
    
    
def plot_densityCurve(*args):
    
    plt.figure(figsize=(9,4), dpi=300)
    sns.distplot(args[0],kde_kws=args[1])
    print('here')
    plt.axvline(args[5], color='yellow', \
                linestyle='-.', linewidth=1,label='sample mean of var')
    plt.axvline(args[5]-args[6], \
                color='black', linestyle=':', linewidth=1, \
                label='1 standard dev')
    plt.axvline(args[5]+args[6], \
                color='black', linestyle=':', linewidth=1)
    plt.axvline(args[7], color='purple', \
                linestyle='-.', linewidth=2,label='True var')
    plt.axvline(args[5]-(1.96*args[6]), \
                color='black',linewidth=2, \
                label='95% confidence line')
    plt.axvline(args[5]+(1.96*args[6]), \
                color='black',linewidth=2)
    #plt.xlim(0.72,0.85)
    plt.legend()
    plt.title("The sampling distribution of sample \
              variance with bootstrap sample number: "+str(args[4]))
    plt.show()