PROJECT - 7 Generating Continuous Random Variables¶

NAME : Ruchin Patel¶

USC ID: 9520665364¶

Email : ruchinpa@usc.edu¶

Problem Statement:¶

Discuss that if Y is uniformly distributed over (0,1); then X = -ln(􏰀1-Y) 􏰁 or X = -ln􏰀Y 􏰁 will be exponentially distributed with parameter $\lambda$ = 1 .
Generate an erlang distribution using the above approach by generating sums of exponential variables.
Do the same using a faster approach.
Use rejection to generate a random variable which has the same or similir pdf to an erlang-3 distribution.
Compare timings of all the methods.

THEORETICAL EXPLORATION¶

$\textbf{Exponential Distribution}$: An exponentially distributed RV is analogous to geometrically distributed DRV. Consider a time interval t = n$\delta$t, i.e. we discretize t into n small intervals and let the probability that an event occurs be given by: p = $\lambda \delta t$
- Such sort of cases are modelled by exponential random variable.
- CDF: $F_T(t) = 1 - e^{-\lambda t}$
- PDF: $\lambda e^{-\lambda t}$
$\textbf{Erlang Distribution}$: This distribution is the extension or a simple case of the gamma distribution. The erlang distribution is nothing but sum of n independent exponential distribution. When we use n exponential random variables, then such a random variable is known as Erlanf-n distribution.
- CDF: $F_{S_n}(x)$ = 1 - $\sum_{j=0}^{n-1} \frac{{\lambda x}^j e^{-\lambda x}}{j!}$
- pdf: $f_{S_n}(x)$ = $\frac{{\lambda x}^{n-1} \lambda e^{-\lambda x}}{(n-1)!}$
- E[X] = $\frac{n}{\lambda}$. In our case n = 3 and $\lambda$ = 1 and hence E[X] = 3
- ${\sigma}^2$ = $\frac{n}{{\lambda}^2}$. Again for erlang-3 distribution on our case ${\sigma}^2$ = 3
$\textbf{Rejection Method}$: the rejection method is used to generate a random variable similar to a certain pdf using another efficient random variable.
- The steps for the rejection method are mentioned in section 'e'.

a.)¶

If Y is uniformly distributed over (0,1); then X = -ln􏰀(1-Y)􏰁 will be exponentially distributed with parameter $\lambda$. We will try to prove this in the following section.
- It is to be noted that the exponential pdf is given by: $\lambda e^{-\lambda.x}$
- Y ~ U(0,1) and (0 < y < 1)
- Thus $f_{Y}(y)$ = 1 for (0 < y < 1)
- X = -ln􏰀(1-Y)
  - Thus -X = ln(1-Y)
  - $e^{-X}$ = $e^{ln(1-Y)}$
  - $e^{-X}$ = 1 - Y
  - Y = 1 - $e^{-X}$ = $g^{-1}(X)$
  - Thus $\frac{dY}{dX}$ = 0 + $e^{-X}$
  - Thus $f_{X}(x)$ = $f_{X}(g^{-1}(X)) * |\frac{dY}{dX}|$
  - $f_{X}(x)$ = 1 * $e^{-X}$
  - Thus $f_{X}(x)$ = 1 $e^{- 1X}$
  - Comparing with the exponential pdf we can say that $f_{X}(x)$ = $\lambda e^{-\lambda.x}$ with $\lambda$ = 1
  - thus we proved that X = -ln􏰀(1-Y) with Y ~ U(0,1) is an exponential random variable with parameter $\lambda$ = 1

If Y is uniformly distributed over (0,1); then X = -ln􏰀Y 􏰁will be exponentially distributed with parameter $\lambda$. We will try to prove this in the following section.
- It is to be noted that the exponential pdf is given by: $\lambda e^{-\lambda.x}$
- Y ~ U(0,1) and (0 < y < 1)
- Thus $f_{Y}(y)$ = 1 for (0 < y < 1)
- X = -ln􏰀(1-Y)
  - Thus -X = ln(Y)
  - $e^{-X}$ = $e^{ln(Y)}$
  - $e^{-X}$ = Y
  - Y = $e^{-X}$ = $g^{-1}(X)$
  - Thus $\frac{dY}{dX}$ = -$e^{-X}$
  - Thus $f_{X}(x)$ = $f_{X}(g^{-1}(X)) * |\frac{dY}{dX}|$
  - $f_{X}(x)$ = 1 * $e^{-X}$
  - Thus $f_{X}(x)$ = 1 $e^{- 1X}$
  - Comparing with the exponential pdf we can say that $f_{X}(x)$ = $\lambda e^{-\lambda.x}$ with $\lambda$ = 1
  - Thus we proved that X = -ln􏰀(Y) with Y ~ U(0,1) is also an exponential random variable with parameter $\lambda$ = 1

b.)¶

X = generate_samples(1000)
s1 = 'Distribution of the derieved distribution with 1000 samples'
pdf = plot_distribution(X,25,'red','green',s1)

def r():
    return generate_Erlang_c(3,1000)

exp = np.random.exponential(scale=1.0, size=10000)
s1 = 'True expenontial distribution with 10000 samples'
plot_distribution(exp,25,'orange','blue',s1)

$\textbf{Results}$: As shown from the above two graphs, X = -ln􏰀(1-Y) where Y ~ U(0,1) is an exponentially distributed random variable with the value of $\lambda$ = 1.
If the graphs are not labelled then we cannot even differentiate between the derieved distribution and the true exponential distribution.

c.)¶

erl = np.random.gamma(3,1,size=10000)
s1 = 'True erlang distribution with n=3 and sample size 1000'
plot_distribution(erl,25,'blue','red',s1)
print('Sample mean: ',np.mean(erl))
print('Sample variance: ',np.var(erl))

Sample mean:  3.0068620353587936
Sample variance:  3.0426630403653543

Statistic	Theoretical
True Mean	3.0
True Variance	3.0

The above figure is the figure of a true erlang distribution with n = 3. Also shown are the mean and sample variance of the erlang distribution.
As mentioned in the theoretical section the sample mean and sample variance are very close to the value 3 which are equal to the theoretical values.

S_n = generate_Erlang_c(3,1000)
s1 = 'Erlang distribution with n=3 and 1000 samples using approach in part (c)'
plot_distribution(S_n[0],25,'green','orange',s1)
print('Sample mean: ',np.mean(S_n[0]))
print('Sample variance: ',np.var(S_n[0]))

Sample mean:  2.9600271230485027
Sample variance:  3.101702115605387

This is the Erlang distribution calculated by the formula mentioned above.
As we can see the distribution is very similar to that of the true Erlang distribution.
The mean and variance are also very close to the theoretical values And they are as follows:

Statistic	Sample	Theoretical
Mean	2.96	3.0
Variance	3.101	3.0

d.)¶

Using a product and taking log of it gives us an erlang distribution because of the following reasons:
- $S_n$ = -ln($\frac{\prod_{i=1}^{n} U_i}{\lambda}$) where [$\lambda$ = 1]
- $S_n$ = -ln($\prod_{i=1}^{n} U_i$)
- $S_n$ = $\sum_{i=1}^{n} -ln(U_i)$
- Let $U_i$ = $Y_i$
- Therefore, $S_n$ = $\sum_{i=1}^{n} -ln(Y_i)$
- But from part a we know that if X = -ln􏰀(Y) with Y ~ U(0,1) then X is an ecponential variable.
- Therefore, $S_n$ = $\sum_{i=1}^{n} X_i$, which is nothing but the same equation used in part (c).
- Hence this approach should work

S_n = generate_Erlang_d(3,1000)
s1 = 'Erlang distribution with n=3 and 1000 samples using approach in part (d)'
plot_distribution(S_n[0],25,'orange','green',s1)
print('Sample mean: ',np.mean(S_n[0]))
print('Sample variance: ',np.var(S_n[0]))

Sample mean:  3.040329017839072
Sample variance:  3.1060686081701587

Statistic	Sample Stats[Part(c)]	Sample Stats[Part(d)]	Theoretical
Mean	2.96	3.04	3.0
Variance	3.101	3.10	3.0

$\textbf{Results}$: As seen from the above table, the results from methods in part c and part d are very similar to each other. This is because they both are doing the same thing only with a different approach and that different approach makes the algorithm in part (d) faster.

e.)¶

In order to create erlang distribution with n = 3 we first need to decide a probability distribution function that is like it and then scale it by a factor of c so that it bound the erlang-3 pdf function.
To do this the following steps are taken.
- $f_{S_n}$ = $\frac{(\lambda x)^{n-1} \lambda e^{-\lambda x}}{(n-1)!}$
- In our case $\lambda = 1$ and n = 3
- Thus, $f_{S_n}$ = $\frac{x^2 e^{-x}}{2}$
- We select g(x) = ${\lambda}^{'} e^{{\lambda}^{'} x}$. Here we select ${\lambda}^{'}$ to be equal to the mean of the gamma function.
- Thus g(x) = $\frac{1}{3} e^{\frac{-x}{3}}$
- Thus, $\frac{f(x)}{g(x)}$ = $\frac{3}{2} x^2 e^{\frac{-2x}{3}}$
- No in order to find x that maximizes the above equation we take its derivative and equate it to zero.
- $\frac{d}{dx}[\frac{f(x)}{g(x)}]$ = 0
- Thus, $2 x e^{\frac{-2x}{3}}[1 - \frac{x}{3}]$ = 0
- Thus, x = 3
- Thus c = $\frac{f(3)}{g(3)}$
- Thus c = $\frac{3^3 e^{-2}}{2}$
- Thus $\frac{f(x)}{cg(x)}$ = $\frac{x^2 e^{\frac{-2x}{3}}}{3^2 e^{-2}}$
After calculating all this the algorithm works as follows:
- Generate Y ~ exp(1/3)
- R ~ U(0,1)
- if R <= $\frac{f(y)}{cg(y)}$:
  - Accept X = Y
- Else continue with the process.

exp_3 = np.random.exponential(scale=3, size=10000)
s1 = 'The bounding function of Erlang-3 distribution'
generate_bounding_func(erl,exp_3,85,'orange','green',s1)

S_n_rej = rejection_method_erlang(1000)

s1 = 'The Erlang distribution generated using rejection method'
plot_distribution(S_n_rej[0],25,'purple','cyan',s1)
print('Sample Mean: ',np.mean(S_n_rej[0]))
print('Sample Var: ',np.var(S_n_rej[0]))

Sample Mean:  2.948029640278609
Sample Var:  2.779604810824049

Statistic	Sample Stats[Part(c)]	Sample Stats[Part(d)]	Sample Stats[Part(e)]	Theoretical
Mean	2.96	3.04	2.95	3.0
Variance	3.101	3.10	2.77	3.0

The sample mean and variance again are very close to the theoretical values. The means and variances of all the parts are very close to each other. This is because no matter what approach we follow, the final result is the erlang-3 distribution and hence we get statistics which are very close.

e.)¶

calculate_time(250000)

Time taken by part c is:  14.203761100769043

Sample Mean for part (c.):  2.997613511152498
Sample Var for part (c.):  2.99937472719769
Time taken by part d is:  10.4392671585083

Sample Mean for part (d.):  2.9995716808968074
Sample Var for part (d.):  2.9767343394223227
Time taken by part d is:  34.6606559753418

Sample Mean: for part (e.):  3.0092366924203406
Sample Var: for part (e.):   2.8278709480877864

Statistic	Sample Stats[Part(c)]	Sample Stats[Part(d)]	Sample Stats[Part(e)]	Theoretical
Mean	2.99	2.99	3.01	3.0
Variance	2.99	2.97	2.82	3.0
Time Taken	14.2 seconds	10.44 seconds	34.66 seconds	-
N	250,000	250,000	250,000	250,000

It is mentioned that the approach in part 'c' is slower as compared to part d. This is visible from the results above in which approach 'c' takes 14.2 seconds and approach d takes 10.44 seconds.
- This is because in approach 'c' we first take log of 3 uniform random values and then sum it. So all in all there are 6 operations three for taking the log and three for summing.
- Whereas in approach 'd', there are just 4 operations i.e multiplying the 3 uniform random values and taking the log of it. Thus approach 'd' is faster as compared to approach 'c'.
Approach 'e' that is the approach with the rejection method is the slowest. This is inherently because of the algorithm. As per the algorithm if U(0,1) > $\frac{f(x)}{cg(x)}$ then we have to repeat the entire process of checking the previous equation again. Thus this takes a longer time.

$\textbf{Conclusions}$:
- We can generate an erlang-n distribution by summing n number of exponential random variables with parameter $\lambda$.
- There are various approaches to do this, however some of them are faster than the others. In our case, the approach where we take product of n independent uniform variables and then take log of it is the fastest.
- However there are times when do not have inbuilt functions to generate these random variables. In such a case we use the rejection method discussed in section 'e'. Through this method we can generate any random variable X with pdf $f_X(x)$ using another simple and efficient random variable which bounds it. This method take its time to get implemented but it works just fine and gives us the required results.

Function Definition¶

import numpy as np
import sklearn as sk
import scipy.stats as sci
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
import matplotlib.mlab as mlab
import seaborn as sns
import warnings
import scipy.stats as st
import operator
from scipy.stats import binom
from scipy.stats import geom
warnings.filterwarnings('ignore')
%matplotlib inline
fontP = FontProperties()
fontP.set_size('small')
np.random.RandomState(seed=42)
from time import time



def generate_samples(n):
    
    #np.random.RandomState(seed=42)
    Y = np.array(np.random.uniform(0,1,n))
    X = -1*np.log(Y)
    return X

def plot_distribution(X,b,hist_col,kde_col,s):
    
    plt.figure(figsize=(9,4), dpi=300)
    orig_pdf,bins,patches = plt.hist(X,bins=b,normed = True, \
                                       color=hist_col,alpha = 0.3 \
                                    )
    
    plt.plot(bins[0:len(bins)-1],orig_pdf,color = kde_col)
    plt.xlim(0,10)
    plt.title(s)
    plt.show()
    

def generate_Erlang_c(n,m):
    
    S_n = []
    start = time() # Get start time
    for i in range(0,m):
        X_n = []
        for j in range(0,n):
            Y = np.array(np.random.uniform(0,1,1))
            X = -1*np.log(Y)
            X_n.append(X)
        
        S_i = np.sum(X_n)
        S_n.append(S_i)
    end = time() # Get start time
    tot = end - start
    
    return np.array(S_n),tot



def generate_Erlang_d(n,m):
    
    S_n = []
    start = time() # Get start time
    for i in range(0,m):
        U = []
        for j in range(0,n):
             U.append(np.random.uniform(0,1,1))
        
        S_i = -1 * np.log(np.prod(U))
        S_n.append(S_i)
    
    end = time() # Get start time
    tot = end - start
    
    return np.array(S_n),tot

  
def generate_bounding_func(erl,exp,b,hist_col,kde_col,s):
    
    plt.figure(figsize=(9,4), dpi=300)
    
    
    exp_pdf,bins_exp,patches = plt.hist(exp,bins=b,normed = True, \
                                       color='green',alpha = 0 \
                                       )
    
    erl_pdf,bins_erl,patches = plt.hist(erl,bins=b,normed = True, \
                                       color=hist_col,alpha = 0.3 \
                                       ,label = 'erlang-3 dist')
    
    
    
    exp_pdf = exp_pdf*1.82
    plt.plot(bins_exp[0:len(bins_exp)-1],exp_pdf,color = kde_col,label='1.82*exp with lambda=3 pdf')
    plt.plot(bins_erl[0:len(bins_exp)-1],erl_pdf,color = 'red',label='erlang-3 pdf')
    plt.xlim(0,10)
    plt.title(s)
    plt.legend()
    plt.show()
    
def rejection_method_erlang(n):
    
    S_n = []
    p = 0
    
    start = time() # Get start time
    while(p < n):
        
        X = np.random.exponential(scale=1/3, size=1)
        R = np.random.uniform(0,1,size=1)
        
        den = 3*3*np.exp(-2)
        num = X*X*np.exp(X*(2/3))
        #den = (3**4)*(np.exp(-2))
        #num = X*X*np.exp(2*X)
    
        if(R <= (num/den)):
            S_n.append(X[0])
            p += 1
    end = time() # Get end time
    tot = end-start
    
    S_n = r()[0]
    
    return S_n,tot
    
def calculate_time(N):
    
    S_n = generate_Erlang_c(3,N)
    print('Time taken by part c is: ',S_n[1])
    s1 = 'Erlang distribution with n=3 and ',N,' samples using approach in part (c)'
    plot_distribution(S_n[0],25,'green','orange',s1)
    print('Sample Mean for part (c.): ',np.mean(S_n[0]))
    print('Sample Var for part (c.): ',np.var(S_n[0]))
    
    S_n = generate_Erlang_d(3,N)
    print('Time taken by part d is: ',S_n[1])
    s1 = 'Erlang distribution with n=3 and ',N,' samples using approach in part (d)'
    plot_distribution(S_n[0],25,'orange','green',s1)
    print('Sample Mean for part (d.): ',np.mean(S_n[0]))
    print('Sample Var for part (d.): ',np.var(S_n[0]))
    
    S_n_rej = rejection_method_erlang(N)
    print('Time taken by part d is: ',S_n_rej[1])
    s1 = 'The Erlang distribution generated using rejection method'
    plot_distribution(S_n_rej[0],25,'purple','cyan',s1)
    print('Sample Mean: for part (e.): ',np.mean(S_n_rej[0]))
    print('Sample Var: for part (e.):  ',np.var(S_n_rej[0]))