Chi-square distribution and degrees of freedom

Chi-square test is the goodness of fit between a set (or vector) of observed values and expected values. Chi-square is one among other distributions.

Typical chi-square distribution can be plotted with following R code:

x = seq(0, 30, length.out = 200)
plot(x,dchisq(x=x, df=2), col=33, type = 'l', xlab = 'Degrees of Freedom', ylab='Density for the chi^2 distribution')
for (i in 3:15){
  ix = dchisq(x=x, df=i)

Resulting in this


With higher Degrees of Freedom, the density of chi-square distribution is getting smaller.

We could also plot distribution function, quantile function and random generation, besides density function, switching dchisq command to pchisq or qchisq or rchisq.


Playing with Regression prediction and MSE measure

In previous post, I have discussed on how to create a sample data-set in R. So let’s use the created data-set from previous post and start playing with Regression predictions.

With each prediction we want to measure, which one scores better the new values and where over-fitting start.

First we will create training and test subset for model prediction. We will use training dataset for model to train and later we will use test data set to actually test our model.

indices <- sample(1000,400)
train <- dat_set[indices, ]
test <-dat_set[-indices, ]

Once we have our subsets loaded, we can visualize a simple scatter plot, to get idea of the point dispersion and what kind of MSE we will be handling. For scatterplot we will use ggplot2 with simple geom_point function.

ggplot()+geom_point(data=train, aes(x=x, y=y))+ggtitle("TRAINING SET")
ggplot()+geom_point(data=test, aes(x=x, y=y))+ggtitle("TEST SET")

Now we can deploy lm function on training dataset to learn and score the model. we will be predicting variable Y (dependent) with independent variable X.

model <- lm(y~x,  data=train)

Lm function shows  (run: model or run: summary(model))

lm(formula = y ~ x, data = train)

(Intercept)            x  
      5.267        2.065

which can easily be tested. let us define x = 5 and observe the value:

predict(model, data.frame(x))


This means that trained model will predict value Y with 15.59 when new value X =5 will be introduced into the model.

I will add some endpoints to draw a linear regression function. This function will serve as the idea how linear, quadratic, cube, etc. functions will behave when predicting better and accurate results.

x <-  c(-5,30)
x_predict <- predict(model, data.frame(x))
endpoints <- data.frame(x, x_predict)

I hard coded the arbitrary endpoints based on the TRAIN graph (see above). in this case -5,30 as two points on the graphs (min and max endpoints). I let the prediction find the other points.

Now let’s plot the scatterplot together with the linear regression line and see the points vs. the line.

  geom_point(data=train, aes(x=x, y=y))+
  geom_line(data=endpoints, aes(x=x,y=x_predict), color='brown', 
  ggtitle("TRAINING SET")

Seeing this one might get the quick idea how the “problem” or the dispersion of the dots is not nearly linear.


Now let’s play with prediction model and  find the best polynomial line fitting the dots. All the time we will also look into the problem of over-fitting the regression function.

We will introduce MSE measure. MSE is mean square error measures the average of the squares of the “errors”, that is, the difference between the estimator and what is estimated.

For this purpose we will calculate use our x values  against the predictions on our model.

#Calculating MSE
x <- test$x
p <- predict(model, data.frame(x)) # or: predict(model, test)

Based on this, we can see the predicted values for each x, which we will use for calculating the MSE.


Continuing we take the mean squares of difference between vector of predictions (that is: actual Y values) and vector of observed values corresponding to the inputs to the function which generated the predictions (that is: predicted model of X).

sum((test$y - predict(model,data.frame(x)))^2)
#AVG - test mean square error
mse_test_value <- mean((test$y - predict(model,data.frame(x)))^2)


Now what we have a variable (mse_value) we can actually search for minimal MSE value among any next polynomial regression function.

So in first step we created linear function, now let’s create a quadratic function for train dataset.

# Quadratic
model_Q <- lm(y~x+I(x^2), data=train)

For this quadratic function we create function for ggplot:

#function for quadratic
f_Q <- function(x) {

This function is added to previous scatter plot with quadratic function.

#plotting training set with quadratic model for quadratic function
  geom_point(data=train, aes(x=x, y=y))+
  geom_line(data=endpoints, aes(x=x,y=x_predict), color='brown', 
  stat_function(data=data.frame(x=c(-5,30)), aes(x=x), fun=f_Q,
color='blue', size=1)+
  ggtitle("TRAINING SET")

Plot shows on training data that quadratic function sways from linear function.


But before we get ahead of ourselves, let’s repeat the calculation of MSE for linear vs. quadratic function and see where MSE measure is better.

#calculate test MSE for quadratic
mean((test$y-predict(model_Q, test))^2)

MSE Calculation is (for my case of dataset)


where as for linear function:

mse_test <- mean((test$y - predict(model,data.frame(x)))^2)


MSE fits slightly better for linear function as for quadratic model. For cubic function (or degree 3) we use following function:

model_3 <- lm(formula=y~poly(x,3, raw=T), data=train)
mse_3 <- mean((test$y-predict(model_3,test))^2)

with result of:


we see the cubic fit function is not as good as quadratic or linear and we may start already over-fitting the model. So let’s loop through the linear to degree of 12 of polynomial function and see the fit scores (MSE). Using following code:

for(i in 1:13) {
  model <- lm(formula=y~poly(x,i, raw=T), data=train)
  mse <- mean((test$y-predict(model,test))^2)

Returns following results:

[1] 93.16901
[1] 94.35723
[1] 95.07303
[1] 95.23098
[1] 94.79259
[1] 96.55435
[1] 108.6518
[1] 132.8873
[1] 130.5214
[1] 212.9898
[1] 169.7865
[1] 7321.596
[1] 226.708

And we see that at degree 5 the function fit is 94.79 as of degree of 2 with 94.35, but still linear model outfits  / outperforms all other functions. What happens with function of degree 7 or higher is what we call over-fitting.


For better visualization I have reduced the function to only 10 iterations, mainly because at the degree of 11 and higher we get extreme outliers and high values.

for(i in 1:10) {
  model <- lm(formula=y~poly(x,i, raw=T), data=train)
  mse_v[i] <- mean((test$y-predict(model,test))^2)

#visualize MSE
y_m <- mse_v
x_m <- 1:10
mse_p <- data.frame(x_m, y_m)

  geom_point(data=mse_p, aes(x=x_m, y=y_m), size=2)+
  geom_line(data=mse_p, aes(x=x_m, y=y_m), size=1)

Graph shows how the MSE is growing with each higher degree.


For the last part, let’s compare trained data to test data with all the functions from linear to function of degree 10.

Therefore we introduce following function:

mse_calc <- function(train,test){
            for(i in 1:10) {
              model <- lm(formula=y~poly(x,i, raw=T), data=train)
              mse[i] <- mean((test$y-predict(model,test))^2)

making following visualization even better with for loop:

x<- 1:10
for(i in 1:10){
  ind <- sample(1000,500)
  train <- dat_set[ind,]
  test <- dat_set[-ind,]
  mse_poly <- data.frame(x,y)
   plot<-plot+geom_point(data=mse_poly, aes(x,y), size=3)
   plot<-plot+geom_line(data=mse_poly, aes(x,y))

Making following outstanding plot:


Couple of words on for loop. In each loop we generate random subset of training and test data, which is predicted for each step and compared with each other, resulting in calculation of MSE. 10 times we calculate linear MSE fit measure for random values, 10 times we calculate quadratic MSE fit measure for random values, 10 times …. for degree 3 to degree 10. We see that over fitting start already at degree 4  and at degree 5 it just explodes.





Random permutations and t-test

Let’s assume that we have two groups (= datasets) of observations. And we are interested if two sets of data are significantly different from each other. In first place we will compare the means of each group and for these two means we will calculate Welch two sample t-test.  I will not go into details on how t-test is being calculated, just briefly mentioning H0 and H1 hypothesis. A test statistic either exactly follows or closely approximates a t-distribution under the null hypothesis, respectively.  For each case, degrees of freedom are calculated.

Our test data for calculating difference between means of each group is:

#b is group Black and w is group White
b <- c(30.24,22.40,23.52,29.12,30.24,34.72,26.88,23.52,22.40,
w <- c(23.52,24.64,16.80,13.44,23.52,17.92,21.28,16.80,24.64,

#calculate difference between means
#Welch two sample t-test

Difference in Mean is 4.9 (4,903111) and T-Test shows statistically significant difference between two groups (p = 0.007, df = 39.1, t = 3.65)

	Welch Two Sample t-test

data:  b and w
t = 3.6582, df = 39.113, p-value = 0.0007474

Now let presume that the degrees of freedom arise from residuals from sum of squares in case of T-test and can be understood also as before and after conditions of calculations. So it is an independent ways by which a dynamic system can move, without violating any constraint.

So by computations of random permutation we can test if this 4.9 difference in mean is a small or big difference. And remember, by staying in boundaries of original degrees of freedom, we can demonstrate what and how statistically significant our difference really is.

In this case we will replace and switch couple of numbers (labels) between the groups in order to show, that this statistical significance has little or no importance.

Following is R code for swapping the values, storing difference in mean and plotting the values.

diff_data <- list()
for(i in 1:5000){
 b <- c(30.24,22.40,23.52,29.12,30.24,34.72,26.88,23.52,22.40,21.28,
 w <- c(23.52,24.64,16.80,13.44,23.52,17.92,21.28,16.80,24.64,26.88,
 #create permutation
 b_r <- sample(b,5)
 w_r <- sample(w,5)
 b_new <- replace(b, b_r, w_r)
 w_new <- replace(w, w_r, b_r)
 #diff<- round((mean(b_new, na.rm=TRUE)-mean(w_new, na.rm=TRUE)), digits=2)
 x <- round((mean(b_new, na.rm=TRUE)-mean(w_new, na.rm=TRUE)), digits=2)
 y <- 1
 diff_data[[i]] <- c(x,y)
diff_data_v <-"rbind", diff_data))
diff <- data.frame(diff,0)
diff_2 <- rbind(diff, c(4.903111, 175))
 geom_histogram(data=diff_data_v, aes(x=V1), color="brown", 
binwidth = 0.05)+
 geom_line(data=diff_2, aes(x=diff, y=X0), color= 'REd', size =1)+
 labs(title="Random Permutation test",x="AVG Diff", y = "Count")

Little explanation to the code. Loop is doing 5000 permutations of in batches of 5 replacements between the groups and calculating the new mean.


Plot clearly shows number of occurrences when the differences between means was so high (4.9) and hence statistically significant. This occurred only in ca. 0,02% of cases, making or repeating this test with permutations relatively questionable if is would be statistically significant next time.

d49 <- diff_data_v$V1 >= 4.9


So watch out next time you do t-test.

Generating sample data in R

When testing random functions or predictions in R it is usually a good thing to have some sample or random data. A lot of libraries and base libraries in R are equipped with good sample data, but let me show you a nice way of generating  a data frame of random data.

We will generate random data using rnorm function (random generation for the normal distribution with mean equal to defined mean). We will apply a linear function to random values using sapply function (applying a function to list or vector or array of values). Similar functions are lapply or vapply.

x <- rnorm(1000,10,5)
y <- sapply(x, function(x) rnorm(1,2*x+6,10))
dat_set <- data.frame(x,y)

After this, we can visulize the dataset dat_set to see the dispersion.

ggplot()+geom_point(data=dat_set, aes(x=x, y=y),size=1, color='brown')

Visualization looks like:

2016-01-04 21_31_23-RStudio

One can tell that initial data distribution follows the linear function of y=2x+6 with applied (using sapply) y-coordinated values.