Random permutations and t-test

Let’s assume that we have two groups (= datasets) of observations. And we are interested if two sets of data are significantly different from each other. In first place we will compare the means of each group and for these two means we will calculate Welch two sample t-test.  I will not go into details on how t-test is being calculated, just briefly mentioning H0 and H1 hypothesis. A test statistic either exactly follows or closely approximates a t-distribution under the null hypothesis, respectively.  For each case, degrees of freedom are calculated.

Our test data for calculating difference between means of each group is:

#b is group Black and w is group White
b <- c(30.24,22.40,23.52,29.12,30.24,34.72,26.88,23.52,22.40,
21.28,25.76,26.88,31.36,21.28,26.88,32.48,20.16,22.40,19.04,
34.72,22.40,28.00,31.36,23.52,30.24)
w <- c(23.52,24.64,16.80,13.44,23.52,17.92,21.28,16.80,24.64,
26.88,21.28,25.76,14.56,24.64,22.40,26.88,20.16,22.40)

#calculate difference between means
 diff<-mean(b)-mean(w)
 diff
#Welch two sample t-test
 t.test(b,w)

Difference in Mean is 4.9 (4,903111) and T-Test shows statistically significant difference between two groups (p = 0.007, df = 39.1, t = 3.65)

	Welch Two Sample t-test

data:  b and w
t = 3.6582, df = 39.113, p-value = 0.0007474


Now let presume that the degrees of freedom arise from residuals from sum of squares in case of T-test and can be understood also as before and after conditions of calculations. So it is an independent ways by which a dynamic system can move, without violating any constraint.

So by computations of random permutation we can test if this 4.9 difference in mean is a small or big difference. And remember, by staying in boundaries of original degrees of freedom, we can demonstrate what and how statistically significant our difference really is.

In this case we will replace and switch couple of numbers (labels) between the groups in order to show, that this statistical significance has little or no importance.

Following is R code for swapping the values, storing difference in mean and plotting the values.

diff_data <- list()
for(i in 1:5000){
 b <- c(30.24,22.40,23.52,29.12,30.24,34.72,26.88,23.52,22.40,21.28,
25.76,26.88,31.36,21.28,26.88,32.48,20.16,22.40,19.04,34.72,22.40,
28.00,31.36,23.52,30.24)
 w <- c(23.52,24.64,16.80,13.44,23.52,17.92,21.28,16.80,24.64,26.88,
21.28,25.76,14.56,24.64,22.40,26.88,20.16,22.40)
 #create permutation
 b_r <- sample(b,5)
 w_r <- sample(w,5)
 b_new <- replace(b, b_r, w_r)
 w_new <- replace(w, w_r, b_r)
 #diff<- round((mean(b_new, na.rm=TRUE)-mean(w_new, na.rm=TRUE)), digits=2)
 x <- round((mean(b_new, na.rm=TRUE)-mean(w_new, na.rm=TRUE)), digits=2)
 y <- 1
 diff_data[[i]] <- c(x,y)
 
}
diff_data_v <- as.data.frame(do.call("rbind", diff_data))
diff <- data.frame(diff,0)
diff_2 <- rbind(diff, c(4.903111, 175))
ggplot()+
 geom_histogram(data=diff_data_v, aes(x=V1), color="brown", 
binwidth = 0.05)+
 geom_line(data=diff_2, aes(x=diff, y=X0), color= 'REd', size =1)+
 labs(title="Random Permutation test",x="AVG Diff", y = "Count")

Little explanation to the code. Loop is doing 5000 permutations of in batches of 5 replacements between the groups and calculating the new mean.

random_permutation_test

Plot clearly shows number of occurrences when the differences between means was so high (4.9) and hence statistically significant. This occurred only in ca. 0,02% of cases, making or repeating this test with permutations relatively questionable if is would be statistically significant next time.

d49 <- diff_data_v$V1 >= 4.9
length(d49[d49==TRUE])

 

So watch out next time you do t-test.

Advertisements
Posted in Uncategorized

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

Categories
Follow TomazTsql on WordPress.com
Revolutions

Tomaz doing BI and DEV with SQL Server and R

tenbulls.co.uk

attaining enlightenment with sql server, .net, biztalk, windows and linux

SQL DBA with A Beard

He's a SQL DBA and he has a beard

DB NewsFeed

Matan Yungman's SQL Server blog

Reeves Smith's SQL & BI Blog

A blog about SQL Server and the Microsoft Business Intelligence stack with some random Non-Microsoft tools thrown in for good measure.

SQL Server

for Application Developers

Clocksmith Games

We make games we love to play

Business Analytics 3.0

Data Driven Business Models

SQL Database Engine Blog

Tomaz doing BI and DEV with SQL Server and R

Search Msdn

Tomaz doing BI and DEV with SQL Server and R

R-bloggers

Tomaz doing BI and DEV with SQL Server and R

Ms SQL Girl

Julie Koesmarno's Journey In Data, BI and SQL World

R-bloggers

R news and tutorials contributed by (750) R bloggers

Data Until I Die!

Data for Life :)

Paul Turley's SQL Server BI Blog

sharing my experiences with the Microsoft data platform, SQL Server BI, Data Modeling, SSAS Design, Power Pivot, Power BI, SSRS Advanced Design, Power BI, Dashboards & Visualization since 2009

Grant Fritchey

Intimidating Databases and Code

Madhivanan's SQL blog

A modern business theme

Alessandro Alpi's Blog

SQL Server, Azure and .net in a nutshell :D

Paul te Braak

Business Intelligence Blog

Sql Server Insane Asylum (A Blog by Pat Wright)

Information about SQL Server from the Asylum.

Gareth's Blog

A blog about Life, SQL & Everything ...

SQLPam's Blog

Life changes fast and this is where I occasionally take time to ponder what I have learned and experienced. A lot of focus will be on SQL and the SQL community – but life varies.

William Durkin

William Durkin a blog on SQL Server, Replication, Performance Tuning and whatever else.

$hell Your Experience !!!

As aventuras de um DBA usando o Poder do $hell

%d bloggers like this: