Effect of normalization of data

Clustering (distributed in particular) can  be dependent on normalization of data. With usage of distance models, data – when clustered – can produce different results or even different clustering models.

A simple every day example can produce two different results. For example, measuring units.  For this purpose we will create two R data-frames.

person <- c("1","2","3","4","5")
age <- c(30,50,30,50,30)
height_cm <- c(180,186,166,165,191)
height_feet <- c(5.91,6.1,5.45,5.42,6.26)
weight_kg <- c(70,90,60,74,104)
weight_stone <- c(11,14.1,9.4,11.6,16.3)

sample1 <- data.frame(person, age, height_cm, weight_kg)
sample2 <- data.frame(person, age, height_feet, weight_stone)

With a simple visualization, I have seen people arguing that even a simple scatter plot will produce different result. So let us try this:

# create two simple scatter plots
#sample1
plot(sample1$age, sample1$height_cm, main="Sample1", xlab="Age", 
ylab="Height (cm)", pch=19) 
textxy(sample1$age, sample1$height_cm,sample1$person, cex=0.9)

#sample2
plot(sample2$age, sample2$height_feet, main="Sample2", xlab="Age", 
ylab="Height (feet)", pch=19) 
textxy(sample2$age, sample2$height_feet,sample2$person, cex=0.9)

With result of:

jpg1

One can see there is relative or no difference when height is centimeters or in feet. So what authors in book:Finding Groups in Data: An Introduction to Cluster Analysis are proposing in chapter on “changing the measurement units may even lead one to see a very different clustering” does not simply hold water.

Excert from their book:

book

book2

Both are perfectly manipulated scatter plots to support their theory. if you descale age between Figure 3 and Figure 4 and also keep proportions on height on both graphs, one should not get such “manipulated” graph.

Adding a line to both graphs and swapping X-axis with Y-axis, one still can not produce such a huge difference!

#what if we switch X with Y since with the conversion of height we get different ratio

#sample1
plot(sample1$height_cm, sample1$age, main="Sample1", xlab="Height (cm)", 
ylab="Age", pch=19) 
textxy(sample1$height_cm,sample1$age,sample1$person, cex=0.9)

#sample2
plot(sample2$height_feet,sample2$age, main="Sample2", xlab="Height (feet)", 
ylab="Age", pch=19) 
textxy(sample2$height_feet,sample2$age,sample2$person, cex=0.9)

#additional sample
scatterplot(age ~ height_cm, data=sample1,xlab="Height (cm)", ylab="age",
main="Sample 1", labels = person)
scatterplot(age ~ height_feet, data=sample2,xlab="Height (feet)", ylab="age",
main="Sample 2", labels = person)

 

Comparing again both graphs reveal same results:

jpg2

So far so good.  But what is we are doing distance based clustering? Well, story changes drastically. For sake of sample, I will presume there are two clusters and we will run kmeans, where n – observations will belong to cluster x based on their nearest mean. So for a given set of observation, a k-means cluster will partition observations into sets in order to minimize the distance between each point in cluster to its center. So minimization of sum of squares will be the  WCSS function (which will maximize the distance between the cluster center K).

In step 1 we will be using columns  Age and Height and observe the results. In step 2 we will add also Weight and observe the results.

##########
#step 1
##########
sample1 <- data.frame(age, height_cm)
sample2 <- data.frame(age, height_feet)

# sample1
fit1 <- kmeans(sample1, 2) 
# sample2
fit2 <- kmeans(sample2, 2) 

#compare
fit1
fit2

##########
#step 2
##########
sample1 <- data.frame(age, height_cm, weight_kg)
sample2 <- data.frame(age, height_feet, weight_stone)

# sample1
fit1 <- kmeans(sample1, 2) 
# sample2
fit2 <- kmeans(sample2, 2) 

#compare
fit1
fit2

In step 1; sum of squares within clusters for dataset Sample1 is 200 and 327, where for Sample2 is 0.33 and 0.23. Calculating Between_SumSqr / Total_SumSql, for Sample1 is 47% and for Sample2 is 99% which is a difference that would need a prio normalization.

In step 2; same difference is present, relatively smaller and calculating Between_SumSqr / Total_SumSql is for Sample1 63,1%, for Sample2 is 92,4%.

In both cases (step1 and step2) different observation would be belonging to different cluster center K.

Now we will introduce normalization of both weight and height, since both can be measured in different units.

#~~~~~~~~~~~~~~~~~~~
# normalizing data
# height & weight
#~~~~~~~~~~~~~~~~~~~
height_cm_z   <- data.Normalization(sample1$height_cm,type="n1",normalization="column")
height_feet_z <- data.Normalization(sample2$height_feet,type="n1",normalization="column")

weight_kg_z   <- data.Normalization(sample1$weight_kg,type="n1",normalization="column")
weight_stone_z <- data.Normalization(sample2$weight_stone,type="n1",normalization="column")

##########
#step 1
##########
sample1 <- data.frame(age, height_cm_z)
sample2 <- data.frame(age, height_feet_z)


# sample1
fit1 <- kmeans(sample1, 2) 
# sample2
fit2 <- kmeans(sample2, 2) 

#compare
fit1
fit2


##########
#step 2
##########
sample1 <- data.frame(age, height_cm_z, weight_kg_z)
sample2 <- data.frame(age, height_feet_z, weight_stone_z)

# sample1
fit1 <- kmeans(sample1, 2) 
# sample2
fit2 <- kmeans(sample2, 2) 

#compare
fit1
fit2

 

Now we have proven that both data samples will return the same result; when comparing cluster vector for both data samples (when normalized) will be relatively the same; some minor differences can occur. Fit function for cluster vector will at the end return same results.

Going back to original question that measuring unit can give different results. In terms of plotting data, we have proven this wrong. In terms of cluster analysis we have done only half of the proof. For reverse check, we will test – for example Sample1 – on original data and normalized data. Here we will see the importance of design size, test of variance and variance inequality.

#~~~~~~~~~~~~~~~~~~~
#
# comparing normalized 
# and non-normalized
# data sample
#~~~~~~~~~~~~~~~~~~~

#sample1 with cm
sample1 <- data.frame(age, height_cm)
sample1_z <- data.frame(age, height_cm_z)

#sample1 with feet
sample2 <- data.frame(age, height_feet)
sample2_z <- data.frame(age, height_feet_z)

# design size
var(sample1$height_cm)
var(sample1_z$height_cm_z)

var(sample2$height_feet)
var(sample2_z$height_feet_z)

#test variance of sample1 height and sample2 height
var.test(sample1$height_cm, sample2$height_feet)

#test variance of sample1 height and sample2 height
var.test(sample1_z$height_cm_z, sample2_z$height_feet_z)

Test of equality of variance between non-normalized and normalized data shows normalized outperform the non-normalized. This reveals that in distance based clustering the data with smaller – when compared  with bigger – values (here is feet vs. cm) will result in bigger design effect. Therefore a normalization is recommended, in second test the ratio of the variance = 1, meaning that variance are equal (also p-value = 1).

p_value

To finalize the test, let’s rerun the fit function for clustering with 2 clusters

#comparison of Fit function for clustering with 2 clusters (on Sample1)
# sample1
fit1 <- kmeans(sample1, 2) 
fit1_z <- kmeans(sample1_z, 2) 

#compare
fit1
fit1_z

The result is:

#non-normalized
Clustering vector: [1] 1 1 2 2 1
Within cluster sum of squares by cluster: [1] 327.3333 200.5000 (between_SS / total_SS =  48.7 %)

#normalized
Clustering vector: [1] 2 1 2 1 2
Within cluster sum of squares by cluster: [1] 1.605972 2.286963  (between_SS / total_SS =  99.2 %)
To summarize; the result of the normalized clustering is much better, since the WCSS is higher, resulting in total variance of data sets is explained for each cluster. We can assume that the belonging to each cluster when normalizing the height is much better when minimizing the differences within the group and maximizing it between the groups.

But to make things more complicated, you should normalize also age; and you will get completely different cluster centers 🙂

Complete code is available here: Effect of normalization of data.

 

 

 

Advertisements

Fear of advanced analytics

I have heard too many times: “We have tried this and it is no good” or “We tried it and it is not functioning for us” and several other excuses, that have little or any direct relation to what has been done in the industry or what can be done.

Yes, I am pointing to advanced analytics. And by advanced I mean everything stretched beyond frequency tables. Even though, many multivariate analysis and statistics are just frequency tables viewed from different aspects, it is in general everything that goes beyond pivot table in excel. Advanced analytics can besides bi- or multivariate statistics also be any kind of mixture of different aspects of multivariate statistics, data mining, sampling and probability theories, to evermore popular statistical learning, machine learning, etc.

Given many excuses, I generally find following reasons the biggest barriers when creating, establishing and managing any kind of advanced analytic.

1)  Lack of knowledge is primal barrier of accomplishing and establishing a mere culture of analytics in the environment.

2) Fear of knowledge is close accompanied with lack of knowledge. People (especially decision makers) are normally afraid of unknown, due to the lack of rationality.

3) Ego, Pride, Prejudiced,…all the psychological aspects of allowing someone else doing “something” trickery that they, themselves will or might not understand.

4) Being fine with what we have is just another lame excuse. If you come to decision makers and give them numbers, the argument “being fine with what we have” will soon not hold water.

5) “There will be insignificant lift.” But there will be a lift. Remember, if you can make a lift of 0,1% using advanced analytics in comparison to simple frequencies, this is still a lift. With couple more steps, you will come to 1%, 2% or maybe more. But remember, this is still a lift, that mathematics or statistics is doing in your favor. So, don’t neglect it!

6) “We have tried and it is not working”. You haven’t tried enough. Sometimes you need to change a simple parameter in your process and there you go. So think about it in this way. It might also be the fact that you haven’t had the appropriate knowledge or understanding on the problem and after letting it evolve through time, you might be able to tackle the problem again.

7) “We don’t have enough data” or “We don’t have the right data”. Often reason that is an advocate to a lot of decisions. But have you exhausted the data you have? It might be, that a very popular algorithms / methods is not applicable to your dataset, which means you have to try other methods. The quantity of data is also just a lame excuse. Once you have exhausted most of possibilities with your current data, go and start collecting new data.

8) Business model is too complex. Well, break it to smaller and more manageable rules in order to apply any knowledge extracted from data to it. Some legacy business rules (I like to call them boutique rules) usually cost us more to implement and maintain, but in return they give us very little. It is up to you to include or exclude it, but remember, getting information out of data is usually to understand the complexity behind (in other words to reduce the complexity).

9) “We don’t want to pay for software” Ok, what is the next lame excuse?

10) It can also be the problem of persuading the right people and the right decision makers. When you do that, don’t over complicate, don’t go deep into details and mostly, try to stick to simple benefits. SWOT analysis might be to cliche, but in fact it has several good aspects. And remember, build a prototype. A model. Use case.  People like to see the trade-offs and curiosity.

There are many other fears people tend to have (as an excuse) to evade advanced analytics, mostly I have listed. And normally, there is also a personal reason stuck in behind, that departments, companies are having difficulties moving toward more advanced and yet more efficient analytics. Compare cloud computing today and minus 5 years ago. What a leap.

If there would be a Latin for the fear of advanced analytics (or statistics) I would certainly name it.

Faux pas of data science

Data scientist has been the “sexiest” job  for the past years mostly thanking to all the buzzword bubble created around it. With emergence of so called Big Data came the need of new word formulation. And it is very much similar to creation of phrase business intelligence (BI). Some of you might still remember it used to be called decision support system (DSS). Regardless of the name, it evolved slowly with computer science and entrance of computers in daily life.

I am fine with DSS or BI naming, it still encapsulates the gist of how and when the acquisition and transformation of raw data into meaningful and useful information can help support business.

I am also fine with the slow evolution from decision support to research to data mining to machine learning to data science. For me, it is still just crunching the numbers, knowing mathematics and statistics, all the “non-fancy” stuff as cleaning, normalizing, de-duplicating data to exploring and even more exploring, to peer-to-peer reviews and again diving into data until coming to “fancy” part of drawing a conclusions and coming to business people with helping them on their decisions.

What I am not fine with is following:

  1. Data science combines all the standard practices and knowledge a statistician must know!
  2. Data science is sexy for the part of knowing and understanding the algorithms for multivariate statistics, for making predictions and for finding the patterns in the data. This is sexy, but to get to this point, one must be a mathematician/statistician with lots of years of experience.  The rest is just crap! Assuring the data quality (no business want to hear that, nobody wants to do this. Well. In reality, if your data is of poor quality, don’t expect good quality results), siting countless hours with one or two variables and finding out the behavior, correlation, causality, diving into literature for finding a smoothing algorithm to assure a better result, etc. Well, this is not really crap, but this is usually what “buzz-word” people don’t really like to mention!
  3. With Big data come big big big problems. Eventual consistency is probably the biggest lie ever (the abuse is similar to the one of statistical significance of p-value). having inconsistent data represents a big challenge. Big data made a big promise which a lot of data scientist couldn’t deliver (not of the lack of the knowledge but usually the lack of time or money). Big data never cared to look into the relational-model. It was never meant for business to adopt it in order to extract a relevant information. But again, this was not the fault of data scientist, but slowly adapting businesses. Stories about 4V (volume, velocity, variety, value) can be misleading mainly because technology of 4V is usually separate story to real research and mining of data (unless you are dealing with stream analysis or daily pushing new models in your business; but also a week old data will be sufficient for proving a point).
  4. Everyone wants to be a data scientist. Yes, and I want a pony. No, no. I want a rainbow unicorn. Being data scientist is dedication, is reading pile of books with formulas (usually hard to understand, but they actually make sense!), siting with random data sets, switching between random mathematical/statistical/database/script programs and languages in order to – well – just to prepare the data.
  5. All new technologies are boosting the ego of non-data-scientist with this fake vision, that a simple prediction of your company’s sales can be done with couple of clicks. I can’t argue with that. My only question is, would the result of this 5 minutes drag-and-drop prediction be of any relevance? or correct?
  6. Everyone like data scientist. But nobody like statisticians. Or mathematicians. First are usually the abusive toward data and they lie about the results and the latter are philosophers with countless formulas proving the existence of life on fifteen  decimal place. But reality is, data scientist = statisticians + mathematicians. So get over it! I still vividly remember 20+ years ago, how “data science” back then was neglected and it’s reputation was… well, it wasn’t.
  7. R and Python is the next best thing I have to learn. Well don’t, if you don’t intend to use it. Go and learn something more useful. Spanish for example. R has been in the community for past 30+ years and it wasn’t invented just recently. So has been python. And we have been using both for the purpose of supporting business decisions. If you would like to learn R, ask your self: 1) Do I know any statistics? and 2) Can I explain the difference between Naive Bayes and Pearson correlation coefficient?. If you answer on both negative, I suggest you to start learning spanish.
  8. Programing is in a lot of aspects very close to theory of statistics. Sampling for example is one of those areas where good programming knowledge will bust your abilities in data sampling and different approaches to probability theory
  9. Salaries are relative. Data scientist can get a very good salaries, especially those who are able to combine a) knowledge  of statistics/mathematics with b) computer literacy (programing, data manipulation) and c) very good understanding of business processes. A lot of knowledge and understanding come from experience and repetitive work, the rest with determination and intelligence.
  10. It is hard to be data scientist in a semi to big company! But much easier in small or as a freelance.

So next time you use term data science or data scientist or you label yourself as one, keep in mind couple of points from above. And unless you have done any kind of research for years and still get a kick out of it, please, don’t call it a sexy job. You might offend someone.

SQL Saturday Vienna 2016

First of April 2016 SQL Saturday Vienna took place, regarding the fact it was Friday, first of April, nobody ended up being April’s fool. Austrian SQL Community and Microsoft Austria did a great job hosting and bringing a lot of attendees, keen on getting new information and knowledge with great agenda. And making this third SQL Saturday Vienna.

My personal trolling photo selections.

A good desert after the evening dinning out with all the speakers.20160331_211706_resized

Miloš Radivojevic (@MilosSQL ) is impersonating someone with rather sexy Sonnenbrille during the event’s lunch.

20160401_125541_resized

And  smile-only photo with Magie.

20160401_105452_resized