Faux pas of data science

Data scientist has been the “sexiest” job  for the past years mostly thanking to all the buzzword bubble created around it. With emergence of so called Big Data came the need of new word formulation. And it is very much similar to creation of phrase business intelligence (BI). Some of you might still remember it used to be called decision support system (DSS). Regardless of the name, it evolved slowly with computer science and entrance of computers in daily life.

I am fine with DSS or BI naming, it still encapsulates the gist of how and when the acquisition and transformation of raw data into meaningful and useful information can help support business.

I am also fine with the slow evolution from decision support to research to data mining to machine learning to data science. For me, it is still just crunching the numbers, knowing mathematics and statistics, all the “non-fancy” stuff as cleaning, normalizing, de-duplicating data to exploring and even more exploring, to peer-to-peer reviews and again diving into data until coming to “fancy” part of drawing a conclusions and coming to business people with helping them on their decisions.

What I am not fine with is following:

  1. Data science combines all the standard practices and knowledge a statistician must know!
  2. Data science is sexy for the part of knowing and understanding the algorithms for multivariate statistics, for making predictions and for finding the patterns in the data. This is sexy, but to get to this point, one must be a mathematician/statistician with lots of years of experience.  The rest is just crap! Assuring the data quality (no business want to hear that, nobody wants to do this. Well. In reality, if your data is of poor quality, don’t expect good quality results), siting countless hours with one or two variables and finding out the behavior, correlation, causality, diving into literature for finding a smoothing algorithm to assure a better result, etc. Well, this is not really crap, but this is usually what “buzz-word” people don’t really like to mention!
  3. With Big data come big big big problems. Eventual consistency is probably the biggest lie ever (the abuse is similar to the one of statistical significance of p-value). having inconsistent data represents a big challenge. Big data made a big promise which a lot of data scientist couldn’t deliver (not of the lack of the knowledge but usually the lack of time or money). Big data never cared to look into the relational-model. It was never meant for business to adopt it in order to extract a relevant information. But again, this was not the fault of data scientist, but slowly adapting businesses. Stories about 4V (volume, velocity, variety, value) can be misleading mainly because technology of 4V is usually separate story to real research and mining of data (unless you are dealing with stream analysis or daily pushing new models in your business; but also a week old data will be sufficient for proving a point).
  4. Everyone wants to be a data scientist. Yes, and I want a pony. No, no. I want a rainbow unicorn. Being data scientist is dedication, is reading pile of books with formulas (usually hard to understand, but they actually make sense!), siting with random data sets, switching between random mathematical/statistical/database/script programs and languages in order to – well – just to prepare the data.
  5. All new technologies are boosting the ego of non-data-scientist with this fake vision, that a simple prediction of your company’s sales can be done with couple of clicks. I can’t argue with that. My only question is, would the result of this 5 minutes drag-and-drop prediction be of any relevance? or correct?
  6. Everyone like data scientist. But nobody like statisticians. Or mathematicians. First are usually the abusive toward data and they lie about the results and the latter are philosophers with countless formulas proving the existence of life on fifteen  decimal place. But reality is, data scientist = statisticians + mathematicians. So get over it! I still vividly remember 20+ years ago, how “data science” back then was neglected and it’s reputation was… well, it wasn’t.
  7. R and Python is the next best thing I have to learn. Well don’t, if you don’t intend to use it. Go and learn something more useful. Spanish for example. R has been in the community for past 30+ years and it wasn’t invented just recently. So has been python. And we have been using both for the purpose of supporting business decisions. If you would like to learn R, ask your self: 1) Do I know any statistics? and 2) Can I explain the difference between Naive Bayes and Pearson correlation coefficient?. If you answer on both negative, I suggest you to start learning spanish.
  8. Programing is in a lot of aspects very close to theory of statistics. Sampling for example is one of those areas where good programming knowledge will bust your abilities in data sampling and different approaches to probability theory
  9. Salaries are relative. Data scientist can get a very good salaries, especially those who are able to combine a) knowledge  of statistics/mathematics with b) computer literacy (programing, data manipulation) and c) very good understanding of business processes. A lot of knowledge and understanding come from experience and repetitive work, the rest with determination and intelligence.
  10. It is hard to be data scientist in a semi to big company! But much easier in small or as a freelance.

So next time you use term data science or data scientist or you label yourself as one, keep in mind couple of points from above. And unless you have done any kind of research for years and still get a kick out of it, please, don’t call it a sexy job. You might offend someone.


2 thoughts on “Faux pas of data science

  1. Good post!

    So my personal journey was a lot of maths, mainly pure, then getting my degree in Philosophy. I always worked in data & analysis heavy roles, taking on more responsibility and scope of work. A few years back I had to buff my stats when I built some models for predicting default etc. This was when I learnt R. I studied more around R & stats and continued using R and building models.

    I’m now a “Lead Data Scientist” and I like to poke fun of myself by wearing jeans, a geeky t-shirt and a blazer so I “look the part” but I can’t yet bring myself to go Mac. My stats is stronger than people’s in BI and I know enough to hire the next people in who will be stronger on the modelling side, but my main focus is developing initial infrastructure to support data science within the company, building the first models, and building a team of data scientists.

    I would suggest that people who don’t want to be data scientists but want to do BI better should learn R (or python) – it is a fantastic data analysis tool providing analysts with the means to achieve more in their day jobs through scripted and reproducible data manipulation and data visualisations.

    I don’t think everyone should be a data scientist, and I’m still very tongue-in-cheek about my own status as a Data Scientist, but I do think more BI people should be learning R (and if they learn some stats along the way then woohoo) as it can really help them do their jobs better.


  2. Thank you Steff for your insights on your experience and drawing a line between Data Scientist person and BI person. It can in many ways be very similar, but the main difference is ways and methods they use in order to draw a conclusion.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s