Data scientist has been the “sexiest” job for the past years mostly thanking to all the buzzword bubble created around it. With emergence of so called Big Data came the need of new word formulation. And it is very much similar to creation of phrase business intelligence (BI). Some of you might still remember it used to be called decision support system (DSS). Regardless of the name, it evolved slowly with computer science and entrance of computers in daily life.
I am fine with DSS or BI naming, it still encapsulates the gist of how and when the acquisition and transformation of raw data into meaningful and useful information can help support business.
I am also fine with the slow evolution from decision support to research to data mining to machine learning to data science. For me, it is still just crunching the numbers, knowing mathematics and statistics, all the “non-fancy” stuff as cleaning, normalizing, de-duplicating data to exploring and even more exploring, to peer-to-peer reviews and again diving into data until coming to “fancy” part of drawing a conclusions and coming to business people with helping them on their decisions.
What I am not fine with is following:
- Data science combines all the standard practices and knowledge a statistician must know!
- Data science is sexy for the part of knowing and understanding the algorithms for multivariate statistics, for making predictions and for finding the patterns in the data. This is sexy, but to get to this point, one must be a mathematician/statistician with lots of years of experience. The rest is just crap! Assuring the data quality (no business want to hear that, nobody wants to do this. Well. In reality, if your data is of poor quality, don’t expect good quality results), siting countless hours with one or two variables and finding out the behavior, correlation, causality, diving into literature for finding a smoothing algorithm to assure a better result, etc. Well, this is not really crap, but this is usually what “buzz-word” people don’t really like to mention!
- With Big data come big big big problems. Eventual consistency is probably the biggest lie ever (the abuse is similar to the one of statistical significance of p-value). having inconsistent data represents a big challenge. Big data made a big promise which a lot of data scientist couldn’t deliver (not of the lack of the knowledge but usually the lack of time or money). Big data never cared to look into the relational-model. It was never meant for business to adopt it in order to extract a relevant information. But again, this was not the fault of data scientist, but slowly adapting businesses. Stories about 4V (volume, velocity, variety, value) can be misleading mainly because technology of 4V is usually separate story to real research and mining of data (unless you are dealing with stream analysis or daily pushing new models in your business; but also a week old data will be sufficient for proving a point).
- Everyone wants to be a data scientist. Yes, and I want a pony. No, no. I want a rainbow unicorn. Being data scientist is dedication, is reading pile of books with formulas (usually hard to understand, but they actually make sense!), siting with random data sets, switching between random mathematical/statistical/database/script programs and languages in order to – well – just to prepare the data.
- All new technologies are boosting the ego of non-data-scientist with this fake vision, that a simple prediction of your company’s sales can be done with couple of clicks. I can’t argue with that. My only question is, would the result of this 5 minutes drag-and-drop prediction be of any relevance? or correct?
- Everyone like data scientist. But nobody like statisticians. Or mathematicians. First are usually the abusive toward data and they lie about the results and the latter are philosophers with countless formulas proving the existence of life on fifteen decimal place. But reality is, data scientist = statisticians + mathematicians. So get over it! I still vividly remember 20+ years ago, how “data science” back then was neglected and it’s reputation was… well, it wasn’t.
- R and Python is the next best thing I have to learn. Well don’t, if you don’t intend to use it. Go and learn something more useful. Spanish for example. R has been in the community for past 30+ years and it wasn’t invented just recently. So has been python. And we have been using both for the purpose of supporting business decisions. If you would like to learn R, ask your self: 1) Do I know any statistics? and 2) Can I explain the difference between Naive Bayes and Pearson correlation coefficient?. If you answer on both negative, I suggest you to start learning spanish.
- Programing is in a lot of aspects very close to theory of statistics. Sampling for example is one of those areas where good programming knowledge will bust your abilities in data sampling and different approaches to probability theory
- Salaries are relative. Data scientist can get a very good salaries, especially those who are able to combine a) knowledge of statistics/mathematics with b) computer literacy (programing, data manipulation) and c) very good understanding of business processes. A lot of knowledge and understanding come from experience and repetitive work, the rest with determination and intelligence.
- It is hard to be data scientist in a semi to big company! But much easier in small or as a freelance.
So next time you use term data science or data scientist or you label yourself as one, keep in mind couple of points from above. And unless you have done any kind of research for years and still get a kick out of it, please, don’t call it a sexy job. You might offend someone.