on 27 SEP by Nathan Piccini, Marketing Manager
What are Data Scientists?
How do you distinguish a genuine data scientist from a dressed-up business analyst, BI, or other related roles?
Truth be told, the industry does not have a standard definition of a data scientist. You’ve probably heard jokes like “a data scientist is a data analyst living in Silicon Valley”.
Finding an “effective” data scientist is difficult. Finding people in the role of a data scientist can be equally difficult. Note the use of “effective” here. I use this word to highlight the fact that there could be people who might possess some of these data science skills yet may not be the best fit in a data science role. The irony is that even the people looking to hire data scientists might not fully understand data science. There are still some job advertisements in the market that describe a traditional data analyst and business analyst roles while labeling it a “Data Scientist” position.
Instead of giving a list of data science skills with bullet points, I will highlight the difference between some of the data-related roles.
Consider the following scenario:
Shop-Mart and Bulk-Mart are two competitors in the retail setting. Someone high up in the management chain asks this question: “How many Shop-Mart customers also go to Bulk-Mart?” Replace Shop-Mart and Bulk-Mart with WalMart, Target, Safeway, or any retail outlets that you know of. The question might be of interest to the management of one of these stores or even a third party. The third-party could possibly be a market research or consumer behavior company, interested in gathering actionable insights about consumer behavior.
How Professionals in Different Data-Related Roles will Approach the Problem
Traditional BI/Reporting Professional: The BI professional generates reports from structured data using SQL and some kind of reporting services (SSRS for example) and sends the data back to management. Management asks more questions based on the data that was sent, and the cycle continues. Insights about the data are most likely not included in the reports. A person in this role will be experienced mostly in database-related skills.
Data Analyst: In addition to doing what the BI professional does, a data analyst will also keep other factors like seasonality, segmentation, and visualization in mind. What if certain trends in shopping behavior are tied to seasonality? What if the trends are different across gender, demographics, geography, or product category? A data analyst will slice and dice the data to understand and annotate the report. Aside from database skills, a data analyst will have an understanding of some of the common visualization tools.
Business Analyst: A business analyst possesses the skills of a BI professional and the data analyst, plus they have domain knowledge and an understanding of the business. A business analyst may also have some basic skills in forecasting.
Data Mining or Big Data Engineer: A data miner does the job of the data analyst, possibly from unstructured data if needed, plus possesses MapReduce and other big data skills. An understanding of common issues in running jobs on large scale data and debugging of MapReduce jobs is needed.
Statistician (a traditional One): A statistician pulls data from a database or obtains it from any of the roles mentioned above and performs statistical analysis. This person ensures the quality of data and correctness of the conclusions by using standard practices like choosing the right sample size, confidence level, level of significance, type of test, and so on.
In the past, statisticians did not traditionally come from a computer science background, needed for writing code to implement statistical models. The situation has changed, Stat students now graduating with strong programming skills and decent foundation skills in CS. This enables them to perform the tasks that previous statisticians were not trained for traditionally.
Program/Project Manager: The program or project manager looks at all the data provided by the professionals mentioned so far, aligns these findings with the business, and influences the leadership to take appropriate action. This person possesses communication skills, presentation skills, and can influence without authority.
Ironically, a PM is influencing business decisions using the data and insights provided by others. If the person does not have a knack for understanding data, chances are that they will not be able to influence others to make the best decisions.
Putting It All Together
The rise of online services has brought a paradigm shift in the software development life cycle and business iteration over successive features and products. Having a different data puller, analyst, statistician, and project manager is just not possible anymore. Now the mantra is: ship, experiment, and learn, adapt, ship, experiment, and learn. This situation has resulted in the birth of a new role, a data scientist.
A data scientist should have the skills of all the individuals mentioned so far. In addition to the skills mentioned above, a data scientist should have rapid prototyping and programming, machine learning, visualization, and hacking skills.
Domain Knowledge and Soft Skills Are Equally Important As Technical Skills
The importance of domain knowledge and soft skills, like communication and influencing without authority, are severely underestimated both by hiring managers and aspiring data scientists. Insights without domain knowledge can potentially mislead the consumers of these insights. Correct insights without the ability to influence decision making are just as bad as having no insights.
All of what I have said above is based on my own tenure as a data scientist at a major search engine and later with the advertising platform within the same company. I learned that sometimes people asking the question may not understand what they want to know. This sounds preposterous yet it happens way too often. Very often a bozo will start digging into something that is not related to the issue at hand just to prove that he/she is relevant. A data scientist encounters such HIPPOs (Highly Paid Person’s Opinions) that are somewhat unrelated to the problem and are very often a big distraction from the problem at hand.
A data scientist should possess the right soft skills to manage situations such as people asking irrelevant, distracting questions that are outside the scope of the task at hand. This is hard, especially in situations where the person asking the question is several levels up the corporate ladder and is known to have an ego. It is a data scientist’s responsibility to manage up and around while presenting and communicating insights.
Suggested Skills a Data Scientist Should Possess
Curiosity About Data and Passion For Domain: If you are not passionate about the domain or business, and if you are not curious about data, then it is unlikely that you will succeed in a data scientist role. If you are working as a data scientist with an online retailer, you should be hungry to crunch and munch from the smorgasbord (of data of course) to know more. If your curiosity does not keep you awake, no skill in the world can help you succeed.
Soft Skills: Communication and influencing without authority are necessary skills. Understand the minimum action that has the maximum impact. Too many findings are as bad as no findings at all. The ability to scoop information out of partners and customers, even from the unwilling ones, is extremely important. The data you are looking for may not be sitting in one single place. You may have to beg, borrow, steal, and do whatever it takes to get the data.
Being a good storyteller is also something that helps. Sometimes the insights obtained from data are counter-intuitive. If you’re not a good storyteller, it will be difficult to convince your audience.
Math/Theory: Machine Learning algorithms, statistics, and probability 101 are fundamental to data science. This includes understanding probability distributions, linear regression, statistical inference, hypothesis testing, and confidence intervals. Learning optimization, such as gradient descent, would be the icing on the cake.
Computer Science/Programming: You should know at least one scripting language or a statistical tool such as R.There are plenty of resources to get started. Data Science Dojo provides numerous, free tutorials on getting started with Python and R to go along with its data science bootcamp. You can also learn programming basics from sites like CodeAcademy and LearnPython.
It’s necessary to possess decent algorithms and DS skills in order to write code that can analyze a lot of data efficiently. You may not be a production code developer, but you should be able to write decent code.
Database management and SQL skills are also helpful, as this is where you will be fetching your data to build models. It also doesn’t hurt to understand Microsoft Excel or another spreadsheet software.
Big Data and Distributed Systems: You need to understand basic MapReduce concepts, Hadoop and Hadoop file system, and at least one language like Hive/Pig. Some companies have their own proprietary implementations of these languages. Knowledge of tools like Mahout and any of the XaaS, like Azure and AWS, would be helpful. Once again, big companies have their own XaaS, so you may be working on variants of any of these.
Data Visualization: Possess the ability to create simple yet elegant and meaningful visualization. Personally, R packages like ggplot, lattice, and others have helped me in most cases, but there are other packages that you can use. In some cases, you might want to use D3.
There are many skills needed to become a full-fledged data scientist. In reality, a data scientist should be a well-rounded data machine with the skills to take on just about any project. It may take years for you to learn all the concepts, and even longer to master them. Make sure you are able to check off each of the skills listed above, and you’ll be well on your way to data science stardom.