Super-charge your Data Science method - the importance of a hypothesis

January 11, 2020 Niall Napier

It's been some time since I was in an academic setting and even longer since I dusted off my lab book and documented the hypothesis, method, and conclusion of the prescribed experiment of the day. I never excelled at lab classes and yet now, years later and in quite a different setting I'm being drawn to the scientific method. In fact, as machine learning has come to dominate the world of data science over the last few years it feels like the scientific method has had a resurgence in popularity.

Recently I was fortunate to be invited to a nearby university to view some of the projects their analytics students had been working on over the past semester. As well as being struck by the enthusiasm and creativity of their projects I was pleasantly surprised by the balance of scientific method and the applied nature of their work. Hypotheses formed from problems that companies everywhere are trying to solve for their own customers:

Interpreting review text to provide better recommendations;
Identifying objects from space to make life at ground level safer;
Using government data to optimize capacity issues for public bodies.

Each presentation started with a hypothesis and while the method in each varied and was defined and reinvented over the course of the project, each ended with a conclusion on the results observed and I couldn't help but form my own follow-up hypotheses and questions for the students. Challenging the source data and exploring the context of the problem for new ideas.

In the day-to-day, it's easy to get lost in the data and subsequently misinterpret results. Exploratory analysis can be wonderfully liberating and prompt ideas and lines of inquiry that would never have existed before, however in isolation, this method risks wandering aimlessly through the data without any course of direction. The hypothesis in each presentation helped the students to stay focused on a clear objective and measure success, regardless of the outcome.

Measuring success bears repeating. One of the things I've seen analytics teams struggle with is how to demonstrate their contribution to the business. Sometimes this contribution gets lost when a path of investigation doesn't yield positive results. In actual fact, there is often still a lot of value in the work conducted; Being able to prove that the original direction wasn't the optimal one and a course correction is needed or a new approach can be just as important as an analysis that proves you got things right first time. Setting the question out in a hypothesis and documenting the results can help demonstrate where time and effort were spent and what was learned over the course of the project.

Having a hypothesis can be the difference between performing exploratory analysis and conducting data science. Both are important in understanding the data landscape and deriving insight, but they are very different ways of looking at the data.

In a recent seminar Cassie Kozyrkov reminded the audience that if we’re conducting real data science then being able to independently test the hypothesis is key. Sometimes that means withholding not only a test set of data, but also a secret test set that even the scientists are not allowed to see - something to prevent them from gaming the experiment to provide optimal outcomes. It's a common trick used by Kaggle and other data science competitions to give participants enough data to experiment but not so much that they can cheat in the contest.

Whatever your motives, take a step back and ask ‘how will we know when we're done?’

How do you track value when performing your analysis? Do hypotheses feature in your approach or do you use a different technique? Let me know in the comments below, or on Twitter @eageranalyst.