After a lot of prodding, we’ve finally decided to open-source some of the conversations between professional data scientists and their mentees on SharpestMinds’ internal Slack. We hope this will let more people benefit from the expertise on SharpestMinds, even if they don’t have access to the community directly. You can view a an earlier Slack conversation we published here.
When most people start learning data science, the data they work with is time-independent. They predict the survival probabilities of passengers on the Titanic, identify hand-written characters in the MNIST dataset, or carry out. some other, similar task.
To solve these problems, you usually begin by *randomly* assigning each of your samples to one of two different datasets: one that will be used to train your model, and another that will be used to validate its performance. That validation step is important because it’s what allows you to make claims like, “I expect this prediction to be accurate to within 10% of the true value, 19 times out of 20.”
That doesn’t work for time series data, though: if you’re trying to predict seasonal effects, stock market fluctuations or customer churn behavior, you’ll quickly realize that randomly assigning data to training and validation sets destroys the information that was contained in the original dataset’s time ordering.
So how can you provide confidence intervals (AKA: prediction intervals) for your time series predictions?
Or, as a SharpestMinds mentee recently asked on our internal Slack community:

Chiemi had previously found this approach, but it works only for gradient boosting regressors. She wanted a more general solution.
The first great suggestion came from SharpestMinds mentor Ray Reng, who’s a genuine data science Slack superhero:

Here’s a clickable version of the link he provided. I’ll be honest, I did not know about this function, and it looks incredibly handy.
Next came SharpestMinds alum (and now data scientist 🚀) Khai Win:


The Jason Brownlee post she’s linking to is here (highly recommend!).
Finally, mentee Christian Fagan also proposed a really interesting strategy based on Bayesian intervals — it’s more advanced, so worth checking out if you’re adventurous and a lover of math:

(again, here it is in clickable form).
And that’s it! Just a short one today, but I thought all the different perspectives and tools that were suggested here would be helpful if you’re taking a look at your own time series problem.
Until next time :)