Embarrassing Myself for Your Education: Learning data science by processing my 37,000 unread emails

Christopher Tavolazzi
5 min readJun 24, 2024

--

Ever feel like you’ve been overwhelmed by an email tsunammi?

Hello dear reader,

I want to thank you for being here. You could spend your time anywhere on the web, and yet you have found your way here. I hope your journey here has been safe and satisfying.

You are reading an article that will change both of our lives.

You see, I have an embarrassing problem:

I have over 37,000 unread emails.

I never meant for it to get this bad. I always said I would get to them when I had time. Only, the magical extra bit of time never manifested, and now my email inbox looks like a field of landmines, daring me to take a step.

But here’s the thing: what if this monumental failure of inbox management could be turned into something positive? What if we could transform this digital landfill into a goldmine of learning opportunities?

That’s right, folks. We’re going to learn data science by diving headfirst into this email abyss. Why? Because sometimes the best way to learn is by solving real problems — and boy, do I have a real problem.

The Grand Plan: A Data Science Odyssey

So, what’s the plan? We’re going to embark on a journey through the wonderful world of data science, using my embarrassing email situation as our dataset.

Here’s a sneak peek of what we’ll be covering:

  1. Setting Up Shop:

We’ll start by getting our hands dirty with the tools of the trade. Python, here we come! We’ll download the email data (pray for my hard drive) and set up our environment. It’s like cleaning your room before a big project, except the room is digital, and the mess is… well, you know.

2. Data Exploration / Email Archaeology:

We’ll dig into this digital sediment, uncovering patterns and insights. How many emails are actually from Netflix? How many times have I ignored my mother? (Sorry, Mom!) We’ll create visualizations that will either impress you or make you question my life choices.

3. Machine Learning Magic:

We’ll teach a computer to do what I clearly couldn’t — decide which emails are important. We’ll build a spam detector, because apparently, I need all the help I can get.

4. Natural Language Processing:

We’ll dive into NLP, analyzing the content of these neglected messages. Maybe we’ll discover I’ve been ignoring a long-lost relative trying to give me an inheritance. (Spoiler: probably not)

5. Network Analysis:

We’ll map out my email connections. Who knows, we might discover I’m only three degrees of separation from Elon Musk. (Note to self: Check if any of those unread emails are from Tesla)

6. Time Series Analysis (Predicting My Future Neglect):

We’ll analyze email patterns over time. Will we be able to predict when I’ll hit 100,000 unread emails? Stay tuned to find out!

7. Building a Better Inbox:

Finally, we’ll take everything we’ve learned and build a system to manage emails more effectively.

The irony of creating an email management system after accumulating 37,000 unread emails is not lost on me.

Why Should You Care?

You might be thinking, “Why should I follow this digital hoarder on their data science journey?” Well, dear reader, here’s why:

  • 1. Learn by Doing: We’re going to tackle real-world problems with real (and really messy) data. No pristine datasets here!
  • 2. Laugh While You Learn: Let’s face it, my situation is ridiculous. But we’re going to have fun with it. If we can’t laugh at ourselves, who can we laugh at?
  • 3. Practical Skills: By the end of this journey, you’ll have a toolbox of data science skills that you can apply to your own projects (and maybe your own neglected inbox).
  • 4. Community: We’re in this together. Share your own email horror stories, suggest analyses, or just point and laugh. It’s all welcome here.

The Road Ahead

In the coming articles, we’ll dive deep into each aspect of our data science journey. We’ll write code, crunch numbers, and hopefully, by the end, I’ll have both a manageable inbox and a new set of skills. And you, dear reader, will have front-row seats to this spectacular adventure.

Our first stop: setting up our data science environment and taking our first peek into the abyss… I mean, dataset.

We’ll be using Python, because that’s what all the cool data scientists use (or so I’m told). We’ll set up libraries like pandas for data manipulation, matplotlib for visualization, and scikit-learn for machine learning.

And, before you ask, YES, we will be using AI models to code.

This series will make HEAVY use of Claude and ChatGPT. We will be using various AI models to handle the boilerplate and ideation iterations. Like most people I’ve found current AI models to be most helpful in coming up with lots of “ok” ideas that can be reworked into good ones.

Here’s what our first bit of code might look like (courtesy Claude 3.5 Sonnet):

import pandas as pd
import matplotlib.pyplot as plt
# Load our email data (assuming we've already extracted it)
emails = pd.read_csv('my_37000_emails_of_shame.csv')
# Let's see what we're dealing with
print(emails.head())
# Plot emails over time, because why not visualize our procrastination?
emails['date'] = pd.to_datetime(emails['date'])
emails.set_index('date')['subject'].resample('D').count().plot()
plt.title("My Daily Email Avoidance")
plt.show()

Don’t worry if this looks like gibberish to you. We’ll be breaking down every step, learning as we go. And who knows? By the end of this series, you might be the one explaining to me why my code is a mess (as if my inbox wasn’t enough).

So, are you ready to turn an embarrassing personal failure into a data science victory? Are you prepared to learn, laugh, and possibly cry as we navigate through 37,000 ignored responsibilities?

If your answer is yes, then stay tuned. Our next article will dive into the nitty-gritty of setting up our Python environment and loading our first batch of emails.

Remember, in the world of data science, one person’s overwhelming inbox is another person’s treasure trove of learning opportunities. Let’s make the most of this digital disaster!

Until next time, may your inboxes be manageable and your data be clean. (But if they’re not, hey, that’s what we’re doing here!)

Next in Series:

>> Defining the Problem <<

--

--

Christopher Tavolazzi

Creative Director, Writer, Musician - Follow me for more Poetry, Science, Spirituality, Self-Development and Art