un — Data Science: From Spreadsheets to Models

un

guest

1 / ?

back to lessons

What Is Data Science?

Data science is not just machine learning. It is not just statistics. It is not just programming.

Data science is the discipline of extracting useful knowledge from data. Most of that work is not glamorous: it is cleaning messy spreadsheets, asking the right questions, and communicating results to people who do not care about your code.

Data Science Pipeline

The typical data science pipeline looks like this:

1. Collect: gather raw data from databases, APIs, CSVs, or web scraping

2. Clean: handle missing values, fix types, remove duplicates

3. Explore: visualize distributions, find patterns, ask questions

4. Engineer: create new features that help models learn

5. Model: train algorithms, evaluate performance, iterate

6. Communicate: present findings to stakeholders who make decisions

If you have ever used Excel pivot tables, conditional formatting, or VLOOKUP, you have already done steps 1-3. This lesson bridges that experience to the Python-based workflow used in industry.

Warm-Up

Your Data Experience

Everyone has worked with data in some form: a budget spreadsheet, a grade tracker, a fitness app, even a playlist with play counts.

Describe a time you worked with data in a spreadsheet or app. What were you trying to figure out, & did the data give you the answer?

Garbage In, Garbage Out

Why Cleaning Matters

Data scientists spend 60-80% of their time cleaning data. This is not an exaggeration: it is a consistent finding across industry surveys.

The reason is simple: garbage in, garbage out. If your data has errors, missing values, or inconsistent formats, every analysis built on top of it will be wrong. A perfect model trained on dirty data produces confidently wrong answers.

Common Data Problems

- Missing values: cells are blank. Was the data not collected, or is the value actually zero? These are different situations that require different handling.

- Wrong data types: a column of numbers stored as text, dates in inconsistent formats (01/02/2024: is that January 2nd or February 1st?)

- Outliers: a salary column has one entry of $1,000,000,000. Is that real, or a typo? Either way, it will skew your averages.

- Duplicates: the same record appears twice because two systems merged imperfectly

- Categorical encoding: a column says 'Yes', 'yes', 'Y', 'TRUE', and '1'. These all mean the same thing, but your computer does not know that.

In pandas (the standard Python data library), you handle these with methods like dropna(), fillna(), astype(), and drop_duplicates(). But the hard part is not the code: it is deciding what to do with each problem.

Common Data Problems

Cleaning Decisions

Deciding What To Do

Here is a real scenario. You have a dataset of 10,000 customer records. The 'age' column has 500 missing values.

Your options:

- Drop the rows: remove all 500 records. Simple, but you lose 5% of your data. If those 500 customers share a trait (maybe they skipped the age field because they are privacy-conscious), dropping them introduces bias.

- Fill with the mean: replace blanks with the average age. Quick, but it artificially reduces the variance of your age column.

- Fill with the median: better than mean if the age distribution is skewed (a few very old or very young customers pulling the average).

- Use a flag: create a new column called 'age_missing' (1 or 0) and fill the original with the median. Now your model can learn whether missingness itself is informative.

There is no universal right answer. The choice depends on why the data is missing & what you plan to do with it.

You have a dataset of employee salaries. 200 out of 5,000 records have missing salary values. You notice that most of the missing values are from executives. Would you drop those rows, fill with the mean, or do something else? Explain your reasoning.

Asking the Right Questions

Exploratory Data Analysis (EDA)

Before you build any model, you need to understand your data. EDA is the process of summarizing, visualizing, & questioning a dataset to find patterns, anomalies, & relationships.

Key Tools

- Histograms: show the distribution of a single variable. Is it bell-shaped? Skewed? Bimodal (two peaks)? A histogram of income is always right-skewed because a few people earn vastly more than the majority.

- Scatter plots: show the relationship between two variables. Do taller people weigh more? Does more study time correlate with higher grades? The pattern (or lack of pattern) tells you whether a relationship exists.

- Correlation: a number between -1 and +1 that measures linear association. +1 means perfect positive relationship, -1 means perfect negative, 0 means no linear relationship. But correlation does not imply causation: ice cream sales and drowning deaths are correlated because both increase in summer.

- Summary statistics: mean, median, standard deviation, min, max. In pandas: df.describe() gives you all of these in one line.

The Real Skill

The tools are easy to learn. The hard part is asking the right questions. Bad question: 'What does the data say?' Good question: 'Do customers who contact support within the first week have higher retention rates?'

EDA Tools

Correlation Trap

Correlation vs. Causation

This is the single most important concept in data literacy. Two variables can be strongly correlated without one causing the other.

Classic examples:

- Cities with more firefighters have more fires. (Larger cities have both.)

- Students who eat breakfast get better grades. (Maybe wealthier families are more likely to provide breakfast AND academic support.)

- Countries that consume more chocolate win more Nobel Prizes. (Both correlate with national wealth.)

The hidden factor is called a confounding variable: a third variable that drives both of the ones you are looking at.

Correlation vs. Causation

A company finds that employees who use the office gym have 30% fewer sick days. The CEO wants to require all employees to use the gym. What is wrong with this reasoning? What confounding variables might explain the correlation?

Creating Useful Variables

What Is Feature Engineering?

A feature is an input variable that a model uses to make predictions. Feature engineering is the art of creating new features from raw data to help models learn patterns they could not see otherwise.

Raw data rarely comes in the form models need. Consider a dataset with a 'date of birth' column. A model cannot do much with raw dates. But if you create an 'age' feature from it, suddenly the model can learn age-based patterns.

Common Techniques

- Normalization: scaling numbers to a common range (0 to 1, or mean=0 and standard deviation=1). Without this, a feature measured in thousands (salary) will dominate a feature measured in single digits (years of experience).

- One-hot encoding: converting categorical variables into binary columns. A 'color' column with values [red, blue, green] becomes three columns: 'color_red', 'color_blue', 'color_green', each with 0 or 1.

- Binning: turning a continuous variable into categories. Age 0-17 becomes 'minor', 18-64 becomes 'adult', 65+ becomes 'senior'. This helps when the relationship is not linear.

- Interaction features: multiplying two features together. 'Square footage times number of bathrooms' might predict house price better than either alone.

- Domain knowledge: the most powerful technique. A doctor creating features for a medical model knows which lab values matter. A marketer knows that 'days since last purchase' is more useful than 'purchase date'. No algorithm can replace this.

Feature Engineering Techniques

Feature Practice

Applying Feature Engineering

Imagine you are building a model to predict which customers will cancel their streaming subscription next month. Your raw data includes:

- Account creation date

- Last login date

- Number of shows watched last month

- Monthly payment amount

- Customer support tickets filed

- Country

From the raw data listed above, propose at least three new features you would engineer. For each one, explain what it captures & why it might help predict cancellation.

Train/Test Split

Why You Split Your Data

The most important rule in modeling: never evaluate a model on the same data you trained it on.

If you do, the model can just memorize the answers. It will score perfectly on the training data but fail on new, unseen data. This is called overfitting: the model learned the noise in your specific dataset instead of the real patterns.

The standard practice is to split your data:

- Training set (typically 70-80%): the model learns from this

- Test set (typically 20-30%): held back, used only to evaluate the final model

In scikit-learn: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Common Algorithms

- Linear regression: draws the best-fit straight line through data. Simple, interpretable, works when the relationship is roughly linear. Predicts a number (price, temperature, score).

- Decision trees: a flowchart of yes/no questions. Easy to understand and explain. Prone to overfitting unless pruned or limited in depth.

- Random forests: many decision trees that vote together. More accurate than a single tree, less prone to overfitting, but harder to explain.

Overfitting vs. Underfitting

- Overfitting: model is too complex. It memorizes training data, including noise. High accuracy on training data, low accuracy on test data.

- Underfitting: model is too simple. It cannot capture the real patterns. Low accuracy on both training and test data.

The goal is the sweet spot in between.

Train/Test Split and Bias-Variance Tradeoff

Evaluation Metrics

How Do You Know If Your Model Is Good?

Accuracy alone can be misleading. If 95% of emails are not spam, a model that always says 'not spam' is 95% accurate: but completely useless.

Key metrics:

- Accuracy: percentage of correct predictions. Useful when classes are balanced.

- Precision: of all the things the model flagged as positive, how many actually were? High precision means few false alarms.

- Recall: of all the actual positives, how many did the model catch? High recall means few missed cases.

- F1 score: the harmonic mean of precision and recall. Useful when you need to balance both.

- RMSE (Root Mean Squared Error): for regression (predicting numbers). How far off are predictions on average?

Which metric matters most depends on the problem. For cancer detection, recall matters more: you do not want to miss a case. For spam filtering, precision matters more: you do not want to delete a real email.

Evaluation Metrics and Confusion Matrix

You are building a model to detect fraudulent credit card transactions. Only 0.1% of transactions are actually fraudulent. If your model predicts every transaction as legitimate, what is its accuracy? Why is accuracy a bad metric here, & what metric would you use instead?

Data Analyst vs. Data Scientist vs. ML Engineer

Three Distinct Roles

The data field has three main career tracks, & they require different skills.

Data Analyst

- Focus: answering business questions with existing data

- Tools: SQL, Excel, Tableau, basic Python or R

- Day-to-day: dashboards, reports, A/B test analysis, stakeholder presentations

- Entry path: often the most accessible. Many analysts start without a CS degree.

Data Scientist

- Focus: building predictive models & finding patterns in complex data

- Tools: Python (pandas, scikit-learn, matplotlib), statistics, SQL, Jupyter notebooks

- Day-to-day: EDA, feature engineering, model building, experimentation

- Entry path: typically requires statistics or quantitative background. Bootcamps & self-study are viable.

Machine Learning Engineer

- Focus: deploying & scaling models in production systems

- Tools: Python, TensorFlow/PyTorch, Docker, cloud platforms (AWS/GCP), APIs

- Day-to-day: model optimization, pipeline infrastructure, monitoring production models

- Entry path: usually requires strong software engineering skills plus ML knowledge.

Building a Portfolio

Hiring managers care about what you can do, not just what you studied. A portfolio of 3-5 solid projects on GitHub matters more than certifications. Good projects use real (not toy) datasets, include clear documentation, and show the full pipeline: from messy data to actionable insight.

Data Career Paths

Your Next Steps

Where to Go From Here

The tools of the trade are free & accessible:

- pandas: the standard Python library for data manipulation

- matplotlib / seaborn: visualization libraries

- scikit-learn: the workhorse for classical machine learning

- Jupyter notebooks: interactive coding environments where you can mix code, output, and notes

- Kaggle: free datasets, competitions, and a community of practitioners

Start with one real dataset that interests you. Download it, clean it, explore it, & try to answer a question. That single project will teach you more than any course.

Based on what you learned in this lesson, which of the three roles (data analyst, data scientist, or ML engineer) interests you most? What is one concrete step you could take this week to start building skills for that role?