How Netflix’s Content Strategy Shifted Over Time (Full EDA Guide)

A data analyst’s first project often starts with clean, pre-processed datasets—isolated exercises designed to teach one skill at a time. But real-world analysis rarely follows that script. Instead, it demands curiosity, adaptability, and the ability to weave multiple techniques into a cohesive investigation. That’s exactly what this end-to-end walkthrough delivers.

We begin with a raw dataset, pose a meaningful question, and navigate the messy, iterative process of cleaning, exploring, and visualizing to reveal genuine insights. No shortcuts. No hand-holding. Just a data professional’s journey from dataset to discovery.

Meet the Dataset: Netflix’s Global Content Library

This analysis uses the Netflix Movies and TV Shows dataset from Kaggle, a publicly available collection of 8,807 titles. It includes key fields like release year, country of origin, genre, duration, and the date content was added to the platform.

If you want to replicate this analysis without downloading the file, you can create a simplified version using this code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from collections import Counter
import warnings

warnings.filterwarnings("ignore")
sns.set_theme(style="darkgrid", palette="husl")
np.random.seed(42)

The dataset is raw in the best sense—unfiltered, unpolished, and ready for real analysis. That means missing values, inconsistent formats, and overlapping categories, all of which mirror the challenges analysts face daily.

Defining the Core Question: From Data to Insight

Before writing a single line of code, it’s essential to frame a question that guides the entire exploration. A good exploratory data analysis (EDA) question should be specific enough to stay focused yet broad enough to allow unexpected discoveries.

The central question here is:

*How has Netflix’s content strategy shifted over time? Specifically, are they producing more original movies or TV shows? Which countries contribute the most content, and which genres dominate the platform?"

This question doesn’t just ask for trends—it invites investigation into the why behind them. That balance between focus and openness is what turns a technical exercise into a genuine analytical challenge.

Step 1: Loading and Initial Exploration

The first step is always the same: load the data and take a quick look.

df = pd.read_csv("netflix_titles.csv")
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nMissing values:\n{df.isnull().sum().sort_values(ascending=False)}")

The output reveals a dataset of 8,807 rows and 12 columns, with significant gaps in fields like director (2,634 missing entries) and cast (825 missing). The country field, for instance, contains comma-separated values—indicating co-productions—while listed_in lists multiple genres in a single string.

This structure highlights a key lesson: real data is messy. Missing values aren’t flaws to ignore; they’re signals to interpret. Dropping rows with missing directors would erase 30% of the data—far too costly for meaningful analysis. Instead, we’ll handle these gaps strategically.

Step 2: Cleaning and Transforming the Data

Cleaning isn’t just about fixing errors—it’s about making the data usable without distorting its meaning. Here’s how we transformed the raw data into a structured format.

First, we corrected the date_added column, stripping whitespace before parsing it as a datetime:

df["date_added"] = pd.to_datetime(df["date_added"].str.strip(), errors="coerce")
df["year_added"] = df["date_added"].dt.year
df["month_added"] = df["date_added"].dt.month

Next, we addressed missing values in categorical fields by replacing them with placeholders:

director, cast, and country → filled with "Unknown"
For country, we extracted the primary production country by splitting the string and taking the first value

df["country_primary"] = df["country"].apply(
    lambda x: x.split(",")[0].strip() if x != "Unknown" else "Unknown"
)

The duration column presented a unique challenge—it combines metrics for both movies (e.g., "90 min") and TV shows (e.g., "2 Seasons"). To analyze this effectively, we split it into two columns:

df["duration_value"] = df["duration"].str.extract(r"(\d+)").astype(float)
df["duration_type"] = df["duration"].str.extract(r"([a-zA-Z]+)")

Finally, we filtered out titles released before 1950 to avoid skewing trends with historical anomalies.

The result? A cleaned dataset ready for deeper analysis—one that preserves the integrity of the original data while making it tractable.

Step 3: Tracking Netflix’s Content Strategy Over Time

Now comes the moment of truth: visualizing how Netflix’s output evolved. We grouped the data by year and content type to see patterns in growth and composition.

yearly_type = df.groupby(["year_added", "type"]).size().reset_index(name="count")
yearly_type = yearly_type[yearly_type["year_added"].between(2010, 2021)]

Two key visualizations emerged from this data:

Annual content additions – A line chart showing total titles added per year
TV show proportion – A bar chart tracking the share of TV shows in Netflix’s yearly additions

The first chart reveals a dramatic surge in content from 2015 to 2019, followed by a noticeable dip in 2020—likely due to pandemic-related production disruptions. The second chart uncovers a clear strategic shift: Netflix has steadily increased its TV show output, moving from a movie-dominated platform toward a more balanced (and increasingly TV-focused) library.

This isn’t just a statistical observation—it’s a window into Netflix’s evolving business model, driven by original productions and global expansion.

Step 4: Identifying Top Producing Countries and Dominant Genres

Next, we turned our attention to geography and genre. Which countries contribute the most content to Netflix? And which genres define the platform’s identity?

By aggregating the primary production country and splitting the listed_in genres, we uncovered:

Top 5 countries by content volume: United States, India, UK, Canada, and France
Dominant genres across movies and TV shows: International Movies, Dramas, Comedies, and Documentaries

A horizontal bar chart of top countries showed the United States producing nearly 2,500 titles—far ahead of India, the next largest contributor. Meanwhile, genre distribution highlighted Netflix’s emphasis on diverse storytelling, with drama and comedy forming the backbone of its catalog.

What the Analysis Reveals About Netflix’s Strategy

Taken together, these findings paint a picture of a company in transition. Netflix didn’t just grow—it evolved. The shift from predominantly movie-based content to a more balanced mix reflects a strategic pivot toward sustainable engagement, leveraging serialized storytelling to retain subscribers globally.

The geographic data underscores Netflix’s global ambitions, with heavy reliance on U.S. productions but growing investments in international markets like India and the UK.

And while genre preferences vary by region, the dominance of drama and comedy suggests a universal appeal—content that resonates across cultures and languages.

Final Thoughts: From Exercise to Expertise

This analysis demonstrates why real-world data work isn’t about isolated skills—it’s about integration. Loading, cleaning, exploring, and visualizing aren’t separate steps; they’re interconnected stages in a single discovery process.

For aspiring data professionals, the takeaway is clear: the best way to learn isn’t by mastering one tool at a time, but by tackling messy, real datasets with purpose and persistence. The insights aren’t just in the charts—they’re in the questions you ask before you ever write a line of code.

AI summary

Explore how Netflix’s content strategy evolved from 2010 to 2021 using real data. Learn end-to-end analysis techniques and uncover key trends in movies vs TV shows, top countries, and genre dominance.