Handling Categorical Data with Pandas

pandas
dataframe
categorical-data
data-transformation
Comprehensive guide to handling categorical data in Pandas, including encoding techniques, grouping operations, and data reshaping methods like melt and pivot.
Author

Mohammed Adil Siraju

Published

September 21, 2025

This notebook covers essential techniques for working with categorical data in Pandas, including: - Encoding Methods: Converting categorical variables to numerical formats - Grouping Operations: Analyzing category distributions and aggregations - Data Transformation: Reshaping data with melt and pivot operations

Categorical data transformation is crucial for machine learning models that require numerical inputs.

1. Setting Up Sample Data

Let’s start by creating a sample DataFrame with categorical data to work with.

import pandas as pd

data = {
    'Category': ['A','B','C','C','B','A']
}

df = pd.DataFrame(data)
df
Category
0 A
1 B
2 C
3 C
4 B
5 A

2. Encoding Categorical Data

Machine learning algorithms typically require numerical inputs. Categorical encoding converts text categories into numbers. Here are the most common techniques:

One-Hot Encoding

One-hot encoding creates binary columns for each category. It’s ideal for nominal (unordered) categories.

Pros: No ordinal assumptions, works well with most algorithms Cons: Can create many columns (curse of dimensionality)

pd.get_dummies(df['Category'])[['A','B']]
A B
0 True False
1 False True
2 False False
3 False False
4 False True
5 True False

Label Encoding

Label encoding assigns integer values to categories. Use this when categories have a natural order (ordinal data).

Pros: Memory efficient, preserves single column Cons: Implies ordinal relationship even when none exists

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

df['Category_LabenEncoded'] = label_encoder.fit_transform(df['Category'])

df
Category Category_LabenEncoded
0 A 0
1 B 1
2 C 2
3 C 2
4 B 1
5 A 0
import pandas as pd

data = {
    'Category': ['A','B','C','C','B','A']
}

df = pd.DataFrame(data)

df
Category
0 A
1 B
2 C
3 C
4 B
5 A

3. Analyzing Categorical Data with Grouping

Grouping operations help you understand the distribution and patterns in your categorical data. This is essential for exploratory data analysis.

Counting Category Frequencies

Use groupby().size() or groupby().count() to see how many times each category appears.

df.groupby('Category').size()
Category
A    2
B    2
C    2
dtype: int64
df.groupby('Category').agg({'Category':'count'})
Category
Category
A 2
B 2
C 2

4. Data Transformation: Reshaping with Melt and Pivot

Data reshaping is crucial for transforming your data between “wide” and “long” formats. This is particularly useful when working with categorical data across multiple variables.

Wide to Long Format (melt)

pd.melt() unpivots a DataFrame from wide format to long format. This is useful for: - Converting multiple categorical columns into a single column - Preparing data for visualization libraries - Making data more database-friendly

# Reshaping Data
data = {
    'Name': ['John', 'Emily', 'Kate'],
    'Math': [90, 85,88],
    'Science': [92, 80, 95]
}

df = pd.DataFrame(data)
df
Name Math Science
0 John 90 92
1 Emily 85 80
2 Kate 88 95
df_melted = pd.melt(df, id_vars='Name', var_name='Subject', value_name='Score')
df_melted
Name Subject Score
0 John Math 90
1 Emily Math 85
2 Kate Math 88
3 John Science 92
4 Emily Science 80
5 Kate Science 95

Long to Wide Format (pivot)

df.pivot() does the opposite of melt - it converts long format back to wide format. This is useful for: - Creating summary tables - Preparing data for certain types of analysis - Making data more human-readable

df_melted.pivot(index='Name', columns='Subject', values='Score')
Subject Math Science
Name
Emily 85 80
John 90 92
Kate 88 95

Summary

In this notebook, you learned essential data transformation techniques for categorical data:

  1. Encoding: Convert text categories to numbers
    • One-hot encoding for nominal data
    • Label encoding for ordinal data
  2. Grouping: Analyze category distributions
    • Count frequencies with groupby().size()
    • Aggregate data by categories
  3. Reshaping: Transform data structure
    • melt(): Wide to long format
    • pivot(): Long to wide format

These techniques form the foundation of data preprocessing for machine learning and analysis workflows. Choose the right method based on your data characteristics and modeling requirements!

Next Steps: Practice with real datasets and explore advanced encoding techniques like target encoding or frequency encoding.