Back to series

Removing Duplicate Records

December 8, 20253 min read
pandasdata-cleaningduplicatespreprocessing

Welcome back to the Data Science Series.

Real-world datasets often contain repeated rows due to data collection issues, system errors, or multiple data sources being merged together. If left untreated, duplicate records can bias analysis, inflate counts, and lead to misleading model performance.

For accurate and fair evaluation, it is essential to identify and remove these duplicates during the preprocessing stage.

Pandas provides a convenient method called drop_duplicates() to eliminate repeated rows from a DataFrame, either by examining all columns or by focusing on specific ones.

Syntax

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

Parameters Explained

subset

Specifies which columns should be used to detect duplicates. If not provided, Pandas checks all columns in the DataFrame.

keep

Determines which duplicate record should be preserved:

'first' (default): keeps the first occurrence and removes the rest.

'last': keeps the last occurrence and removes earlier ones.

False: removes all duplicate occurrences entirely.

inplace

True: modifies the original DataFrame directly.

False (default): returns a new DataFrame without altering the original.

Return Value

The method returns a new DataFrame with duplicate rows removed unless inplace=True is specified.

Examples

Sample DataFrame

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Alice", "David", "Bob"],
    "Age": [25, 30, 25, 40, 30]
}

df = pd.DataFrame(data)

1. Removing Duplicates (Default Behavior)

Keeps the first occurrence of each duplicate row.

df_no_duplicates = df.drop_duplicates()

Result: Duplicate rows appearing later in the DataFrame are removed.

2. Keeping the Last Occurrence

Removes earlier duplicates and retains the last matching row.

df_keep_last = df.drop_duplicates(keep='last')

Use case: Helpful when newer records should replace older ones.

3. Removing All Duplicate Records

Eliminates every occurrence of duplicated rows.

df_remove_all = df.drop_duplicates(keep=False)

Use case: Useful when duplicate entries indicate unreliable or corrupted data.

4. Removing Duplicates Based on Specific Columns

Checks duplicates using only selected columns.

df_subset = df.drop_duplicates(subset=["Name"])

Result: Only the first occurrence of each unique name is retained, regardless of age.

5. Modifying the Original DataFrame

Applies changes directly to the existing DataFrame.

df.drop_duplicates(inplace=True)

Note: This operation cannot be undone unless a copy of the original data was saved.

Key Takeaways

Duplicate records can distort statistical analysis and machine learning results.

drop_duplicates() offers flexible control over how duplicates are detected and removed.

Choosing the correct keep strategy depends on the nature of your data and analysis goals.

Always inspect your data after removing duplicates to ensure no important information was lost.