Removing Duplicate Records

Welcome back to the Data Science Series.

Real-world datasets often contain repeated rows due to data collection issues, system errors, or multiple data sources being merged together. If left untreated, duplicate records can bias analysis, inflate counts, and lead to misleading model performance.

For accurate and fair evaluation, it is essential to identify and remove these duplicates during the preprocessing stage.

Pandas provides a convenient method called drop_duplicates() to eliminate repeated rows from a DataFrame, either by examining all columns or by focusing on specific ones.

Syntax

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

'last': keeps the last occurrence and removes earlier ones.

False: removes all duplicate occurrences entirely.

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Alice", "David", "Bob"],
    "Age": [25, 30, 25, 40, 30]
}

df = pd.DataFrame(data)

1. Removing Duplicates (Default Behavior)

Keeps the first occurrence of each duplicate row.

df_no_duplicates = df.drop_duplicates()

Result: Duplicate rows appearing later in the DataFrame are removed.

2. Keeping the Last Occurrence

Removes earlier duplicates and retains the last matching row.

df_keep_last = df.drop_duplicates(keep='last')

Use case: Helpful when newer records should replace older ones.

3. Removing All Duplicate Records

Eliminates every occurrence of duplicated rows.

df_remove_all = df.drop_duplicates(keep=False)

Use case: Useful when duplicate entries indicate unreliable or corrupted data.

4. Removing Duplicates Based on Specific Columns

Checks duplicates using only selected columns.

df_subset = df.drop_duplicates(subset=["Name"])

Result: Only the first occurrence of each unique name is retained, regardless of age.

5. Modifying the Original DataFrame

Applies changes directly to the existing DataFrame.

df.drop_duplicates(inplace=True)

Note: This operation cannot be undone unless a copy of the original data was saved.

Key Takeaways

Duplicate records can distort statistical analysis and machine learning results.

drop_duplicates() offers flexible control over how duplicates are detected and removed.

Choosing the correct keep strategy depends on the nature of your data and analysis goals.

Always inspect your data after removing duplicates to ensure no important information was lost.

Removing Duplicate Records

Syntax

Parameters Explained

subset

keep

inplace

Return Value

Examples

Sample DataFrame