Removing Duplicate Records
Welcome back to the Data Science Series.
Real-world datasets often contain repeated rows due to data collection issues, system errors, or multiple data sources being merged together. If left untreated, duplicate records can bias analysis, inflate counts, and lead to misleading model performance.
For accurate and fair evaluation, it is essential to identify and remove these duplicates during the preprocessing stage.
Pandas provides a convenient method called drop_duplicates() to eliminate repeated rows from a DataFrame, either by examining all columns or by focusing on specific ones.
Syntax
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
Parameters Explained
subset
Specifies which columns should be used to detect duplicates. If not provided, Pandas checks all columns in the DataFrame.
keep
Determines which duplicate record should be preserved:
'first' (default): keeps the first occurrence and removes the rest.
'last': keeps the last occurrence and removes earlier ones.
False: removes all duplicate occurrences entirely.
inplace
True: modifies the original DataFrame directly.
False (default): returns a new DataFrame without altering the original.
Return Value
The method returns a new DataFrame with duplicate rows removed unless inplace=True is specified.
Examples
Sample DataFrame
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Alice", "David", "Bob"],
"Age": [25, 30, 25, 40, 30]
}
df = pd.DataFrame(data)
1. Removing Duplicates (Default Behavior)
Keeps the first occurrence of each duplicate row.
df_no_duplicates = df.drop_duplicates()
Result: Duplicate rows appearing later in the DataFrame are removed.
2. Keeping the Last Occurrence
Removes earlier duplicates and retains the last matching row.
df_keep_last = df.drop_duplicates(keep='last')
Use case: Helpful when newer records should replace older ones.
3. Removing All Duplicate Records
Eliminates every occurrence of duplicated rows.
df_remove_all = df.drop_duplicates(keep=False)
Use case: Useful when duplicate entries indicate unreliable or corrupted data.
4. Removing Duplicates Based on Specific Columns
Checks duplicates using only selected columns.
df_subset = df.drop_duplicates(subset=["Name"])
Result: Only the first occurrence of each unique name is retained, regardless of age.
5. Modifying the Original DataFrame
Applies changes directly to the existing DataFrame.
df.drop_duplicates(inplace=True)
Note: This operation cannot be undone unless a copy of the original data was saved.
Key Takeaways
Duplicate records can distort statistical analysis and machine learning results.
drop_duplicates() offers flexible control over how duplicates are detected and removed.
Choosing the correct keep strategy depends on the nature of your data and analysis goals.
Always inspect your data after removing duplicates to ensure no important information was lost.