[practice]Data Cleaning
Duplicates & Reset Index
# theory
finding duplicates
# Check for duplicate rows
df.duplicated() # True/False for each row
df.duplicated().sum() # Count of duplicates
df[df.duplicated()] # View duplicate rows
# Check specific columns
df.duplicated(subset=["name"]) # Duplicate names only
df.duplicated(subset=["name", "email"]) # Both must match
removing duplicates
# Remove duplicate rows
df.drop_duplicates()
# Keep first or last occurrence
df.drop_duplicates(keep="first") # Default
df.drop_duplicates(keep="last")
df.drop_duplicates(keep=False) # Remove ALL duplicates
# Based on specific columns
df.drop_duplicates(subset=["email"])
reset_index
After filtering or sorting, the index might have gaps. Reset it:
df.reset_index() # Old index becomes a column
df.reset_index(drop=True) # Discard old index
set_index
df.set_index("id") # Use 'id' column as index
df.set_index("id", drop=True) # Remove from columns (default)
df.set_index(["year", "month"]) # Multi-level index# examples [3]
# example 01 · finding and removing duplicates
Identify and remove duplicate rows
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
🐍
# example 02 · reset index after filtering
Clean up index after removing rows
1
2
3
4
5
6
7
8
9
🐍
# example 03 · set a column as index
Use a meaningful column as the row identifier
1
2
3
4
5
6
7
8
🐍
# challenges [2]
# challenge 01/02todo
Check how many duplicate rows exist in the students DataFrame and print the count.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
🐍
# challenge 02/02todo
Filter students with grade 'A', reset the index (dropping the old one), and print the result.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
🐍
# project
# project-challenge
thread: Survey Insights Report · reward: 50 xp
# brief
Before finalizing your report, verify there are no duplicate survey submissions. Check for duplicate RespondentIDs and reset the index after any filtering operations.
# task
Check for Duplicate Respondents
# your code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
🐍