python-mastery

# theory

finding duplicates

# Check for duplicate rows
df.duplicated()                    # True/False for each row
df.duplicated().sum()              # Count of duplicates
df[df.duplicated()]                # View duplicate rows

# Check specific columns
df.duplicated(subset=["name"])     # Duplicate names only
df.duplicated(subset=["name", "email"])  # Both must match

removing duplicates

# Remove duplicate rows
df.drop_duplicates()

# Keep first or last occurrence
df.drop_duplicates(keep="first")   # Default
df.drop_duplicates(keep="last")
df.drop_duplicates(keep=False)     # Remove ALL duplicates

# Based on specific columns
df.drop_duplicates(subset=["email"])

reset_index

After filtering or sorting, the index might have gaps. Reset it:

df.reset_index()                   # Old index becomes a column
df.reset_index(drop=True)          # Discard old index

set_index

df.set_index("id")                 # Use 'id' column as index
df.set_index("id", drop=True)      # Remove from columns (default)
df.set_index(["year", "month"])    # Multi-level index

# examples [3]

# example 01 · finding and removing duplicates

Identify and remove duplicate rows

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

🐍

Loading PythonSetting up pandas & numpy...

# example 02 · reset index after filtering

Clean up index after removing rows

1

2

3

4

5

6

7

8

9

🐍

Loading PythonSetting up pandas & numpy...

# example 03 · set a column as index

Use a meaningful column as the row identifier

1

2

3

4

5

6

7

8

🐍

Loading PythonSetting up pandas & numpy...

# challenges [2]

# challenge 01/02todo

Check how many duplicate rows exist in the students DataFrame and print the count.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

🐍

Loading PythonSetting up pandas & numpy...

# challenge 02/02todo

Filter students with grade 'A', reset the index (dropping the old one), and print the result.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

import io

csv_data = """name,email,score
Alice,alice@test.com,95
Bob,bob@test.com,82
Alice,alice@test.com,95
Carol,carol@test.com,91
Bob,bob2@test.com,85"""

df = pd.read_csv(io.StringIO(csv_data))
print("Original data:")
print(df)
print("\nDuplicate rows:", df.duplicated().sum())


# Filter students with grade 'A', reset the index (dropping the old one), and print the result.
# Your code here:

🐍

Loading PythonSetting up pandas & numpy...

# project

# project-challenge

thread: Survey Insights Report · reward: 50 xp

# brief

Before finalizing your report, verify there are no duplicate survey submissions. Check for duplicate RespondentIDs and reset the index after any filtering operations.

# task

Check for Duplicate Respondents

# your code

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

import pandas as pd
import io

survey_csv = """RespondentID,Country,Age,YearsExperience,LanguageUsed,Salary,RemoteWork,Education,JobTitle
1001,USA,28,5,Python,95000,Yes,Bachelor's,Data Scientist
1002,India,24,2,Python,28000,Yes,Master's,Data Analyst
1003,USA,35,12,Python,145000,No,Master's,Senior Data Engineer
1004,Canada,29,6,R,82000,Yes,PhD,Research Scientist
1005,UK,31,8,Python,78000,Hybrid,Master's,Machine Learning Engineer
1006,Germany,27,4,SQL,65000,No,Bachelor's,Data Analyst
1007,USA,42,18,Python,175000,Yes,PhD,Principal Data Scientist
1008,India,26,3,Python,32000,Yes,Bachelor's,Data Analyst
1009,USA,33,9,Java,125000,No,Master's,Data Engineer
1010,Canada,38,14,Python,115000,Hybrid,Bachelor's,Senior Data Scientist
1011,UK,25,2,R,45000,Yes,Master's,Junior Data Scientist
1012,India,30,7,Python,48000,Yes,Master's,Data Scientist
1013,USA,29,5,Python,98000,Hybrid,Bachelor's,Data Scientist
1014,Australia,34,10,Python,105000,Yes,Master's,Machine Learning Engineer
1015,Germany,28,4,SQL,58000,No,Bachelor's,Business Analyst"""

survey = pd.read_csv(io.StringIO(survey_csv))

# Check for duplicate RespondentIDs, remove any duplicates, and reset the index

import pandas as pd
import io

survey_csv = """RespondentID,Country,Age,YearsExperience,LanguageUsed,Salary,RemoteWork,Education,JobTitle
1001,USA,28,5,Python,95000,Yes,Bachelor's,Data Scientist
1002,India,24,2,Python,28000,Yes,Master's,Data Analyst
1003,USA,35,12,Python,145000,No,Master's,Senior Data Engineer
1004,Canada,29,6,R,82000,Yes,PhD,Research Scientist
1005,UK,31,8,Python,78000,Hybrid,Master's,Machine Learning Engineer
1006,Germany,27,4,SQL,65000,No,Bachelor's,Data Analyst
1007,USA,42,18,Python,175000,Yes,PhD,Principal Data Scientist
1008,India,26,3,Python,32000,Yes,Bachelor's,Data Analyst
1009,USA,33,9,Java,125000,No,Master's,Data Engineer
1010,Canada,38,14,Python,115000,Hybrid,Bachelor's,Senior Data Scientist
1011,UK,25,2,R,45000,Yes,Master's,Junior Data Scientist
1012,India,30,7,Python,48000,Yes,Master's,Data Scientist
1013,USA,29,5,Python,98000,Hybrid,Bachelor's,Data Scientist
1014,Australia,34,10,Python,105000,Yes,Master's,Machine Learning Engineer
1015,Germany,28,4,SQL,58000,No,Bachelor's,Business Analyst"""

survey = pd.read_csv(io.StringIO(survey_csv))

# Check for duplicate RespondentIDs, remove any duplicates, and reset the index

🐍

Loading PythonSetting up pandas & numpy...

Duplicates & Reset Index

finding duplicates

removing duplicates

reset_index

set_index

Check for Duplicate Respondents