pyodide: loading…

[practice]Data Cleaning

Type Conversion

# theory

why types matter

Pandas might read numbers as strings, dates as text, or misinterpret your data. Wrong types mean:

  • Math operations fail
  • Sorting is alphabetical instead of numeric
  • Filtering doesn't work as expected

checking

df.dtypes         # Type of each column
df["col"].dtype   # Type of specific column
df.info()         # Types + non-null counts

converting

# To numeric
df["price"] = pd.to_numeric(df["price"])
df["price"] = pd.to_numeric(df["price"], errors="coerce")  # NaN for invalid

# To datetime
df["date"] = pd.to_datetime(df["date"])
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")

# Using astype
df["age"] = df["age"].astype(int)
df["active"] = df["active"].astype(bool)
df["id"] = df["id"].astype(str)

the errors parameter

pd.to_numeric(series, errors="raise")   # Default: raise error on invalid
pd.to_numeric(series, errors="coerce")  # Convert invalid to NaN
pd.to_numeric(series, errors="ignore")  # Return original on error

date formats

FormatExample
%Y-%m-%d2024-03-15
%m/%d/%Y03/15/2024
%d-%b-%Y15-Mar-2024
%Y-%m-%d %H:%M:%S2024-03-15 14:30:00

dates

df["date"] = pd.to_datetime(df["date"])
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["day_name"] = df["date"].dt.day_name()

# examples [3]

# example 01 · converting to numeric

Handle strings that should be numbers

1
2
3
4
5
6
7
8
9
10
11
12
🐍
Loading PythonSetting up pandas & numpy...
# example 02 · converting dates

Parse date strings into datetime objects

1
2
3
4
5
6
7
8
9
10
11
🐍
Loading PythonSetting up pandas & numpy...
# example 03 · using astype

Direct type conversion

1
2
3
4
5
6
7
8
9
10
11
12
13
14
🐍
Loading PythonSetting up pandas & numpy...

# challenges [2]

# challenge 01/02todo
Convert the 'date' column in sales DataFrame to datetime and print the data types.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
🐍
Loading PythonSetting up pandas & numpy...
# challenge 02/02todo
Convert students['score'] to float type and print the mean.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
🐍
Loading PythonSetting up pandas & numpy...

# project

# project-challenge

thread: Survey Insights Report · reward: 50 xp

# brief

The survey data was imported with some columns as generic types. You need to ensure Salary is an integer and Age is properly typed for accurate statistical analysis in your recruiter report.

# task

Fix Salary and Age Data Types

# your code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
🐍
Loading PythonSetting up pandas & numpy...