[practice]Data Cleaning
Type Conversion
# theory
why types matter
Pandas might read numbers as strings, dates as text, or misinterpret your data. Wrong types mean:
- Math operations fail
- Sorting is alphabetical instead of numeric
- Filtering doesn't work as expected
checking
df.dtypes # Type of each column
df["col"].dtype # Type of specific column
df.info() # Types + non-null counts
converting
# To numeric
df["price"] = pd.to_numeric(df["price"])
df["price"] = pd.to_numeric(df["price"], errors="coerce") # NaN for invalid
# To datetime
df["date"] = pd.to_datetime(df["date"])
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d")
# Using astype
df["age"] = df["age"].astype(int)
df["active"] = df["active"].astype(bool)
df["id"] = df["id"].astype(str)
the errors parameter
pd.to_numeric(series, errors="raise") # Default: raise error on invalid
pd.to_numeric(series, errors="coerce") # Convert invalid to NaN
pd.to_numeric(series, errors="ignore") # Return original on error
date formats
| Format | Example |
|---|---|
| %Y-%m-%d | 2024-03-15 |
| %m/%d/%Y | 03/15/2024 |
| %d-%b-%Y | 15-Mar-2024 |
| %Y-%m-%d %H:%M:%S | 2024-03-15 14:30:00 |
dates
df["date"] = pd.to_datetime(df["date"])
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["day_name"] = df["date"].dt.day_name()# examples [3]
# example 01 · converting to numeric
Handle strings that should be numbers
1
2
3
4
5
6
7
8
9
10
11
12
🐍
# example 02 · converting dates
Parse date strings into datetime objects
1
2
3
4
5
6
7
8
9
10
11
🐍
# example 03 · using astype
Direct type conversion
1
2
3
4
5
6
7
8
9
10
11
12
13
14
🐍
# challenges [2]
# challenge 01/02todo
Convert the 'date' column in sales DataFrame to datetime and print the data types.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
🐍
# challenge 02/02todo
Convert students['score'] to float type and print the mean.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
🐍
# project
# project-challenge
thread: Survey Insights Report · reward: 50 xp
# brief
The survey data was imported with some columns as generic types. You need to ensure Salary is an integer and Age is properly typed for accurate statistical analysis in your recruiter report.
# task
Fix Salary and Age Data Types
# your code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
🐍