pyodide: loading…

[concept]Data Cleaning

Datetime Handling

# theory

parsing dates

Most CSVs give you dates as strings. pd.to_datetime parses them. It's smart about common formats: 2024-01-15, 01/15/2024, Jan 15 2024 all work.

df["date"] = pd.to_datetime(df["date"])

If parsing fails on a row, you'll get an error. Use errors="coerce" to turn bad rows into NaT (missing) instead:

df["date"] = pd.to_datetime(df["date"], errors="coerce")

If the format is weird and pandas can't guess, tell it:

df["date"] = pd.to_datetime(df["date"], format="%d-%b-%Y")

the .dt accessor

Once a column is datetime, .dt gives you parts:

df["date"].dt.year         # 2024
df["date"].dt.month        # 1
df["date"].dt.day          # 15
df["date"].dt.day_name()   # "Monday"
df["date"].dt.weekday      # 0 (Monday)
df["date"].dt.is_month_end # True/False

date arithmetic

Subtracting two datetimes gives you a Timedelta:

df["days_since"] = (pd.Timestamp.today() - df["date"]).dt.days

Adding offsets:

df["next_week"] = df["date"] + pd.Timedelta(days=7)
df["next_month"] = df["date"] + pd.DateOffset(months=1)

filtering by date

Once parsed, comparisons just work:

recent = df[df["date"] >= "2024-01-01"]
last_30_days = df[df["date"] >= pd.Timestamp.today() - pd.Timedelta(days=30)]

formatting

strftime formats a datetime as a string:

df["date_str"] = df["date"].dt.strftime("%Y-%m-%d")
df["pretty"] = df["date"].dt.strftime("%B %d, %Y")  # "January 15, 2024"

Common format codes: %Y year, %m month, %d day, %H hour (24h), %M minute, %B month name, %A day name.

# examples [3]

# example 01 · parse a column with mixed-quality dates

errors='coerce' converts unparseable values to NaT instead of raising

1
2
3
4
5
6
7
8
🐍
Loading PythonSetting up pandas & numpy...
# example 02 · extract parts with .dt

Day name, month, weekday flag, all from the same column

1
2
3
4
5
6
7
8
9
🐍
Loading PythonSetting up pandas & numpy...
# example 03 · filter to the last n days

Today minus a Timedelta is the cutoff

1
2
3
4
5
6
7
8
9
🐍
Loading PythonSetting up pandas & numpy...

# challenges [2]

# challenge 01/02todo
Parse a column of date strings to datetime and print the day_name() for each row. Use df['dates'] = ['2024-03-04', '2024-07-21', '2024-11-15'] as your data.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
🐍
Loading PythonSetting up pandas & numpy...
# challenge 02/02todo
Parse a column with some bad/unparseable dates ('2024-01-15', 'not a date', '2024-02-30', '2024-04-10') using errors='coerce' and print exactly 'bad rows: N' where N is the count of NaT rows.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
🐍
Loading PythonSetting up pandas & numpy...