pyodide: loading…

[concept]Data Manipulation (WCTC)

Working with Numbers & Types

# theory

checking dtypes

First thing you wanna do with any new dataset is check what types you're working with.

df.dtypes          # see all column types
df["col"].dtype    # just one column
df.info()          # types plus null counts

Common types you'll see:

  • int64; integers
  • float64; decimals
  • object; usually strings (the label is misleading)
  • bool; True/False
  • datetime64; dates

astype

When pandas reads a CSV, it guesses types. Sometimes it guesses wrong.

# Convert to different types
df["price"] = df["price"].astype(float)
df["quantity"] = df["quantity"].astype(int)
df["active"] = df["active"].astype(bool)

If the conversion isn't possible (like turning "hello" into an int), it'll throw an error. Which is actually helpful.

pd.to_numeric for messy data

Here's the thing; astype is strict. If you have even one bad value, it fails. pd.to_numeric is more forgiving.

# errors='coerce' turns bad values into NaN instead of crashing
df["price"] = pd.to_numeric(df["price"], errors="coerce")

# errors='ignore' just leaves bad values as-is
df["price"] = pd.to_numeric(df["price"], errors="ignore")

I use errors="coerce" way more than I expected. Real data is messy.

rounding

df["price"].round(2)     # round to 2 decimal places
df["price"].round(0)     # round to whole number
df["price"].round(-1)    # round to nearest 10

You can also use numpy functions:

import numpy as np
np.floor(df["price"])    # always round down
np.ceil(df["price"])     # always round up

formatting via apply

For display formatting, apply is your friend.

# Format as currency
df["price"].apply(lambda x: f"${x:,.2f}")

# Format as percentage
df["rate"].apply(lambda x: f"{x:.1%}")

# Add commas to large numbers
df["population"].apply(lambda x: f"{x:,}")

mixed-type columns

Sometimes a column has both numbers and text. This happens more than you'd think.

# Check if values are numeric
pd.to_numeric(df["col"], errors="coerce").notna()

# Find the non-numeric rows
mask = pd.to_numeric(df["col"], errors="coerce").isna()
print(df[mask])  # shows you the problem rows

This is super useful for debugging data quality issues.

# examples [3]

# example 01 · converting and cleaning number columns

Handle a column that has some non-numeric values mixed in

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
🐍
Loading PythonSetting up pandas & numpy...
# example 02 · rounding and formatting

Make numbers look nice for display

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
🐍
Loading PythonSetting up pandas & numpy...
# example 03 · type conversion chain

Sometimes you need multiple conversions to get data right

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
🐍
Loading PythonSetting up pandas & numpy...

# challenges [2]

# challenge 01/02todo
Calculate the average score from the students DataFrame, round it to 1 decimal place, and print it.
1
2
3
4
5
6
7
8
9
10
11
12
🐍
Loading PythonSetting up pandas & numpy...
# challenge 02/02todo
Create a new column 'score_display' that formats each score with a % sign (like '95%') and print the first 3 rows.
1
2
3
4
5
6
7
8
9
10
11
12
🐍
Loading PythonSetting up pandas & numpy...