pyodide: loading…

[concept]Data Manipulation (WCTC)

String Methods in Pandas

# theory

the .str accessor

Pandas exposes a str accessor that lets you call string methods across an entire column at once. It's one of the most useful features once you know it's there.

df["name"].str.upper()   # uppercase everything
df["name"].str.lower()   # lowercase everything
df["name"].str.strip()   # remove leading/trailing whitespace

Without the .str you'd have to loop through rows. With it, everything just works on the whole column.

common methods

Changing case:

df["text"].str.upper()      # ALL CAPS
df["text"].str.lower()      # all lowercase
df["text"].str.title()      # Title Case
df["text"].str.capitalize() # First letter only

Cleaning up whitespace:

df["text"].str.strip()   # both ends
df["text"].str.lstrip()  # left side only
df["text"].str.rstrip()  # right side only

Replacing text:

df["text"].str.replace("old", "new")
df["text"].str.replace(r"\d+", "", regex=True)  # remove all digits

searching with contains

This one's super handy for filtering. It returns True/False for each row.

# Find rows where name contains "son"
df[df["name"].str.contains("son")]

# Case insensitive
df[df["name"].str.contains("bob", case=False)]

# Use regex
df[df["email"].str.contains(r"@gmail\.com$", regex=True)]

Watch out though; contains throws errors if you have NaN values. Use na=False to avoid that:

df[df["name"].str.contains("son", na=False)]

extract

If you need to pull out specific parts of strings, extract uses regex groups.

# Extract the domain from emails
df["email"].str.extract(r"@(.+)")

# Extract area code from phone numbers
df["phone"].str.extract(r"\((\d{3})\)")

The parentheses in the regex define what gets captured. This part tripped me up at first but it makes sense once you see it.

splitting

# Split into a list
df["name"].str.split(" ")

# Split into separate columns
df["name"].str.split(" ", expand=True)

# Get just the first part
df["name"].str.split(" ").str[0]

That last one chains the str accessor twice. A bit weird looking but it works.

# examples [3]

# example 01 · cleaning messy text

Strip whitespace and standardize case; happens all the time with real data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
🐍
Loading PythonSetting up pandas & numpy...
# example 02 · using contains to filter

Find all rows where a column contains certain text

1
2
3
4
5
6
7
8
9
🐍
Loading PythonSetting up pandas & numpy...
# example 03 · splitting and extracting

Pull apart strings when you need specific pieces

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
🐍
Loading PythonSetting up pandas & numpy...

# challenges [2]

# challenge 01/02todo
Convert all student names to lowercase and print the result.
1
2
3
4
5
6
7
8
9
10
11
12
🐍
Loading PythonSetting up pandas & numpy...
# challenge 02/02todo
Filter the students DataFrame to show only students whose subject contains 'Sci' (use case insensitive search) and print their names.
1
2
3
4
5
6
7
8
9
10
11
12
🐍
Loading PythonSetting up pandas & numpy...