pyodide: loading…

[concept]Data Manipulation (WCTC)

Combining DataFrames

# theory

combining DataFrames

Real data is almost never in one table. You've got customers in one place, orders in another, products somewhere else. Combining them is like half the job.

merge

merge is the main one. It's basically SQL joins.

# Join on a common column
result = pd.merge(orders, customers, on="customer_id")

# Different column names in each table
result = pd.merge(orders, customers, left_on="cust_id", right_on="customer_id")

join types

# Inner join (default); only matching rows
pd.merge(df1, df2, on="id", how="inner")

# Left join; all rows from left, matching from right
pd.merge(df1, df2, on="id", how="left")

# Right join; all rows from right, matching from left
pd.merge(df1, df2, on="id", how="right")

# Outer join; all rows from both
pd.merge(df1, df2, on="id", how="outer")

Inner is the default and what you want most of the time. Left join is second most common, the "give me all my orders, and whatever customer info exists" pattern.

duplicate column names

If both DataFrames have a column with the same name (besides the join key), you get suffixes:

# By default you get _x and _y
pd.merge(df1, df2, on="id")
# Creates name_x and name_y

# Custom suffixes
pd.merge(df1, df2, on="id", suffixes=("_left", "_right"))

concat

Use concat when you want to stack DataFrames on top of each other (or side by side).

# Stack vertically (add more rows)
combined = pd.concat([df1, df2])

# Stack horizontally (add more columns)
combined = pd.concat([df1, df2], axis=1)

Watch the index though; by default it keeps the original indexes. Use ignore_index=True to reset:

combined = pd.concat([df1, df2], ignore_index=True)

join (index-based)

join is basically merge but uses the index by default.

# Join on index
df1.join(df2)

# Join df1's index to df2's column
df1.join(df2.set_index("key"), on="key")

merge gets used more often than join because it's more explicit about what's happening.

# examples [3]

# example 01 · inner vs left join

See the difference; inner drops unmatched, left keeps all from left side

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
🐍
Loading PythonSetting up pandas & numpy...
# example 02 · concatenating dataframes

Stack data from multiple sources together

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
🐍
Loading PythonSetting up pandas & numpy...
# example 03 · multiple key merge

Sometimes you need to match on more than one column

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
🐍
Loading PythonSetting up pandas & numpy...

# challenges [2]

# challenge 01/02todo
Merge the orders and customers DataFrames using an inner join on customer_id and print the result.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
🐍
Loading PythonSetting up pandas & numpy...
# challenge 02/02todo
Use a left join to keep all orders and add customer names. Print the result to see which orders have missing customer info (NaN).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
🐍
Loading PythonSetting up pandas & numpy...