[concept]NumPy Foundations
Arrays and dtypes
# theory
why NumPy when pandas exists
Every pandas DataFrame and Series is a thin layer over a NumPy array. The math is in NumPy. So when pandas gets slow on a hot loop, the fix is almost always "drop to NumPy."
NumPy arrays have:
- One dtype per array. A whole array is
int64orfloat64orbool. No mixed columns. That's what makes them fast. - A fixed shape. You declare a shape; NumPy lays the bytes out contiguously. Reshape later if needed.
- No labels. Just positions. Labels are pandas's job.
creating arrays
import numpy as np
a = np.array([1, 2, 3, 4]) # 1D, dtype inferred (int64)
b = np.array([[1, 2], [3, 4]]) # 2D, shape (2, 2)
c = np.zeros((3, 4)) # 3x4 of float zeros
d = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
e = np.linspace(0, 1, 5) # [0., 0.25, 0.5, 0.75, 1.]
dtype matters
Mixing types triggers automatic promotion:
np.array([1, 2, 3]).dtype # int64
np.array([1, 2, 3.0]).dtype # float64 (one float promoted everything)
np.array([1, "a"]).dtype # <U21 (string), all numbers got stringified
This is the failure mode you'll hit again and again. A bad value in a CSV makes the whole column a string and your math silently breaks.
shape & indexing
m = np.array([[1, 2, 3], [4, 5, 6]])
m.shape # (2, 3)
m.size # 6 total elements
m[0] # first row: [1, 2, 3]
m[0, 1] # element at row 0, col 1: 2
m[:, 1] # all rows, second column: [2, 5]
m[1, :] # second row, all columns: [4, 5, 6]
Slicing returns a view into the same memory, not a copy. Modifying the slice modifies the original. Useful, easy to forget.
# examples [3]
zeros, ones, arange, linspace cover most starter shapes.
One stray string promotes the whole array. This is the source of most 'why is my math returning weird stuff' bugs.
Editing a slice mutates the original array. Use .copy() when you want a real new array.
# challenges [2]