[challenge]Web & APIs
Building a Data Pipeline
# theory
the pipeline
Three steps, three functions, one source of truth: an actual JSON endpoint. We'll use /posts and /users from jsonplaceholder.
- Extract with
pyfetch. Hit each endpoint, check status, get JSON. - Transform with pandas. Convert lists of dicts to DataFrames, merge on
userId, classify. - Load with print/return. Real pipelines write to CSV, a database, or a dashboard; in the browser we render.
Wrapping each phase in a function is what turns a pile of fetch calls into a pipeline. It also makes the pieces testable in isolation.
skeleton
async def extract():
posts_resp = await pyfetch("https://jsonplaceholder.typicode.com/posts")
users_resp = await pyfetch("https://jsonplaceholder.typicode.com/users")
if posts_resp.status != 200 or users_resp.status != 200:
raise RuntimeError("upstream API not available")
return await posts_resp.json(), await users_resp.json()
def transform(posts_raw, users_raw):
posts = pd.DataFrame(posts_raw)
users = pd.DataFrame(users_raw)[["id", "name"]]
return posts.merge(users, left_on="userId", right_on="id", suffixes=("_post", "_user"))
def load(df):
counts = df.groupby("name").size().sort_values(ascending=False)
print(counts)
return counts
logging
The browser console can swallow errors mid-pipeline. A tiny log helper makes it obvious where a run stopped.
from datetime import datetime
def log(step, msg):
print(f"[{datetime.now():%H:%M:%S}] [{step}] {msg}")
idempotence
A pipeline you can re-run safely is worth ten times one you can't. That usually means:
- Extract is side-effect-free (just fetching, not mutating server state)
- Transform takes raw data and returns new data, never mutates inputs
- Load either overwrites the destination or uses an upsert key
# examples [3]
Real pipelines often separate extract from transform so the network step can be re-run independently. Cache the raw payload.
Pull comments, group by the post they belong to, flag posts that attract long comments.
In a browser we can't write a real file, but we can build the same CSV that load() would write to disk.
# challenges [2]
# project
# project-challenge
thread: Sales Performance Dashboard · reward: 50 xp
# brief
Create a complete ETL pipeline for the sales dashboard. Extract data from the source, transform it with revenue calculations and categorization, then load a summary report.
# task