Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions 02_activities/assignments/DC_Cohort/Assignment2.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,8 @@ The store wants to keep customer addresses. Propose two architectures for the CU
**HINT:** search type 1 vs type 2 slowly changing dimensions.

```
Your answer...
Architecture A - SCD Type 1: Keeps only the current address. When a customer updates their address, the old one is overwritten and lost forever.
Architecture B - SCD Type 2: Keeps every version of the customer’s address by creating a new row each time the address changes.
```

***
Expand Down Expand Up @@ -189,7 +190,15 @@ Read: Boykis, V. (2019, October 16). _Neural nets are just people all the way do

Consider, for example, concepts of labour, bias, LLM proliferation, moderating content, intersection of technology and society, ect.


```
Your thoughts...
Vicki Boykis’s 2019 essay “Neural nets are just people all the way down” delivers a deceptively simple but profound reminder that behind every impressive neural network lies layer upon layer of invisible human labor, choices, and biases. What appears to be pure machine intelligence is actually a vast, often precarious assembly of people. The ethical stakes are high and remarkably relevant to today’s AI landscape.
The article’s central metaphor, clothes are still stitched by human hands because robots can’t handle fabric’s variability, extends directly to AI training. ImageNet, the foundational dataset that powered the deep-learning revolution in 2012, required millions of images to be manually labeled. Stanford’s Fei-Fei Li first paid Princeton students $10 an hour; when that proved too slow and expensive, the project turned to Amazon Mechanical Turk. Low-paid crowdworkers performed the repetitive, cognitively demanding task of classifying thousands of images daily.
This pattern has only intensified. Modern large language models (LLMs) and multimodal systems rely on even larger armies of data labelers, content moderators, and “AI trainerswho annotate toxic text, rate model outputs, or generate preference data for reinforcement learning. The ethical problem is twofold: exploitation and invisibility. Workers endure repetitive strain, psychological harm from viewing disturbing content, and precarious gig-economy contracts with no benefits or job security. Yet their labour is erased in marketing narratives that celebrate “artificial” intelligence. As Boykis shows, even the linguistic backbone rests on uncredited graduate student and clerical work from decades earlier. This is not a one-off historical quirk. It is structural. AI’s economic model externalizes human costs while privatizing the gains.
Every taxonomy is political. Boykis traces how ImageNet drew on WordNet’s synsets, which are human-created groupings of concepts that inevitably reflect the cultural assumptions, blind spots, and prejudices of their creators. When those taxonomies are used to label millions of images, biases become baked in at scale. The essay highlights ImageNet Roulette, an experiment that exposed wildly offensive or absurd labels for people. The ImageNet team later acknowledged that 1,593 synsets in the “person” subtree were problematic or unsafe and began a manual cleanup process.
This is an example of representational harm. Classification systems are never neutral; they encode power. Who decides what counts as a “normal” family photo, a “professional” hairstyle, or an “angry” facial expression? When these datasets train today’s LLMs and vision models, the same biases propagate, thus disproportionately harming marginalized groups in hiring algorithms, facial recognition, content moderation, and medical AI.
Boykis notes that Mechanical Turk workers were monitored with control images to catch cheating. In 2025, content moderators for major platforms and AI companies still perform the same invisible work, flagging hate speech, violence, and exploitation so that recommendation engines and safety filters can function. The ethical tension is acute. Moderation is essential for safer AI, yet it is chronically underpaid, under-supported, and psychologically damaging. The very systems that claim to “learn” from human feedback depend on this hidden workforce while simultaneously automating away the jobs of the people who built them.
Boykis wrote before the ChatGPT era, but her argument is even more urgent now. The explosive proliferation of LLMs has multiplied the demand for human-labeled data by orders of magnitude. Reinforcement learning from human feedback (RLHF), synthetic data generation, and red-teaming all require massive human input. Yet the dominant narrative remains “bigger models = better intelligence.” This obscures the reality that scaling is only possible because of an ever-larger, often exploited underclass of data workers.
At its heart, the essay challenges the myth of technological neutrality. Neural nets do not float above society. They are society which is compressed, encoded, and amplified. The choices made in 1964 (Brown Corpus), 2007 (ImageNet labeling), and today (preference datasets for LLMs) shape what AI “knows” and how it behaves. When those choices reflect historical inequalities, the resulting systems entrench them. This is not an engineering problem that can be solved with better algorithms alone. It is a sociotechnical one requiring transparency, accountability, fair compensation for data workers, and democratic oversight of foundational datasets.
Boykis leaves us with a powerful image that every neural net is “just people all the way down.” Recognizing this does not diminish the technical achievements of AI. It demands we treat the humans behind it with dignity, confront the politics embedded in our data infrastructures, and design systems that are honest about their origins rather than pretending to be magically autonomous.
In an era of trillion-parameter models and breathless claims of artificial general intelligence, this reminder is essential. The ethical path forward requires more than technical fixes. It calls for labour rights for data workers, rigorous bias audits, public accountability for training data, and a cultural shift that values the human infrastructure of AI as much as the silicon. Until we acknowledge that neural nets are people all the way down, we risk building ever more powerful systems on foundations that are ethically precarious.
```
186 changes: 139 additions & 47 deletions 02_activities/assignments/DC_Cohort/assignment2.sql
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,11 @@ The `||` values concatenate the columns into strings.
Edit the appropriate columns -- you're making two edits -- and the NULL rows will be fixed.
All the other rows will remain the same. */
--QUERY 1




SELECT
COALESCE(product_name, '') || ', ' ||
COALESCE(product_size, '') || ' (' ||
COALESCE(product_qty_type, 'unit') || ')'
FROM product;
--END QUERY


Expand All @@ -40,10 +41,15 @@ each new market date for each customer, or select only the unique market dates p
HINT: One of these approaches uses ROW_NUMBER() and one uses DENSE_RANK().
Filter the visits to dates before April 29, 2022. */
--QUERY 2




SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY market_date
) AS visit_number
FROM customer_purchases
WHERE market_date < '2022-04-29'
ORDER BY customer_id, visit_number;
--END QUERY


Expand All @@ -52,10 +58,18 @@ then write another query that uses this one as a subquery (or temp table) and fi
only the customer’s most recent visit.
HINT: Do not use the previous visit dates filter. */
--QUERY 3




SELECT *
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY market_date DESC
) AS visit_number
FROM customer_purchases
) AS numbered_visits
WHERE visit_number = 1
ORDER BY customer_id;
--END QUERY


Expand All @@ -65,10 +79,15 @@ customer_purchases table that indicates how many different times that customer h
You can make this a running count by including an ORDER BY within the PARTITION BY if desired.
Filter the visits to dates before April 29, 2022. */
--QUERY 4




SELECT
*,
COUNT(*) OVER (
PARTITION BY customer_id, product_id
ORDER BY market_date
) AS purchase_count
FROM customer_purchases
WHERE market_date < '2022-04-29'
ORDER BY customer_id, product_id, market_date;
--END QUERY


Expand All @@ -84,19 +103,23 @@ Remove any trailing or leading whitespaces. Don't just use a case statement for

Hint: you might need to use INSTR(product_name,'-') to find the hyphens. INSTR will help split the column. */
--QUERY 5




SELECT
product_name,
CASE
WHEN INSTR(product_name, '-') > 0 THEN
TRIM(SUBSTR(product_name, INSTR(product_name, '-') + 1))
ELSE NULL
END AS product_description
FROM product;
--END QUERY


/* 2. Filter the query to show any product_size value that contain a number with REGEXP. */
--QUERY 6




SELECT *
FROM product
WHERE product_size REGEXP '[0-9]'
ORDER BY product_name;
--END QUERY


Expand All @@ -110,10 +133,38 @@ HINT: There are a possibly a few ways to do this query, but if you're struggling
3) Query the second temp table twice, once for the best day, once for the worst day,
with a UNION binding them. */
--QUERY 7




-- Calculate total sales per market_date
WITH daily_sales AS (
SELECT
market_date,
SUM(quantity * cost_to_customer_per_qty) AS total_sales
FROM customer_purchases
GROUP BY market_date
),
-- Rank the days (1 = highest sales, 1 = lowest sales)
ranked_sales AS (
SELECT
market_date,
total_sales,
RANK() OVER (ORDER BY total_sales DESC) AS sales_rank_high, -- 1 = best day
RANK() OVER (ORDER BY total_sales ASC) AS sales_rank_low -- 1 = worst day
FROM daily_sales
)
-- Get the best day and worst day using UNION
SELECT
market_date,
total_sales,
'best day' AS day_type
FROM ranked_sales
WHERE sales_rank_high = 1
UNION
SELECT
market_date,
total_sales,
'worst day' AS day_type
FROM ranked_sales
WHERE sales_rank_low = 1
ORDER BY total_sales DESC;
--END QUERY


Expand All @@ -131,10 +182,32 @@ Think a bit about the row counts: how many distinct vendors, product names are t
How many customers are there (y).
Before your final group by you should have the product of those two queries (x*y). */
--QUERY 8




-- Get unique vendor + product combinations with their price
WITH vendor_products AS (
SELECT
v.vendor_name,
p.product_name,
vi.original_price
FROM vendor_inventory vi
JOIN vendor v ON vi.vendor_id = v.vendor_id
JOIN product p ON vi.product_id = p.product_id
GROUP BY v.vendor_name, p.product_name, vi.original_price
),
-- Count total customers
customer_count AS (
SELECT COUNT(*) AS num_customers
FROM customer
)
SELECT
vp.vendor_name,
vp.product_name,
vp.original_price,
cc.num_customers,
5 * cc.num_customers AS quantity_sold,
ROUND(5 * cc.num_customers * vp.original_price, 2) AS total_revenue_per_product
FROM vendor_products vp
CROSS JOIN customer_count cc
ORDER BY total_revenue_per_product DESC, vp.vendor_name, vp.product_name;
--END QUERY


Expand All @@ -144,20 +217,34 @@ This table will contain only products where the `product_qty_type = 'unit'`.
It should use all of the columns from the product table, as well as a new column for the `CURRENT_TIMESTAMP`.
Name the timestamp column `snapshot_timestamp`. */
--QUERY 9




CREATE TABLE product_units AS
SELECT
*,
CURRENT_TIMESTAMP AS snapshot_timestamp
FROM product
WHERE product_qty_type = 'unit';
--END QUERY


/*2. Using `INSERT`, add a new row to the product_units table (with an updated timestamp).
This can be any product you desire (e.g. add another record for Apple Pie). */
--QUERY 10




INSERT INTO product_units
(product_id,
product_name,
product_size,
product_category_id,
product_qty_type,
pepper_flag,
snapshot_timestamp)
VALUES
(999,
'Matcha Cake',
'3 lbs',
999,
'unit',
0,
CURRENT_TIMESTAMP);
--END QUERY


Expand All @@ -166,10 +253,8 @@ This can be any product you desire (e.g. add another record for Apple Pie). */

HINT: If you don't specify a WHERE clause, you are going to have a bad time.*/
--QUERY 11




DELETE FROM product_units
WHERE product_name = 'Matcha Cake';
--END QUERY


Expand All @@ -190,10 +275,17 @@ Finally, make sure you have a WHERE statement to update the right row,
you'll need to use product_units.product_id to refer to the correct row within the product_units table.
When you have all of these components, you can run the update statement. */
--QUERY 12
ALTER TABLE product_units
ADD current_quantity INT;




UPDATE product_units
SET current_quantity = COALESCE((
SELECT vi.quantity
FROM vendor_inventory vi
WHERE vi.product_id = product_units.product_id
ORDER BY vi.market_date DESC
LIMIT 1
), 0);
--END QUERY


Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.