diff --git a/02_activities/assignments/DC_Cohort/Assignment2.md b/02_activities/assignments/DC_Cohort/Assignment2.md index 01f991d02..dfe2fe520 100644 --- a/02_activities/assignments/DC_Cohort/Assignment2.md +++ b/02_activities/assignments/DC_Cohort/Assignment2.md @@ -56,7 +56,8 @@ The store wants to keep customer addresses. Propose two architectures for the CU **HINT:** search type 1 vs type 2 slowly changing dimensions. ``` -Your answer... +Architecture A - SCD Type 1: Keeps only the current address. When a customer updates their address, the old one is overwritten and lost forever. +Architecture B - SCD Type 2: Keeps every version of the customer’s address by creating a new row each time the address changes. ``` *** @@ -189,7 +190,15 @@ Read: Boykis, V. (2019, October 16). _Neural nets are just people all the way do Consider, for example, concepts of labour, bias, LLM proliferation, moderating content, intersection of technology and society, ect. - ``` -Your thoughts... +Vicki Boykis’s 2019 essay “Neural nets are just people all the way down” delivers a deceptively simple but profound reminder that behind every impressive neural network lies layer upon layer of invisible human labor, choices, and biases. What appears to be pure machine intelligence is actually a vast, often precarious assembly of people. The ethical stakes are high and remarkably relevant to today’s AI landscape. +The article’s central metaphor, clothes are still stitched by human hands because robots can’t handle fabric’s variability, extends directly to AI training. ImageNet, the foundational dataset that powered the deep-learning revolution in 2012, required millions of images to be manually labeled. Stanford’s Fei-Fei Li first paid Princeton students $10 an hour; when that proved too slow and expensive, the project turned to Amazon Mechanical Turk. Low-paid crowdworkers performed the repetitive, cognitively demanding task of classifying thousands of images daily. +This pattern has only intensified. Modern large language models (LLMs) and multimodal systems rely on even larger armies of data labelers, content moderators, and “AI trainerswho annotate toxic text, rate model outputs, or generate preference data for reinforcement learning. The ethical problem is twofold: exploitation and invisibility. Workers endure repetitive strain, psychological harm from viewing disturbing content, and precarious gig-economy contracts with no benefits or job security. Yet their labour is erased in marketing narratives that celebrate “artificial” intelligence. As Boykis shows, even the linguistic backbone rests on uncredited graduate student and clerical work from decades earlier. This is not a one-off historical quirk. It is structural. AI’s economic model externalizes human costs while privatizing the gains. +Every taxonomy is political. Boykis traces how ImageNet drew on WordNet’s synsets, which are human-created groupings of concepts that inevitably reflect the cultural assumptions, blind spots, and prejudices of their creators. When those taxonomies are used to label millions of images, biases become baked in at scale. The essay highlights ImageNet Roulette, an experiment that exposed wildly offensive or absurd labels for people. The ImageNet team later acknowledged that 1,593 synsets in the “person” subtree were problematic or unsafe and began a manual cleanup process. +This is an example of representational harm. Classification systems are never neutral; they encode power. Who decides what counts as a “normal” family photo, a “professional” hairstyle, or an “angry” facial expression? When these datasets train today’s LLMs and vision models, the same biases propagate, thus disproportionately harming marginalized groups in hiring algorithms, facial recognition, content moderation, and medical AI. +Boykis notes that Mechanical Turk workers were monitored with control images to catch cheating. In 2025, content moderators for major platforms and AI companies still perform the same invisible work, flagging hate speech, violence, and exploitation so that recommendation engines and safety filters can function. The ethical tension is acute. Moderation is essential for safer AI, yet it is chronically underpaid, under-supported, and psychologically damaging. The very systems that claim to “learn” from human feedback depend on this hidden workforce while simultaneously automating away the jobs of the people who built them. +Boykis wrote before the ChatGPT era, but her argument is even more urgent now. The explosive proliferation of LLMs has multiplied the demand for human-labeled data by orders of magnitude. Reinforcement learning from human feedback (RLHF), synthetic data generation, and red-teaming all require massive human input. Yet the dominant narrative remains “bigger models = better intelligence.” This obscures the reality that scaling is only possible because of an ever-larger, often exploited underclass of data workers. +At its heart, the essay challenges the myth of technological neutrality. Neural nets do not float above society. They are society which is compressed, encoded, and amplified. The choices made in 1964 (Brown Corpus), 2007 (ImageNet labeling), and today (preference datasets for LLMs) shape what AI “knows” and how it behaves. When those choices reflect historical inequalities, the resulting systems entrench them. This is not an engineering problem that can be solved with better algorithms alone. It is a sociotechnical one requiring transparency, accountability, fair compensation for data workers, and democratic oversight of foundational datasets. +Boykis leaves us with a powerful image that every neural net is “just people all the way down.” Recognizing this does not diminish the technical achievements of AI. It demands we treat the humans behind it with dignity, confront the politics embedded in our data infrastructures, and design systems that are honest about their origins rather than pretending to be magically autonomous. +In an era of trillion-parameter models and breathless claims of artificial general intelligence, this reminder is essential. The ethical path forward requires more than technical fixes. It calls for labour rights for data workers, rigorous bias audits, public accountability for training data, and a cultural shift that values the human infrastructure of AI as much as the silicon. Until we acknowledge that neural nets are people all the way down, we risk building ever more powerful systems on foundations that are ethically precarious. ``` diff --git a/02_activities/assignments/DC_Cohort/assignment2.sql b/02_activities/assignments/DC_Cohort/assignment2.sql index f7515f625..4ded1adc5 100644 --- a/02_activities/assignments/DC_Cohort/assignment2.sql +++ b/02_activities/assignments/DC_Cohort/assignment2.sql @@ -22,10 +22,11 @@ The `||` values concatenate the columns into strings. Edit the appropriate columns -- you're making two edits -- and the NULL rows will be fixed. All the other rows will remain the same. */ --QUERY 1 - - - - +SELECT + COALESCE(product_name, '') || ', ' || + COALESCE(product_size, '') || ' (' || + COALESCE(product_qty_type, 'unit') || ')' +FROM product; --END QUERY @@ -40,10 +41,15 @@ each new market date for each customer, or select only the unique market dates p HINT: One of these approaches uses ROW_NUMBER() and one uses DENSE_RANK(). Filter the visits to dates before April 29, 2022. */ --QUERY 2 - - - - +SELECT + *, + ROW_NUMBER() OVER ( + PARTITION BY customer_id + ORDER BY market_date + ) AS visit_number +FROM customer_purchases +WHERE market_date < '2022-04-29' +ORDER BY customer_id, visit_number; --END QUERY @@ -52,10 +58,18 @@ then write another query that uses this one as a subquery (or temp table) and fi only the customer’s most recent visit. HINT: Do not use the previous visit dates filter. */ --QUERY 3 - - - - +SELECT * +FROM ( + SELECT + *, + ROW_NUMBER() OVER ( + PARTITION BY customer_id + ORDER BY market_date DESC + ) AS visit_number + FROM customer_purchases +) AS numbered_visits +WHERE visit_number = 1 +ORDER BY customer_id; --END QUERY @@ -65,10 +79,15 @@ customer_purchases table that indicates how many different times that customer h You can make this a running count by including an ORDER BY within the PARTITION BY if desired. Filter the visits to dates before April 29, 2022. */ --QUERY 4 - - - - +SELECT + *, + COUNT(*) OVER ( + PARTITION BY customer_id, product_id + ORDER BY market_date + ) AS purchase_count +FROM customer_purchases +WHERE market_date < '2022-04-29' +ORDER BY customer_id, product_id, market_date; --END QUERY @@ -84,19 +103,23 @@ Remove any trailing or leading whitespaces. Don't just use a case statement for Hint: you might need to use INSTR(product_name,'-') to find the hyphens. INSTR will help split the column. */ --QUERY 5 - - - - +SELECT + product_name, + CASE + WHEN INSTR(product_name, '-') > 0 THEN + TRIM(SUBSTR(product_name, INSTR(product_name, '-') + 1)) + ELSE NULL + END AS product_description +FROM product; --END QUERY /* 2. Filter the query to show any product_size value that contain a number with REGEXP. */ --QUERY 6 - - - - +SELECT * +FROM product +WHERE product_size REGEXP '[0-9]' +ORDER BY product_name; --END QUERY @@ -110,10 +133,38 @@ HINT: There are a possibly a few ways to do this query, but if you're struggling 3) Query the second temp table twice, once for the best day, once for the worst day, with a UNION binding them. */ --QUERY 7 - - - - +-- Calculate total sales per market_date +WITH daily_sales AS ( + SELECT + market_date, + SUM(quantity * cost_to_customer_per_qty) AS total_sales + FROM customer_purchases + GROUP BY market_date +), +-- Rank the days (1 = highest sales, 1 = lowest sales) +ranked_sales AS ( + SELECT + market_date, + total_sales, + RANK() OVER (ORDER BY total_sales DESC) AS sales_rank_high, -- 1 = best day + RANK() OVER (ORDER BY total_sales ASC) AS sales_rank_low -- 1 = worst day + FROM daily_sales +) +-- Get the best day and worst day using UNION +SELECT + market_date, + total_sales, + 'best day' AS day_type +FROM ranked_sales +WHERE sales_rank_high = 1 +UNION +SELECT + market_date, + total_sales, + 'worst day' AS day_type +FROM ranked_sales +WHERE sales_rank_low = 1 +ORDER BY total_sales DESC; --END QUERY @@ -131,10 +182,32 @@ Think a bit about the row counts: how many distinct vendors, product names are t How many customers are there (y). Before your final group by you should have the product of those two queries (x*y). */ --QUERY 8 - - - - +-- Get unique vendor + product combinations with their price +WITH vendor_products AS ( + SELECT + v.vendor_name, + p.product_name, + vi.original_price + FROM vendor_inventory vi + JOIN vendor v ON vi.vendor_id = v.vendor_id + JOIN product p ON vi.product_id = p.product_id + GROUP BY v.vendor_name, p.product_name, vi.original_price +), +-- Count total customers +customer_count AS ( + SELECT COUNT(*) AS num_customers + FROM customer +) +SELECT + vp.vendor_name, + vp.product_name, + vp.original_price, + cc.num_customers, + 5 * cc.num_customers AS quantity_sold, + ROUND(5 * cc.num_customers * vp.original_price, 2) AS total_revenue_per_product +FROM vendor_products vp +CROSS JOIN customer_count cc +ORDER BY total_revenue_per_product DESC, vp.vendor_name, vp.product_name; --END QUERY @@ -144,20 +217,34 @@ This table will contain only products where the `product_qty_type = 'unit'`. It should use all of the columns from the product table, as well as a new column for the `CURRENT_TIMESTAMP`. Name the timestamp column `snapshot_timestamp`. */ --QUERY 9 - - - - +CREATE TABLE product_units AS +SELECT + *, + CURRENT_TIMESTAMP AS snapshot_timestamp +FROM product +WHERE product_qty_type = 'unit'; --END QUERY /*2. Using `INSERT`, add a new row to the product_units table (with an updated timestamp). This can be any product you desire (e.g. add another record for Apple Pie). */ --QUERY 10 - - - - +INSERT INTO product_units + (product_id, + product_name, + product_size, + product_category_id, + product_qty_type, + pepper_flag, + snapshot_timestamp) +VALUES + (999, + 'Matcha Cake', + '3 lbs', + 999, + 'unit', + 0, + CURRENT_TIMESTAMP); --END QUERY @@ -166,10 +253,8 @@ This can be any product you desire (e.g. add another record for Apple Pie). */ HINT: If you don't specify a WHERE clause, you are going to have a bad time.*/ --QUERY 11 - - - - +DELETE FROM product_units +WHERE product_name = 'Matcha Cake'; --END QUERY @@ -190,10 +275,17 @@ Finally, make sure you have a WHERE statement to update the right row, you'll need to use product_units.product_id to refer to the correct row within the product_units table. When you have all of these components, you can run the update statement. */ --QUERY 12 +ALTER TABLE product_units +ADD current_quantity INT; - - - +UPDATE product_units +SET current_quantity = COALESCE(( + SELECT vi.quantity + FROM vendor_inventory vi + WHERE vi.product_id = product_units.product_id + ORDER BY vi.market_date DESC + LIMIT 1 +), 0); --END QUERY diff --git a/02_activities/assignments/DC_Cohort/assignment2_section1_logicalmodel_prompt1.png b/02_activities/assignments/DC_Cohort/assignment2_section1_logicalmodel_prompt1.png new file mode 100644 index 000000000..df5726cf4 Binary files /dev/null and b/02_activities/assignments/DC_Cohort/assignment2_section1_logicalmodel_prompt1.png differ diff --git a/02_activities/assignments/DC_Cohort/assignment2_section1_logicalmodel_prompt2.png b/02_activities/assignments/DC_Cohort/assignment2_section1_logicalmodel_prompt2.png new file mode 100644 index 000000000..666db91cf Binary files /dev/null and b/02_activities/assignments/DC_Cohort/assignment2_section1_logicalmodel_prompt2.png differ