SQL Deep Dive: Joins, Windows, and Subqueries
Real analytics means joining, ranking, and drilling into multi-table data. In this guide you will build a reporting pipeline for an e-commerce company, combining customers, products, and orders into the kind of dashboard leadership actually wants.
Time: About 10 minutes. What you will learn: INNER and LEFT joins, GROUP BY with HAVING, IN subqueries, CTEs for multi-step logic, window functions (ROW_NUMBER, RANK, DENSE_RANK, running totals), and how to combine everything into a single executive dashboard query.
The Problem
You are a data engineer at a growing e-commerce company. Your data lives in three normalized tables: customers, products, and orders. Leadership wants a dashboard that answers questions like:
- What is the lifetime value of each customer?
- Which product categories drive the most revenue?
- How do regions compare?
- Who are the top customers, and how do they rank against each other?
None of those answers live in a single table. You need joins to connect the data, aggregation to summarize it, subqueries to filter it, and window functions to rank it. Let's build it up piece by piece.
Building the Dataset
Fire up the Teide REPL and create the three tables. This is a simplified star schema: orders is the fact table, customers and products are dimensions.
CREATE TABLE customers (
id INTEGER, name VARCHAR, email VARCHAR, region VARCHAR, joined_date DATE
);
INSERT INTO customers VALUES
(1, 'Alice Chen', 'alice@example.com', 'US-West', '2022-01-15'),
(2, 'Bob Martinez', 'bob@example.com', 'US-East', '2022-03-22'),
(3, 'Carol White', 'carol@example.com', 'EU-West', '2022-06-10'),
(4, 'Dave Kim', 'dave@example.com', 'US-West', '2023-01-05'),
(5, 'Eve Johnson', 'eve@example.com', 'US-East', '2023-04-18'),
(6, 'Frank Brown', 'frank@example.com', 'EU-West', '2023-08-30'),
(7, 'Grace Liu', 'grace@example.com', 'APAC', '2024-02-14'),
(8, 'Heidi Park', 'heidi@example.com', 'APAC', '2024-06-01');
CREATE TABLE products (
id INTEGER, name VARCHAR, category VARCHAR, price REAL
);
INSERT INTO products VALUES
(1, 'Widget Pro', 'hardware', 49.99),
(2, 'Widget Basic', 'hardware', 29.99),
(3, 'CloudSync', 'software', 199.99),
(4, 'DataVault', 'software', 149.99),
(5, 'Support Plan', 'services', 99.99),
(6, 'Training Pack', 'services', 299.99);
CREATE TABLE orders (
id INTEGER, customer_id INTEGER, product_id INTEGER,
quantity INTEGER, order_date DATE, status VARCHAR
);
INSERT INTO orders VALUES
(1, 1, 1, 3, '2022-02-10', 'completed'),
(2, 1, 3, 1, '2022-05-15', 'completed'),
(3, 2, 2, 5, '2022-04-01', 'completed'),
(4, 2, 4, 1, '2022-09-20', 'completed'),
(5, 3, 1, 2, '2022-07-12', 'completed'),
(6, 3, 5, 1, '2023-01-08', 'completed'),
(7, 4, 3, 1, '2023-02-14', 'completed'),
(8, 4, 6, 1, '2023-06-30', 'completed'),
(9, 5, 2, 10, '2023-05-22', 'completed'),
(10, 5, 1, 4, '2023-08-15', 'completed'),
(11, 6, 4, 2, '2023-10-01', 'completed'),
(12, 6, 5, 1, '2024-01-15', 'completed'),
(13, 7, 3, 1, '2024-03-10', 'completed'),
(14, 7, 1, 6, '2024-05-20', 'completed'),
(15, 8, 6, 1, '2024-07-01', 'completed'),
(16, 1, 4, 1, '2023-03-10', 'completed'),
(17, 2, 3, 1, '2023-07-22', 'completed'),
(18, 3, 6, 1, '2023-11-05', 'refunded'),
(19, 5, 3, 1, '2024-01-30', 'completed'),
(20, 4, 1, 2, '2024-04-11', 'completed');
Quick sanity check — verify the row counts:
SELECT COUNT(*) FROM customers;
SELECT COUNT(*) FROM products;
SELECT COUNT(*) FROM orders;
Joining Tables: Who Bought What?
The most common question in any e-commerce dataset is simply: who bought what, and when? That information is spread across all three tables. An INNER JOIN connects them through the foreign keys.
SELECT
c.name AS customer,
p.name AS product,
p.category,
o.quantity,
o.quantity * p.price AS line_total,
o.order_date
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id
INNER JOIN products p ON o.product_id = p.id
WHERE o.status = 'completed'
ORDER BY o.order_date
LIMIT 10;
Notice the pattern: each row in orders gets enriched with the customer name and product details. We also computed line_total on the fly. The WHERE o.status = 'completed' filter drops the single refunded order so our revenue numbers stay accurate.
LEFT JOIN: The Complete Picture
An INNER JOIN only returns rows that match on both sides. But sometimes you need the opposite: show me everything on the left, even if there is no match on the right. That is what LEFT JOIN is for.
Let's find customers who have not placed any orders in 2024. Since the engine only supports equi-join conditions in ON clauses, we first filter orders to 2024 in a CTE, then LEFT JOIN against it:
WITH orders_2024 AS (
SELECT id, customer_id, order_date
FROM orders
WHERE order_date >= CAST('2024-01-01' AS DATE)
)
SELECT
c.name AS customer,
c.region,
o.id AS order_id,
o.order_date
FROM customers c
LEFT JOIN orders_2024 o
ON c.id = o.customer_id
ORDER BY o.order_date, c.name;
The NULL values in order_id and order_date tell you exactly who has gone quiet. Alice, Bob, and Carol — your earliest customers — have not ordered in 2024. By pre-filtering the orders in a CTE, the LEFT JOIN naturally produces NULLs for customers with no 2024 orders. That is the kind of insight a retention team can act on immediately.
Aggregation Across Joins
Now let's answer the questions leadership actually asks: how much revenue does each category generate? How do regions compare? Which products move the most units?
Revenue per Category
SELECT
p.category,
SUM(o.quantity * p.price) AS total_revenue,
SUM(o.quantity) AS units_sold,
COUNT(*) AS order_count
FROM orders o
INNER JOIN products p ON o.product_id = p.id
WHERE o.status = 'completed'
GROUP BY p.category
ORDER BY total_revenue DESC;
Software drives the most revenue despite fewer units — high-value items pulling their weight.
Revenue per Region
SELECT
c.region,
SUM(o.quantity * p.price) AS total_revenue,
COUNT(*) AS order_count
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id
INNER JOIN products p ON o.product_id = p.id
WHERE o.status = 'completed'
GROUP BY c.region
ORDER BY total_revenue DESC;
HAVING: Filtering Groups
Suppose you only want product categories that have generated more than $800 in revenue. WHERE filters rows before aggregation; HAVING filters groups after. We compute the joined data in a CTE first, then apply GROUP BY and HAVING on the flat result:
WITH order_lines AS (
SELECT p.category, o.quantity * p.price AS line_total
FROM orders o
INNER JOIN products p ON o.product_id = p.id
WHERE o.status = 'completed'
)
SELECT
category,
SUM(line_total) AS total_revenue
FROM order_lines
GROUP BY category
HAVING SUM(line_total) > 800.0
ORDER BY total_revenue DESC;
Subqueries: Asking Questions Within Questions
A subquery is a SELECT inside another SELECT. It lets you compute a value (or a set of values) and use it as a filter condition.
Above-Average Spenders (CTE Approach)
Who has spent more than the average customer? We use two CTEs: one to compute each customer's total spend, and another to compute the average. Then we join them to filter.
WITH customer_totals AS (
SELECT
c.name AS customer_name,
SUM(o.quantity * p.price) AS total_spend
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id
INNER JOIN products p ON o.product_id = p.id
WHERE o.status = 'completed'
GROUP BY c.name
),
avg_spend AS (
SELECT AVG(total_spend) AS avg_total FROM customer_totals
)
SELECT
ct.customer_name,
ct.total_spend
FROM customer_totals ct, avg_spend a
WHERE ct.total_spend > a.avg_total
ORDER BY ct.total_spend DESC;
IN Subquery: Customers Who Bought Software
Sometimes you want to filter by membership in a dynamically computed set. The IN subquery handles this cleanly.
SELECT DISTINCT c.name, c.region
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id
WHERE o.product_id IN (
SELECT id FROM products WHERE category = 'software'
)
ORDER BY c.name;
Try it yourself: Write a query that finds all customers who have never purchased a product in the "services" category. Hint: use NOT IN with a subquery on orders joined to products.
CTEs for Multi-Step Logic
Common Table Expressions (CTEs) let you name intermediate result sets and reference them later. Think of them as temporary views that exist only for the duration of one query. They make complex logic readable and debuggable.
Customer Lifetime Value with Ranking
Let's build this in steps: first compute each customer's total spend, then rank them.
WITH customer_spend AS (
SELECT
c.id,
c.name AS customer_name,
c.region,
c.joined_date,
SUM(o.quantity * p.price) AS lifetime_value,
COUNT(*) AS order_count
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id
INNER JOIN products p ON o.product_id = p.id
WHERE o.status = 'completed'
GROUP BY c.id, c.name, c.region, c.joined_date
)
SELECT
customer_name,
region,
lifetime_value,
order_count,
lifetime_value / order_count AS avg_order_value
FROM customer_spend
ORDER BY lifetime_value DESC;
Chained CTEs: Cohort Analysis
CTEs can reference earlier CTEs, letting you build up logic step by step. Let's group customers by the year they joined and see how each cohort performs.
WITH customer_spend AS (
SELECT
c.id,
c.joined_date,
SUM(o.quantity * p.price) AS lifetime_value
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id
INNER JOIN products p ON o.product_id = p.id
WHERE o.status = 'completed'
GROUP BY c.id, c.joined_date
),
cohorts AS (
SELECT
EXTRACT(YEAR FROM joined_date) AS cohort_year,
COUNT(*) AS cohort_size,
SUM(lifetime_value) AS cohort_revenue,
AVG(lifetime_value) AS avg_clv
FROM customer_spend
GROUP BY EXTRACT(YEAR FROM joined_date)
)
SELECT
cohort_year,
cohort_size,
cohort_revenue,
avg_clv
FROM cohorts
ORDER BY cohort_year;
The 2023 cohort has the highest average CLV. That could reflect more time to accumulate purchases, or it could signal a particularly engaged group. Either way, it is the kind of signal worth investigating.
Window Functions
Window functions compute a value across a set of rows related to the current row, without collapsing the result into a single group. They are what you reach for when you need rankings, running totals, or comparisons against an aggregate — while keeping every individual row visible.
Running Total per Customer
Let's track how each customer's spend accumulates over time:
SELECT
c.name AS customer_name,
o.order_date,
o.quantity * p.price AS line_total,
SUM(o.quantity * p.price) OVER (
PARTITION BY c.id ORDER BY o.order_date
) AS running_total
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id
INNER JOIN products p ON o.product_id = p.id
WHERE o.status = 'completed'
AND c.name IN ('Alice Chen', 'Eve Johnson')
ORDER BY c.name, o.order_date;
RANK and DENSE_RANK: Ranking Customers by Total Spend
RANK() assigns the same rank to ties but leaves gaps. DENSE_RANK() assigns the same rank to ties without gaps. ROW_NUMBER() always gives a unique number.
WITH customer_spend AS (
SELECT
c.name AS customer_name,
c.region,
SUM(o.quantity * p.price) AS total_spend
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id
INNER JOIN products p ON o.product_id = p.id
WHERE o.status = 'completed'
GROUP BY c.name, c.region
)
SELECT
customer_name,
region,
total_spend,
RANK() OVER (ORDER BY total_spend DESC) AS rank,
DENSE_RANK() OVER (ORDER BY total_spend DESC) AS dense_rank,
ROW_NUMBER() OVER (ORDER BY total_spend DESC) AS row_num
FROM customer_spend
ORDER BY rank;
Notice how Bob and Grace both have a total spend near 499.93. With RANK, they both get rank 4 and rank 6 is skipped. With DENSE_RANK, there are no gaps. ROW_NUMBER breaks the tie arbitrarily.
Try it yourself: Add a window function column that shows each customer's spend as a percentage of the total across all customers. Hint: use SUM(total_spend) OVER () (empty OVER clause = entire result set) as the denominator, and multiply by 100.
The Executive Dashboard
Now let's bring everything together. This is the kind of query you would schedule nightly and pipe into a BI tool. It combines joins, aggregation, a CTE, and a window function into a single result set.
WITH customer_metrics AS (
SELECT
c.id AS customer_id,
c.name AS customer_name,
c.region,
EXTRACT(YEAR FROM c.joined_date) AS join_year,
COUNT(*) AS order_count,
SUM(o.quantity * p.price) AS lifetime_value,
SUM(o.quantity) AS total_units
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id
INNER JOIN products p ON o.product_id = p.id
WHERE o.status = 'completed'
GROUP BY c.id, c.name, c.region, c.joined_date
)
SELECT
customer_name,
region,
join_year,
order_count,
total_units,
lifetime_value,
lifetime_value / order_count AS avg_order_value,
RANK() OVER (ORDER BY lifetime_value DESC) AS spend_rank,
SUM(lifetime_value) OVER (
ORDER BY lifetime_value DESC
) AS cumulative_revenue
FROM customer_metrics
ORDER BY spend_rank;
In one query you can see that Eve Johnson is your highest-value customer, the top four customers account for more than 60% of total revenue (cumulative hits $2299.70 of $3699.56), and the 2023 cohort holds the top two spots. That is a dashboard slide, not a SQL exercise.
What's Next
You now have the core SQL toolkit for building real reporting queries. Here are some directions to explore next:
Working with Time
DATE, TIME, and TIMESTAMP columns. Temporal filtering, extraction, and time-series patterns.
Joins
Full reference for INNER, LEFT, RIGHT, FULL, and CROSS joins.
Window Functions
ROW_NUMBER, RANK, DENSE_RANK, NTILE, LAG, LEAD, and frame specifications.
Subqueries & CTEs
Scalar, IN, EXISTS, and correlated subqueries. WITH clause and chained CTEs.