Graph Queries with SQL/PGQ: Finding Patterns in Connected Data
Following the money is harder than it looks. SQL/PGQ makes it possible.
You are a fraud analyst at a fintech company. A compliance alert has flagged a cluster of accounts with unusual transfer patterns — small amounts bouncing between accounts, money leaving and mysteriously returning, a single source splitting payments to multiple destinations just below the reporting threshold. Your job is to trace the flows, identify the patterns, and produce evidence for the investigation team.
The catch: the evidence lives in connections, not in rows. A single transfer looks innocent. It is only when you follow the chain — account to account to account — that the pattern emerges. Traditional SQL makes this painful. Graph queries make it natural.
In this guide you will learn how to:
- Model financial data as a property graph with
CREATE PROPERTY GRAPH - Write
MATCHpatterns to trace direct and multi-hop transfers - Use variable-length paths and
ANY SHORTESTto follow the money - Detect circular flows (a classic money laundering signal)
- Spot smurfing patterns using graph queries composed with SQL aggregation
- Run PageRank, community detection, and connected components on financial networks
- Compose graph results with CTEs and GROUP BY for a full investigation
Estimated time: about 15 minutes.
The Problem: Why Self-Joins Fall Apart
Before we reach for graph queries, let's feel the pain of doing this with standard SQL. Suppose you want to trace a 3-hop transfer chain starting from account 101. With regular joins, you would write something like this:
-- The painful way: trace 3 hops with self-joins
SELECT t1.src_account AS hop0,
t1.dst_account AS hop1,
t2.dst_account AS hop2,
t3.dst_account AS hop3,
t1.amount AS amt1,
t2.amount AS amt2,
t3.amount AS amt3
FROM transfers t1
JOIN transfers t2 ON t1.dst_account = t2.src_account
JOIN transfers t3 ON t2.dst_account = t3.src_account
WHERE t1.src_account = 101;
That is three joins for three hops. Want four hops? Add another join. Want variable-length
paths? You cannot — not without recursive CTEs that quickly become unreadable. Want to
check if the money comes back to where it started? Add a WHERE t3.dst_account = 101
and hope you guessed the right chain length.
Now imagine explaining this query to a colleague. Or extending it to include account owners, filter by amount thresholds, and detect cycles of unknown length. The SQL becomes a wall of aliases and join conditions that obscures the very pattern you are trying to find.
There has to be a better way. There is.
Building the Dataset
Our investigation involves four tables: persons (account holders), accounts, an ownership
mapping, and a transfer log. Fire up the TeideDB REPL (cargo run --features cli)
and create the tables:
CREATE TABLE persons (
id INTEGER, name VARCHAR, risk_score REAL, country VARCHAR
);
INSERT INTO persons VALUES
(1, 'Alice Yang', 0.1, 'US'),
(2, 'Bob Petrov', 0.3, 'UK'),
(3, 'Carol Santos', 0.8, 'BR'),
(4, 'Dave Mueller', 0.2, 'DE'),
(5, 'Eve Nakamura', 0.9, 'JP'),
(6, 'Frank Osei', 0.4, 'GH'),
(7, 'Grace Park', 0.1, 'KR'),
(8, 'Hector Ruiz', 0.7, 'MX');
CREATE TABLE accounts (
id INTEGER, owner_id INTEGER, account_type VARCHAR,
balance REAL, opened_date DATE
);
INSERT INTO accounts VALUES
(101, 1, 'checking', 50000.0, '2020-01-15'),
(102, 1, 'savings', 120000.0, '2020-01-15'),
(103, 2, 'checking', 35000.0, '2019-06-01'),
(104, 3, 'checking', 8000.0, '2022-11-10'),
(105, 3, 'business', 250000.0, '2022-11-10'),
(106, 4, 'checking', 42000.0, '2021-03-20'),
(107, 5, 'business', 15000.0, '2023-05-01'),
(108, 5, 'checking', 3000.0, '2023-05-01'),
(109, 6, 'checking', 28000.0, '2020-09-15'),
(110, 7, 'savings', 95000.0, '2018-12-01'),
(111, 8, 'checking', 11000.0, '2023-08-20');
CREATE TABLE owns (
person_id INTEGER, account_id INTEGER
);
INSERT INTO owns VALUES
(1, 101), (1, 102), (2, 103), (3, 104), (3, 105),
(4, 106), (5, 107), (5, 108), (6, 109), (7, 110), (8, 111);
CREATE TABLE transfers (
id INTEGER, src_account INTEGER, dst_account INTEGER,
amount INTEGER, transfer_date DATE, memo VARCHAR
);
INSERT INTO transfers VALUES
(1, 101, 103, 15000, '2024-01-10', 'consulting'),
(2, 103, 104, 9500, '2024-01-12', 'payment'),
(3, 104, 107, 9000, '2024-01-14', 'services'),
(4, 107, 101, 8500, '2024-01-16', 'refund'),
(5, 102, 106, 25000, '2024-02-01', 'investment'),
(6, 106, 109, 12000, '2024-02-05', 'transfer'),
(7, 109, 110, 5000, '2024-02-10', 'savings'),
(8, 105, 108, 4900, '2024-03-01', 'split-1'),
(9, 105, 109, 4800, '2024-03-01', 'split-2'),
(10, 105, 111, 4700, '2024-03-01', 'split-3'),
(11, 110, 102, 30000, '2024-03-15', 'quarterly'),
(12, 103, 106, 7000, '2024-04-01', 'payment'),
(13, 106, 104, 3000, '2024-04-05', 'fee'),
(14, 108, 103, 2500, '2024-04-10', 'return'),
(15, 111, 104, 4500, '2024-04-15', 'payment');
Let's do a quick sanity check:
SELECT COUNT(*) AS n_persons FROM persons;
SELECT COUNT(*) AS n_accounts FROM accounts;
SELECT COUNT(*) AS n_transfers FROM transfers;
Eight people, eleven accounts, fifteen transfers. Buried in those fifteen transfers are a circular flow, a smurfing pattern, and a web of connections that would take pages of self-joins to untangle. Let's use a graph instead.
Creating a Property Graph
A property graph overlays structure on your existing tables. You tell TeideDB which tables are vertices (nodes) and which are edges (connections), and how they reference each other. The data stays in the tables — the graph is a lens for querying it.
CREATE PROPERTY GRAPH financial
VERTEX TABLES (
persons KEY (id),
accounts KEY (id)
)
EDGE TABLES (
owns SOURCE KEY (person_id) REFERENCES persons (id)
DESTINATION KEY (account_id) REFERENCES accounts (id),
transfers SOURCE KEY (src_account) REFERENCES accounts (id)
DESTINATION KEY (dst_account) REFERENCES accounts (id)
);
Let's break down what this does:
- VERTEX TABLES —
personsandaccountsbecome node types.KEY (id)tells TeideDB which column uniquely identifies each node. These keys can be non-sequential — TeideDB handles the internal mapping. - EDGE TABLES —
ownsconnects persons to accounts.transfersconnects accounts to accounts.SOURCE KEYandDESTINATION KEYspecify which columns hold the foreign keys, andREFERENCESlinks them to the vertex tables.
Behind the scenes, TeideDB builds a compressed sparse row (CSR) index for fast graph
traversal. But from your perspective, you now have a named graph called
financial that you can query with MATCH patterns.
First MATCH: Direct Transfers
The simplest graph query matches a single edge — one hop. Let's see all transfers between accounts:
SELECT src, dst, amount
FROM GRAPH_TABLE (financial
MATCH (a:accounts)-[t:transfers]->(b:accounts)
COLUMNS (a.id AS src, b.id AS dst, t.amount AS amount)
);
All fifteen transfers, expressed as a pattern: "account a sent money through
transfer t to account b." The arrow -> indicates
direction — money flows from source to destination.
Now let's filter for large transfers only:
SELECT src, dst, amount, memo
FROM GRAPH_TABLE (financial
MATCH (a:accounts)-[t:transfers WHERE t.amount > 10000]->(b:accounts)
COLUMNS (a.id AS src, b.id AS dst, t.amount AS amount, t.memo AS memo)
);
Four transfers above $10,000. The WHERE clause inside the MATCH pattern
filters at the edge level — only transfers matching the condition are traversed.
Who Owns What?
Fraud investigation is not just about money flows. You need to know who is behind
each account. The owns edge connects persons to accounts:
SELECT person_name, acct, acct_type
FROM GRAPH_TABLE (financial
MATCH (p:persons)-[o:owns]->(a:accounts)
COLUMNS (p.name AS person_name, a.id AS acct, a.account_type AS acct_type)
);
Now we can see the full picture: Alice owns accounts 101 and 102, Carol owns 104 and 105, Eve owns 107 and 108. These multi-account holders are particularly interesting for investigation — they can move money between their own accounts to obscure trails.
Following the Money: Multi-Hop Paths
Here is where graph queries truly shine. Instead of writing one join per hop, you specify a variable-length path and let TeideDB do the traversal.
Let's trace all paths of 1 to 3 hops starting from account 101:
SELECT src, dst, hops
FROM GRAPH_TABLE (financial
MATCH (a:accounts WHERE a.id = 101)-[t:transfers]->{1,3}(b:accounts)
COLUMNS (a.id AS src, b.id AS dst, path_length(t) AS hops)
);
The results show all accounts reachable from account 101 within 1 to 3 hops, along with the number of hops required to reach each one. Some accounts may appear multiple times if there are multiple distinct paths of different lengths reaching them. For example, account 103 is reachable in one hop (directly), while accounts further along the chain require two or three hops.
What if you only care about the shortest route between two specific accounts? Use
ANY SHORTEST:
SELECT DISTINCT src, dst, hops
FROM GRAPH_TABLE (financial
MATCH ANY SHORTEST (a:accounts WHERE a.id = 101)-[t:transfers]->{1,5}(b:accounts WHERE b.id = 107)
COLUMNS (a.id AS src, b.id AS dst, path_length(t) AS hops)
);
Three hops: 101 → 103 → 104 → 107. That is the shortest chain of transfers
connecting Alice's checking account to Eve's business account. The DISTINCT
collapses the per-step rows that ANY SHORTEST returns into a single summary
row. In the self-join world, you would have needed to try every possible chain length and
pick the shortest. Here, TeideDB handles it in one query.
Detecting Circular Flows
This is the moment the investigation gets interesting. A circular flow — money that leaves an account and eventually returns to it — is one of the oldest money laundering signals. The idea is simple: send money on a tour through several accounts to disguise its origin, then bring it back home.
To detect a cycle, use the same variable name for both the start and end node in the MATCH pattern. This tells TeideDB that the path must begin and end at the same node:
SELECT acct, hops
FROM GRAPH_TABLE (financial
MATCH (a:accounts WHERE a.id = 101)-[t:transfers]->{2,5}(a)
COLUMNS (a.id AS acct, path_length(t) AS hops)
);
There it is. A 4-hop cycle: account 101 sent money that traveled through a chain of
accounts and returned to 101. The key is the repeated variable a —
(a:accounts WHERE a.id = 101)-[t:transfers]->{2,5}(a) means "start at
a and end at the same node a."
Now let's use CHEAPEST with COST to find the lowest-cost
path between two accounts. This uses Dijkstra's algorithm under the hood:
SELECT src, dst, total_cost
FROM GRAPH_TABLE (financial
MATCH CHEAPEST (a:accounts WHERE a.id = 101)-[t:transfers]->{1,5}(b:accounts WHERE b.id = 110)
COST t.amount
COLUMNS (a.id AS src, b.id AS dst, COST() AS total_cost)
);
This finds the cheapest (lowest total transfer amount) path from account 101 to account
110, summing the amount on each edge along the way. The result tells you both
the route cost and how many hops were needed. This is useful for understanding the minimum
financial exposure along any path between two accounts.
Try it yourself: Check whether any other accounts have circular flows.
Try accounts 103, 106, and 110. Which ones have cycles, and how long are they? Hint:
widen the path range to {2,6} to catch longer cycles. Remember to use the
same variable for start and end: (a:accounts WHERE a.id = 103)-[t:transfers]->{2,6}(a).
Smurfing Detection
Smurfing — also called "structuring" — is when a sender splits a large payment into multiple smaller ones to stay below reporting thresholds. Each individual transfer looks unremarkable, but the pattern is unmistakable when you see the full picture.
Let's look for accounts that sent money to three or more distinct destinations in a single day:
SELECT src, transfer_dt, n_destinations, total_amount
FROM (
SELECT src, transfer_dt, COUNT(*) AS n_destinations, SUM(amount) AS total_amount
FROM GRAPH_TABLE (financial
MATCH (a:accounts)-[t:transfers]->(b:accounts)
COLUMNS (a.id AS src, b.id AS dst, t.amount AS amount, t.transfer_date AS transfer_dt)
)
GROUP BY src, transfer_dt
)
WHERE n_destinations >= 3;
Account 105 sent three transfers on the same day totaling $14,400. Let's see the details:
SELECT src, dst, amount, memo
FROM GRAPH_TABLE (financial
MATCH (a:accounts WHERE a.id = 105)-[t:transfers]->(b:accounts)
COLUMNS (a.id AS src, b.id AS dst, t.amount AS amount, t.memo AS memo)
);
Three transfers of $4,900, $4,800, and $4,700 — all on the same day, all just below $5,000, all going to different accounts. The memos even say "split-1", "split-2", "split-3". In the real world, memos would not be this obvious, but the transfer pattern would look exactly the same. Account 105 belongs to Carol Santos (risk score: 0.8). This warrants a closer look.
Graph Algorithms: The Big Picture
Individual patterns are useful, but sometimes you need to step back and look at the structure of the entire network. Which accounts are the most central? Do the accounts cluster into distinct communities? Are there isolated subnetworks?
TeideDB provides graph algorithm functions that you can use directly in GRAPH_TABLE
COLUMNS clauses. However, algorithm functions (PAGERANK, COMMUNITY, COMPONENT) require
a graph with a single edge table. Our financial graph has
two edge tables (owns and transfers), so we need to create a
separate graph for the transfer network:
CREATE PROPERTY GRAPH transfer_net
VERTEX TABLES (
accounts KEY (id)
)
EDGE TABLES (
transfers SOURCE KEY (src_account) REFERENCES accounts (id)
DESTINATION KEY (dst_account) REFERENCES accounts (id)
);
This transfer_net graph has only accounts as vertices and transfers as
edges — exactly what we need for network analysis.
Let's start with PageRank — a measure of how "important" or central each account is in the transfer network:
SELECT acct, rank
FROM GRAPH_TABLE (transfer_net
MATCH (a:accounts)
COLUMNS (a.id AS acct, PAGERANK(transfer_net, a) AS rank)
)
ORDER BY rank DESC;
The results rank all eleven accounts by their PageRank score. Accounts that receive transfers from many sources (like account 104, which receives from accounts 103, 106, and 111) will have the highest scores. Accounts that only send but never receive (like account 105) will have the lowest scores. PageRank values are nondeterministic due to floating-point iteration order, so exact scores may vary between runs — but the relative ranking is stable.
Now let's look at community detection using the Louvain algorithm. This groups accounts that transact heavily with each other:
SELECT acct, community
FROM GRAPH_TABLE (transfer_net
MATCH (a:accounts)
COLUMNS (a.id AS acct, COMMUNITY(transfer_net, a) AS community)
)
ORDER BY community, acct;
The Louvain algorithm partitions the accounts into communities of tightly-connected nodes. You will typically see the circular-flow accounts (101, 103, 104, 107) grouped together, the chain from Alice's savings through Dave and Frank to Grace in another community, and Carol's business account with the smurfing destinations in a third. The exact community IDs may vary between runs, but the groupings are consistent.
Finally, connected components tells us whether there are isolated subnetworks:
SELECT acct, comp
FROM GRAPH_TABLE (transfer_net
MATCH (a:accounts)
COLUMNS (a.id AS acct, COMPONENT(transfer_net, a) AS comp)
)
ORDER BY comp, acct;
All accounts belong to a single connected component. Every account in our dataset can reach every other account through some chain of transfers. There are no isolated nodes — this is a fully connected network, which makes containment more difficult if any of these accounts are compromised.
Composing with SQL
Graph queries do not live in isolation. You can wrap GRAPH_TABLE in CTEs, join the results with other tables, and use the full power of SQL aggregation. This is where the investigation comes together.
Let's find the average risk score per community by joining graph results with the
persons and ownership data. Note that we use the transfer_net graph (single
edge table) for the algorithm query and the financial graph for the
ownership query:
WITH communities AS (
SELECT acct, community
FROM GRAPH_TABLE (transfer_net
MATCH (a:accounts)
COLUMNS (a.id AS acct, COMMUNITY(transfer_net, a) AS community)
)
),
ownership AS (
SELECT person_name, acct, risk
FROM GRAPH_TABLE (financial
MATCH (p:persons)-[o:owns]->(a:accounts)
COLUMNS (p.name AS person_name, a.id AS acct, p.risk_score AS risk)
)
)
SELECT c.community,
COUNT(*) AS n_accounts,
AVG(o.risk) AS avg_risk,
MAX(o.risk) AS max_risk
FROM communities c
JOIN ownership o ON c.acct = o.acct
GROUP BY c.community
ORDER BY avg_risk DESC;
The results show each community alongside the number of accounts it contains and the average and maximum risk scores of its members. The community containing the smurfing accounts (Carol Santos, risk 0.8, and Eve Nakamura, risk 0.9) will have the highest average risk, while the chain community (Alice, Dave, Frank, Grace) will have the lowest. The suspicious structural patterns correlate with the risk scores.
Try it yourself: Write a query that uses CLUSTERING_COEFFICIENT
to find accounts with a clustering coefficient above 0.3. These are accounts whose neighbors
also transact with each other — a sign of a tightly-knit group. Which accounts are they,
and do they overlap with the suspicious communities? Remember to use the
transfer_net graph for algorithm functions.
The Full Investigation
Let's bring everything together in a single compound query. We want to identify the
highest-PageRank accounts, determine which community they belong to, and look up their
owners. We use the transfer_net graph for algorithms and
financial for ownership.
WITH ranked AS (
SELECT acct, rank
FROM GRAPH_TABLE (transfer_net
MATCH (a:accounts)
COLUMNS (
a.id AS acct,
PAGERANK(transfer_net, a) AS rank
)
)
),
comm AS (
SELECT acct, community
FROM GRAPH_TABLE (transfer_net
MATCH (a:accounts)
COLUMNS (
a.id AS acct,
COMMUNITY(transfer_net, a) AS community
)
)
),
ownership AS (
SELECT person_name, acct, risk, country
FROM GRAPH_TABLE (financial
MATCH (p:persons)-[o:owns]->(a:accounts)
COLUMNS (p.name AS person_name, a.id AS acct,
p.risk_score AS risk, p.country AS country)
)
)
SELECT r.acct,
o.person_name,
o.country,
o.risk,
r.rank,
c.community
FROM ranked r
JOIN comm c ON r.acct = c.acct
JOIN ownership o ON r.acct = o.acct
ORDER BY r.rank DESC;
The results combine three dimensions of analysis into a single view. Each row shows an account alongside its owner, country, risk score, PageRank centrality, and community membership. The investigation picture becomes clear:
- Account 104 (Carol Santos, Brazil, risk 0.8) is the most central node in the network and sits in the circular-flow community. Top priority.
- Account 103 (Bob Petrov, UK, risk 0.3) is the second most central and also in the circular-flow community. It is the first hop from Alice's account.
- Account 105 (Carol Santos, Brazil, risk 0.8) has the lowest PageRank but is the source of the smurfing pattern. Carol controls both the most central receiving account (104) and the distributing account (105) — a classic setup.
- Eve Nakamura (risk 0.9) appears in multiple communities (via account 107, part of the circular flow, and via account 108, a smurfing destination). She bridges two suspicious clusters.
This is the power of graph queries composed with SQL: you start with structure (MATCH), add analytics (PageRank, community), and finish with relational operations (JOIN, ORDER BY) to produce actionable intelligence. Each layer builds on the last, and the final result is something no amount of self-joins could produce this cleanly.
What's Next
You have seen how SQL/PGQ turns relationship-heavy investigations from painful self-join exercises into expressive pattern queries. But graph structure is only one dimension of your data. If your accounts have text descriptions, transaction narratives, or behavioral embeddings, you can combine graph queries with vector similarity search.
- Vector Search — store embeddings, compute cosine similarity, and build HNSW indexes for fast nearest-neighbor queries.
- SQL/PGQ Reference — complete syntax documentation for CREATE PROPERTY GRAPH, MATCH patterns, path modes, and all algorithm functions.
- Graph Algorithms — detailed reference for PageRank, connected components, Louvain community detection, and clustering coefficient.