Vector Search: Embeddings, Similarity, and Nearest Neighbors
Your users don't search with SQL keywords. They search with meaning.
You maintain a developer documentation site with hundreds of articles. Your search bar uses keyword matching, and users keep complaining. Someone types "debugging rust borrow checker" and gets zero results because no article contains that exact phrase. Meanwhile, "Advanced Rust Lifetimes" — the article that actually answers their question — sits undiscovered. You need a search system that understands meaning, not just words.
In this guide, you will build a "related articles" feature backed by vector embeddings and similarity search. By the end, you will know how to store embeddings in TeideDB, query them with cosine similarity and euclidean distance, use the automatic KNN optimization, and scale with HNSW indexes. Estimated time: 12 minutes.
The Problem with Keyword Search
Keyword search works by matching tokens. It is fast, well-understood, and completely blind to synonyms, paraphrases, and conceptual relationships. Consider these queries against a knowledge base:
- "debugging rust borrow checker" — should match "Advanced Rust Lifetimes", but shares zero words
- "how to store data in tables" — should match "Introduction to SQL", but the word "tables" alone could match articles about HTML tables
- "ML model training" — should match "Neural Networks from Scratch", but the article never uses the abbreviation "ML"
The fix is to represent each article as a point in a high-dimensional space where proximity reflects semantic similarity. These points are called embeddings.
What Are Embeddings?
An embedding is a list of numbers (a vector) that captures the meaning of a piece of text. Embedding models — trained neural networks like OpenAI's text-embedding-3 or open-source alternatives like sentence-transformers — read text and output a fixed-length vector. The key property: texts with similar meaning produce vectors that point in similar directions.
Think of it geometrically. Each vector is an arrow from the origin. Two articles about Rust programming produce arrows pointing roughly the same way. An article about machine learning points in a different direction. The angle between two arrows tells you how related the articles are. A small angle means high similarity; a large angle means low similarity. This angle-based measure is called cosine similarity: a value of 1.0 means the vectors point in exactly the same direction (identical meaning), and 0.0 means they are perpendicular (unrelated).
In practice, embedding models produce vectors with hundreds or thousands of dimensions — OpenAI's text-embedding-3-small uses 1536 dimensions, and sentence-transformers models commonly use 384 or 768. For this guide, we use 4-dimensional vectors so you can see the actual numbers and build intuition. The principles are identical at any scale: similarity is computed the same way whether your vectors have 4 components or 4,000.
Setting Up the Data
Start by creating the articles table with standard SQL:
CREATE TABLE articles (
id INTEGER, title VARCHAR, category VARCHAR, published_date DATE
);
INSERT INTO articles VALUES
(1, 'Getting Started with Rust', 'tutorial', '2024-06-15'),
(2, 'Advanced Rust Lifetimes', 'tutorial', '2024-07-01'),
(3, 'Building REST APIs in Rust', 'tutorial', '2024-07-20'),
(4, 'Introduction to SQL', 'database', '2024-08-01'),
(5, 'Query Optimization Techniques', 'database', '2024-08-15'),
(6, 'Graph Databases Explained', 'database', '2024-09-01'),
(7, 'Machine Learning Fundamentals', 'ml', '2024-09-15'),
(8, 'Neural Networks from Scratch', 'ml', '2024-10-01'),
(9, 'Embeddings and Vector Search', 'ml', '2024-10-15'),
(10, 'Full-Text Search vs Vectors', 'search', '2024-11-01');
Now add the embedding column. Embedding columns use a specialized flat N×D memory layout (TD_F32 type) for efficient SIMD computation, so they are created via the Rust API rather than SQL. Once the column exists, everything else is pure SQL.
use teide::Session;
let mut session = Session::new()?;
// Execute the CREATE TABLE and INSERT above (omitted for brevity)
// 4-dimensional embeddings: one 4-element vector per row
// Designed so articles in the same category cluster together
let embeddings: Vec<f32> = vec![
0.9, 0.8, 0.1, 0.05, // 1: Getting Started with Rust
0.85, 0.9, 0.1, 0.1, // 2: Advanced Rust Lifetimes
0.8, 0.7, 0.3, 0.1, // 3: Building REST APIs in Rust
0.1, 0.2, 0.9, 0.1, // 4: Introduction to SQL
0.15, 0.1, 0.85, 0.2, // 5: Query Optimization Techniques
0.2, 0.15, 0.8, 0.3, // 6: Graph Databases Explained
0.1, 0.1, 0.2, 0.9, // 7: Machine Learning Fundamentals
0.05, 0.15, 0.15, 0.95, // 8: Neural Networks from Scratch
0.1, 0.1, 0.4, 0.85, // 9: Embeddings and Vector Search
0.2, 0.1, 0.6, 0.6, // 10: Full-Text Search vs Vectors
];
session.add_embedding_column("articles", "embedding", 4, &embeddings)?;
Look at the vectors. The Rust tutorial articles (rows 1–3) have high values in dimensions 0 and 1. The database articles (rows 4–6) are heavy in dimension 2. The ML articles (rows 7–9) concentrate in dimension 3. Article 10 ("Full-Text Search vs Vectors") sits between database and ML, which makes sense — it spans both topics.
After calling add_embedding_column, the articles table
has five columns: the four you created with SQL plus the embedding column.
The embedding column is invisible in normal SELECT output (it contains raw float
arrays), but similarity functions can reference it by name.
Verify the table exists:
SELECT id, title, category FROM articles;
Finding Similar Articles with Cosine Similarity
A user is reading a Rust tutorial and wants related content. In a real application,
you would run the user's search query through an embedding model to get a query
vector. Here, we construct one by hand: [0.88, 0.82, 0.1, 0.08] —
heavy in the "Rust" dimensions (0 and 1), light in the database and ML dimensions
(2 and 3). Let us find the five most similar articles:
SELECT
id,
title,
category,
COSINE_SIMILARITY(embedding, ARRAY[0.88, 0.82, 0.1, 0.08]) AS sim
FROM articles
ORDER BY sim DESC
LIMIT 5;
The results tell a clear story. The three Rust tutorials score above 0.98 — almost perfect alignment with the query vector. "Introduction to SQL" and the search article appear next, but their scores drop sharply to around 0.3. That gap is your signal: the top 3 are genuinely related, and the rest are noise. In a real application, you would apply a similarity threshold (say, 0.7) to cut off irrelevant results.
Notice that COSINE_SIMILARITY returns a value between 0 and 1 (for
non-negative vectors like ours). A score of 1.0 means the vectors point in exactly
the same direction. The query vector [0.88, 0.82, 0.1, 0.08] is nearly
identical to row 1's [0.9, 0.8, 0.1, 0.05], which is why that score
is so close to 1.
Euclidean Distance
Cosine similarity measures the angle between vectors. Euclidean distance measures the straight-line distance between their endpoints. Let us run the same query using euclidean distance, where smaller values mean more similar:
SELECT
id,
title,
category,
EUCLIDEAN_DISTANCE(embedding, ARRAY[0.88, 0.82, 0.1, 0.08]) AS dist
FROM articles
ORDER BY dist ASC
LIMIT 5;
The top three are the same Rust tutorials, and the ordering within them is identical. But look at positions 4 and 5: euclidean distance ranks "Full-Text Search vs Vectors" slightly closer than "Introduction to SQL", while cosine similarity had it the other way around. This is because euclidean distance considers vector magnitude (length), not just direction.
When to use which: Cosine similarity is the standard choice for text embeddings because embedding models typically produce normalized vectors (all roughly the same length). When vectors have varying magnitudes — for example, TF-IDF vectors where document length affects the scale — euclidean distance may be more appropriate. When in doubt, start with cosine similarity.
The KNN Optimization
Both queries above follow the same pattern: compute a similarity or distance score for every row, sort, and take the top K. For a table with 10 rows, that is fine. For a table with 10 million rows, computing all 10 million similarities just to return 5 results is wasteful.
TeideDB detects this pattern automatically. When the query planner sees:
SELECT ... FROM t
ORDER BY COSINE_SIMILARITY(emb_col, ARRAY[...]) DESC
LIMIT k;
it replaces the full scan + sort + limit pipeline with a single KNN (K-Nearest Neighbors) kernel that avoids materializing all scores. You do not need to call a special function or use non-standard syntax. You write natural SQL, and the optimizer does the rest.
This is a deliberate design choice. Other databases require vendor-specific operators
like <-> or special function calls like vector_search().
In TeideDB, vector search is just SQL. If you know how to write
ORDER BY ... LIMIT, you already know how to do KNN.
The same optimization applies to EUCLIDEAN_DISTANCE with
ORDER BY ... ASC LIMIT k. If an HNSW index exists on the column, the
KNN kernel will use it transparently — you still write the same SQL.
Scaling with HNSW Indexes
The KNN kernel is exact: it always returns the true K nearest neighbors. For tables with millions of rows, even an optimized linear scan can be slow. HNSW (Hierarchical Navigable Small World) indexes trade a small amount of accuracy for dramatically faster lookups.
Think of HNSW as a network of shortcuts through your data. At the top level, a few widely-spaced nodes provide long-range jumps. Each level down adds more nodes with shorter-range connections. A search starts at the top, takes a few long jumps to get close to the target, then descends through finer levels for precision. It is like navigating a city: first take the highway to the right neighborhood, then local streets to the right block, then walk to the door.
Create an HNSW index on the embedding column:
CREATE VECTOR INDEX idx_articles_emb
ON articles(embedding)
USING HNSW(M = 16, ef_construction = 200);
The two parameters control the index structure:
- M (default: 16) — How many connections each node maintains. Higher M means more shortcuts and better recall, but larger memory footprint and slower index builds. Values between 12 and 48 work well for most datasets.
- ef_construction (default: 200) — How hard the builder searches for good connections during index construction. Higher values produce a better-quality index but take longer to build. This only affects build time, not query time.
Once the index exists, your queries do not change at all:
-- Exact same query as before; HNSW index used transparently
SELECT
id,
title,
COSINE_SIMILARITY(embedding, ARRAY[0.88, 0.82, 0.1, 0.08]) AS sim
FROM articles
ORDER BY sim DESC
LIMIT 5;
The planner detects the HNSW index and uses it automatically. For small tables like ours, you will not notice a difference. For tables with hundreds of thousands of rows or more, the speedup is significant — HNSW searches are typically O(log N) rather than O(N).
For our 10-row table, the index is overkill. But consider a production knowledge base with 500,000 articles, each with a 768-dimensional embedding. Without an index, every similarity query scans all 500,000 vectors. With HNSW, the same query touches a few hundred nodes and returns in milliseconds. The trade-off is a small chance of missing the true nearest neighbor — but in practice, HNSW recall is above 95% with reasonable parameter choices.
To remove an index:
DROP VECTOR INDEX idx_articles_emb;
-- Or, to avoid an error if the index does not exist:
DROP VECTOR INDEX IF EXISTS idx_articles_emb;
Limitations You Should Know
TeideDB's vector search has real constraints that you should understand before designing your application. These are not bugs — they are consequences of how the C engine's columnar kernels interact with the flat N×D memory layout of embedding columns.
Restricted operations on tables with embedding columns:
SELECT ... WHERE ...is not supported (except via the KNN pattern described above). The filter kernel operates element-wise on flat F32 arrays and is not dimension-aware.SELECT ... ORDER BY ... LIMIT ...on non-similarity expressions is not supported, for the same reason.UPDATEis not supported.DELETE ... WHERE ...is not supported.DELETE FROM articles(without WHERE) works — it truncates the table while preserving embedding column metadata.INSERT INTO ... VALUES ...is not supported for embedding data. UseINSERT INTO ... SELECT ...instead.GROUP BYandDISTINCTare not supported on embedding tables.
DML operations invalidate vector indexes. If you INSERT into or DELETE from a table that has an HNSW index, the index is automatically dropped because the underlying column data is reallocated. You must recreate the index after bulk modifications:
-- After inserting new articles and their embeddings:
CREATE VECTOR INDEX idx_articles_emb
ON articles(embedding)
USING HNSW(M = 16, ef_construction = 200);
Practical workarounds:
- For filtering, use the KNN pattern to retrieve candidates, then filter in your application layer.
- For bulk updates, truncate the table, re-insert all data, add the embedding column fresh, and rebuild the index.
- Keep metadata columns (id, title, category) in a separate table that you JOIN against after the vector search, if you need full SQL flexibility on those columns.
Challenges
COSINE_SIMILARITY search with LIMIT 5. Do all three
database articles appear at the top? Which non-database article is closest to the
centroid, and does that make intuitive sense given its topic? This technique —
searching by the centroid of a cluster — is how "more like these" features
work in production.
What's Next
You now know how to store embeddings in TeideDB, query them with cosine similarity and euclidean distance, rely on the automatic KNN optimization for efficient top-K retrieval, and scale to large datasets with HNSW indexes. You also know the current limitations of embedding tables and practical strategies for working around them.
In a real application, the embedding vectors come from a model — not hand-crafted arrays. The Rust Integration guide shows how to programmatically generate embeddings, manage the Session lifecycle, and build end-to-end search pipelines in Rust. For the full function reference, see the Vector Search SQL Reference.