Embedding TeideDB in Rust Applications
TeideDB is a library, not just a CLI. Embed a columnar analytics engine in your Rust application with zero external dependencies and zero setup.
You are building an IoT monitoring service in Rust. Sensors push temperature, humidity, and pressure readings every few seconds. Your API needs to answer queries like "average temperature over the last hour, grouped by sensor" -- fast, without a network round-trip to a separate database process. You need the database inside your binary.
This guide covers the full integration path. By the end, you will know how to:
- Add TeideDB as a Rust dependency and configure feature flags
- Build computation graphs with the low-level
GraphAPI - Execute SQL via the
SessionAPI and read typed results - Store and search vector embeddings with HNSW indexes
- Expose your embedded database over the PostgreSQL wire protocol
- Handle the
!Send + !Syncconstraint in async code
Estimated time: about 12 minutes.
The Problem: Analytics Inside Your Process
SQLite solves the embedded case for row-oriented workloads, but columnar aggregations
over millions of rows are not its strength. DuckDB is columnar but exposes a C++ API --
the FFI boundary is wide and the build system is heavy. TeideDB takes a different approach:
the core engine is C17 (~15,000 lines), vendored directly into the Rust crate, and compiled
from source by build.rs. No system library, no dynamic linking, no separate
process. cargo build gives you a statically linked columnar engine with a
morsel-driven executor, a cost-based optimizer, and SQL/PGQ graph queries -- all callable
from safe Rust.
Adding the Dependency
TeideDB is not yet on crates.io. Add it as a path or git dependency:
[dependencies]
# From a local checkout:
teide = { path = "../teide-rs" }
# Or from git:
teide = { git = "https://github.com/TeideDB/teide-rs.git" }
The crate exposes three feature flags:
| Feature | What it adds | Default |
|---|---|---|
(default) | Core engine + SQL session API | Yes |
cli | Interactive REPL binary | No |
server | PgWire server binary | No |
For embedding in your own application, the default features are all you need. Leave
cli and server off to avoid pulling in their dependencies
(rustyline, pgwire, tokio).
gcc >= 7, clang >= 5, or MSVC 2019+). The
build.rs script compiles the vendored C source tree at
vendor/teide/ using the cc crate. On most systems this
just works. If it does not, check that cc can find your C compiler.
The Graph API: Low-Level Power
At the lowest level, TeideDB exposes a lazy DAG of operations. You build a computation
graph, then call execute(). Nothing touches data until that final call --
the optimizer (constant folding, predicate pushdown, CSE, fusion, DCE) runs first, then
the morsel-driven executor processes the optimized plan.
use teide::{Context, Table};
fn main() -> Result<(), teide::Error> {
let ctx = Context::new()?;
let table = Table::from_vecs(
&ctx,
&["sensor_id", "temperature"],
&[vec![1i64, 2, 3, 1, 2, 3]], // i64 columns
&[vec![22.1, 23.5, 19.8, 22.4, 24.1, 20.0]], // f64 columns
)?;
let mut g = ctx.graph(&table)?;
let sensor = g.scan("sensor_id")?;
let temp = g.scan("temperature")?;
// Celsius to Fahrenheit: temp * 1.8 + 32
let fahrenheit = g.add(g.mul(temp, g.const_f64(1.8)?)?, g.const_f64(32.0)?)?;
// Filter: only sensor_id == 1
let mask = g.eq(sensor, g.const_i64(1)?)?;
let filtered = g.filter(fahrenheit, mask)?;
let result = g.execute(filtered)?; // optimizer + executor run here
for row in 0..result.nrows() {
println!("row {}: {:.1} F", row, result.read_f64(0, row));
}
Ok(())
}
Key points about the Graph API:
ColumnisCopy-- a non-owning handle (*mut td_op_t) into the DAG. Pass it to multiple operations freely.- Operations return new nodes.
g.mul()does not mutate its inputs -- it creates a new DAG node. This enables CSE in the optimizer. execute()takes the root node and traces dependencies backward, pruning unreachable nodes before execution.
The Session API: SQL Strings
For most applications, you do not need the Graph API at all. The Session
type wraps a Context, maintains a table registry, and accepts SQL strings.
It parses, plans, optimizes, and executes in a single call.
use teide::sql::{Session, ExecResult};
fn main() -> Result<(), teide::SqlError> {
let mut session = Session::new()?;
session.execute("CREATE TABLE sensors (id INTEGER, location VARCHAR, installed DATE)")?;
session.execute(
"INSERT INTO sensors VALUES
(1, 'roof', '2024-01-15'), (2, 'basement', '2024-03-22'),
(3, 'roof', '2024-06-01'), (4, 'garage', '2024-07-10')")?;
match session.execute(
"SELECT location, COUNT(*) AS cnt FROM sensors GROUP BY location ORDER BY cnt DESC")?
{
ExecResult::Query(result) => {
for row in 0..result.nrows {
let location = result.table.read_str(0, row);
let count = result.table.read_i64(1, row);
println!("{} => {}", location, count);
}
}
ExecResult::Ddl(msg) => println!("{}", msg),
}
Ok(())
}
ExecResult::Query(SqlResult) gives you the result Table,
column names, row count, and embedding metadata. ExecResult::Ddl(String)
returns a status message for DDL/DML. Tables persist in the session's registry until
dropped or the session itself is dropped.
Reading Results
The Table type provides typed accessors for each column type and format
helpers for temporal values. Use col_type() to dispatch on the type code:
fn print_result(result: &teide::sql::SqlResult) {
for row in 0..result.nrows {
for col in 0..result.columns.len() {
let cell = match result.table.col_type(col) {
6 => format!("{}", result.table.read_i64(col, row)), // TD_I64
7 => format!("{:.2}", result.table.read_f64(col, row)), // TD_F64
9 => { // TD_DATE
let days = result.table.read_i64(col, row);
teide::Table::format_date(days as i32)
}
10 => { // TD_TIME
let ms = result.table.read_i64(col, row);
teide::Table::format_time(ms)
}
11 => { // TD_TIMESTAMP
let us = result.table.read_i64(col, row);
teide::Table::format_timestamp(us)
}
20 => result.table.read_str(col, row).to_string(), // TD_SYM
_ => "?".to_string(),
};
print!("{:>15}", cell);
}
println!();
}
}
Key type codes: TD_I64 = 6, TD_F64 = 7,
TD_F32 = 8, TD_DATE = 9, TD_TIME = 10,
TD_TIMESTAMP = 11, TD_SYM = 20 (string/symbol).
Use table.nrows() and table.ncols() to get dimensions.
Working with Embeddings
TeideDB stores vector embeddings as flat TD_F32 columns -- N rows of
D-dimensional vectors packed contiguously. This gives you columnar compression benefits
and zero-copy access from the C engine. On top of that, you can build HNSW indexes for
approximate nearest-neighbor search.
use teide::sql::{Session, ExecResult};
fn main() -> Result<(), teide::SqlError> {
let mut session = Session::new()?;
session.execute("CREATE TABLE docs (id INTEGER, title VARCHAR)")?;
session.execute(
"INSERT INTO docs VALUES (1, 'Rust ownership'), (2, 'Graph databases'),
(3, 'Vector search'), (4, 'Columnar storage')")?;
// Add a 4-dimensional embedding column (flat f32 array: 4 rows * 4 dims)
let embeddings: Vec<f32> = vec![
0.9, 0.1, 0.0, 0.2, // doc 1
0.1, 0.8, 0.3, 0.1, // doc 2
0.2, 0.3, 0.9, 0.1, // doc 3
0.7, 0.1, 0.1, 0.8, // doc 4
];
session.add_embedding_column("docs", "embedding", 4, &embeddings)?;
// Query with cosine similarity
match session.execute(
"SELECT title, COSINE_SIMILARITY(embedding, ARRAY[0.85, 0.15, 0.05, 0.25]) AS sim
FROM docs")?
{
ExecResult::Query(r) => {
for row in 0..r.nrows {
println!("{:<20} {:.4}", r.table.read_str(0, row), r.table.read_f64(1, row));
}
}
_ => {}
}
Ok(())
}
For large collections, linear scan is too slow. Build an HNSW index for approximate nearest-neighbor search in logarithmic time:
use teide::HnswIndex;
// Build an HNSW index on the embedding column
// Parameters: table, column_index, dimension, M (neighbors), ef_construction
let stored = session.get_table("docs").unwrap();
let index = HnswIndex::build(&stored.table, 2, 4, 16, 200)?;
// Search: find 2 nearest neighbors
let query = vec![0.85f32, 0.15, 0.05, 0.25];
let results = index.search(&query, 2, 50)?; // k=2, ef_search=50
for (row_id, distance) in &results {
println!("row {} distance {:.4}", row_id, distance);
}
// Persist the index to disk
index.save("docs_embedding.hnsw")?;
// Later, reload it
let loaded = HnswIndex::load("docs_embedding.hnsw")?;
The HNSW index is also available via SQL DDL:
CREATE VECTOR INDEX docs_emb_idx ON docs(embedding) USING HNSW(M=16, ef_construction=200);
-- Later:
DROP VECTOR INDEX docs_emb_idx;
The PgWire Server
Sometimes you need external tools -- psql, DBeaver, Python -- to query data your Rust service manages. TeideDB includes a PostgreSQL wire protocol server.
# Start the server on port 5433
cargo run --features server -- --port 5433
# Connect from another terminal
psql -h 127.0.0.1 -p 5433
The !Send constraint dictates the thread model: each connection gets its own
OS thread with a dedicated Session, bridged to the async pgwire handler via
channels:
// Simplified view of the server thread model:
//
// tokio runtime (async) OS thread (sync)
// +---------------------+ +-------------------+
// | pgwire connection | ----> | Session |
// | handler | <---- | (owns Context, |
// | | chan | table registry) |
// +---------------------+ +-------------------+
//
// Each connection = one OS thread = one Session.
// The Context never crosses thread boundaries.
Each connection has its own isolated table namespace. Tables created in one connection are not visible to others.
Critical Constraints
Context is !Send + !Sync
Context and Session cannot cross thread boundaries. The type
system enforces this via PhantomData<*mut ()> -- trying to move a
Context into tokio::spawn is a compile error.
The correct pattern for async applications: spawn a dedicated OS thread that owns the
Session, and communicate via channels.
use std::sync::mpsc;
use std::thread;
use teide::sql::{Session, ExecResult};
fn spawn_db_thread() -> mpsc::Sender<(String, mpsc::Sender<String>)> {
let (tx, rx) = mpsc::channel();
thread::spawn(move || {
let mut session = Session::new().expect("engine init");
for (sql, reply): (String, mpsc::Sender<String>) in rx {
let msg = match session.execute(&sql) {
Ok(ExecResult::Query(r)) => format!("{} rows", r.nrows),
Ok(ExecResult::Ddl(msg)) => msg,
Err(e) => format!("error: {e}"),
};
let _ = reply.send(msg);
}
});
tx
}
ENGINE_LOCK in Tests
Rust's test harness runs tests in parallel. Since the C engine's global state cannot be initialized/destroyed concurrently, you must serialize access with a mutex:
use std::sync::Mutex;
use teide::Context;
static ENGINE_LOCK: Mutex<()> = Mutex::new(());
#[test]
fn test_sensor_query() {
let _guard = ENGINE_LOCK.lock().unwrap();
let ctx = Context::new().unwrap();
// ... your test code ...
// Context drops here, but the engine singleton persists
// until all Arc references are gone.
}
#[test]
fn test_another_query() {
let _guard = ENGINE_LOCK.lock().unwrap();
let ctx = Context::new().unwrap();
// Safe: ENGINE_LOCK ensures this doesn't race with the test above.
}
Under the hood, the engine is managed via OnceLock<Mutex<Weak<EngineGuard>>>.
Multiple Context handles share one Arc<EngineGuard>; the engine
tears down only when the last Arc drops. These constraints are not bugs --
they are the price of zero-copy access to the C engine's thread-local memory arenas.
In production, you typically have one Session per thread and no contention.
Putting It All Together
Here is a complete example: an IoT monitoring module that wraps a Session
in a domain-specific struct, ingests sensor readings, and exposes typed query methods.
use teide::sql::{Session, ExecResult};
struct SensorMonitor { session: Session }
impl SensorMonitor {
fn new() -> Result<Self, teide::SqlError> {
let mut session = Session::new()?;
session.execute(
"CREATE TABLE readings (
sensor_id INTEGER, temperature DOUBLE,
humidity DOUBLE, ts TIMESTAMP)"
)?;
Ok(SensorMonitor { session })
}
fn ingest(&mut self, id: i64, temp: f64, hum: f64, ts: &str)
-> Result<(), teide::SqlError>
{
self.session.execute(&format!(
"INSERT INTO readings VALUES ({id}, {temp}, {hum}, '{ts}')"))?;
Ok(())
}
fn avg_by_sensor(&mut self) -> Result<Vec<(i64, f64, f64)>, teide::SqlError> {
match self.session.execute(
"SELECT sensor_id, AVG(temperature), AVG(humidity)
FROM readings GROUP BY sensor_id ORDER BY sensor_id")?
{
ExecResult::Query(r) => Ok((0..r.nrows).map(|i| (
r.table.read_i64(0, i), r.table.read_f64(1, i), r.table.read_f64(2, i)
)).collect()),
_ => Ok(vec![]),
}
}
}
fn main() -> Result<(), teide::SqlError> {
let mut m = SensorMonitor::new()?;
m.ingest(1, 22.5, 45.0, "2024-08-01 10:00:00")?;
m.ingest(1, 23.1, 44.2, "2024-08-01 10:05:00")?;
m.ingest(2, 19.8, 62.1, "2024-08-01 10:00:00")?;
m.ingest(2, 20.1, 61.5, "2024-08-01 10:05:00")?;
for (sensor, temp, hum) in m.avg_by_sensor()? {
println!("Sensor {sensor}: avg temp {temp:.1}, avg humidity {hum:.1}");
}
Ok(())
}
Challenges
mpsc channel to
a single consumer thread that owns a Session. The consumer should batch
inserts (collect N readings, then execute a single multi-row INSERT). Add a query
thread that periodically requests aggregated results via a separate channel. Measure
throughput: how many readings per second can you sustain with batch sizes of 1, 10,
100, and 1000?
axum or actix-web) that accepts a JSON body with a
query vector and returns the top-K nearest neighbors from a TeideDB table with an HNSW
index. The handler must not own the Session directly (it runs on tokio).
Use the channel pattern from the "Critical Constraints" section to bridge async and
sync worlds. Add an endpoint that inserts new documents with embeddings, and handle
the index rebuild that insertion triggers.
What's Next
- SQL Reference -- complete syntax documentation for every statement and function TeideDB supports.
- SQL/PGQ Reference -- property graph DDL, MATCH patterns, and graph algorithm functions.
- Vector Search Reference -- HNSW index parameters, similarity functions, and DML restrictions.
- Graph Queries Guide -- if your embedded use case involves relationship data, this guide covers SQL/PGQ from the ground up.
- Vector Search Guide -- deeper coverage of embedding workflows, including RAG patterns and hybrid search.