Storage & Symbols
Teide's on-disk formats: how splayed tables, partitioned tables, and the symbol system work. Essential reading if you're coming from kdb+/q or working with persistent columnar data.
What Are Symbols?
Symbols are interned strings. Instead of storing the full text of every string value, Teide maintains a global lookup table that maps each unique string to a small integer ID. Column data then stores these compact IDs instead of the raw text.
This is the same concept as kdb+'s symbol type or a database "dictionary encoding".
Why symbols?
| Benefit | Explanation |
|---|---|
| Memory efficiency | A million rows with 50 unique strings store 50 strings + 1M small integers, not 1M full strings |
| Fast comparison | Comparing two symbols is a single integer comparison, not a byte-by-byte string compare |
| Fast GROUP BY / JOIN | Hash tables on integer keys are much faster than on variable-length strings |
| Cache friendly | Fixed-width integer columns pack tightly in CPU cache lines |
How it works
When Teide encounters the string 'AAPL' for the first time, it assigns it an ID (say, 0). The next unique string 'GOOG' gets 1, and so on. If 'AAPL' appears again, it reuses ID 0.
Global symbol table:
0 → "AAPL"
1 → "GOOG"
2 → "MSFT"
Column data (symbol encoded):
[0, 1, 0, 2, 0, 1, 2, 0]
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
AAPL GOOG AAPL MSFT AAPL GOOG MSFT AAPL
In SQL, symbols behave exactly like VARCHAR — you query them with string literals and they display as text. The encoding is transparent:
-- Symbols look and act like strings in SQL
SELECT * FROM trades WHERE symbol = 'AAPL';
SELECT DISTINCT symbol FROM trades ORDER BY symbol;
Adaptive width encoding
Symbol IDs in column files use the smallest integer width that can represent all IDs in the symbol table:
| Width | Max symbols | Bytes per value |
|---|---|---|
| W8 | 256 | 1 |
| W16 | 65,536 | 2 |
| W32 | ~4 billion | 4 |
| W64 | unlimited | 8 |
Most real-world datasets have far fewer than 256 unique strings per column, so symbols typically use just 1 byte per value.
The sym File
The symbol table can be persisted to a binary file. This is required for splayed and partitioned tables so that symbol IDs in column files can be resolved back to strings.
Binary format
Bytes 0-3: magic (0x4D595354 = "TSYM" little-endian)
Bytes 4-7: count (uint32, number of symbols)
For each symbol i = 0..count-1:
[4 bytes] length of string i (uint32)
[N bytes] UTF-8 string data
Symbol IDs correspond directly to their position in the file: the first string is ID 0, the second is ID 1, etc.
Rust API
// Intern a string (returns stable i64 ID)
let id = teide::sym_intern("AAPL")?;
// Intern the same string again — returns the same ID
let id2 = teide::sym_intern("AAPL")?;
assert_eq!(id, id2);
Splayed Tables
A splayed table stores each column as a separate binary file within a directory. This is the fundamental on-disk format in Teide, borrowed from kdb+.
Directory layout
/data/tables/trades/
.d -- schema file (vector of column name symbol IDs)
ts -- column: timestamp vector
symbol -- column: symbol vector
price -- column: f64 vector
qty -- column: i64 vector
Schema file (.d)
The .d file is a binary vector of i64 symbol IDs, one per column. Each ID maps to a column name in the symbol table. The order defines the column order of the table.
Column file format
Each column file is a raw binary vector with a 32-byte block header:
Bytes 0-15: nullmap (inline bitmap for null tracking)
Byte 16: mmod (memory mode: 0=heap, 1=mmap)
Byte 17: order (block size class)
Byte 18: type (type tag: 6=i64, 7=f64, 20=sym, ...)
Byte 19: attrs (flags: has-nulls, external nullmap, ...)
Bytes 20-23: rc (reference count, 0 on disk)
Bytes 24-31: len (number of elements, uint64)
Bytes 32+: data (len * element_size bytes)
[optional]: external nullmap ((len+7)/8 bytes, if flagged)
Loading in SQL
-- Load a splayed table (auto-discovers column names from .d)
SELECT * FROM read_splayed('/data/tables/trades');
-- With an explicit symbol file path
SELECT * FROM read_splayed('/data/tables/trades', '/data/sym');
-- Persist as an in-memory table
CREATE TABLE trades AS SELECT * FROM read_splayed('/data/tables/trades');
Loading in Rust
let ctx = Context::new()?;
// Without shared symbol file
let table = ctx.read_splayed("/data/tables/trades")?;
// With shared symbol file
let table = ctx.read_splayed("/data/tables/trades", Some("/data/sym"))?;
println!("{} rows, {} cols", table.nrows(), table.ncols());
Zero-copy: Column files are memory-mapped. The OS pages data in on demand, so you can query datasets larger than available RAM without loading everything upfront.
Partitioned Tables
A partitioned table splits data across multiple directories, each representing a time period (typically a date). This is the standard layout for time-series data in kdb+ systems.
Directory layout
/data/marketdb/
sym -- shared symbol table (required)
2024.01.15/
trades/
.d -- schema
symbol -- column: sym
price -- column: f64
qty -- column: i64
2024.01.16/
trades/
.d
symbol
price
qty
2024.01.17/
trades/
.d
symbol
price
qty
Partition naming: Directory names must be dates in YYYY.MM.DD format (e.g., 2024.01.15) or integer keys. Teide auto-discovers all valid partition directories and loads them in sorted order.
How partitions are discovered
- Scan
db_rootfor subdirectories whose names consist of digits and dots - Validate date format: 10 characters, dots at positions 4 and 7, valid month (01-12) and day (01-31)
- Sort directories lexicographically for deterministic ordering
- Load each
db_root/<partition>/<table_name>as a splayed table - Load the shared symbol table from
db_root/sym - Concatenate all partition segments into a single logical table
The virtual partition column (MAPCOMMON)
When you open a partitioned table, Teide automatically creates a virtual column containing the partition key (the date). This column doesn't exist on disk — it's synthesized at query time.
-- The partition date is available as a queryable column
SELECT date, COUNT(*) AS trades
FROM read_parted('/data/marketdb', 'trades')
GROUP BY date
ORDER BY date;
The virtual column's type depends on the partition directory names:
| Directory names | Virtual column type | Example |
|---|---|---|
YYYY.MM.DD dates | DATE | 2024.01.15, 2024.01.16 |
| Integer keys | BIGINT | 1, 2, 3 |
| Other strings | VARCHAR (symbol) | us, eu, asia |
Parted column types
Internally, each column in a partitioned table is stored as a parted vector — a list of per-partition segments. When you access row N, Teide resolves which partition contains that row and reads from the correct segment.
This is transparent to SQL queries. You read and filter parted columns exactly like regular columns.
Loading in SQL
-- Open a partitioned table
CREATE TABLE trades AS
SELECT * FROM read_parted('/data/marketdb', 'trades');
-- Query across all partitions
SELECT symbol, SUM(qty) AS total_qty, AVG(price) AS avg_price
FROM trades
WHERE date BETWEEN '2024-01-15' AND '2024-01-16'
GROUP BY symbol
ORDER BY total_qty DESC
LIMIT 10;
Loading in Rust
let ctx = Context::new()?;
// Open partitioned table (auto-loads sym, discovers partitions)
let trades = ctx.read_parted("/data/marketdb", "trades")?;
// Second call returns cached result instantly
let trades2 = ctx.read_parted("/data/marketdb", "trades")?;
println!("{} rows, {} cols", trades.nrows(), trades.ncols());
Storage Format Comparison
| Format | Access | Best for | SQL function |
|---|---|---|---|
| CSV | Full copy into memory | Import/export, ad-hoc analysis, small datasets | read_csv(path) |
| Splayed | Zero-copy mmap | Single tables, fast column scans, large datasets | read_splayed(dir [, sym]) |
| Partitioned | Zero-copy mmap | Time-series data, date-sharded tables, multi-day queries | read_parted(root, name) |
For kdb+/q Users
Teide's storage format is inspired by and compatible with the kdb+ on-disk layout:
| kdb+ concept | Teide equivalent |
|---|---|
`sym type | VARCHAR / SYM (symbol-encoded internally) |
Splayed table (`:/path/table/) | read_splayed('/path/table') |
Partitioned DB (`:/path/db/) | read_parted('/path/db', 'table') |
.d schema file | Same — binary vector of symbol IDs |
sym file in HDB root | Same — loaded from db_root/sym |
Date partitions (2024.01.15/) | Same directory naming convention |
Virtual date column | MAPCOMMON virtual column (auto-created) |
The key difference: in Teide, you query these tables with standard SQL instead of q expressions.