DEVELOPER

Storage & Symbols

Teide's on-disk formats: how splayed tables, partitioned tables, and the symbol system work. Essential reading if you're coming from kdb+/q or working with persistent columnar data.

What Are Symbols?

Symbols are interned strings. Instead of storing the full text of every string value, Teide maintains a global lookup table that maps each unique string to a small integer ID. Column data then stores these compact IDs instead of the raw text.

This is the same concept as kdb+'s symbol type or a database "dictionary encoding".

Why symbols?

Benefit	Explanation
Memory efficiency	A million rows with 50 unique strings store 50 strings + 1M small integers, not 1M full strings
Fast comparison	Comparing two symbols is a single integer comparison, not a byte-by-byte string compare
Fast GROUP BY / JOIN	Hash tables on integer keys are much faster than on variable-length strings
Cache friendly	Fixed-width integer columns pack tightly in CPU cache lines

How it works

When Teide encounters the string 'AAPL' for the first time, it assigns it an ID (say, 0). The next unique string 'GOOG' gets 1, and so on. If 'AAPL' appears again, it reuses ID 0.

Global symbol table:
  0 → "AAPL"
  1 → "GOOG"
  2 → "MSFT"

Column data (symbol encoded):
  [0, 1, 0, 2, 0, 1, 2, 0]
   ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓
  AAPL GOOG AAPL MSFT AAPL GOOG MSFT AAPL

In SQL, symbols behave exactly like VARCHAR — you query them with string literals and they display as text. The encoding is transparent:

-- Symbols look and act like strings in SQL
SELECT * FROM trades WHERE symbol = 'AAPL';
SELECT DISTINCT symbol FROM trades ORDER BY symbol;

Adaptive width encoding

Symbol IDs in column files use the smallest integer width that can represent all IDs in the symbol table:

Width	Max symbols	Bytes per value
W8	256	1
W16	65,536	2
W32	~4 billion	4
W64	unlimited	8

Most real-world datasets have far fewer than 256 unique strings per column, so symbols typically use just 1 byte per value.

The sym File

The symbol table can be persisted to a binary file. This is required for splayed and partitioned tables so that symbol IDs in column files can be resolved back to strings.

Binary format

Bytes 0-3:    magic  (0x4D595354 = "TSYM" little-endian)
Bytes 4-7:    count  (uint32, number of symbols)

For each symbol i = 0..count-1:
  [4 bytes]   length of string i (uint32)
  [N bytes]   UTF-8 string data

Symbol IDs correspond directly to their position in the file: the first string is ID 0, the second is ID 1, etc.

Rust API

// Intern a string (returns stable i64 ID)
let id = teide::sym_intern("AAPL")?;

// Intern the same string again — returns the same ID
let id2 = teide::sym_intern("AAPL")?;
assert_eq!(id, id2);

Splayed Tables

A splayed table stores each column as a separate binary file within a directory. This is the fundamental on-disk format in Teide, borrowed from kdb+.

Directory layout

/data/tables/trades/
  .d           -- schema file (vector of column name symbol IDs)
  ts           -- column: timestamp vector
  symbol       -- column: symbol vector
  price        -- column: f64 vector
  qty          -- column: i64 vector

Schema file (.d)

The .d file is a binary vector of i64 symbol IDs, one per column. Each ID maps to a column name in the symbol table. The order defines the column order of the table.

Column file format

Each column file is a raw binary vector with a 32-byte block header:

Bytes 0-15:   nullmap (inline bitmap for null tracking)
Byte  16:     mmod    (memory mode: 0=heap, 1=mmap)
Byte  17:     order   (block size class)
Byte  18:     type    (type tag: 6=i64, 7=f64, 20=sym, ...)
Byte  19:     attrs   (flags: has-nulls, external nullmap, ...)
Bytes 20-23:  rc      (reference count, 0 on disk)
Bytes 24-31:  len     (number of elements, uint64)
Bytes 32+:    data    (len * element_size bytes)
[optional]:   external nullmap ((len+7)/8 bytes, if flagged)

Loading in SQL

-- Load a splayed table (auto-discovers column names from .d)
SELECT * FROM read_splayed('/data/tables/trades');

-- With an explicit symbol file path
SELECT * FROM read_splayed('/data/tables/trades', '/data/sym');

-- Persist as an in-memory table
CREATE TABLE trades AS SELECT * FROM read_splayed('/data/tables/trades');

Loading in Rust

let ctx = Context::new()?;

// Without shared symbol file
let table = ctx.read_splayed("/data/tables/trades")?;

// With shared symbol file
let table = ctx.read_splayed("/data/tables/trades", Some("/data/sym"))?;

println!("{} rows, {} cols", table.nrows(), table.ncols());

Zero-copy: Column files are memory-mapped. The OS pages data in on demand, so you can query datasets larger than available RAM without loading everything upfront.

Partitioned Tables

A partitioned table splits data across multiple directories, each representing a time period (typically a date). This is the standard layout for time-series data in kdb+ systems.

Directory layout

/data/marketdb/
  sym                          -- shared symbol table (required)
  2024.01.15/
    trades/
      .d                       -- schema
      symbol                   -- column: sym
      price                    -- column: f64
      qty                      -- column: i64
  2024.01.16/
    trades/
      .d
      symbol
      price
      qty
  2024.01.17/
    trades/
      .d
      symbol
      price
      qty

Partition naming: Directory names must be dates in YYYY.MM.DD format (e.g., 2024.01.15) or integer keys. Teide auto-discovers all valid partition directories and loads them in sorted order.

How partitions are discovered

Scan db_root for subdirectories whose names consist of digits and dots
Validate date format: 10 characters, dots at positions 4 and 7, valid month (01-12) and day (01-31)
Sort directories lexicographically for deterministic ordering
Load each db_root/<partition>/<table_name> as a splayed table
Load the shared symbol table from db_root/sym
Concatenate all partition segments into a single logical table

The virtual partition column (MAPCOMMON)

When you open a partitioned table, Teide automatically creates a virtual column containing the partition key (the date). This column doesn't exist on disk — it's synthesized at query time.

-- The partition date is available as a queryable column
SELECT date, COUNT(*) AS trades
FROM read_parted('/data/marketdb', 'trades')
GROUP BY date
ORDER BY date;

2024.01.15 12847 2024.01.16 15203 2024.01.17 11592

The virtual column's type depends on the partition directory names:

Directory names	Virtual column type	Example
`YYYY.MM.DD` dates	DATE	`2024.01.15`, `2024.01.16`
Integer keys	BIGINT	`1`, `2`, `3`
Other strings	VARCHAR (symbol)	`us`, `eu`, `asia`

Parted column types

Internally, each column in a partitioned table is stored as a parted vector — a list of per-partition segments. When you access row N, Teide resolves which partition contains that row and reads from the correct segment.

This is transparent to SQL queries. You read and filter parted columns exactly like regular columns.

Loading in SQL

-- Open a partitioned table
CREATE TABLE trades AS
SELECT * FROM read_parted('/data/marketdb', 'trades');

-- Query across all partitions
SELECT symbol, SUM(qty) AS total_qty, AVG(price) AS avg_price
FROM trades
WHERE date BETWEEN '2024-01-15' AND '2024-01-16'
GROUP BY symbol
ORDER BY total_qty DESC
LIMIT 10;

Loading in Rust

let ctx = Context::new()?;

// Open partitioned table (auto-loads sym, discovers partitions)
let trades = ctx.read_parted("/data/marketdb", "trades")?;

// Second call returns cached result instantly
let trades2 = ctx.read_parted("/data/marketdb", "trades")?;

println!("{} rows, {} cols", trades.nrows(), trades.ncols());

Storage Format Comparison

Format	Access	Best for	SQL function
CSV	Full copy into memory	Import/export, ad-hoc analysis, small datasets	`read_csv(path)`
Splayed	Zero-copy mmap	Single tables, fast column scans, large datasets	`read_splayed(dir [, sym])`
Partitioned	Zero-copy mmap	Time-series data, date-sharded tables, multi-day queries	`read_parted(root, name)`

For kdb+/q Users

Teide's storage format is inspired by and compatible with the kdb+ on-disk layout:

kdb+ concept	Teide equivalent
`sym type	`VARCHAR` / `SYM` (symbol-encoded internally)
Splayed table (`:/path/table/)	`read_splayed('/path/table')`
Partitioned DB (`:/path/db/)	`read_parted('/path/db', 'table')`
`.d` schema file	Same — binary vector of symbol IDs
`sym` file in HDB root	Same — loaded from `db_root/sym`
Date partitions (`2024.01.15/`)	Same directory naming convention
Virtual `date` column	MAPCOMMON virtual column (auto-created)

The key difference: in Teide, you query these tables with standard SQL instead of q expressions.