Binary Format Specification
Detailed binary format specification for Matchy databases.
Matchy databases use the MaxMind DB (MMDB) format with optional extensions for string and pattern matching.
Overview
The format has three main components:
- MMDB Section: Standard MaxMind DB format for IP address lookups
- PARAGLOB Section: Optional extension for glob pattern matching
- String Literals Hash Section: Optional extension for exact string matching
All components coexist in a single .mxy file.
File Structure
Note: The MMDB format is unusual - it has no header or magic bytes at the start. The file begins directly with the IP search tree, and all metadata is stored at the end of the file.
┌─────────────────────────────────────────────────────────┐
│ IP Search Tree (Binary Trie) │ Starts at byte 0
├─────────────────────────────────────────────────────────┤
│ 16-byte separator │
├─────────────────────────────────────────────────────────┤
│ Data Section (Shared) │ MMDB data values
├─────────────────────────────────────────────────────────┤
│ MMDB_PATTERN separator (optional) │ "MMDB_PATTERN\x00\x00\x00\x00"
├─────────────────────────────────────────────────────────┤
│ PARAGLOB SECTION (optional) │ Glob pattern matching
├─────────────────────────────────────────────────────────┤
│ MMDB_LITERAL separator (optional) │ "MMDB_LITERAL\x00\x00\x00\x00"
├─────────────────────────────────────────────────────────┤
│ STRING LITERALS HASH SECTION (optional) │ O(1) exact string lookups
├─────────────────────────────────────────────────────────┤
│ Metadata Marker │ "\xAB\xCD\xEFMaxMind.com"
├─────────────────────────────────────────────────────────┤
│ MMDB Metadata (within last 128KB) │ node_count, record_size, etc.
└─────────────────────────────────────────────────────────┘
Section Descriptions
IP Search Tree: Binary trie for IP address lookups. This is the first data in the file (offset 0). The tree structure depends on metadata fields that are only available after parsing the metadata at the end of the file.
Data Section: Shared MMDB-encoded data values referenced by all query types (IP, pattern, and literal lookups).
PARAGLOB Section: Optional section for glob pattern matching. Only present if the database contains patterns with wildcards (e.g., *.example.com).
String Literals Hash Section: Optional hash table for O(1) exact string matching. Only present if the database contains literal strings (non-wildcard patterns).
MMDB Metadata: Contains essential database information:
node_count: Number of nodes in the IP search treerecord_size: Size of tree records (24, 28, or 32 bits)ip_version: IPv4 (4) or IPv6 (6)pattern_section_offset: Offset to PARAGLOB section (0 if absent)literal_section_offset: Offset to literal hash section (0 if absent)- Build timestamp, database type, description, etc.
The metadata marker (\xAB\xCD\xEFMaxMind.com) is located within the last 128KB of the file. Parsers search backwards from the end to find it.
MMDB Section
The file follows the standard MaxMind DB format:
- See MaxMind DB Spec
Key characteristics:
- No header at start of file
- File begins with IP search tree data at offset 0
- Metadata stored at end of file for fast tail access
- Memory-mappable with zero-copy access
Metadata
Standard MMDB metadata map at the end of the file (after metadata marker):
{
"binary_format_major_version": 2,
"binary_format_minor_version": 0,
"build_epoch": 1234567890,
"database_type": "Matchy",
"description": {
"en": "Matchy unified database"
},
"ip_version": 6,
"node_count": 12345,
"record_size": 28
}
Search Tree
Binary trie for IP address lookups:
- Node size: 7 bytes (28-bit pointers × 2)
- Record size: 28 bits per record
- Addressing: Supports up to 256M nodes
Each node contains two 28-bit pointers (left/right):
Node (7 bytes):
├─ Left pointer (28 bits) → next node or data
└─ Right pointer (28 bits) → next node or data
Data Section
MMDB-format data types:
| Type | Code | Size | Notes |
|---|---|---|---|
| Pointer | 1 | Variable | Offset into data section |
| String | 2 | Variable | UTF-8 text |
| Double | 3 | 8 bytes | IEEE 754 |
| Bytes | 4 | Variable | Binary data |
| Uint16 | 5 | 2 bytes | Unsigned integer |
| Uint32 | 6 | 4 bytes | Unsigned integer |
| Map | 7 | Variable | Key-value pairs |
| Int32 | 8 | 4 bytes | Signed integer |
| Uint64 | 9 | 8 bytes | Unsigned integer |
| Boolean | 14 | 0 bytes | Value in type byte |
| Float | 15 | 4 bytes | IEEE 754 |
| Array | 11 | Variable | Ordered list |
| Timestamp | 128 | 8 bytes | Matchy extension (Unix epoch seconds) |
See MaxMind DB Format for encoding details.
Matchy Extended Types
Matchy extends the MMDB format with additional types using codes 128+:
| Type | Code | Size | Notes |
|---|---|---|---|
| Timestamp | 128 | 8 bytes | Unix epoch seconds (signed i64) |
These types are stored using the MMDB extended type mechanism (raw byte = code - 7). Timestamp values are serialized to JSON as ISO 8601 strings (e.g., 2025-10-02T18:44:31Z) for human readability while stored compactly as 8 bytes instead of 27-byte strings.
PARAGLOB Section Format
When glob patterns are present, the PARAGLOB section contains:
#![allow(unused)]
fn main() {
#[repr(C)]
struct ParaglobHeader {
magic: [u8; 8], // "PARAGLOB"
version: u32, // Format version (currently 5)
match_mode: u32, // 0=CaseSensitive, 1=CaseInsensitive
ac_node_count: u32, // Number of AC automaton nodes
ac_nodes_offset: u32, // Offset to node array
// ... additional fields for pattern data
}
}
Followed by:
- Aho-Corasick automaton nodes and edges
- Pattern metadata entries
- Glob segment data
- Pattern-to-data mappings
See matchy-format/src/offset_format.rs for the complete ParaglobHeader structure (112 bytes in v5).
String Literals Hash Section Format (Version 2)
When literal strings are present, a hash table section provides O(1) lookups using 96-bit truncated XXH3 hashes:
#![allow(unused)]
fn main() {
#[repr(C)]
struct LiteralHashHeader {
magic: [u8; 4], // "LHSH"
version: u32, // 2
entry_count: u32, // Number of patterns
table_size: u32, // Hash table capacity
reserved1: u32, // Reserved (was strings_offset in v1)
reserved2: u32, // Reserved (was strings_size in v1)
num_shards: u32, // Number of shards (power of 2)
shard_bits: u32, // Bits used for sharding
}
#[repr(C)]
struct HashEntry {
hash: [u8; 12], // 96-bit truncated XXH3_128
pattern_id: u32, // Pattern ID for data lookup
}
}
Key characteristics:
- Hash-only storage: Original strings are not stored (privacy-preserving)
- 96-bit hashes: Negligible collision probability (< 10⁻²⁴ per query)
- Sharded construction: Parallel building for large datasets
- 16-byte entries: Same size as v1, but ~50% smaller total (no string pool)
See matchy-literal-hash crate for implementation details.
Data Alignment
All structures are aligned:
- Header: 8-byte alignment
- Nodes: 8-byte alignment
- Edges: 4-byte alignment
- Hash buckets: 4-byte alignment
Padding bytes are zeros.
Offset Encoding
All offsets are relative to the start of the PARAGLOB section:
File offset = PARAGLOB_SECTION_START + relative_offset
Special values:
0x00000000= NULL pointer0xFFFFFFFF= Invalid/end marker
Version History
Version 5 (Current)
- Serialized glob segments for zero-copy loading
- Optimized memory layout with ACNodeHot (16 bytes)
- Support for patterns, exact strings, and IP addresses
- Aho-Corasick automaton for pattern matching
- Separate hash table for exact literal matches
- Embedded MMDB data format
Previous Versions
- v4: ACNodeHot (20-byte) for 50% memory reduction
- v3: AC literal mapping for O(1) zero-copy loading
- v2: Data section support for pattern-associated data
- v1: Original format, patterns only
Format Validation
Matchy validates these invariants on load:
- Magic bytes match: “\xAB\xCD\xEFMaxMind.com” at end, “PARAGLOB” if pattern section present
- Version supported: PARAGLOB version 5 currently
- Offsets in bounds: All offsets point within file
- Alignment correct: Structures properly aligned
- Section offsets: Metadata contains correct
pattern_section_offsetandliteral_section_offset - File size: Must be at least large enough for tree + metadata
Validation errors result in format errors. See matchy validate command for detailed validation.
Memory Mapping
The format is designed for memory mapping:
- No pointer fixups: All offsets are file-relative
- No relocations: Position-independent
- Aligned access: Natural alignment for all types
- Bounds checkable: All sizes/offsets in header
Example:
#![allow(unused)]
fn main() {
let file = File::open("database.mxy")?;
let mmap = unsafe { Mmap::map(&file)? };
// Direct access to structures
let header = read_paraglob_header(&mmap)?;
let nodes = get_node_array(&mmap, header.nodes_offset)?;
}
Cross-Platform Compatibility
Format is platform-independent:
- Endianness: Native byte order (little-endian on x86/ARM). Marker stored for future big-endian support if needed.
- Alignment: Conservative alignment for all platforms
- Sizes: Fixed-size types (
u32, notsize_t) - ABI:
#[repr(C)]structures
A database built on Linux/x86-64 works on macOS/ARM64 (both little-endian).
Future Extensions
Reserved fields for future versions:
- Pattern compilation flags (case sensitivity, etc.)
- Compressed string tables
- Alternative hash functions
- Additional data formats
Version changes will be backward-compatible when possible.