AgenticCodebase
File Format Specification
The .acb binary format stores a complete code concept graph in a single file. This document describes the on-disk layout, section structure, and design rationale.
The .acb binary format stores a complete code concept graph in a single file. This document describes the on-disk layout, section structure, and design rationale.
Design Goals
- O(1) random access. Look up any code unit by ID without scanning the file.
- Compact. LZ4 compression for variable-length strings. Fixed-size records for units and edges.
- Memory-mappable. The format supports
mmap()for zero-copy access to unit and edge tables. - Forward-compatible. New fields are appended to the header. Older readers skip unknown sections.
File Layout
Offset 0x00 ┌─────────────────────────────┐
│ Header (128 B) │
├─────────────────────────────┤
│ Unit Table (96N bytes) │ N = unit_count
├─────────────────────────────┤
│ Edge Table (40M bytes) │ M = edge_count
├─────────────────────────────┤
│ String Pool (LZ4 compressed) │ Variable size
├─────────────────────────────┤
│ Feature Vectors (f32 array) │ N * dim * 4 bytes
└─────────────────────────────┘Header (128 bytes)
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| 0x00 | 4 | [u8; 4] | magic | Magic bytes: ACB\0 |
| 0x04 | 4 | u32 | version | Format version (currently 1) |
| 0x08 | 8 | u64 | unit_count | Number of code units |
| 0x10 | 8 | u64 | edge_count | Number of edges |
| 0x18 | 8 | u64 | string_pool_offset | Byte offset of string pool section |
| 0x20 | 8 | u64 | string_pool_size | Compressed size of string pool |
| 0x28 | 8 | u64 | feature_offset | Byte offset of feature vector section |
| 0x30 | 4 | u32 | dimension | Feature vector dimensionality |
| 0x34 | 8 | u64 | timestamp | Compilation timestamp (Unix epoch) |
| 0x3C | 52 | [u8; 52] | reserved | Reserved for future fields |
Total: 128 bytes (fixed).
Unit Table
Starts immediately after the header at offset 128. Each unit record is 96 bytes.
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| 0x00 | 8 | u64 | id | Unique unit identifier |
| 0x04 | 4 | u32 | name_offset | Offset into decompressed string pool |
| 0x08 | 4 | u32 | name_length | Length of name string |
| 0x0C | 4 | u32 | qname_offset | Qualified name offset |
| 0x10 | 4 | u32 | qname_length | Qualified name length |
| 0x14 | 1 | u8 | unit_type | UnitType enum discriminant |
| 0x15 | 1 | u8 | language | Language enum discriminant |
| 0x16 | 1 | u8 | visibility | Visibility enum discriminant |
| 0x17 | 1 | u8 | flags | Bit flags (is_async, is_generator, etc.) |
| 0x18 | 4 | u32 | file_offset | File path offset in string pool |
| 0x1C | 4 | u32 | file_length | File path length |
| 0x20 | 4 | u32 | start_line | Span start line |
| 0x24 | 4 | u32 | start_col | Span start column |
| 0x28 | 4 | u32 | end_line | Span end line |
| 0x2C | 4 | u32 | end_col | Span end column |
| 0x30 | 4 | u32 | complexity | Cyclomatic complexity |
| 0x34 | 4 | f32 | stability | Stability score (0.0 - 1.0) |
| 0x38 | 4 | u32 | sig_offset | Signature string offset (0 if none) |
| 0x3C | 4 | u32 | sig_length | Signature string length |
| 0x40 | 4 | u32 | doc_offset | Doc summary offset (0 if none) |
| 0x44 | 4 | u32 | doc_length | Doc summary length |
| 0x48 | 24 | [u8; 24] | reserved | Reserved for future fields |
Total: 96 bytes per unit.
Edge Table
Starts after the unit table. Each edge record is 40 bytes.
| Offset | Size | Type | Field | Description |
|---|---|---|---|---|
| 0x00 | 8 | u64 | source_id | Source unit ID |
| 0x08 | 8 | u64 | target_id | Target unit ID |
| 0x10 | 1 | u8 | edge_type | EdgeType enum discriminant |
| 0x11 | 7 | [u8; 7] | padding | Alignment padding |
| 0x18 | 8 | f64 | weight | Edge weight (0.0 - 1.0) |
| 0x20 | 8 | [u8; 8] | reserved | Reserved |
Total: 40 bytes per edge.
String Pool
The string pool contains all variable-length text: unit names, qualified names, file paths, signatures, and documentation summaries. Stored as a single contiguous buffer, LZ4-compressed.
On read, the entire pool is decompressed into memory. String references in unit records use (offset, length) pairs into this decompressed buffer.
Compression
LZ4 block compression is used for the string pool. Typical compression ratios on source code metadata:
- English identifiers: ~2.5x compression
- File paths with common prefixes: ~3-4x compression
- Documentation text: ~2-3x compression
LZ4 decompression runs at 3-5 GB/s on modern hardware, making the decompression cost negligible.
Feature Vectors
Feature vectors are stored as a flat array of f32 values, one vector per unit. The vector for unit N starts at offset feature_offset + N * dimension * 4.
Default dimension is 64, configurable at compile time. Vectors are not compressed since f32 values compress poorly.
Versioning
The version field in the header enables forward compatibility:
- Version 1 (current): Base format as described in this document.
- Future versions will maintain backward compatibility by appending new sections after existing ones and using reserved header fields.
Readers should check the version field and reject files with unsupported versions rather than attempting to parse unknown formats.
Checksum
The current format does not include checksums. File integrity can be verified using external tools (e.g., blake3sum). A checksum field may be added in a future version using reserved header space.