vector-databases

melodic-software/vector-databases

Data & Analytics

About

SKILL.md

vector-databases

melodic-software/vector-databases

Data & Analytics

About

Vector database selection, embedding storage, approximate nearest neighbor (ANN) algorithms, and vector search optimization...

SKILL.md

Vector Databases

When to Use This Skill

Use this skill when:

Choosing between vector database options
Designing semantic/similarity search systems
Optimizing vector search performance
Understanding ANN algorithm trade-offs
Scaling vector search infrastructure
Implementing hybrid search (vectors + filters)

Keywords: vector database, embeddings, vector search, similarity search, ANN, approximate nearest neighbor, HNSW, IVF, FAISS, Pinecone, Weaviate, Milvus, Qdrant, Chroma, pgvector, cosine similarity, semantic search

Vector Database Comparison

Managed Services

Database	Strengths	Limitations	Best For
Pinecone	Fully managed, easy scaling, enterprise	Vendor lock-in, cost at scale	Enterprise production
Weaviate Cloud	GraphQL, hybrid search, modules	Complexity	Knowledge graphs
Zilliz Cloud	Milvus-based, high performance	Learning curve	High-scale production
MongoDB Atlas Vector	Existing MongoDB users	Newer feature	MongoDB shops
Elastic Vector	Existing Elastic stack	Resource heavy	Search platforms

Self-Hosted Options

Database	Strengths	Limitations	Best For
Milvus	Feature-rich, scalable, GPU support	Operational complexity	Large-scale production
Qdrant	Rust performance, filtering, easy	Smaller ecosystem	Performance-focused
Weaviate	Modules, semantic, hybrid	Memory usage	Knowledge applications
Chroma	Simple, Python-native	Limited scale	Development, prototyping
pgvector	PostgreSQL extension	Performance limits	Postgres shops
FAISS	Library, not DB, fastest	No persistence, no filtering	Research, embedded

Selection Decision Tree

Need managed, don't want operations?
├── Yes → Pinecone (simplest) or Weaviate Cloud
└── No (self-hosted)
    └── Already using PostgreSQL?
        ├── Yes, <1M vectors → pgvector
        └── No
            └── Need maximum performance at scale?
                ├── Yes → Milvus or Qdrant
                └── No
                    └── Prototyping/development?
                        ├── Yes → Chroma
                        └── No → Qdrant (balanced choice)

ANN Algorithms

Algorithm Overview

Exact KNN:
• Search ALL vectors
• O(n) time complexity
• Perfect accuracy
• Impractical at scale

Approximate NN (ANN):
• Search SUBSET of vectors
• O(log n) to O(1) complexity
• Near-perfect accuracy
• Practical at any scale

HNSW (Hierarchical Navigable Small World)

Layer 3: ○───────────────────────○  (sparse, long connections)
          │                       │
Layer 2: ○───○───────○───────○───○  (medium density)
          │   │       │       │   │
Layer 1: ○─○─○─○─○─○─○─○─○─○─○─○─○  (denser)
          │││││││││││││││││││││││
Layer 0: ○○○○○○○○○○○○○○○○○○○○○○○○○  (all vectors)

Search: Start at top layer, greedily descend
• Fast: O(log n) search time
• High recall: >95% typically
• Memory: Extra graph storage

HNSW Parameters:

Parameter	Description	Trade-off
`M`	Connections per node	Memory vs. recall
`ef_construction`	Build-time search width	Build time vs. recall
`ef_search`	Query-time search width	Latency vs. recall

IVF (Inverted File Index)

Clustering Phase:
┌─────────────────────────────────────────┐
│     Cluster vectors into K centroids    │
│                                         │
│    ●         ●         ●         ●     │
│   /│\       /│\       /│\       /│\    │
│  ○○○○○     ○○○○○     ○○○○○     ○○○○○   │
│ Cluster 1  Cluster 2 Cluster 3 Cluster 4│
└─────────────────────────────────────────┘

Search Phase:
1. Find nprobe nearest centroids
2. Search only those clusters
3. Much faster than exhaustive

IVF Parameters:

Parameter	Description	Trade-off
`nlist`	Number of clusters	Build time vs. search quality
`nprobe`	Clusters to search	Latency vs. recall

IVF-PQ (Product Quantization)

Original Vector (128 dim):
[0.1, 0.2, ..., 0.9]  (128 × 4 bytes = 512 bytes)

PQ Compressed (8 subvectors, 8-bit codes):
[23, 45, 12, 89, 56, 34, 78, 90]  (8 bytes)

Memory reduction: 64x
Accuracy trade-off: ~5% recall drop

Algorithm Comparison

Algorithm	Search Speed	Memory	Build Time	Recall
Flat/Brute	Slow (O(n))	Low	None	100%
IVF	Fast	Low	Medium	90-95%
IVF-PQ	Very fast	Very low	Medium	85-92%
HNSW	Very fast	High	Slow	95-99%
HNSW+PQ	Very fast	Medium	Slow	90-95%

When to Use Which

< 100K vectors:
└── Flat index (exact search is fast enough)

100K - 1M vectors:
└── HNSW (best recall/speed trade-off)

1M - 100M vectors:
├── Memory available → HNSW
└── Memory constrained → IVF-PQ or HNSW+PQ

> 100M vectors:
└── Sharded IVF-PQ or distributed HNSW

Distance Metrics

Common Metrics

Metric	Formula	Range	Best For
Cosine Similarity	`A·B / (\|\|A\|\| \|\|B\|\|)`	[-1, 1]	Normalized embeddings
Dot Product	`A·B`	(-∞, ∞)	When magnitude matters
Euclidean (L2)	`√Σ(A-B)²`	[0, ∞)	Absolute distances
Manhattan (L1)	`Σ\|A-B\|`	[0, ∞)	High-dimensional sparse

Metric Selection

Embeddings pre-normalized (unit vectors)?
├── Yes → Cosine = Dot Product (use Dot, faster)
└── No
    └── Magnitude meaningful?
        ├── Yes → Dot Product
        └── No → Cosine Similarity

Note: Most embedding models output normalized vectors
      → Dot product is usually the best choice

Filtering and Hybrid Search

Pre-filtering vs Post-filtering

Pre-filtering (Filter → Search):
┌─────────────────────────────────────────┐
│ 1. Apply metadata filter               │
│    (category = "electronics")           │
│    Result: 10K of 1M vectors           │
│                                         │
│ 2. Vector search on 10K vectors        │
│    Much faster, guaranteed filter match │
└─────────────────────────────────────────┘

Post-filtering (Search → Filter):
┌─────────────────────────────────────────┐
│ 1. Vector search on 1M vectors         │
│    Return top-1000                      │
│                                         │
│ 2. Apply metadata filter               │
│    May return < K results!             │
└─────────────────────────────────────────┘

Hybrid Search Architecture

Query: "wireless headphones under $100"
           │
     ┌─────┴─────┐
     ▼           ▼
 ┌───────┐  ┌───────┐
 │Vector │  │Filter │
 │Search │  │ Build │
 │"wire- │  │price  │
 │less   │  │< 100  │
 │head-  │  │       │
 │phones"│  │       │
 └───────┘  └───────┘
     │           │
     └─────┬─────┘
           ▼
    ┌───────────┐
    │  Combine  │
    │  Results  │
    └───────────┘

Metadata Index Design

Metadata Type	Index Strategy	Query Example
Categorical	Bitmap/hash index	category = "books"
Numeric range	B-tree	price BETWEEN 10 AND 50
Keyword search	Inverted index	tags CONTAINS "sale"
Geospatial	R-tree/geohash	location NEAR (lat, lng)

Scaling Strategies

Sharding Approaches

Naive Sharding (by ID):
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Shard 1 │ │ Shard 2 │ │ Shard 3 │
│ IDs 0-N │ │IDs N-2N │ │IDs 2N-3N│
└─────────┘ └─────────┘ └─────────┘
Query → Search ALL shards → Merge results

Semantic Sharding (by cluster):
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Shard 1 │ │ Shard 2 │ │ Shard 3 │
│ Tech    │ │ Health  │ │ Finance │
│ docs    │ │ docs    │ │ docs    │
└─────────┘ └─────────┘ └─────────┘
Query → Route to relevant shard(s) → Faster!

Replication

┌─────────────────────────────────────────┐
│              Load Balancer              │
└─────────────────────────────────────────┘
         │           │           │
         ▼           ▼           ▼
    ┌─────────┐ ┌─────────┐ ┌─────────┐
    │Replica 1│ │Replica 2│ │Replica 3│
    │  (Read) │ │  (Read) │ │  (Read) │
    └─────────┘ └─────────┘ └─────────┘
         │           │           │
         └───────────┼───────────┘
                     │
                ┌─────────┐
                │ Primary │
                │ (Write) │
                └─────────┘

Scaling Decision Matrix

Scale (vectors)	Architecture	Replication
< 1M	Single node	Optional
1-10M	Single node, more RAM	For HA
10-100M	Sharded, few nodes	Required
100M-1B	Sharded, many nodes	Required
> 1B	Sharded + tiered	Required

Performance Optimization

Index Build Optimization

Optimization	Description	Impact
Batch insertion	Insert in batches of 1K-10K	10x faster
Parallel build	Multi-threaded index construction	2-4x faster
Incremental index	Add to existing index	Avoids rebuild
GPU acceleration	Use GPU for training (IVF)	10-100x faster

Query Optimization

Optimization	Description	Impact
Warm cache	Keep index in memory	10x latency reduction
Query batching	Batch similar queries	Higher throughput
Reduce dimensions	PCA, random projection	2-4x faster
Early termination	Stop when "good enough"	Lower latency

Memory Optimization

Memory per vector:
┌────────────────────────────────────────┐
│ 1536 dims × 4 bytes = 6KB per vector   │
│                                        │
│ 1M vectors:                            │
│   Raw: 6GB                             │
│   + HNSW graph: +2-4GB (M-dependent)   │
│   = 8-10GB total                       │
│                                        │
│ With PQ (64 subquantizers):            │
│   1M vectors: ~64MB                    │
│   = 100x reduction                     │
└────────────────────────────────────────┘

Operational Considerations

Backup and Recovery

Strategy	Description	RPO/RTO
Snapshots	Periodic full backup	Hours
WAL replication	Write-ahead log streaming	Minutes
Real-time sync	Synchronous replication	Seconds

Monitoring Metrics

Metric	Description	Alert Threshold
Query latency p99	99th percentile latency	> 100ms
Recall	Search accuracy	< 90%
QPS	Queries per second	Capacity dependent
Memory usage	Index memory	> 80%
Index freshness	Time since last update	Domain dependent

Index Maintenance

┌─────────────────────────────────────────┐
│        Index Maintenance Tasks          │
├─────────────────────────────────────────┤
│ • Compaction: Merge small segments      │
│ • Reindex: Rebuild degraded index       │
│ • Vacuum: Remove deleted vectors        │
│ • Optimize: Tune parameters             │
│                                         │
│ Schedule during low-traffic periods     │
└─────────────────────────────────────────┘

Common Patterns

Multi-Tenant Vector Search

Option 1: Namespace/Collection per tenant
┌─────────────────────────────────────────┐
│ tenant_1_collection                     │
│ tenant_2_collection                     │
│ tenant_3_collection                     │
└─────────────────────────────────────────┘
Pro: Complete isolation
Con: Many indexes, operational overhead

Option 2: Single collection + tenant filter
┌─────────────────────────────────────────┐
│ shared_collection                       │
│   metadata: { tenant_id: "..." }        │
│   Pre-filter by tenant_id               │
└─────────────────────────────────────────┘
Pro: Simpler operations
Con: Requires efficient filtering

Real-Time Updates

Write Path:
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Write     │    │   Buffer    │    │   Merge     │
│   Request   │───▶│   (Memory)  │───▶│   to Index  │
└─────────────┘    └─────────────┘    └─────────────┘

Strategy:
1. Buffer writes in memory
2. Periodically merge to main index
3. Search: main index + buffer
4. Compact periodically

Embedding Versioning

Version 1 embeddings ──┐
                       │
Version 2 embeddings ──┼──▶ Parallel indexes during migration
                       │
                       │    ┌─────────────────────┐
                       └───▶│ Gradual reindexing  │
                            │ Blue-green switch   │
                            └─────────────────────┘

Cost Estimation

Storage Costs

Cost = (vectors × dimensions × bytes × replication) / GB × $/GB/month

Example:
10M vectors × 1536 dims × 4 bytes × 3 replicas = 184 GB
At $0.10/GB/month = $18.40/month storage

Note: Memory (for serving) costs more than storage

Compute Costs

Factors:
• QPS (queries per second)
• Latency requirements
• Index type (HNSW needs more RAM)
• Filtering complexity

Rule of thumb:
• 1M vectors, HNSW, <50ms latency: 16GB RAM
• 10M vectors, HNSW, <50ms latency: 64-128GB RAM
• 100M vectors: Distributed system required

Related Skills

rag-architecture - Using vector databases in RAG systems
llm-serving-patterns - LLM inference with vector retrieval
ml-system-design - End-to-end ML pipeline design
estimation-techniques - Capacity planning for vector systems

Version History

v1.0.0 (2025-12-26): Initial release - Vector database patterns for systems design

Last Updated

Date: 2025-12-26

About

SKILL.md

About

Vector database selection, embedding storage, approximate nearest neighbor (ANN) algorithms, and vector search optimization...

SKILL.md

Vector Databases

When to Use This Skill

Use this skill when:

Choosing between vector database options
Designing semantic/similarity search systems
Optimizing vector search performance
Understanding ANN algorithm trade-offs
Scaling vector search infrastructure
Implementing hybrid search (vectors + filters)

Vector Database Comparison

Managed Services

Database	Strengths	Limitations	Best For
Pinecone	Fully managed, easy scaling, enterprise	Vendor lock-in, cost at scale	Enterprise production
Weaviate Cloud	GraphQL, hybrid search, modules	Complexity	Knowledge graphs
Zilliz Cloud	Milvus-based, high performance	Learning curve	High-scale production
MongoDB Atlas Vector	Existing MongoDB users	Newer feature	MongoDB shops
Elastic Vector	Existing Elastic stack	Resource heavy	Search platforms

Self-Hosted Options

Database	Strengths	Limitations	Best For
Milvus	Feature-rich, scalable, GPU support	Operational complexity	Large-scale production
Qdrant	Rust performance, filtering, easy	Smaller ecosystem	Performance-focused
Weaviate	Modules, semantic, hybrid	Memory usage	Knowledge applications
Chroma	Simple, Python-native	Limited scale	Development, prototyping
pgvector	PostgreSQL extension	Performance limits	Postgres shops
FAISS	Library, not DB, fastest	No persistence, no filtering	Research, embedded

Selection Decision Tree

Need managed, don't want operations?
├── Yes → Pinecone (simplest) or Weaviate Cloud
└── No (self-hosted)
    └── Already using PostgreSQL?
        ├── Yes, <1M vectors → pgvector
        └── No
            └── Need maximum performance at scale?
                ├── Yes → Milvus or Qdrant
                └── No
                    └── Prototyping/development?
                        ├── Yes → Chroma
                        └── No → Qdrant (balanced choice)

ANN Algorithms

Algorithm Overview

Exact KNN:
• Search ALL vectors
• O(n) time complexity
• Perfect accuracy
• Impractical at scale

Approximate NN (ANN):
• Search SUBSET of vectors
• O(log n) to O(1) complexity
• Near-perfect accuracy
• Practical at any scale

HNSW (Hierarchical Navigable Small World)

Layer 3: ○───────────────────────○  (sparse, long connections)
          │                       │
Layer 2: ○───○───────○───────○───○  (medium density)
          │   │       │       │   │
Layer 1: ○─○─○─○─○─○─○─○─○─○─○─○─○  (denser)
          │││││││││││││││││││││││
Layer 0: ○○○○○○○○○○○○○○○○○○○○○○○○○  (all vectors)

Search: Start at top layer, greedily descend
• Fast: O(log n) search time
• High recall: >95% typically
• Memory: Extra graph storage

HNSW Parameters:

Parameter	Description	Trade-off
`M`	Connections per node	Memory vs. recall
`ef_construction`	Build-time search width	Build time vs. recall
`ef_search`	Query-time search width	Latency vs. recall

IVF (Inverted File Index)

Clustering Phase:
┌─────────────────────────────────────────┐
│     Cluster vectors into K centroids    │
│                                         │
│    ●         ●         ●         ●     │
│   /│\       /│\       /│\       /│\    │
│  ○○○○○     ○○○○○     ○○○○○     ○○○○○   │
│ Cluster 1  Cluster 2 Cluster 3 Cluster 4│
└─────────────────────────────────────────┘

Search Phase:
1. Find nprobe nearest centroids
2. Search only those clusters
3. Much faster than exhaustive

IVF Parameters:

Parameter	Description	Trade-off
`nlist`	Number of clusters	Build time vs. search quality
`nprobe`	Clusters to search	Latency vs. recall

IVF-PQ (Product Quantization)

Original Vector (128 dim):
[0.1, 0.2, ..., 0.9]  (128 × 4 bytes = 512 bytes)

PQ Compressed (8 subvectors, 8-bit codes):
[23, 45, 12, 89, 56, 34, 78, 90]  (8 bytes)

Memory reduction: 64x
Accuracy trade-off: ~5% recall drop

Algorithm Comparison

Algorithm	Search Speed	Memory	Build Time	Recall
Flat/Brute	Slow (O(n))	Low	None	100%
IVF	Fast	Low	Medium	90-95%
IVF-PQ	Very fast	Very low	Medium	85-92%
HNSW	Very fast	High	Slow	95-99%
HNSW+PQ	Very fast	Medium	Slow	90-95%

When to Use Which

< 100K vectors:
└── Flat index (exact search is fast enough)

100K - 1M vectors:
└── HNSW (best recall/speed trade-off)

1M - 100M vectors:
├── Memory available → HNSW
└── Memory constrained → IVF-PQ or HNSW+PQ

> 100M vectors:
└── Sharded IVF-PQ or distributed HNSW

Distance Metrics

Common Metrics

Metric	Formula	Range	Best For
Cosine Similarity	`A·B / (\|\|A\|\| \|\|B\|\|)`	[-1, 1]	Normalized embeddings
Dot Product	`A·B`	(-∞, ∞)	When magnitude matters
Euclidean (L2)	`√Σ(A-B)²`	[0, ∞)	Absolute distances
Manhattan (L1)	`Σ\|A-B\|`	[0, ∞)	High-dimensional sparse

Metric Selection

Embeddings pre-normalized (unit vectors)?
├── Yes → Cosine = Dot Product (use Dot, faster)
└── No
    └── Magnitude meaningful?
        ├── Yes → Dot Product
        └── No → Cosine Similarity

Note: Most embedding models output normalized vectors
      → Dot product is usually the best choice

Filtering and Hybrid Search

Pre-filtering vs Post-filtering

Pre-filtering (Filter → Search):
┌─────────────────────────────────────────┐
│ 1. Apply metadata filter               │
│    (category = "electronics")           │
│    Result: 10K of 1M vectors           │
│                                         │
│ 2. Vector search on 10K vectors        │
│    Much faster, guaranteed filter match │
└─────────────────────────────────────────┘

Post-filtering (Search → Filter):
┌─────────────────────────────────────────┐
│ 1. Vector search on 1M vectors         │
│    Return top-1000                      │
│                                         │
│ 2. Apply metadata filter               │
│    May return < K results!             │
└─────────────────────────────────────────┘

Hybrid Search Architecture

Query: "wireless headphones under $100"
           │
     ┌─────┴─────┐
     ▼           ▼
 ┌───────┐  ┌───────┐
 │Vector │  │Filter │
 │Search │  │ Build │
 │"wire- │  │price  │
 │less   │  │< 100  │
 │head-  │  │       │
 │phones"│  │       │
 └───────┘  └───────┘
     │           │
     └─────┬─────┘
           ▼
    ┌───────────┐
    │  Combine  │
    │  Results  │
    └───────────┘

Metadata Index Design

Metadata Type	Index Strategy	Query Example
Categorical	Bitmap/hash index	category = "books"
Numeric range	B-tree	price BETWEEN 10 AND 50
Keyword search	Inverted index	tags CONTAINS "sale"
Geospatial	R-tree/geohash	location NEAR (lat, lng)

Scaling Strategies

Sharding Approaches

Naive Sharding (by ID):
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Shard 1 │ │ Shard 2 │ │ Shard 3 │
│ IDs 0-N │ │IDs N-2N │ │IDs 2N-3N│
└─────────┘ └─────────┘ └─────────┘
Query → Search ALL shards → Merge results

Semantic Sharding (by cluster):
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Shard 1 │ │ Shard 2 │ │ Shard 3 │
│ Tech    │ │ Health  │ │ Finance │
│ docs    │ │ docs    │ │ docs    │
└─────────┘ └─────────┘ └─────────┘
Query → Route to relevant shard(s) → Faster!

Replication

┌─────────────────────────────────────────┐
│              Load Balancer              │
└─────────────────────────────────────────┘
         │           │           │
         ▼           ▼           ▼
    ┌─────────┐ ┌─────────┐ ┌─────────┐
    │Replica 1│ │Replica 2│ │Replica 3│
    │  (Read) │ │  (Read) │ │  (Read) │
    └─────────┘ └─────────┘ └─────────┘
         │           │           │
         └───────────┼───────────┘
                     │
                ┌─────────┐
                │ Primary │
                │ (Write) │
                └─────────┘

Scaling Decision Matrix

Scale (vectors)	Architecture	Replication
< 1M	Single node	Optional
1-10M	Single node, more RAM	For HA
10-100M	Sharded, few nodes	Required
100M-1B	Sharded, many nodes	Required
> 1B	Sharded + tiered	Required

Performance Optimization

Index Build Optimization

Optimization	Description	Impact
Batch insertion	Insert in batches of 1K-10K	10x faster
Parallel build	Multi-threaded index construction	2-4x faster
Incremental index	Add to existing index	Avoids rebuild
GPU acceleration	Use GPU for training (IVF)	10-100x faster

Query Optimization

Optimization	Description	Impact
Warm cache	Keep index in memory	10x latency reduction
Query batching	Batch similar queries	Higher throughput
Reduce dimensions	PCA, random projection	2-4x faster
Early termination	Stop when "good enough"	Lower latency

Memory Optimization

Memory per vector:
┌────────────────────────────────────────┐
│ 1536 dims × 4 bytes = 6KB per vector   │
│                                        │
│ 1M vectors:                            │
│   Raw: 6GB                             │
│   + HNSW graph: +2-4GB (M-dependent)   │
│   = 8-10GB total                       │
│                                        │
│ With PQ (64 subquantizers):            │
│   1M vectors: ~64MB                    │
│   = 100x reduction                     │
└────────────────────────────────────────┘

Operational Considerations

Backup and Recovery

Strategy	Description	RPO/RTO
Snapshots	Periodic full backup	Hours
WAL replication	Write-ahead log streaming	Minutes
Real-time sync	Synchronous replication	Seconds

Monitoring Metrics

Metric	Description	Alert Threshold
Query latency p99	99th percentile latency	> 100ms
Recall	Search accuracy	< 90%
QPS	Queries per second	Capacity dependent
Memory usage	Index memory	> 80%
Index freshness	Time since last update	Domain dependent

Index Maintenance

┌─────────────────────────────────────────┐
│        Index Maintenance Tasks          │
├─────────────────────────────────────────┤
│ • Compaction: Merge small segments      │
│ • Reindex: Rebuild degraded index       │
│ • Vacuum: Remove deleted vectors        │
│ • Optimize: Tune parameters             │
│                                         │
│ Schedule during low-traffic periods     │
└─────────────────────────────────────────┘

Common Patterns

Multi-Tenant Vector Search

Option 1: Namespace/Collection per tenant
┌─────────────────────────────────────────┐
│ tenant_1_collection                     │
│ tenant_2_collection                     │
│ tenant_3_collection                     │
└─────────────────────────────────────────┘
Pro: Complete isolation
Con: Many indexes, operational overhead

Option 2: Single collection + tenant filter
┌─────────────────────────────────────────┐
│ shared_collection                       │
│   metadata: { tenant_id: "..." }        │
│   Pre-filter by tenant_id               │
└─────────────────────────────────────────┘
Pro: Simpler operations
Con: Requires efficient filtering

Real-Time Updates

Write Path:
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Write     │    │   Buffer    │    │   Merge     │
│   Request   │───▶│   (Memory)  │───▶│   to Index  │
└─────────────┘    └─────────────┘    └─────────────┘

Strategy:
1. Buffer writes in memory
2. Periodically merge to main index
3. Search: main index + buffer
4. Compact periodically

Embedding Versioning

Version 1 embeddings ──┐
                       │
Version 2 embeddings ──┼──▶ Parallel indexes during migration
                       │
                       │    ┌─────────────────────┐
                       └───▶│ Gradual reindexing  │
                            │ Blue-green switch   │
                            └─────────────────────┘

Cost Estimation

Storage Costs

Cost = (vectors × dimensions × bytes × replication) / GB × $/GB/month

Example:
10M vectors × 1536 dims × 4 bytes × 3 replicas = 184 GB
At $0.10/GB/month = $18.40/month storage

Note: Memory (for serving) costs more than storage

Compute Costs

Factors:
• QPS (queries per second)
• Latency requirements
• Index type (HNSW needs more RAM)
• Filtering complexity

Rule of thumb:
• 1M vectors, HNSW, <50ms latency: 16GB RAM
• 10M vectors, HNSW, <50ms latency: 64-128GB RAM
• 100M vectors: Distributed system required

Related Skills

rag-architecture - Using vector databases in RAG systems
llm-serving-patterns - LLM inference with vector retrieval
ml-system-design - End-to-end ML pipeline design
estimation-techniques - Capacity planning for vector systems

Version History

v1.0.0 (2025-12-26): Initial release - Vector database patterns for systems design

Last Updated

Date: 2025-12-26