Optimize full text search with BM25

Tiger Cloud: Performance, Scale, Enterprise, Free

Self-hosted products

MST

Postgres full-text search at scale consistently hits a wall where performance degrades catastrophically. Tiger Data's pg_textsearch brings modern BM25-based full-text search directly into Postgres, with a memtable architecture for efficient indexing and ranking. pg_textsearch integrates seamlessly with SQL and provides better search quality and performance than the Postgres built-in full-text search.

BM25 scores in pg_textsearch are returned as negative values, where lower (more negative) numbers indicate better matches. pg_textsearch implements the following:

Corpus-aware ranking: BM25 uses inverse document frequency to weight rare terms higher
Term frequency saturation: prevents documents with excessive term repetition from dominating results
Length normalization: adjusts scores based on document length relative to corpus average
Relative ranking: focuses on rank order rather than absolute score values

This page shows you how to install pg_textsearch, configure BM25 indexes, and optimize your search capabilities using the following best practice:

Memory planning: size your index_memory_limit based on corpus vocabulary and document count
Language configuration: choose appropriate text search configurations for your data language
Hybrid search: combine with pgvector or pgvectorscale for applications requiring both semantic and keyword search
Query optimization: use score thresholds to filter low-relevance results
Index monitoring: regularly check index usage and memory consumption

Early access: October 2025 this preview release is designed for development and staging environments. It is not recommended for use with hypertables.

Prerequisites

To follow the steps on this page:

Create a target Tiger Cloud service with the Real-time analytics capability.
You need your connection details. This procedure also works for self-hosted TimescaleDB.

Install pg_textsearch

To install this Postgres extension:

Connect to your Tiger Cloud service
In Tiger Cloud Console open an SQL editor. You can also connect to your service using psql.
Enable the extension on your Tiger Cloud service
- For new services, simply enable the extension:
```
CREATE EXTENSION pg_textsearch;
```
- For existing services, update your instance, then enable the extension:
  The extension may not be available until after your next scheduled maintenance window. To pick up the update immediately, manually pause and restart your service.

Verify the installation


SELECT * FROM pg_extension WHERE extname = 'pg_textsearch';

You have installed pg_textsearch on Tiger Cloud.

Create BM25 indexes on your data

BM25 indexes provide modern relevance ranking that outperforms Postgres's built-in ts_rank functions by using corpus statistics and better algorithmic design.

To create a BM25 index with pg_textsearch:

Create a table with text content


CREATE TABLE products (
    id serial PRIMARY KEY,
    name text,
    description text,
    category text,
    price numeric
);

Insert sample data


INSERT INTO products (name, description, category, price) VALUES
('Mechanical Keyboard', 'Durable mechanical switches with RGB backlighting for gaming and productivity', 'Electronics', 149.99),
('Ergonomic Mouse', 'Wireless mouse with ergonomic design to reduce wrist strain during long work sessions', 'Electronics', 79.99),
('Standing Desk', 'Adjustable height desk for better posture and productivity throughout the workday', 'Furniture', 599.99);

Create a BM25 index


CREATE INDEX products_search_idx ON products
USING bm25(description)
WITH (text_config='english');

BM25 supports single-column indexes only.

You have created a BM25 index for full-text search.

Optimize search queries for performance

Use efficient query patterns to leverage BM25 ranking and optimize search performance.

Perform ranked searches using the distance operator


SELECT name, description,
       description <@> to_bm25query('ergonomic work', 'products_search_idx') as score
FROM products
ORDER BY description <@> to_bm25query('ergonomic work', 'products_search_idx')
LIMIT 3;

Filter results by score threshold


SELECT name,
       description <@> to_bm25query('wireless', 'products_search_idx') as score
FROM products
WHERE description <@> to_bm25query('wireless', 'products_search_idx') < -2.0;

Combine with standard SQL operations


SELECT category, name,
       description <@> to_bm25query('ergonomic', 'products_search_idx') as score
FROM products
WHERE price < 500
  AND description <@> to_bm25query('ergonomic', 'products_search_idx') < -1.0
ORDER BY description <@> to_bm25query('ergonomic', 'products_search_idx')
LIMIT 5;

Verify index usage with EXPLAIN


EXPLAIN SELECT * FROM products
ORDER BY description <@> to_bm25query('wireless keyboard', 'products_search_idx')
LIMIT 5;

You have optimized your search queries for BM25 ranking.

Build hybrid search with semantic and keyword search

Combine pg_textsearch with pgvector or pgvectorscale to build powerful hybrid search systems that use both semantic vector search and keyword BM25 search.

Enable the vectorscale extension on your Tiger Cloud service
```
CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;
```

Create a table with both text content and vector embeddings


CREATE TABLE articles (
    id serial PRIMARY KEY,
    title text,
    content text,
    embedding vector(1536)  -- OpenAI ada-002 embedding dimension
);

Create indexes for both search types


-- Vector index for semantic search
CREATE INDEX articles_embedding_idx ON articles
USING hnsw (embedding vector_cosine_ops);

-- Keyword index for BM25 search
CREATE INDEX articles_content_idx ON articles
USING bm25(content)
WITH (text_config='english');

Perform hybrid search using reciprocal rank fusion


WITH vector_search AS (
  SELECT id,
         ROW_NUMBER() OVER (ORDER BY embedding <=> '[0.1, 0.2, 0.3]'::vector) AS rank
  FROM articles
  ORDER BY embedding <=> '[0.1, 0.2, 0.3]'::vector
  LIMIT 20
),
keyword_search AS (
  SELECT id,
         ROW_NUMBER() OVER (ORDER BY content <@> to_bm25query('query performance', 'articles_content_idx')) AS rank
  FROM articles
  ORDER BY content <@> to_bm25query('query performance', 'articles_content_idx')
  LIMIT 20
)
SELECT a.id,
       a.title,
       COALESCE(1.0 / (60 + v.rank), 0.0) + COALESCE(1.0 / (60 + k.rank), 0.0) AS combined_score
FROM articles a
LEFT JOIN vector_search v ON a.id = v.id
LEFT JOIN keyword_search k ON a.id = k.id
WHERE v.id IS NOT NULL OR k.id IS NOT NULL
ORDER BY combined_score DESC
LIMIT 10;

Adjust relative weights for different search types


WITH vector_search AS (
  SELECT id,
         ROW_NUMBER() OVER (ORDER BY embedding <=> '[0.1, 0.2, 0.3]'::vector) AS rank
  FROM articles
  ORDER BY embedding <=> '[0.1, 0.2, 0.3]'::vector
  LIMIT 20
),
keyword_search AS (
  SELECT id,
         ROW_NUMBER() OVER (ORDER BY content <@> to_bm25query('query performance', 'articles_content_idx')) AS rank
  FROM articles
  ORDER BY content <@> to_bm25query('query performance', 'articles_content_idx')
  LIMIT 20
)
SELECT
    a.id,
    a.title,
    0.7 * COALESCE(1.0 / (60 + v.rank), 0.0) +  -- 70% weight to vectors
    0.3 * COALESCE(1.0 / (60 + k.rank), 0.0)    -- 30% weight to keywords
AS combined_score
FROM articles a
LEFT JOIN vector_search v ON a.id = v.id
LEFT JOIN keyword_search k ON a.id = k.id
WHERE v.id IS NOT NULL OR k.id IS NOT NULL
ORDER BY combined_score DESC
LIMIT 10;

You have implemented hybrid search combining semantic and keyword search.

Configuration options

Customize pg_textsearch behavior for your specific use case and data characteristics.

Configure the memory limit
The size of the memtable depends primarily on the number of distinct terms in your corpus. A corpus with longer documents or more varied vocabulary requires more memory per document.
```
-- Set memory limit per index (default 64MB)
SET pg_textsearch.index_memory_limit = '128MB';
```

Configure language-specific text processing


-- French language configuration
CREATE INDEX products_fr_idx ON products_fr
USING pg_textsearch(description)
WITH (text_config='french');

-- Simple tokenization without stemming
CREATE INDEX products_simple_idx ON products
USING pg_textsearch(description)
WITH (text_config='simple');

Tune BM25 parameters


-- Adjust term frequency saturation (k1) and length normalization (b)
CREATE INDEX products_custom_idx ON products
USING bm25(description)
WITH (text_config='english', k1=1.5, b=0.8);

Monitor index usage and memory consumption

Check index usage statistics


SELECT schemaname, relname, indexrelname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
WHERE indexrelid::regclass::text ~ 'bm25';

View detailed index information


SELECT bm25_debug_dump_index('products_search_idx');

You have configured pg_textsearch for optimal performance. For production applications, consider implementing result caching and pagination to improve user experience with large result sets.

Current limitations

This preview release focuses on core BM25 functionality. It has the following limitations:

Memory-only storage: indexes are limited by pg_textsearch.index_memory_limit (default 64MB)
No phrase queries: cannot search for exact multi-word phrases yet

These limitations will be addressed in upcoming releases with disk-based segments and expanded query capabilities.

Related Content