Back

Keyword Search: The Cornerstone of Information Retrieval

Keywords:
  1. Keyword Search, at its core, is an information retrieval technique that locates relevant data based on specific words or phrases (i.e., keywords) entered by the user. It stands as the most foundational and widely used method across search engines, databases, and various digital information systems globally.

    I. Core Principles and Mechanism

    The fundamental process of keyword search relies on matching the user's input with the content within documents. While complex optimizations exist, the basic workflow includes:

    1. Indexing: Building the Inverted Index

    Before searching, all documents must be processed to build an Inverted Index. This index maps words to the documents in which they appear, ensuring rapid lookups.

    • Example: Word A -> Doc 1, Doc 5, Doc 10
    1. Text Preprocessing

    Both the documents and the user query undergo normalization to ensure consistency and accuracy:

    • Tokenization: Breaking text into individual word units.
    • Normalization: Unifying case (e.g., lowercase).
    • Stop Word Removal: Eliminating common, low-value words (e.g., "the," "is," "a").
    • Stemming/Lemmatization: Reducing words to their base or root form (e.g., running -> run).
    1. Retrieval and Ranking

    After matching keywords against the index, documents are scored to determine relevance. The classic ranking algorithm is TF-IDF (Term Frequency-Inverse Document Frequency):

    • Term Frequency (TF): How often the keyword appears in the current document.
    • Inverse Document Frequency (IDF): The rarity of the keyword across all documents. Rarer words receive higher weight.

    Score=tQTF(t,d)×IDF(t)Score = \sum_{t \in Q} TF(t, d) \times IDF(t)

    II. Advantages and Limitations

    FeatureAdvantages (Pros)Limitations (Cons)
    SpeedExtremely fast retrieval due to the inverted index structure, ideal for massive datasets.Lacks Semantic Understanding
    SimplicityEasy to implement and use; requires no complex query language from the user.Synonym Problem
    VersatilityApplicable across text, code, and structured data.Ambiguity (Polysemy)

    III. Evolution: Modern Search and Hybrid Retrieval

    To overcome the semantic limitations of traditional keyword matching, modern search systems have adopted advanced techniques:

    1. Query Expansion: Automatically adding synonyms, related terms, and spelling variations to the user's query to increase recall.
    2. Vector Search (Semantic Search): Transforming both queries and documents into high-dimensional numerical embeddings (vectors) that capture their meaning. Relevance is determined by the similarity (e.g., Cosine Similarity) between these vectors, allowing the system to understand intent rather than just literal word matches.
    3. Hybrid Search****: Combining the speed of traditional Keyword Search (exact matching) with the accuracy of Vector Search (semantic matching) to achieve the best of both worlds.

    VeloDB****, a specialized database for search, supports Hybrid Retrieval, seamlessly integrating both keyword-based and vector-based search methods to deliver more relevant and comprehensive results.

    VeloDB Keyword Search

    Principle

    Inverted Index: Maps terms to document ID lists, enabling fast keyword queries.

    "Machine Learning" → Tokenization → ["Machine", "Learning"] → Inverted Index → [doc1, doc2, ...]
    

    Two Modes:

    • Tokenized Index: Text tokenized before indexing, supports fuzzy matching + BM25 scoring
    • Untokenized Index: Whole-value indexing, exact matching

    glossary_information_retrieval_bc81ecf75d

    Usage

    -- 1. Create Index
    CREATE TABLE docs (
        id INT,
        content TEXT,
        status VARCHAR(20),
        INDEX idx_content (content) USING INVERTED PROPERTIES("parser"="standard"), -- Tokenized
        INDEX idx_status (status) USING INVERTED  -- Untokenized (default)
    ) DUPLICATE KEY(id) DISTRIBUTED BY HASH(id) BUCKETS 3;
    
    -- 2. Exact Query (Untokenized Index)
    SELECT * FROM docs WHERE status = 'ACTIVE';
    
    -- 3. Full-Text Search (Tokenized Index)
    SELECT * FROM docs WHERE content MATCH_ALL 'machine learning';
    
    -- 4. BM25 Scoring & Ranking (Tokenized Index Only)
    SELECT id, content, BM25() as score 
    FROM docs 
    WHERE content MATCH_ALL 'deep learning'
    ORDER BY score DESC;
    

    Comparison

    Untokenized IndexTokenized Index
    ParserNonestandard, english, chinese
    Querystatus = 'ACTIVE'content MATCH 'learning'
    BM25NoYes
    Use CaseStatus, Tags, IDsArticles, Comments

    Summary: VeloDB Inverted Index supports both tokenized and untokenized modes with BM25 relevance scoring.