Glossary

search icon
Merge on Write is a primary key model implementation in VeloDB. It generates a Delete Bitmap at write time to skip duplicate rows during queries, delivering sub-second query latency for real-time analytics, reporting, and Web3 data analysis.
In the rapidly growing Web3 ecosystem, data has become the most critical asset. From on-chain transaction analysis, DeFi protocol monitoring, and NFT marketplace insights to off-chain user behavior analytics, observability, and A/B testing, Web3 companies face an ever-increasing demand for real-time, sub-second latency, and cost-efficient data infrastructure.
The Model Context Protocol (MCP) is an open standard that enables large language models (LLMs) to dynamically interact with external tools, databases, and APIs through a standardized interface. Introduced by Anthropic in November 2024 and later adopted by OpenAI, MCP provides a universal, open standard for connecting AI systems with data sources, replacing fragmented integrations with a single protocol that enables seamless, secure, and scalable AI workflows.
LLM Observability is the comprehensive practice of monitoring, tracking, and analyzing the behavior, performance, and outputs of Large Language Models (LLMs) throughout their entire lifecycle from development to production. It provides real-time visibility into every layer of LLM-based systems, enabling organizations to understand not just what is happening with their AI models, but why specific behaviors occur, ensuring reliable, safe, and cost-effective AI operations.
Logstash is an open-source server-side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to your favorite "stash." As a core component of the ELK Stack (Elasticsearch, Logstash, and Kibana), Logstash serves as the data collection and log-parsing engine that helps organizations centralize, transform, and route data for analysis and storage.
Kibana is a powerful open-source data visualization and exploration platform designed to work seamlessly with Elasticsearch as part of the Elastic Stack. Originally developed by Elasticsearch (now Elastic) in 2013, Kibana has evolved into the leading solution for creating interactive dashboards, real-time data visualization, and comprehensive log analysis. As organizations increasingly generate massive volumes of structured and unstructured data across distributed systems, cloud platforms, and applications, Kibana serves as the visual interface that transforms raw Elasticsearch data into actionable insights through intuitive dashboards, advanced visualizations, and real-time monitoring capabilities that enable data-driven decision making across security, operations, business intelligence, and application performance monitoring use cases.
Filebeat is a lightweight log shipper designed to efficiently forward and centralize log data as part of the Elastic Stack ecosystem. Originally developed by Elastic, Filebeat belongs to the Beats family of data shippers and serves as a crucial component in modern log management pipelines. As organizations increasingly deploy distributed systems, microservices, and cloud-native applications that generate massive volumes of log data across multiple servers and containers, Filebeat provides a reliable, resource-efficient solution for collecting, processing, and forwarding log files to centralized destinations like Elasticsearch, Logstash, or other data processing systems. Unlike heavy-weight log collection tools, Filebeat is specifically designed to consume minimal system resources while maintaining high reliability and performance in production environments.
The ELK Stack is a powerful collection of three open-source tools—Elasticsearch, Logstash, and Kibana—designed to provide comprehensive log management, search, analysis, and visualization capabilities. Originally developed by Elastic, this stack has become the de facto standard for centralized logging, observability, and security information and event management (SIEM) across modern IT infrastructures. As organizations increasingly adopt microservices architectures, cloud-native deployments, and distributed systems, the ELK Stack provides essential capabilities for aggregating, processing, and analyzing the massive volumes of log data generated by applications, servers, and network devices to maintain operational visibility and troubleshoot complex issues.
OLAP (Online Analytical Processing) and OLTP (Online Transaction Processing) are two fundamental paradigms in database systems that serve distinctly different purposes in modern data architecture. OLTP systems are designed to handle high-frequency, real-time transactional operations with an emphasis on data consistency, speed, and concurrent user access for operational business processes. In contrast, OLAP systems are optimized for complex analytical queries, data aggregation, and business intelligence operations that process large volumes of historical data to generate insights. Understanding the differences between these systems is crucial for designing effective data architectures that support both operational efficiency and strategic decision-making in today's data-driven organizations.
Real-time analytics is the ability to process, analyze, and derive insights from data immediately as it arrives, allowing organizations to make instantaneous decisions based on current information. Unlike traditional batch processing, which analyzes historical data hours or days after collection, real-time analytics provides sub-second to sub-minute insights from streaming data sources such as user interactions, IoT sensors, financial transactions, and application logs. As businesses increasingly require immediate responses to changing conditions—from fraud detection and dynamic pricing to personalized recommendations and operational monitoring—real-time analytics has become a strategic necessity for maintaining a competitive edge in today's fast-paced digital economy.
An inverted index is a fundamental data structure used in information retrieval systems and search engines to enable fast full-text search capabilities. Unlike a regular index that maps document IDs to their content, an inverted index reverses this relationship by mapping each unique word or term to a list of documents containing that term. This "inversion" allows search engines to quickly identify which documents contain specific search terms without scanning through entire document collections. Inverted indexes are the backbone of modern search technologies, powering everything from web search engines like Google to database full-text search capabilities in systems like Apache Doris, ClickHouse, and Elasticsearch.
A Cost-Based Optimizer (CBO) represents a sophisticated query optimization framework designed to maximize database performance by systematically evaluating multiple potential execution plans and selecting the one with the lowest estimated computational cost. In contrast to traditional rule-based optimizers, which depend on fixed heuristic rules, the CBO leverages comprehensive statistical metadata—including data distribution, table cardinality, and index availability—to make context-aware, data-driven optimization decisions.
RAG (Retrieval-Augmented Generation) is an AI framework that enhances large language models (LLMs) by combining them with external knowledge retrieval systems. This architecture allows LLMs to access up-to-date, domain-specific information from external databases, documents, or knowledge bases during the generation process, significantly improving the accuracy, relevance, and factuality of AI-generated responses.
Hybrid search is a powerful search approach that combines multiple search methodologies, primarily keyword-based (lexical) search and vector-based (semantic) search, to deliver more comprehensive and accurate search results. By leveraging the strengths of both exact term matching and semantic understanding, hybrid search provides users with relevant results that capture both literal matches and contextual meaning, significantly improving search precision and user satisfaction.
Vector search is a modern search technique that enables finding similar items by converting data into high-dimensional numerical representations called vectors or embeddings. Unlike traditional keyword-based search that matches exact terms, vector search understands semantic meaning and context, allowing users to find relevant content even when exact keywords don't match. This technology powers recommendation systems, similarity search, and AI applications by measuring mathematical distances between vectors in multi-dimensional space.
Semi-structured data is a form of data that sits between structured and unstructured data, containing some organizational properties without conforming to a rigid schema like traditional relational databases. This data format maintains partial organization through tags, metadata, and hierarchical structures while retaining flexibility for varied content representation. As organizations increasingly handle diverse data sources including web content, IoT device outputs, social media feeds, and API responses, semi-structured data has become fundamental to modern data management strategies. Unlike structured data that fits neatly into rows and columns, or unstructured data that lacks any organizational framework, semi-structured data provides a balance of flexibility and organization that enables efficient storage, processing, and analysis across distributed systems and cloud-native architectures.
OpenTelemetry is a 100% free and open-source observability framework designed to provide comprehensive telemetry data collection, processing, and export capabilities for modern distributed systems. Born as a merger of OpenTracing and OpenCensus projects in 2019, OpenTelemetry has become the industry standard for observability instrumentation under the Cloud Native Computing Foundation (CNCF). As organizations increasingly adopt microservices, containerized applications, and cloud-native architectures, OpenTelemetry addresses the critical need for unified observability across complex distributed systems by providing standardized APIs, SDKs, and tools for generating, collecting, and exporting traces, metrics, and logs without vendor lock-in.
Grafana is an open-source analytics and monitoring platform that provides comprehensive data visualization, dashboards, and alerting capabilities for observability across modern IT infrastructure. Originally developed by Torkel Ödegaard in 2014, Grafana has evolved into the leading solution for creating interactive dashboards that unify metrics, logs, traces, and other data sources into coherent visual narratives.
A columnar database is a database management system that stores data organized by columns rather than by rows, fundamentally changing how information is physically stored and accessed on disk. Unlike traditional row-oriented databases where each record's data is stored together, columnar databases group all values for each column together, creating a storage structure optimized for analytical queries and data compression. This approach has become the cornerstone of modern data warehousing and business intelligence platforms, powering cloud analytics services like Amazon Redshift, Google BigQuery, and Snowflake.
Apache Parquet is an open-source columnar storage format optimized for large-scale data processing and analytics. It's widely adopted across big data ecosystems, including Apache Hive, Spark, Doris, Trino, Presto, and many others.
Apache Paimon is a high-performance streaming-batch unified table format and data lake storage system designed specifically for real-time data processing. It supports transactions, consistent views, incremental read/write operations, and schema evolution, providing essential capabilities required by modern data lake architectures.
Apache ORC (Optimized Row Columnar) is an open-source columnar storage format optimized for large-scale data storage and analytics. Developed by Hortonworks in 2013, it has become an Apache top-level project and is widely used in big data ecosystems including Apache Hive, Spark, Presto, Trino, and more.
Apache Iceberg is an open-source large-scale analytical table format initiated by Netflix and donated to Apache, designed to address the limitations of traditional Hive table formats in consistency, performance, and metadata management. In today's lakehouse architectures with multi-engine concurrent access and frequent schema evolution, Iceberg provides ACID transactions, hidden partitioning, time travel capabilities, making it highly sought after.
Apache Hudi (Hadoop Upserts Deletes Incrementals) is an open-source data lake platform originally developed by Uber and became an Apache top-level project in 2019. By providing transactional capabilities, incremental processing, and consistency control for data lakes, it transforms traditional data lakes into modern lakehouses. As users increasingly focus on real-time capabilities, update functionality, and cost efficiency in massive data scenarios, Hudi emerges as the solution to address these pain points.
Apache Hive is a distributed, fault-tolerant data warehouse system built on Hadoop that supports reading, writing, and managing massive datasets (typically at petabyte scale) using HiveQL, an SQL-like language. As big data scales continue to grow exponentially, enterprises increasingly demand familiar SQL interfaces for processing enormous datasets. Hive emerged precisely to address this need, delivering tremendous productivity value.
Delta Lake is an open-source storage format that combines Apache Parquet files with a powerful metadata transaction log. It brings ACID transactions, consistency guarantees, and data versioning capabilities to data lakes. As large-scale data lakes have been widely deployed in enterprises, using Parquet alone cannot solve issues like performance bottlenecks, data consistency, and poor governance. Delta Lake emerged to address these challenges and has become an essential foundation for building modern lakehouse architectures.
Lakehouse (Data Lake + Data Warehouse) is a unified architecture that aims to provide data warehouse-level transactional capabilities, management capabilities, and query performance on top of a data lake foundation. It not only retains the low cost and flexibility of data lakes but also provides the consistency and high-performance analytical capabilities of data warehouses.
Apache Doris is an MPP-based real-time data warehouse known for its high query speed. For queries on large datasets, it returns results in sub-seconds. It supports both high-concurrency point queries and high-throughput complex analysis. It can be used for report analysis, ad-hoc queries, unified data warehouse, and data lake query acceleration. Based on Apache Doris, users can build applications for user behavior analysis, A/B testing platform, log analysis, user profile analysis, and e-commerce order analysis.
An analytics database is a specialized database management system optimized for Online Analytical Processing (OLAP), designed to handle complex queries, aggregations, and analytical workloads across large datasets. Unlike traditional transactional databases that focus on operational efficiency and data consistency, analytics databases prioritize query performance, data compression, and support for multidimensional analysis. Modern analytics databases leverage columnar storage, massively parallel processing (MPP) architectures, and vectorized execution engines to deliver sub-second response times on petabyte-scale datasets, making them essential for business intelligence, data science, and real-time decision-making applications.
A data warehouse is a centralized repository designed to store, integrate, and analyze large volumes of structured data from multiple sources within an organization. Unlike traditional databases optimized for transactional processing, data warehouses are specifically architected for analytical processing (OLAP), enabling complex queries and historical data analysis. In the modern data-driven landscape, data warehouses serve as the foundation for business intelligence, reporting, and decision-making processes across enterprises of all sizes.