tidb-observability-migrating-prometheus-victoriametrics

Registration for TiDB SCaiLE 2025 is now open! Secure your spot at our annual event.Register Now

TiDB Cloud Serverless

A fully-managed, auto-scaling cloud service designed for your dynamic workloads.

TiDB Cloud Dedicated

A fully-managed cloud DBaaS for predictable performance and dedicated resources.

TiDB Self-Managed

An advanced, open source, distributed SQL database for your infrastructure.

Capabilities Horizontal Scaling Vector Search

Ecosystem Integrations TiKV TiSpark OSS Insight

Pricing

Customer Stories

Trusted and verified by innovation leaders around the world.

By Industry Fintech eCommerce SaaS

By Use Case Enable Operational Intelligence Modernize MySQL Workloads

Learn Blog eBooks & Whitepapers Videos & Replays Developer Hub

Engage Events & Webinars Discord Community HTAP Summit Spring Launch Event

PingCAP University Courses Hands-on Labs Certifications

Trust Hub

Explore how TiDB ensures the confidentiality and availability of your data.

About Press Releases & News About Us Careers Partners Contact Us

Docs

Sign In Start for Free

Products

TiDB Cloud Serverless

A fully-managed, auto-scaling cloud service designed for your dynamic workloads.

TiDB Cloud Dedicated

A fully-managed cloud DBaaS for predictable performance and dedicated resources.

TiDB Self-Managed

An advanced, open source, distributed SQL database for your infrastructure.

Capabilities

Horizontal Scaling Vector Search

Ecosystem

Integrations TiKV TiSpark OSS Insight

Pricing

Solutions

Customer Stories

Trusted and verified by innovation leaders around the world.

By Industry

Fintech eCommerce SaaS

By Use Case

Enable Operational Intelligence Modernize MySQL Workloads

Resources

Learn

Blog eBooks & Whitepapers Videos & Replays Developer Hub

Engage

Events & Webinars Discord Community HTAP Summit Spring Launch Event

PingCAP University

Courses Hands-on Labs Certifications

Company

Trust Hub

Explore how TiDB ensures the confidentiality and availability of your data.

About

Press Releases & News About Us Careers Partners Contact Us

Docs

Sign In

Start for Free

Scaling Observability: Why TiDB Moved from Prometheus to VictoriaMetrics

Engineering

Jack Ma

Professional Services Engineer

tidb_feature_1800x600 (1)

From the outset, Prometheus has served as a go-to tool for real-time performance metric collection, storage, querying, and observability in TiDB. Additionally, TiDB supports offline diagnostics using TiDB Clinic, which allows for replaying collected metrics to investigate historical issues.

Figure 1. How TiDB Clinic provides offline diagnostics.

However, as deployments scaled, so did the challenges of using Prometheus. This blog explores those growing pains and why we ultimately transitioned to VictoriaMetrics, a high-performance, open-source time series database and monitoring solution.

TiDB Observability: The Limitations of Prometheus at Scale

Pinterest is one of PingCAP’s largest enterprise customers, operating a TiDB cluster with over 700 nodes handling 700K+ QPS. However, the team began to encounter monitoring issues. During diagnostic sessions with TiDB Clinic, Prometheus consistently crashed, escalating operational burdens and delaying incident resolution.

Scalability Issues

During PingCAP’s work with the Pinterest team, they observed frequent out of memory (OOM) crashes in Prometheus, even on a high-end i4i.24xlarge instance (96 cores, 768GB RAM). Query failures and long restart times significantly impacted their ability to diagnose issues effectively.

Figure 2. A diagram representing Pinterest’s OOM crash using Prometheus.

OOM (Out of Memory) Problem

When executing large queries, Prometheus frequently ran OOM, leading to crashes. The restart process posed additional challenges:

The below logs show crash recovery duration in a test environment with 400 nodes.

WAL start at 22:52:07

WAL finished at 23:34:32, spent 42mins

These limitations highlighted the need for a more scalable and resilient monitoring solution.

Query Performance

For large queries, we had to limit the time range to 15 minutes; otherwise, the query would either become slow or fail entirely.

A similar issue occurred with clinic data collection. When attempting to retrieve 1 hour of metrics, the query would run for 40 minutes before eventually failing due to OOM. 

Total Cost of Ownership (TCO)

With Prometheus, we had to allocate a large monitoring instance (i4i.24xlarge, 96 cores, 768GB RAM), yet it still struggled with stability and performance.

Why TiDB Switched to VictoriaMetrics for Enhanced Observability

To meet the evolving needs of our internal teams and cloud customers, we evaluated alternative time-series backends and ultimately migrated to VictoriaMetrics. Below are key reasons behind the switch and the concrete improvements that followed.

1. Better Resource Utilization

After migrating to VictoriaMetrics, we observed a significant reduction in resource consumption:

2. Improved Query Performance

3. Lower Resource Consumption and Improved TCO

After switching to VictoriaMetrics, Pinterest significantly reduced its resource consumption while improving stability. Additionally, better storage efficiency helped lower disk usage, making monitoring more cost-effective.

Overall, VictoriaMetrics provided greater stability, efficiency, and scalability, making it a more reliable solution for monitoring TiDB.

After validating the improvements with Pinterest, the team there successfully migrated to VictoriaMetrics.

TiDB Observability: Performance Test Results

To evaluate VictoriaMetrics’ impact, we conducted tests on Pinterest’s cluster using different configurations. The results showed that VictoriaMetrics significantly reduced resource usage and improved query performance, as shown in the below table.

Key Takeaways

  1. Prometheus consistently failed on KV requests, while VictoriaMetrics showed significant improvements, especially in the Release configuration (success in 3 mins).
  2. Prometheus struggled with gRPC request duration, failing even for a 30-minute or 1-hour query.
  3. VictoriaMetrics significantly improved gRPC query performance, reducing execution time from 8.4s (Default) to 6.5s (Release).
  4. Different VictoriaMetrics configurations (Default, Tuned, Release) adjusted parameters like maxQueryDuration and maxSeries, impacting performance and success rates.

A Smooth Migration Strategy

Given that the data retention period was 10 days, we took a gradual migration approach to minimize risks and ensure a smooth transition from Prometheus to VictoriaMetrics.

Step 1: Parallel Deployment for Observation

Step 2: Validation & Monitoring

Step 3: Final Cutover

This progressive migration strategy allowed us to ensure a stable transition while avoiding the risks of an abrupt switch.

Configuration and Integration Considerations

Migrating from Prometheus to VictoriaMetrics required adjustments across several key components, including scrape configurations, discovery files, startup scripts, Grafana dashboards, and clinic integration.

Scrape Configuration & Discovery Files

Startup Script Testing

We tested three different VictoriaMetrics startup configurations to balance query performance, resource limits, and stability**.** We ended up choosing a tuned configuration approach.

1.  Default Configuration (Baseline Setup)

docker run -it -v {PATH}/victoria-metrics-data:/victoria-metrics-data \

    --network host -p 8428:8428 victoriametrics/victoria-metrics:v1.106.1 \

    -retentionPeriod=10d \

    -promscrape.config=/victoria-metrics-data/vm.config \

    -promscrape.maxScrapeSize=400MB

2.  Tuned Configuration (Slight Limits Increase, Final Choice ✅)

docker run -it -v {PATH}/victoria-metrics-data:/victoria-metrics-data \

    --network host -p 8428:8428 victoriametrics/victoria-metrics:v1.106.1 \

    -search.maxSeries=5000000 \

    -search.maxLabelsAPISeries=5000000 \

    -search.maxQueryDuration=1m \

    -promscrape.config=/victoria-metrics-data/vm.config \

    -promscrape.maxScrapeSize=400MB \

    -search.maxSamplesPerQuery=1000000000 \

    -search.logSlowQueryDuration=30s \

    -retentionPeriod=10d

3.  Release Configuration (Aggressive Limits, Not Chosen)

docker run -it -v /mnt/docker/overlay2/victoria-metrics-data:/victoria-metrics-data \

    --network host -v /var/lib/normandie:/var/lib/normandie:ro,rslave \

    -p 8428:8428 victoriametrics/victoria-metrics:v1.106.1 \

    -search.maxSeries=50000000 \

    -search.maxLabelsAPISeries=50000000 \

    -search.maxQueryDuration=10m \

    -promscrape.config=/victoria-metrics-data/vm.config \

    -promscrape.maxScrapeSize=400MB \

    -search.maxSamplesPerQuery=10000000000 \

    -search.logSlowQueryDuration=30s \

    -retentionPeriod=10d

Clinic Command for Diagnostics

To collect TiDB clinic diagnostics, we adjusted the command to use VictoriaMetrics as the Prometheus replacement:

tiup diag util metricdump --name {cluster_name} --pd={PD_URL}:{PD_PORT} \

    --prometheus="{VICTORIA_URL}:8428" --from "-1h" --to "-0h"

Grafana Dashboard Adjustments

Final Thoughts: The Future of TiDB Observability

Observability remains a crucial pillar of maintaining a healthy, high-performance TiDB cluster. With that said, the move to VictoriaMetrics has greatly improved scalability, reliability, and efficiency.

Looking ahead, further enhancements could include:

VictoriaMetrics has proven to be a powerful foundation for TiDB observability at scale.

If you have any questions about TiDB’s approach to monitoring and observability, please feel free to connect with us on TwitterLinkedIn, or through our Slack Channel

Database MonitoringDistributed SQLObservabilityTiDBVictoriaMetrics

Experience modern data infrastructure firsthand.

Try TiDB Serverless

Share:

Related Resources

[

tidb_feature_1800x600 (1)

Solution

5 Reasons Why Amazon Aurora Fails to Scale Large, Complex Workloads

][93][

tidb_feature_1800x600 (1)

Product

From Hypothesis to Validation: How User Research Shaped TiDB Cloud’s New Navigation

][94][

tidb_feature_1800x600 (1)

Solution

The FinTech Scalability Crisis: How Distributed SQL Unlocks Innovation with Zero Downtime Operations

]95

[

tidb_feature_1800x600 (1)

Solution

5 Reasons Why Amazon Aurora Fails to Scale Large, Complex Workloads

]96

[

tidb_feature_1800x600 (1)

Product

From Hypothesis to Validation: How User Research Shaped TiDB Cloud’s New Navigation

]97

[

tidb_feature_1800x600 (1)

Solution

The FinTech Scalability Crisis: How Distributed SQL Unlocks Innovation with Zero Downtime Operations

]98

View All

Have questions? Let us know how we can help.

Contact Us

TiDB Cloud Dedicated

A fully-managed cloud DBaaS for predictable workloads

Sign Up Learn More

TiDB Cloud Serverless

A fully-managed cloud DBaaS for auto-scaling workloads

Start for Free Learn More

English

中文 日本語

Products

TiDB Cloud Serverless TiDB Cloud Dedicated TiDB Self-Managed Pricing

English

中文 日本語

Ecosystem

Integrations TiKV TiSpark OSS Insight

Resources

Blog Articles Events & Webinars HTAP Summit Docs Developer Guide FAQs Support

Company

About Us News Careers Contact Us Partners Trust Hub Security Release Support Brand Guidelines

Stay Connected

Sign up to receive periodic updates and
feature releases to your email.

© 2025 PingCAP. All Rights Reserved.

Privacy Policy Legal