Request Tracing and Metrics

Tracing and Metrics provide fine-grained insight into how an application is performing, and helps to diagnose when it is not.

Health indicators can tell you a lot about the performance of an application. Monitoring them is vital both during its development and production lifecycle.

For a database, performance is best encapsulated via per-request performance. In addition to having insight down to the request level, it is also important to understand how the application behaves as a whole. For both the micro and macro-view, tracing and metrics are powerful tools in the toolbox.

Request Tracing allows you to analyze the latency characteristics of individual requests and answer questions about where the overall wall-clock time is spent. If a request takes 500 milliseconds to execute, it is possible to understand what fraction was spent on the server and network, and how much time was spent in encoding for example. Couchbase’s built-in tracer will perform the aggregations and log the top slowest operations at a configurable interval to provide efficient ways to start debugging and troubleshooting.

Metrics, by contrast, provide insight into how the application (or SDK for that matter) is performing as a whole. Individual requests are aggregated, and our default meter implementation will log aggregated percentiles on a per operation-type basis. This makes it possible to quickly pinpoint slow operation types (like a get operation,) and then further pinpoint the causes through tracing.

In addition to built-in turnkey logging solutions, support for external systems is also provided. These depend on your platform of choice, but OpenTelemetry is one of the more popular ones. As well as further discussion below, refer to the the howto guide for specific information for your target platform of choice.

Request Tracing

Tracing provides granular timing information down the individual phases of a request/operation. For a simple overview, consider the following output for an individual request — which is logged from the default tracer that ships with the SDK, and is enabled by default:

{
  "total_duration_us": 1200,
  "encode_duration_us": 100,
  "last_dispatch_duration_us": 40,
  "total_dispatch_duration_us": 40,
  "last_server_duration_us": 2,
  "total_server_duration_us": 2,
  "operation_name": "upsert",
  "last_local_id": "66388CF5BFCF7522/18CC8791579B567C",
  "operation_id": "0x23",
  "last_local_socket": "10.211.55.3:52450",
  "last_remote_socket": "10.112.180.101:11210",
  "timeout_ms": 2500,
}

It is possible to identify how long the operation took to encode, how much time it spent on the server, to which node it got sent to and much more. Only requests over a configurable threshold are logged and reported, so it is possible to fine tune these settings depending on the performance characteristics of each environment.

Configuration options are avialable to not only tune the thresholds, but also the logging interval and sample size.

Metrics

Request metrics provide insight into the sum of all individual requests flowing through an SDK. The overall timings (which can be compared to the total_duration_us value from the tracing section above) are aggregated and dumped with percentile values on each configurable interval. Here is an example:

{
  "meta": {
    "emit_interval_s": 600,
  },
  "operations": {
    "kv": {
        "upsert": {
        "total_count": 120,
        "percentiles_us": {
            "50.0": 5,
            "90.0": 10,
            "99.0": 33,
            "99.9": 55,
            "100.0": 101,
        }
        }
    }
  }
}

The JSON above shows the maximum latency as well as lower percentiles for individual operations. Note that they are aggregated across all nodes and include potential retries, so this really shows end-to-end latency from a users/SDK perspective.

OpenTelemetry Integration

In addition to the built-in turnkey logging solutions for both tracing and metrics — depending on platform availability — we are providing support for OpenTelemetry and other platform-specific integrations. This allows tight integration into your monitoring infrastructure, making monitoring and troubleshooting performance even more efficient.

The howto section provides more information on the platform specifics, and general information about OpenTelemetry can be found on the https://opentelemetry.io/ website.

Additional Information

A practical look at working with metrics is available in the Slow Operations Logging howto docs.