deco Studio

Monitoring

How to persist and query monitoring data in self-hosted deployments

How Monitoring Works

deco Studio includes an OpenTelemetry exporter that writes NDJSON files to the DATA_DIR directory. Data is organized into three subdirectories:

  • /metrics — system and application metrics
  • /logs — structured log entries for every tool invocation
  • /traces — distributed traces across MCP operations

Files are org-sharded with the following path structure:

 {DATA_DIR}/{type}/{org_id}/YYYY/MM/DD/HH/{uuid}.ndjson 

A 30-day retention policy with automatic cleanup keeps disk usage in check.

Persisting Monitoring Data (S3 Sidecar)

By default, monitoring data lives on the local filesystem. To survive container restarts and enable centralized storage, set up a sidecar process that periodically syncs DATA_DIR to S3:

 # Example: sync monitoring data to S3 every 5 minutes
aws s3 sync $DATA_DIR/metrics s3://your-bucket/metrics
aws s3 sync $DATA_DIR/logs s3://your-bucket/logs
aws s3 sync $DATA_DIR/traces s3://your-bucket/traces 

In Kubernetes, run this as a sidecar container sharing a volume with the main deco Studio container.

For production deployments, ClickHouse provides scalable, fast aggregations over monitoring data.

  1. Set the CLICKHOUSE_URL environment variable to your ClickHouse HTTP endpoint.
  2. Create tables named monitoring_logs and monitoring_metrics in your ClickHouse instance.
  3. When configured, the monitoring UI queries ClickHouse directly via HTTP.
 CLICKHOUSE_URL=https://your-clickhouse-instance:8123 

ClickHouse is the best choice for production: it handles large volumes efficiently, supports fast aggregations, and has native S3 integration for loading NDJSON files.

Option B: DuckDB + S3

For smaller deployments that want to avoid running a separate database, you can use DuckDB with S3-mounted storage.

  1. Mount your S3 bucket as a local filesystem using a tool like s3fs, goofys, or mountpoint-s3.
  2. Set DATA_DIR to the mounted path.
  3. deco Studio writes NDJSON files directly to the mount, and the embedded DuckDB engine reads from the same path.
 # Mount S3 bucket
mountpoint-s3 your-bucket /mnt/monitoring

# Point DATA_DIR to the mount
DATA_DIR=/mnt/monitoring 

No CLICKHOUSE_URL is needed — DuckDB queries the NDJSON files on disk.

Option C: Google Cloud Storage (OTLP via collector)

For self-hosted deployments on GCP that want no ClickHouse and no disk/sidecar, Studio can read its monitoring data directly from a GCS bucket. Studio already emits monitoring data as standard OTLP logs over the network; you point those at an OpenTelemetry Collector that writes them to GCS with the google_cloud_storage exporter, and the embedded DuckDB engine reads them back via GCS’s S3-compatible endpoint.

 Studio ──OTLP logs──▶ OTel Collector ──google_cloud_storage exporter──▶ gs://bucket/<prefix>/...
                              (OTLP JSON, native GCS client)                  ▲
                                            embedded DuckDB reads + flattens at query time 

The collector writes with a Google service account (native GCS client — no S3 signing). Studio reads via DuckDB’s httpfs , which speaks GCS’s S3-compatible API, so it needs an HMAC key. Both can use the same service account.

Do not use the awss3 exporter for GCS. The AWS SDK v2’s default request checksums are rejected by GCS ( SignatureDoesNotMatch ), and the env workaround does not take effect inside that exporter. Use google_cloud_storage .

Starting from scratch (no bucket yet)? The steps below use the gcloud CLI — set PROJECT_ID and a globally-unique BUCKET first.

1. Create the bucket. (Pre-create it — the exporter reuses it, see step 5.)

 gcloud storage buckets create "gs://${BUCKET}" \
  --project="${PROJECT_ID}" --location=us --uniform-bucket-level-access 

2. Create a service account and grant bucket access. Bucket-scoped storage.admin covers what the exporter needs (object writes and storage.buckets.get , which reuse_if_exists calls). No project-level buckets.create is required.

 gcloud iam service-accounts create studio-monitoring --project="${PROJECT_ID}"
SA="studio-monitoring@${PROJECT_ID}.iam.gserviceaccount.com"

gcloud storage buckets add-iam-policy-binding "gs://${BUCKET}" \
  --member="serviceAccount:${SA}" --role="roles/storage.admin" 

On GKE, bind this SA to the collector’s Kubernetes SA via Workload Identity (no key file needed). Off-GKE, create a key ( gcloud iam service-accounts keys create key.json --iam-account="${SA}" ) and mount it with GOOGLE_APPLICATION_CREDENTIALS .

3. Create an HMAC key for Studio’s read. DuckDB reads via the S3-compatible API, which needs an HMAC key on the same SA:

 gcloud storage hmac keys create "${SA}" --project="${PROJECT_ID}"
#  → accessId  = MONITORING_S3_ACCESS_KEY_ID  (GOOG1E...)
#  → secret    = MONITORING_S3_SECRET_ACCESS_KEY   (shown only once) 

4. Send Studio’s monitoring logs to your collector. Set MONITORING_OTLP_ENDPOINT (or enable the in-cluster collector) so Studio exports OTLP logs to it.

5. Configure the collector to write OTLP-JSON to GCS. Add the google_cloud_storage exporter to the collector’s logs pipeline. It marshals to OTLP JSON by default — exactly what the dashboard reads.

 processors:
  batch:
    # One object is written per flush. Keep batches bounded so each file stays
    # well under the reader's 32 MiB per-file limit, and to limit how many
    # objects each dashboard query scans. (send_batch_max_size must be >=
    # send_batch_size.)
    send_batch_size: 2048
    send_batch_max_size: 2048
    timeout: 60s
exporters:
  google_cloud_storage:
    bucket:
      name: your-bucket
      project_id: your-project   # auto-detected on GKE; required off-GCP
      region: us
      reuse_if_exists: true      # use the pre-created bucket; required for restart-safety
      file_prefix: logs
      partition:
        prefix: logs             # the read prefix (must match MONITORING_S3_PREFIX)
        format: "year=%Y/month=%m/day=%d/hour=%H"
service:
  pipelines:
    logs:
      processors: [batch]
      exporters: [google_cloud_storage] 

reuse_if_exists: true is required. With the default ( false ) the exporter tries to create the bucket on every startup and fails with a 409 Conflict once it exists — so the collector won’t restart.

The reader caps a single file at 32 MiB; larger files are skipped. The batch settings above keep each flushed object well under that — bound send_batch_max_size if your tool inputs are large. (Smaller, fewer files also make each dashboard query cheaper.)

6. Point Studio’s reader at the same bucket:

 MONITORING_S3_BUCKET=your-bucket
MONITORING_S3_PREFIX=logs       # matches the collector's partition.prefix
MONITORING_S3_ENDPOINT=https://storage.googleapis.com
MONITORING_S3_ACCESS_KEY_ID=<hmac-key>
MONITORING_S3_SECRET_ACCESS_KEY=<hmac-secret> 

When MONITORING_S3_BUCKET is set (and CLICKHOUSE_URL is not), the dashboard reads the OTLP-JSON log files from the bucket. Metrics (calls, errors, latency percentiles) are derived from those same log rows, so there is no separate metrics store. The httpfs extension DuckDB needs is baked into the official image, so this works with strict outbound network policies.

7. Verify. Make a few tool calls through Studio, then confirm files land (the collector flushes on its batch timeout) and the dashboard populates:

 gcloud storage ls --recursive "gs://${BUCKET}/logs/" | head 

Retention (recommended). When the bucket uses the year=/month=/day= partition layout above, each dashboard query prunes the read to only the day partitions covered by the selected date range — it no longer flattens the whole prefix. Schema detection still lists the objects under the prefix, so a bucket lifecycle rule remains the practical way to bound that listing cost as history accumulates. Studio applies no retention itself — add a rule to auto-delete objects, e.g. after 30 days:

 echo '{"rule":[{"action":{"type":"Delete"},"condition":{"age":30}}]}' > /tmp/lifecycle.json
gcloud storage buckets update "gs://${BUCKET}" --lifecycle-file=/tmp/lifecycle.json 

The OTLP export caps each tool call’s output payload at 8 KB (matching the hosted ClickHouse path). Tool inputs and all analytics (counts, error rate, latency) are unaffected; only very large response bodies are clipped in the call inspector.

Troubleshooting

  • Collector won’t start / 409 Conflict on the bucket: set bucket.reuse_if_exists: true and pre-create the bucket (step 1).
  • 403 storage.buckets.get denied on startup: the collector’s service account needs bucket-level storage.admin (or at least storage.buckets.get plus object write) — see step 2.
  • Dashboard empty / “Monitoring stats unavailable”: confirm objects exist with gcloud storage ls "gs://${BUCKET}/logs/" , and confirm MONITORING_S3_PREFIX exactly matches the collector’s partition.prefix .
  • Studio fails to start with a config error: when MONITORING_S3_BUCKET is set, the HMAC access key and secret are required (the DuckDB extension directory is baked into the official image).
  • Malformed JSON in file … : a truncated/corrupt object under the year=/… partitions fails the whole read — DuckDB can’t skip parse errors for single-object JSON. Delete the object named in the error (a partial write from a crashed collector is the usual cause); a retention rule keeps these from accumulating. Note that leftover objects outside the year=/month=/day= layout (e.g. legacy dumps at the prefix root, or an older partition scheme) are skipped automatically — the query only reads year=*/… .
  • Out of Memory Error from queryMetricTimeseries / queryLlmUsageStats : the OOM comes from flattening the whole prefix in one query. With the year=/month=/day= partition layout above, queries prune the data scan to the day partitions in the dashboard’s date range, which avoids it — so first confirm the collector writes that layout ( gcloud storage ls should show year=…/month=…/day=…/ paths) and narrow the date range. If it still OOMs on a small container, reduce parallelism with DUCKDB_THREADS (e.g. 2 ) — fewer threads lowers peak memory. Note DUCKDB_MEMORY_LIMIT defaults to ~80% of the container’s RAM and cannot exceed physical memory, so raising it past what the container has doesn’t help (and lowering it just trips the limit sooner) — give the container more memory instead. Also bound growth with a bucket retention/lifecycle rule.

Environment Variables

Variable Default Description
DATA_DIR ~/deco Base directory for monitoring NDJSON files
CLICKHOUSE_URL (not set) ClickHouse HTTP endpoint. When set, the monitoring UI uses ClickHouse instead of DuckDB
MONITORING_OTLP_ENDPOINT (falls back to OTEL_EXPORTER_OTLP_ENDPOINT ) OTLP endpoint Studio exports monitoring logs to (your collector)
MONITORING_S3_BUCKET (not set) GCS bucket holding OTLP-JSON logs. When set (and CLICKHOUSE_URL is not), the dashboard reads from this bucket via DuckDB
MONITORING_S3_PREFIX (none) Key prefix within the bucket (matches the collector’s s3_prefix )
MONITORING_S3_ENDPOINT falls back to S3_ENDPOINT S3-compatible endpoint, e.g. https://storage.googleapis.com
MONITORING_S3_REGION falls back to S3_REGION Region for SigV4 signing ( auto for GCS)
MONITORING_S3_ACCESS_KEY_ID falls back to S3_ACCESS_KEY_ID GCS HMAC key
MONITORING_S3_SECRET_ACCESS_KEY falls back to S3_SECRET_ACCESS_KEY GCS HMAC secret
DUCKDB_MEMORY_LIMIT (80% of RAM) Memory cap for the embedded DuckDB monitoring engine, e.g. 2GB . Lower it on a memory-constrained container
DUCKDB_THREADS (all CPUs) Thread count for the embedded DuckDB engine. Fewer threads lowers peak memory

Found an error or want to improve this page?

Edit this page