Monitoring
How to persist and query monitoring data in self-hosted deployments
How Monitoring Works
deco Studio includes an OpenTelemetry exporter that writes NDJSON files to the DATA_DIR directory. Data is organized into three subdirectories:
/metrics— system and application metrics/logs— structured log entries for every tool invocation/traces— distributed traces across MCP operations
Files are org-sharded with the following path structure:
{DATA_DIR}/{type}/{org_id}/YYYY/MM/DD/HH/{uuid}.ndjson
A 30-day retention policy with automatic cleanup keeps disk usage in check.
Persisting Monitoring Data (S3 Sidecar)
By default, monitoring data lives on the local filesystem. To survive container restarts and enable centralized storage, set up a sidecar process that periodically syncs DATA_DIR to S3:
# Example: sync monitoring data to S3 every 5 minutes
aws s3 sync $DATA_DIR/metrics s3://your-bucket/metrics
aws s3 sync $DATA_DIR/logs s3://your-bucket/logs
aws s3 sync $DATA_DIR/traces s3://your-bucket/traces
In Kubernetes, run this as a sidecar container sharing a volume with the main deco Studio container.
Option A: ClickHouse (Recommended)
For production deployments, ClickHouse provides scalable, fast aggregations over monitoring data.
- Set the
CLICKHOUSE_URLenvironment variable to your ClickHouse HTTP endpoint. - Create tables named
monitoring_logsandmonitoring_metricsin your ClickHouse instance. - When configured, the monitoring UI queries ClickHouse directly via HTTP.
CLICKHOUSE_URL=https://your-clickhouse-instance:8123
ClickHouse is the best choice for production: it handles large volumes efficiently, supports fast aggregations, and has native S3 integration for loading NDJSON files.
Option B: DuckDB + S3
For smaller deployments that want to avoid running a separate database, you can use DuckDB with S3-mounted storage.
- Mount your S3 bucket as a local filesystem using a tool like s3fs, goofys, or mountpoint-s3.
- Set
DATA_DIRto the mounted path. - deco Studio writes NDJSON files directly to the mount, and the embedded DuckDB engine reads from the same path.
# Mount S3 bucket
mountpoint-s3 your-bucket /mnt/monitoring
# Point DATA_DIR to the mount
DATA_DIR=/mnt/monitoring
No CLICKHOUSE_URL is needed — DuckDB queries the NDJSON files on disk.
Option C: Google Cloud Storage (OTLP via collector)
For self-hosted deployments on GCP that want no ClickHouse and no disk/sidecar, Studio can read its monitoring data directly from a GCS bucket. Studio already emits monitoring data as standard OTLP logs over the network; you point those at an OpenTelemetry Collector that writes them to GCS with the google_cloud_storage exporter, and the embedded DuckDB engine reads them back via GCS’s S3-compatible endpoint.
Studio ──OTLP logs──▶ OTel Collector ──google_cloud_storage exporter──▶ gs://bucket/<prefix>/...
(OTLP JSON, native GCS client) ▲
embedded DuckDB reads + flattens at query time
The collector writes with a Google service account (native GCS client — no S3 signing). Studio reads via DuckDB’s httpfs , which speaks GCS’s S3-compatible API, so it needs an HMAC key. Both can use the same service account.
Do not use the awss3 exporter for GCS. The AWS SDK v2’s default request checksums are rejected by GCS ( SignatureDoesNotMatch ), and the env workaround does not take effect inside that exporter. Use google_cloud_storage .
Starting from scratch (no bucket yet)? The steps below use the gcloud CLI — set PROJECT_ID and a globally-unique BUCKET first.
1. Create the bucket. (Pre-create it — the exporter reuses it, see step 5.)
gcloud storage buckets create "gs://${BUCKET}" \
--project="${PROJECT_ID}" --location=us --uniform-bucket-level-access
2. Create a service account and grant bucket access. Bucket-scoped storage.admin covers what the exporter needs (object writes and storage.buckets.get , which reuse_if_exists calls). No project-level buckets.create is required.
gcloud iam service-accounts create studio-monitoring --project="${PROJECT_ID}"
SA="studio-monitoring@${PROJECT_ID}.iam.gserviceaccount.com"
gcloud storage buckets add-iam-policy-binding "gs://${BUCKET}" \
--member="serviceAccount:${SA}" --role="roles/storage.admin"
On GKE, bind this SA to the collector’s Kubernetes SA via Workload Identity (no key file needed). Off-GKE, create a key ( gcloud iam service-accounts keys create key.json --iam-account="${SA}" ) and mount it with GOOGLE_APPLICATION_CREDENTIALS .
3. Create an HMAC key for Studio’s read. DuckDB reads via the S3-compatible API, which needs an HMAC key on the same SA:
gcloud storage hmac keys create "${SA}" --project="${PROJECT_ID}"
# → accessId = MONITORING_S3_ACCESS_KEY_ID (GOOG1E...)
# → secret = MONITORING_S3_SECRET_ACCESS_KEY (shown only once)
4. Send Studio’s monitoring logs to your collector. Set MONITORING_OTLP_ENDPOINT (or enable the in-cluster collector) so Studio exports OTLP logs to it.
5. Configure the collector to write OTLP-JSON to GCS. Add the google_cloud_storage exporter to the collector’s logs pipeline. It marshals to OTLP JSON by default — exactly what the dashboard reads.
processors:
batch:
# One object is written per flush. Keep batches bounded so each file stays
# well under the reader's 32 MiB per-file limit, and to limit how many
# objects each dashboard query scans. (send_batch_max_size must be >=
# send_batch_size.)
send_batch_size: 2048
send_batch_max_size: 2048
timeout: 60s
exporters:
google_cloud_storage:
bucket:
name: your-bucket
project_id: your-project # auto-detected on GKE; required off-GCP
region: us
reuse_if_exists: true # use the pre-created bucket; required for restart-safety
file_prefix: logs
partition:
prefix: logs # the read prefix (must match MONITORING_S3_PREFIX)
format: "year=%Y/month=%m/day=%d/hour=%H"
service:
pipelines:
logs:
processors: [batch]
exporters: [google_cloud_storage]
reuse_if_exists: true is required. With the default ( false ) the exporter tries to create the bucket on every startup and fails with a 409 Conflict once it exists — so the collector won’t restart.
The reader caps a single file at 32 MiB; larger files are skipped. The batch settings above keep each flushed object well under that — bound send_batch_max_size if your tool inputs are large. (Smaller, fewer files also make each dashboard query cheaper.)
6. Point Studio’s reader at the same bucket:
MONITORING_S3_BUCKET=your-bucket
MONITORING_S3_PREFIX=logs # matches the collector's partition.prefix
MONITORING_S3_ENDPOINT=https://storage.googleapis.com
MONITORING_S3_ACCESS_KEY_ID=<hmac-key>
MONITORING_S3_SECRET_ACCESS_KEY=<hmac-secret>
When MONITORING_S3_BUCKET is set (and CLICKHOUSE_URL is not), the dashboard reads the OTLP-JSON log files from the bucket. Metrics (calls, errors, latency percentiles) are derived from those same log rows, so there is no separate metrics store. The httpfs extension DuckDB needs is baked into the official image, so this works with strict outbound network policies.
7. Verify. Make a few tool calls through Studio, then confirm files land (the collector flushes on its batch timeout) and the dashboard populates:
gcloud storage ls --recursive "gs://${BUCKET}/logs/" | head
Retention (recommended). When the bucket uses the year=/month=/day= partition layout above, each dashboard query prunes the read to only the day partitions covered by the selected date range — it no longer flattens the whole prefix. Schema detection still lists the objects under the prefix, so a bucket lifecycle rule remains the practical way to bound that listing cost as history accumulates. Studio applies no retention itself — add a rule to auto-delete objects, e.g. after 30 days:
echo '{"rule":[{"action":{"type":"Delete"},"condition":{"age":30}}]}' > /tmp/lifecycle.json
gcloud storage buckets update "gs://${BUCKET}" --lifecycle-file=/tmp/lifecycle.json
The OTLP export caps each tool call’s output payload at 8 KB (matching the hosted ClickHouse path). Tool inputs and all analytics (counts, error rate, latency) are unaffected; only very large response bodies are clipped in the call inspector.
Troubleshooting
- Collector won’t start /
409 Conflicton the bucket: setbucket.reuse_if_exists: trueand pre-create the bucket (step 1). 403 storage.buckets.get deniedon startup: the collector’s service account needs bucket-levelstorage.admin(or at leaststorage.buckets.getplus object write) — see step 2.- Dashboard empty / “Monitoring stats unavailable”: confirm objects exist with
gcloud storage ls "gs://${BUCKET}/logs/", and confirmMONITORING_S3_PREFIXexactly matches the collector’spartition.prefix. - Studio fails to start with a config error: when
MONITORING_S3_BUCKETis set, the HMAC access key and secret are required (the DuckDB extension directory is baked into the official image). Malformed JSON in file …: a truncated/corrupt object under theyear=/…partitions fails the whole read — DuckDB can’t skip parse errors for single-object JSON. Delete the object named in the error (a partial write from a crashed collector is the usual cause); a retention rule keeps these from accumulating. Note that leftover objects outside theyear=/month=/day=layout (e.g. legacy dumps at the prefix root, or an older partition scheme) are skipped automatically — the query only readsyear=*/….Out of Memory ErrorfromqueryMetricTimeseries/queryLlmUsageStats: the OOM comes from flattening the whole prefix in one query. With theyear=/month=/day=partition layout above, queries prune the data scan to the day partitions in the dashboard’s date range, which avoids it — so first confirm the collector writes that layout (gcloud storage lsshould showyear=…/month=…/day=…/paths) and narrow the date range. If it still OOMs on a small container, reduce parallelism withDUCKDB_THREADS(e.g.2) — fewer threads lowers peak memory. NoteDUCKDB_MEMORY_LIMITdefaults to ~80% of the container’s RAM and cannot exceed physical memory, so raising it past what the container has doesn’t help (and lowering it just trips the limit sooner) — give the container more memory instead. Also bound growth with a bucket retention/lifecycle rule.
Environment Variables
| Variable | Default | Description |
|---|---|---|
DATA_DIR | ~/deco | Base directory for monitoring NDJSON files |
CLICKHOUSE_URL | (not set) | ClickHouse HTTP endpoint. When set, the monitoring UI uses ClickHouse instead of DuckDB |
MONITORING_OTLP_ENDPOINT | (falls back to OTEL_EXPORTER_OTLP_ENDPOINT ) | OTLP endpoint Studio exports monitoring logs to (your collector) |
MONITORING_S3_BUCKET | (not set) | GCS bucket holding OTLP-JSON logs. When set (and CLICKHOUSE_URL is not), the dashboard reads from this bucket via DuckDB |
MONITORING_S3_PREFIX | (none) | Key prefix within the bucket (matches the collector’s s3_prefix ) |
MONITORING_S3_ENDPOINT | falls back to S3_ENDPOINT | S3-compatible endpoint, e.g. https://storage.googleapis.com |
MONITORING_S3_REGION | falls back to S3_REGION | Region for SigV4 signing ( auto for GCS) |
MONITORING_S3_ACCESS_KEY_ID | falls back to S3_ACCESS_KEY_ID | GCS HMAC key |
MONITORING_S3_SECRET_ACCESS_KEY | falls back to S3_SECRET_ACCESS_KEY | GCS HMAC secret |
DUCKDB_MEMORY_LIMIT | (80% of RAM) | Memory cap for the embedded DuckDB monitoring engine, e.g. 2GB . Lower it on a memory-constrained container |
DUCKDB_THREADS | (all CPUs) | Thread count for the embedded DuckDB engine. Fewer threads lowers peak memory |
Found an error or want to improve this page?
Edit this page