Monitoring & Observability

Monitor ForgeX services with Google Cloud's integrated observability platform and custom dashboards.

Current Production Configuration

This section documents what's actually deployed in GCP today. The rest of this page is a how-to reference for adding more — but read this first to understand the baseline.

Deployed Alert Policies

Both policies live in Cloud Monitoring under project forge-475221. Each policy has an embedded runbook in its documentation.content field — open the policy in the Console to see investigation commands and likely-cause checklists.

Policy	Trigger	Backing Metric	Notifies
Cloud SQL — Real ERROR rate (forge-postgres-prod)	`> 3` filtered errors per 5min, auto-close 30min	`logging.googleapis.com/user/forge_cloudsql_real_errors`	Email + GCP mobile push
Cloud Run — Real ERROR rate (forge-bids-backend, forge-supertokens)	`> 5` filtered errors per 5min, auto-close 30min	`logging.googleapis.com/user/forge_cloudrun_real_errors`	Email + GCP mobile push

Log-Based Metrics (the "real errors" filters)

Both metrics deliberately exclude known-benign log entries so the alerts only fire on actual incidents:

Metric	Excludes (benign noise)
`forge_cloudsql_real_errors`	`tenants_pkey` duplicates, `could not serialize access` (both are SuperTokens bootstrap chatter that fires once per cold start of the SuperTokens core)
`forge_cloudrun_real_errors`	`cloudaudit.googleapis.com` metadata entries (admin activity, not application errors)

If you ever see one of these benign messages start firing alerts again, do not add it as a new filter without understanding why — these filters are the result of root-cause investigation, not pattern-matching. The SuperTokens errors specifically only fire on cold starts, which is why we also keep both services warm with --min-instances=1 (see below).

Cold-Start Prevention

forge-bids-backend and forge-supertokens are both configured with --min-instances=1. This is intentional and load-bearing:

The problem it solves: without warm instances, the auth chain (bids-backend → supertokens core → Cloud SQL) produces HTTP 500s on /api/auth/session/refresh whenever a user is the first to hit auth after an idle period. Empirically, this fired ~20 times in 14 days.
Cost: approximately $5–10/month per service. See COMPLETE_DEPLOYMENT_GUIDE.md → "Scale-to-zero policy" for the full math.
Verify the setting is present:

gcloud run services describe forge-supertokens --region=us-south1 \
  --format=yaml | grep minScale
gcloud run services describe forge-bids-backend --region=us-south1 \
  --format=yaml | grep minScale

Both should return autoscaling.knative.dev/minScale: '1'.

Sentry Integration State

Sentry is enabled for forgex-portal-frontend, forgex-bids-frontend, and forgex-bids-backend. Post-deploy verification happens via Sentry Logs, not Sentry Issues — each SDK emits a logger.info startup entry that you can find in Sentry → Explore → Logs filtered by service:portal-frontend (or the equivalent for the other services). The previous captureMessage ping pattern was removed because it created self-regressing Sentry Issues on every cold start.

Quick Health Checks

Backend Health
SuperTokens Health
Frontend Status

# Bids API
curl https://bids.precisionsiteservices.com/api/health

# Expected: {"status":"ok","timestamp":"...","service":"bids"}

# Projects API (Phase 2)
curl https://projects.precisionsiteservices.com/api/health

# Field API (Phase 3)
curl https://field.precisionsiteservices.com/api/health

# SuperTokens Core
curl https://forge-supertokens-45561947981.us-south1.run.app/

# Expected: Hello

# API Version
curl https://forge-supertokens-45561947981.us-south1.run.app/apiversion

# Expected: {"versions":["2.0","3.0",...]}

# Portal
curl -I https://forge.precisionsiteservices.com

# Bids
curl -I https://bids.precisionsiteservices.com

# Expected: HTTP/2 200

Cloud Run Logs

View Real-Time Logs

# Bids backend (last 50 lines)
gcloud run services logs read forge-bids-backend --region us-south1 --limit 50

# Follow logs (tail -f style)
gcloud run services logs tail forge-bids-backend --region us-south1

# Filter by severity
gcloud run services logs read forge-bids-backend --region us-south1 \
  --log-filter="severity>=ERROR"

Structured Logging

ForgeX uses structured JSON logging for easy parsing:

// Backend logging format
console.log(JSON.stringify({
  timestamp: new Date().toISOString(),
  level: 'INFO',
  service: 'bids-backend',
  message: 'User logged in',
  userId: user.id,
  email: user.email,
  ip: req.ip
}));

info

Cloud Logging automatically parses JSON logs and indexes fields for searching.

Search Logs in Console

Go to Cloud Logging

Filter by resource:

resource.type="cloud_run_revision"
resource.labels.service_name="forge-bids-backend"

Search by severity:
```
severity>=ERROR
```
Search by custom fields:
```
jsonPayload.userId="user-123"
```

Metrics & Dashboards

Cloud Run Metrics

Key metrics available in Cloud Monitoring:

📈

Request Count

run.googleapis.com/request_count

Total requests per service

🕐

Request Latency

run.googleapis.com/request_latencies

P50, P95, P99 latencies

🖥️

Instance Count

run.googleapis.com/container/instance_count

Active container instances

⚙️

CPU Utilization

run.googleapis.com/container/cpu/utilizations

CPU usage per instance

💾

Memory Utilization

run.googleapis.com/container/memory/utilizations

Memory usage per instance

💰

Billable Time

run.googleapis.com/container/billable_instance_time

Cost tracking

Custom Dashboard

Create a unified dashboard for all services:

Open Cloud Monitoring

Navigate to Monitoring → Dashboards

Create Dashboard

Click "Create Dashboard" → Name it "ForgeX Production"

Add Charts

Add charts for each metric:

Request Rate:

resource.type="cloud_run_revision"
metric.type="run.googleapis.com/request_count"

Error Rate:

resource.type="cloud_run_revision"
metric.type="run.googleapis.com/request_count"
metric.label.response_code_class="5xx"

Latency P95:

resource.type="cloud_run_revision"
metric.type="run.googleapis.com/request_latencies"
aggregation: 95th percentile

Save Dashboard

Save and pin to your GCP Console home

Example Dashboard JSON

dashboard-forgex.json

{
  "displayName": "ForgeX Production",
  "mosaicLayout": {
    "columns": 12,
    "tiles": [
      {
        "width": 6,
        "height": 4,
        "widget": {
          "title": "Request Rate (all services)",
          "xyChart": {
            "dataSets": [{
              "timeSeriesQuery": {
                "timeSeriesFilter": {
                  "filter": "resource.type=\"cloud_run_revision\" metric.type=\"run.googleapis.com/request_count\"",
                  "aggregation": {
                    "alignmentPeriod": "60s",
                    "perSeriesAligner": "ALIGN_RATE"
                  }
                }
              }
            }]
          }
        }
      },
      {
        "width": 6,
        "height": 4,
        "widget": {
          "title": "Error Rate (5xx)",
          "xyChart": {
            "dataSets": [{
              "timeSeriesQuery": {
                "timeSeriesFilter": {
                  "filter": "resource.type=\"cloud_run_revision\" metric.type=\"run.googleapis.com/request_count\" metric.label.response_code_class=\"5xx\"",
                  "aggregation": {
                    "alignmentPeriod": "60s",
                    "perSeriesAligner": "ALIGN_RATE"
                  }
                }
              }
            }]
          }
        }
      }
    ]
  }
}

Alerts

Create Alert Policies

Set up alerts for critical conditions:

High Error Rate
High Latency
Service Down
Database Connections

Alert when 5xx errors exceed 5% of requests:

gcloud alpha monitoring policies create \
  --notification-channels=CHANNEL_ID \
  --display-name="High Error Rate - Bids Backend" \
  --condition-display-name="Error rate > 5%" \
  --condition-threshold-value=0.05 \
  --condition-threshold-duration=300s \
  --condition-filter='resource.type="cloud_run_revision" AND
    resource.label.service_name="forge-bids-backend" AND
    metric.type="run.googleapis.com/request_count" AND
    metric.label.response_code_class="5xx"'

Alert when P95 latency exceeds 2 seconds:

gcloud alpha monitoring policies create \
  --notification-channels=CHANNEL_ID \
  --display-name="High Latency - Bids Backend" \
  --condition-display-name="P95 latency > 2s" \
  --condition-threshold-value=2000 \
  --condition-threshold-duration=300s \
  --condition-aggregations='["ALIGN_PERCENTILE_95"]' \
  --condition-filter='resource.type="cloud_run_revision" AND
    resource.label.service_name="forge-bids-backend" AND
    metric.type="run.googleapis.com/request_latencies"'

Alert when no requests for 5 minutes:

gcloud alpha monitoring policies create \
  --notification-channels=CHANNEL_ID \
  --display-name="No Requests - Bids Backend" \
  --condition-display-name="Zero requests for 5 min" \
  --condition-absence-duration=300s \
  --condition-filter='resource.type="cloud_run_revision" AND
    resource.label.service_name="forge-bids-backend" AND
    metric.type="run.googleapis.com/request_count"'

Alert when Cloud SQL connections near max:

gcloud alpha monitoring policies create \
  --notification-channels=CHANNEL_ID \
  --display-name="High Database Connections" \
  --condition-display-name="Connections > 80% of max" \
  --condition-threshold-value=80 \
  --condition-threshold-duration=300s \
  --condition-filter='resource.type="cloudsql_database" AND
    resource.label.database_id="forge-475221:forge-postgres-prod" AND
    metric.type="cloudsql.googleapis.com/database/postgresql/num_backends"'

Notification Channels

Set up notification channels for alerts:

Email Notifications

gcloud alpha monitoring channels create \
  --display-name="Ops Team Email" \
  --type=email \
  --channel-labels=email_address=ops@precisionsiteservices.com

Slack Notifications

Create Slack webhook: https://api.slack.com/messaging/webhooks
Add webhook to Cloud Monitoring:

gcloud alpha monitoring channels create \
  --display-name="Ops Slack" \
  --type=slack \
  --channel-labels=url=https://hooks.slack.com/services/YOUR/WEBHOOK/URL

SMS Notifications

gcloud alpha monitoring channels create \
  --display-name="On-Call Phone" \
  --type=sms \
  --channel-labels=number=+12819391377

Uptime Monitoring

Create Uptime Checks

Monitor endpoint availability:

# Portal uptime check
gcloud monitoring uptime create forge-portal-uptime \
  --resource-type=uptime-url \
  --host=forge.precisionsiteservices.com \
  --path=/ \
  --check-interval=60s

# Bids API uptime check
gcloud monitoring uptime create forge-bids-api-uptime \
  --resource-type=uptime-url \
  --host=bids.precisionsiteservices.com \
  --path=/api/health \
  --check-interval=60s

Uptime Check Alerts

Automatically alert on uptime check failures:

gcloud alpha monitoring policies create \
  --notification-channels=CHANNEL_ID \
  --display-name="Portal Down" \
  --condition-display-name="Uptime check failed" \
  --condition-threshold-value=1 \
  --condition-threshold-duration=60s \
  --condition-filter='metric.type="monitoring.googleapis.com/uptime_check/check_passed" AND
    metric.label.check_id="forge-portal-uptime" AND
    metric.value=0'

Application Performance Monitoring (APM)

Error Tracking

Cloud Error Reporting automatically groups errors:

# View errors
gcloud error-reporting events list --service=forge-bids-backend

# View error details
gcloud error-reporting events list --service=forge-bids-backend --time-range=1d

In GCP Console:

Go to Error Reporting
Filter by service
View stack traces and occurrence counts

Trace Analysis

Cloud Trace shows request flow across services:

// Add trace context to requests
const { trace } = require('@google-cloud/trace-agent').start();

app.get('/api/bids/:id', async (req, res) => {
  const span = trace.createChildSpan({ name: 'getBid' });
  try {
    const bid = await db.bid.findUnique({ where: { id: req.params.id } });
    res.json(bid);
  } finally {
    span.endSpan();
  }
});

View traces in Cloud Trace Console.

Cost Monitoring

Set Budget Alerts

Create Budget

Go to Billing → Budgets

Set Threshold

Budget Amount: $100/month (adjust as needed)
Alerts at: 50%, 75%, 90%, 100%

Add Recipients

Email: billing@precisionsiteservices.com

Cost Breakdown

Track costs by service:

# Export billing data
gcloud beta billing export describe

# View cost trends
gcloud beta billing accounts describe BILLING_ACCOUNT_ID --format=json

In GCP Console:

Go to Billing → Reports
Group by: Service
Filter by: Cloud Run, Cloud SQL, Cloud Storage

Security Monitoring

Audit Logs

Cloud Audit Logs track admin and data access:

# View admin activity
gcloud logging read "logName:activity" --limit=50

# View data access
gcloud logging read "logName:data_access" --limit=50

# Filter by user
gcloud logging read 'protoPayload.authenticationInfo.principalEmail="admin@precisionsiteservices.com"'

Security Command Center

Enable Security Command Center for:

Vulnerability scanning
Anomaly detection
Security health analytics
Web Security Scanner

Performance Optimization

Identify Slow Endpoints

# Find requests > 2 seconds
gcloud logging read '
  resource.type="cloud_run_revision"
  resource.labels.service_name="forge-bids-backend"
  httpRequest.latency>"2s"
' --limit=20 --format=json

Database Query Analysis

Enable Cloud SQL Insights:

gcloud sql instances patch forge-postgres-prod \
  --insights-config-query-insights-enabled \
  --insights-config-query-string-length=1024 \
  --insights-config-record-application-tags

View slow queries in Cloud SQL → Query Insights.

Status Page

Create Public Status Page

Use a service like status.io or Statuspage to show system status:

Components:

Portal (forge.precisionsiteservices.com)
Bids Service
Projects Service (Phase 2)
Field Service (Phase 3)
Authentication (SuperTokens)
Database (Cloud SQL)

Incidents:

Automated via Cloud Monitoring webhooks
Manual incident creation
Scheduled maintenance windows

Troubleshooting Dashboard

Quick links for common issues:

📄

Next Steps

🗄️

Current Production Configuration​

Deployed Alert Policies​

Log-Based Metrics (the "real errors" filters)​

Cold-Start Prevention​

Sentry Integration State​

Quick Health Checks​

Cloud Run Logs​

View Real-Time Logs​

Structured Logging​

Search Logs in Console​

Metrics & Dashboards​

Cloud Run Metrics​

Request Count

Request Latency

Instance Count

CPU Utilization

Memory Utilization

Billable Time

Custom Dashboard​

Example Dashboard JSON​

Alerts​

Create Alert Policies​

Notification Channels​

Uptime Monitoring​

Create Uptime Checks​

Uptime Check Alerts​

Application Performance Monitoring (APM)​

Error Tracking​

Trace Analysis​

Cost Monitoring​

Set Budget Alerts​

Cost Breakdown​

Security Monitoring​

Audit Logs​

Security Command Center​

Performance Optimization​

Identify Slow Endpoints​

Database Query Analysis​

Status Page​

Create Public Status Page​

Troubleshooting Dashboard​

Cloud Run Logs

Error Reporting

Cloud SQL

Cloud Trace

Load Balancer

Cloud Monitoring

Next Steps​

Database Backups

GCP Setup

Development Guide

Environment Variables

Current Production Configuration

Deployed Alert Policies

Log-Based Metrics (the "real errors" filters)

Cold-Start Prevention

Sentry Integration State

Quick Health Checks

Cloud Run Logs

View Real-Time Logs

Structured Logging

Search Logs in Console

Metrics & Dashboards

Cloud Run Metrics

Custom Dashboard

Example Dashboard JSON

Alerts

Create Alert Policies

Notification Channels

Uptime Monitoring

Create Uptime Checks

Uptime Check Alerts

Application Performance Monitoring (APM)

Error Tracking

Trace Analysis

Cost Monitoring

Set Budget Alerts

Cost Breakdown

Security Monitoring

Audit Logs

Security Command Center

Performance Optimization

Identify Slow Endpoints

Database Query Analysis

Status Page

Create Public Status Page

Troubleshooting Dashboard

Next Steps