Skip to main content

Monitoring & Observability

Monitor ForgeX services with Google Cloud's integrated observability platform and custom dashboards.

Current Production Configuration

This section documents what's actually deployed in GCP today. The rest of this page is a how-to reference for adding more — but read this first to understand the baseline.

Deployed Alert Policies

Both policies live in Cloud Monitoring under project forge-475221. Each policy has an embedded runbook in its documentation.content field — open the policy in the Console to see investigation commands and likely-cause checklists.

PolicyTriggerBacking MetricNotifies
Cloud SQL — Real ERROR rate (forge-postgres-prod)> 3 filtered errors per 5min, auto-close 30minlogging.googleapis.com/user/forge_cloudsql_real_errorsEmail + GCP mobile push
Cloud Run — Real ERROR rate (forge-bids-backend, forge-supertokens)> 5 filtered errors per 5min, auto-close 30minlogging.googleapis.com/user/forge_cloudrun_real_errorsEmail + GCP mobile push

Log-Based Metrics (the "real errors" filters)

Both metrics deliberately exclude known-benign log entries so the alerts only fire on actual incidents:

MetricExcludes (benign noise)
forge_cloudsql_real_errorstenants_pkey duplicates, could not serialize access (both are SuperTokens bootstrap chatter that fires once per cold start of the SuperTokens core)
forge_cloudrun_real_errorscloudaudit.googleapis.com metadata entries (admin activity, not application errors)

If you ever see one of these benign messages start firing alerts again, do not add it as a new filter without understanding why — these filters are the result of root-cause investigation, not pattern-matching. The SuperTokens errors specifically only fire on cold starts, which is why we also keep both services warm with --min-instances=1 (see below).

Cold-Start Prevention

forge-bids-backend and forge-supertokens are both configured with --min-instances=1. This is intentional and load-bearing:

  • The problem it solves: without warm instances, the auth chain (bids-backend → supertokens core → Cloud SQL) produces HTTP 500s on /api/auth/session/refresh whenever a user is the first to hit auth after an idle period. Empirically, this fired ~20 times in 14 days.
  • Cost: approximately $5–10/month per service. See COMPLETE_DEPLOYMENT_GUIDE.md → "Scale-to-zero policy" for the full math.
  • Verify the setting is present:
gcloud run services describe forge-supertokens --region=us-south1 \
--format=yaml | grep minScale
gcloud run services describe forge-bids-backend --region=us-south1 \
--format=yaml | grep minScale

Both should return autoscaling.knative.dev/minScale: '1'.

Sentry Integration State

Sentry is enabled for forgex-portal-frontend, forgex-bids-frontend, and forgex-bids-backend. Post-deploy verification happens via Sentry Logs, not Sentry Issues — each SDK emits a logger.info startup entry that you can find in Sentry → Explore → Logs filtered by service:portal-frontend (or the equivalent for the other services). The previous captureMessage ping pattern was removed because it created self-regressing Sentry Issues on every cold start.

Quick Health Checks

# Bids API
curl https://bids.precisionsiteservices.com/api/health

# Expected: {"status":"ok","timestamp":"...","service":"bids"}

# Projects API (Phase 2)
curl https://projects.precisionsiteservices.com/api/health

# Field API (Phase 3)
curl https://field.precisionsiteservices.com/api/health

Cloud Run Logs

View Real-Time Logs

# Bids backend (last 50 lines)
gcloud run services logs read forge-bids-backend --region us-south1 --limit 50

# Follow logs (tail -f style)
gcloud run services logs tail forge-bids-backend --region us-south1

# Filter by severity
gcloud run services logs read forge-bids-backend --region us-south1 \
--log-filter="severity>=ERROR"

Structured Logging

ForgeX uses structured JSON logging for easy parsing:

// Backend logging format
console.log(JSON.stringify({
timestamp: new Date().toISOString(),
level: 'INFO',
service: 'bids-backend',
message: 'User logged in',
userId: user.id,
email: user.email,
ip: req.ip
}));
info

Cloud Logging automatically parses JSON logs and indexes fields for searching.

Search Logs in Console

  1. Go to Cloud Logging
  2. Filter by resource:
    resource.type="cloud_run_revision"
    resource.labels.service_name="forge-bids-backend"
  3. Search by severity:
    severity>=ERROR
  4. Search by custom fields:
    jsonPayload.userId="user-123"

Metrics & Dashboards

Cloud Run Metrics

Key metrics available in Cloud Monitoring:

📈

Request Count

run.googleapis.com/request_count

Total requests per service

🕐

Request Latency

run.googleapis.com/request_latencies

P50, P95, P99 latencies

🖥️

Instance Count

run.googleapis.com/container/instance_count

Active container instances

⚙️

CPU Utilization

run.googleapis.com/container/cpu/utilizations

CPU usage per instance

💾

Memory Utilization

run.googleapis.com/container/memory/utilizations

Memory usage per instance

💰

Billable Time

run.googleapis.com/container/billable_instance_time

Cost tracking

Custom Dashboard

Create a unified dashboard for all services:

1
Open Cloud Monitoring
2
Create Dashboard

Click "Create Dashboard" → Name it "ForgeX Production"

3
Add Charts

Add charts for each metric:

Request Rate:

resource.type="cloud_run_revision"
metric.type="run.googleapis.com/request_count"

Error Rate:

resource.type="cloud_run_revision"
metric.type="run.googleapis.com/request_count"
metric.label.response_code_class="5xx"

Latency P95:

resource.type="cloud_run_revision"
metric.type="run.googleapis.com/request_latencies"
aggregation: 95th percentile
4
Save Dashboard

Save and pin to your GCP Console home

Example Dashboard JSON

dashboard-forgex.json
{
"displayName": "ForgeX Production",
"mosaicLayout": {
"columns": 12,
"tiles": [
{
"width": 6,
"height": 4,
"widget": {
"title": "Request Rate (all services)",
"xyChart": {
"dataSets": [{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloud_run_revision\" metric.type=\"run.googleapis.com/request_count\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE"
}
}
}
}]
}
}
},
{
"width": 6,
"height": 4,
"widget": {
"title": "Error Rate (5xx)",
"xyChart": {
"dataSets": [{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloud_run_revision\" metric.type=\"run.googleapis.com/request_count\" metric.label.response_code_class=\"5xx\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE"
}
}
}
}]
}
}
}
]
}
}

Alerts

Create Alert Policies

Set up alerts for critical conditions:

Alert when 5xx errors exceed 5% of requests:

gcloud alpha monitoring policies create \
--notification-channels=CHANNEL_ID \
--display-name="High Error Rate - Bids Backend" \
--condition-display-name="Error rate > 5%" \
--condition-threshold-value=0.05 \
--condition-threshold-duration=300s \
--condition-filter='resource.type="cloud_run_revision" AND
resource.label.service_name="forge-bids-backend" AND
metric.type="run.googleapis.com/request_count" AND
metric.label.response_code_class="5xx"'

Notification Channels

Set up notification channels for alerts:

1
Email Notifications
gcloud alpha monitoring channels create \
--display-name="Ops Team Email" \
--type=email \
--channel-labels=email_address=ops@precisionsiteservices.com
2
Slack Notifications
  1. Create Slack webhook: https://api.slack.com/messaging/webhooks
  2. Add webhook to Cloud Monitoring:
gcloud alpha monitoring channels create \
--display-name="Ops Slack" \
--type=slack \
--channel-labels=url=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
3
SMS Notifications
gcloud alpha monitoring channels create \
--display-name="On-Call Phone" \
--type=sms \
--channel-labels=number=+12819391377

Uptime Monitoring

Create Uptime Checks

Monitor endpoint availability:

# Portal uptime check
gcloud monitoring uptime create forge-portal-uptime \
--resource-type=uptime-url \
--host=forge.precisionsiteservices.com \
--path=/ \
--check-interval=60s

# Bids API uptime check
gcloud monitoring uptime create forge-bids-api-uptime \
--resource-type=uptime-url \
--host=bids.precisionsiteservices.com \
--path=/api/health \
--check-interval=60s

Uptime Check Alerts

Automatically alert on uptime check failures:

gcloud alpha monitoring policies create \
--notification-channels=CHANNEL_ID \
--display-name="Portal Down" \
--condition-display-name="Uptime check failed" \
--condition-threshold-value=1 \
--condition-threshold-duration=60s \
--condition-filter='metric.type="monitoring.googleapis.com/uptime_check/check_passed" AND
metric.label.check_id="forge-portal-uptime" AND
metric.value=0'

Application Performance Monitoring (APM)

Error Tracking

Cloud Error Reporting automatically groups errors:

# View errors
gcloud error-reporting events list --service=forge-bids-backend

# View error details
gcloud error-reporting events list --service=forge-bids-backend --time-range=1d

In GCP Console:

  1. Go to Error Reporting
  2. Filter by service
  3. View stack traces and occurrence counts

Trace Analysis

Cloud Trace shows request flow across services:

// Add trace context to requests
const { trace } = require('@google-cloud/trace-agent').start();

app.get('/api/bids/:id', async (req, res) => {
const span = trace.createChildSpan({ name: 'getBid' });
try {
const bid = await db.bid.findUnique({ where: { id: req.params.id } });
res.json(bid);
} finally {
span.endSpan();
}
});

View traces in Cloud Trace Console.

Cost Monitoring

Set Budget Alerts

1
Create Budget
2
Set Threshold
  • Budget Amount: $100/month (adjust as needed)
  • Alerts at: 50%, 75%, 90%, 100%
3
Add Recipients

Email: billing@precisionsiteservices.com

Cost Breakdown

Track costs by service:

# Export billing data
gcloud beta billing export describe

# View cost trends
gcloud beta billing accounts describe BILLING_ACCOUNT_ID --format=json

In GCP Console:

  1. Go to Billing → Reports
  2. Group by: Service
  3. Filter by: Cloud Run, Cloud SQL, Cloud Storage

Security Monitoring

Audit Logs

Cloud Audit Logs track admin and data access:

# View admin activity
gcloud logging read "logName:activity" --limit=50

# View data access
gcloud logging read "logName:data_access" --limit=50

# Filter by user
gcloud logging read 'protoPayload.authenticationInfo.principalEmail="admin@precisionsiteservices.com"'

Security Command Center

Enable Security Command Center for:

  • Vulnerability scanning
  • Anomaly detection
  • Security health analytics
  • Web Security Scanner

Performance Optimization

Identify Slow Endpoints

# Find requests > 2 seconds
gcloud logging read '
resource.type="cloud_run_revision"
resource.labels.service_name="forge-bids-backend"
httpRequest.latency>"2s"
' --limit=20 --format=json

Database Query Analysis

Enable Cloud SQL Insights:

gcloud sql instances patch forge-postgres-prod \
--insights-config-query-insights-enabled \
--insights-config-query-string-length=1024 \
--insights-config-record-application-tags

View slow queries in Cloud SQL → Query Insights.

Status Page

Create Public Status Page

Use a service like status.io or Statuspage to show system status:

Components:

  • Portal (forge.precisionsiteservices.com)
  • Bids Service
  • Projects Service (Phase 2)
  • Field Service (Phase 3)
  • Authentication (SuperTokens)
  • Database (Cloud SQL)

Incidents:

  • Automated via Cloud Monitoring webhooks
  • Manual incident creation
  • Scheduled maintenance windows

Troubleshooting Dashboard

Quick links for common issues:

Next Steps