Monitoring & Observability
Monitor ForgeX services with Google Cloud's integrated observability platform and custom dashboards.
Current Production Configuration
This section documents what's actually deployed in GCP today. The rest of this page is a how-to reference for adding more — but read this first to understand the baseline.
Deployed Alert Policies
Both policies live in Cloud Monitoring under project forge-475221. Each policy
has an embedded runbook in its documentation.content field — open the policy
in the Console to see investigation commands and likely-cause checklists.
| Policy | Trigger | Backing Metric | Notifies |
|---|---|---|---|
| Cloud SQL — Real ERROR rate (forge-postgres-prod) | > 3 filtered errors per 5min, auto-close 30min | logging.googleapis.com/user/forge_cloudsql_real_errors | Email + GCP mobile push |
| Cloud Run — Real ERROR rate (forge-bids-backend, forge-supertokens) | > 5 filtered errors per 5min, auto-close 30min | logging.googleapis.com/user/forge_cloudrun_real_errors | Email + GCP mobile push |
Log-Based Metrics (the "real errors" filters)
Both metrics deliberately exclude known-benign log entries so the alerts only fire on actual incidents:
| Metric | Excludes (benign noise) |
|---|---|
forge_cloudsql_real_errors | tenants_pkey duplicates, could not serialize access (both are SuperTokens bootstrap chatter that fires once per cold start of the SuperTokens core) |
forge_cloudrun_real_errors | cloudaudit.googleapis.com metadata entries (admin activity, not application errors) |
If you ever see one of these benign messages start firing alerts again, do
not add it as a new filter without understanding why — these filters are
the result of root-cause investigation, not pattern-matching. The SuperTokens
errors specifically only fire on cold starts, which is why we also keep both
services warm with --min-instances=1 (see below).
Cold-Start Prevention
forge-bids-backend and forge-supertokens are both configured with
--min-instances=1. This is intentional and load-bearing:
- The problem it solves: without warm instances, the auth chain
(bids-backend → supertokens core → Cloud SQL) produces HTTP 500s on
/api/auth/session/refreshwhenever a user is the first to hit auth after an idle period. Empirically, this fired ~20 times in 14 days. - Cost: approximately $5–10/month per service. See COMPLETE_DEPLOYMENT_GUIDE.md → "Scale-to-zero policy" for the full math.
- Verify the setting is present:
gcloud run services describe forge-supertokens --region=us-south1 \
--format=yaml | grep minScale
gcloud run services describe forge-bids-backend --region=us-south1 \
--format=yaml | grep minScale
Both should return autoscaling.knative.dev/minScale: '1'.
Sentry Integration State
Sentry is enabled for forgex-portal-frontend, forgex-bids-frontend, and
forgex-bids-backend. Post-deploy verification happens via Sentry Logs,
not Sentry Issues — each SDK emits a logger.info startup entry that you
can find in Sentry → Explore → Logs filtered by service:portal-frontend
(or the equivalent for the other services). The previous captureMessage
ping pattern was removed because it created self-regressing Sentry Issues
on every cold start.
Quick Health Checks
- Backend Health
- SuperTokens Health
- Frontend Status
# Bids API
curl https://bids.precisionsiteservices.com/api/health
# Expected: {"status":"ok","timestamp":"...","service":"bids"}
# Projects API (Phase 2)
curl https://projects.precisionsiteservices.com/api/health
# Field API (Phase 3)
curl https://field.precisionsiteservices.com/api/health
# SuperTokens Core
curl https://forge-supertokens-45561947981.us-south1.run.app/
# Expected: Hello
# API Version
curl https://forge-supertokens-45561947981.us-south1.run.app/apiversion
# Expected: {"versions":["2.0","3.0",...]}
# Portal
curl -I https://forge.precisionsiteservices.com
# Bids
curl -I https://bids.precisionsiteservices.com
# Expected: HTTP/2 200
Cloud Run Logs
View Real-Time Logs
# Bids backend (last 50 lines)
gcloud run services logs read forge-bids-backend --region us-south1 --limit 50
# Follow logs (tail -f style)
gcloud run services logs tail forge-bids-backend --region us-south1
# Filter by severity
gcloud run services logs read forge-bids-backend --region us-south1 \
--log-filter="severity>=ERROR"
Structured Logging
ForgeX uses structured JSON logging for easy parsing:
// Backend logging format
console.log(JSON.stringify({
timestamp: new Date().toISOString(),
level: 'INFO',
service: 'bids-backend',
message: 'User logged in',
userId: user.id,
email: user.email,
ip: req.ip
}));
Cloud Logging automatically parses JSON logs and indexes fields for searching.
Search Logs in Console
- Go to Cloud Logging
- Filter by resource:
resource.type="cloud_run_revision"
resource.labels.service_name="forge-bids-backend" - Search by severity:
severity>=ERROR - Search by custom fields:
jsonPayload.userId="user-123"
Metrics & Dashboards
Cloud Run Metrics
Key metrics available in Cloud Monitoring:
Request Count
run.googleapis.com/request_count
Total requests per service
Request Latency
run.googleapis.com/request_latencies
P50, P95, P99 latencies
Instance Count
run.googleapis.com/container/instance_count
Active container instances
CPU Utilization
run.googleapis.com/container/cpu/utilizations
CPU usage per instance
Memory Utilization
run.googleapis.com/container/memory/utilizations
Memory usage per instance
Billable Time
run.googleapis.com/container/billable_instance_time
Cost tracking
Custom Dashboard
Create a unified dashboard for all services:
Navigate to Monitoring → Dashboards
Click "Create Dashboard" → Name it "ForgeX Production"
Add charts for each metric:
Request Rate:
resource.type="cloud_run_revision"
metric.type="run.googleapis.com/request_count"
Error Rate:
resource.type="cloud_run_revision"
metric.type="run.googleapis.com/request_count"
metric.label.response_code_class="5xx"
Latency P95:
resource.type="cloud_run_revision"
metric.type="run.googleapis.com/request_latencies"
aggregation: 95th percentile
Save and pin to your GCP Console home
Example Dashboard JSON
dashboard-forgex.json
{
"displayName": "ForgeX Production",
"mosaicLayout": {
"columns": 12,
"tiles": [
{
"width": 6,
"height": 4,
"widget": {
"title": "Request Rate (all services)",
"xyChart": {
"dataSets": [{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloud_run_revision\" metric.type=\"run.googleapis.com/request_count\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE"
}
}
}
}]
}
}
},
{
"width": 6,
"height": 4,
"widget": {
"title": "Error Rate (5xx)",
"xyChart": {
"dataSets": [{
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloud_run_revision\" metric.type=\"run.googleapis.com/request_count\" metric.label.response_code_class=\"5xx\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE"
}
}
}
}]
}
}
}
]
}
}
Alerts
Create Alert Policies
Set up alerts for critical conditions:
- High Error Rate
- High Latency
- Service Down
- Database Connections
Alert when 5xx errors exceed 5% of requests:
gcloud alpha monitoring policies create \
--notification-channels=CHANNEL_ID \
--display-name="High Error Rate - Bids Backend" \
--condition-display-name="Error rate > 5%" \
--condition-threshold-value=0.05 \
--condition-threshold-duration=300s \
--condition-filter='resource.type="cloud_run_revision" AND
resource.label.service_name="forge-bids-backend" AND
metric.type="run.googleapis.com/request_count" AND
metric.label.response_code_class="5xx"'
Alert when P95 latency exceeds 2 seconds:
gcloud alpha monitoring policies create \
--notification-channels=CHANNEL_ID \
--display-name="High Latency - Bids Backend" \
--condition-display-name="P95 latency > 2s" \
--condition-threshold-value=2000 \
--condition-threshold-duration=300s \
--condition-aggregations='["ALIGN_PERCENTILE_95"]' \
--condition-filter='resource.type="cloud_run_revision" AND
resource.label.service_name="forge-bids-backend" AND
metric.type="run.googleapis.com/request_latencies"'
Alert when no requests for 5 minutes:
gcloud alpha monitoring policies create \
--notification-channels=CHANNEL_ID \
--display-name="No Requests - Bids Backend" \
--condition-display-name="Zero requests for 5 min" \
--condition-absence-duration=300s \
--condition-filter='resource.type="cloud_run_revision" AND
resource.label.service_name="forge-bids-backend" AND
metric.type="run.googleapis.com/request_count"'
Alert when Cloud SQL connections near max:
gcloud alpha monitoring policies create \
--notification-channels=CHANNEL_ID \
--display-name="High Database Connections" \
--condition-display-name="Connections > 80% of max" \
--condition-threshold-value=80 \
--condition-threshold-duration=300s \
--condition-filter='resource.type="cloudsql_database" AND
resource.label.database_id="forge-475221:forge-postgres-prod" AND
metric.type="cloudsql.googleapis.com/database/postgresql/num_backends"'
Notification Channels
Set up notification channels for alerts:
gcloud alpha monitoring channels create \
--display-name="Ops Team Email" \
--type=email \
--channel-labels=email_address=ops@precisionsiteservices.com
- Create Slack webhook: https://api.slack.com/messaging/webhooks
- Add webhook to Cloud Monitoring:
gcloud alpha monitoring channels create \
--display-name="Ops Slack" \
--type=slack \
--channel-labels=url=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
gcloud alpha monitoring channels create \
--display-name="On-Call Phone" \
--type=sms \
--channel-labels=number=+12819391377
Uptime Monitoring
Create Uptime Checks
Monitor endpoint availability:
# Portal uptime check
gcloud monitoring uptime create forge-portal-uptime \
--resource-type=uptime-url \
--host=forge.precisionsiteservices.com \
--path=/ \
--check-interval=60s
# Bids API uptime check
gcloud monitoring uptime create forge-bids-api-uptime \
--resource-type=uptime-url \
--host=bids.precisionsiteservices.com \
--path=/api/health \
--check-interval=60s
Uptime Check Alerts
Automatically alert on uptime check failures:
gcloud alpha monitoring policies create \
--notification-channels=CHANNEL_ID \
--display-name="Portal Down" \
--condition-display-name="Uptime check failed" \
--condition-threshold-value=1 \
--condition-threshold-duration=60s \
--condition-filter='metric.type="monitoring.googleapis.com/uptime_check/check_passed" AND
metric.label.check_id="forge-portal-uptime" AND
metric.value=0'
Application Performance Monitoring (APM)
Error Tracking
Cloud Error Reporting automatically groups errors:
# View errors
gcloud error-reporting events list --service=forge-bids-backend
# View error details
gcloud error-reporting events list --service=forge-bids-backend --time-range=1d
In GCP Console:
- Go to Error Reporting
- Filter by service
- View stack traces and occurrence counts
Trace Analysis
Cloud Trace shows request flow across services:
// Add trace context to requests
const { trace } = require('@google-cloud/trace-agent').start();
app.get('/api/bids/:id', async (req, res) => {
const span = trace.createChildSpan({ name: 'getBid' });
try {
const bid = await db.bid.findUnique({ where: { id: req.params.id } });
res.json(bid);
} finally {
span.endSpan();
}
});
View traces in Cloud Trace Console.
Cost Monitoring
Set Budget Alerts
Go to Billing → Budgets
- Budget Amount: $100/month (adjust as needed)
- Alerts at: 50%, 75%, 90%, 100%
Email: billing@precisionsiteservices.com
Cost Breakdown
Track costs by service:
# Export billing data
gcloud beta billing export describe
# View cost trends
gcloud beta billing accounts describe BILLING_ACCOUNT_ID --format=json
In GCP Console:
- Go to Billing → Reports
- Group by: Service
- Filter by: Cloud Run, Cloud SQL, Cloud Storage
Security Monitoring
Audit Logs
Cloud Audit Logs track admin and data access:
# View admin activity
gcloud logging read "logName:activity" --limit=50
# View data access
gcloud logging read "logName:data_access" --limit=50
# Filter by user
gcloud logging read 'protoPayload.authenticationInfo.principalEmail="admin@precisionsiteservices.com"'
Security Command Center
Enable Security Command Center for:
- Vulnerability scanning
- Anomaly detection
- Security health analytics
- Web Security Scanner
Performance Optimization
Identify Slow Endpoints
# Find requests > 2 seconds
gcloud logging read '
resource.type="cloud_run_revision"
resource.labels.service_name="forge-bids-backend"
httpRequest.latency>"2s"
' --limit=20 --format=json
Database Query Analysis
Enable Cloud SQL Insights:
gcloud sql instances patch forge-postgres-prod \
--insights-config-query-insights-enabled \
--insights-config-query-string-length=1024 \
--insights-config-record-application-tags
View slow queries in Cloud SQL → Query Insights.
Status Page
Create Public Status Page
Use a service like status.io or Statuspage to show system status:
Components:
- Portal (forge.precisionsiteservices.com)
- Bids Service
- Projects Service (Phase 2)
- Field Service (Phase 3)
- Authentication (SuperTokens)
- Database (Cloud SQL)
Incidents:
- Automated via Cloud Monitoring webhooks
- Manual incident creation
- Scheduled maintenance windows
Troubleshooting Dashboard
Quick links for common issues:
Cloud Run Logs
View service logs and metrics
Error Reporting
Group and track errors
Cloud SQL
Database performance and connections
Cloud Trace
Request traces and latency
Load Balancer
Traffic distribution and health
Cloud Monitoring
Metrics, alerts, and dashboards