Deploying OpenDSO on GCP with Helm
This guide walks through deploying OpenDSO to Google Kubernetes Engine (GKE) from scratch using Helm. It covers every step: provisioning GCP infrastructure, configuring DNS and TLS, preparing Kubernetes secrets, and deploying the Helm chart.
For other deployment options, see:
- GCP Marketplace — Customer Pre-Deployment Checklist - Preparing a cluster for a Marketplace install
- GCP Marketplace — Package Reference - Technical reference for the Marketplace package
1. Prerequisites
1.1 Required Tools
Install all of the following tools before proceeding.
gcloud CLI
The Google Cloud SDK — required to manage GCP resources.
# Install (Linux/macOS)
curl https://sdk.cloud.google.com | bash
exec -l $SHELL
gcloud init
See the official install guide for Windows or alternative methods.
kubectl
The Kubernetes CLI. The simplest way to install it on GKE is via the gcloud components:
gcloud components install kubectl
Or download directly:
# Linux
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl && sudo mv kubectl /usr/local/bin/
Helm (v3.8+)
The Kubernetes package manager.
# Linux/macOS (script install)
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
# macOS (Homebrew)
brew install helm
# Windows (Chocolatey)
choco install kubernetes-helm
Verify the version:
helm version
# Should show v3.8.0 or later
nk — NATS NKey Tool
Required to generate NATS authentication keys.
# Download the latest release for your OS from:
# https://github.com/nats-io/nkeys/releases
# Example for Linux amd64:
curl -LO https://github.com/nats-io/nkeys/releases/latest/download/nk-linux-amd64.zip
unzip nk-linux-amd64.zip
chmod +x nk && sudo mv nk /usr/local/bin/
jq and openssl
Used in helper scripts.
# Ubuntu/Debian
sudo apt-get install -y jq openssl
# macOS
brew install jq openssl
1.2 Tool Version Reference
| Tool | Minimum Version |
|---|---|
| gcloud CLI | any current |
| kubectl | any current |
| Helm | 3.8+ |
| nk (NATS NKey) | any current |
| cert-manager | 1.12+ |
| Kubernetes (GKE) | 1.24+ |
2. GCP Project Setup
2.1 Create or Select a GCP Project
# Create a new project
gcloud projects create YOUR_PROJECT_ID --name="OpenDSO"
# Or select an existing project
gcloud config set project YOUR_PROJECT_ID
export GCP_PROJECT=YOUR_PROJECT_ID
2.2 Enable Required APIs
gcloud services enable \
container.googleapis.com \
dns.googleapis.com \
compute.googleapis.com \
artifactregistry.googleapis.com \
--project=$GCP_PROJECT
2.3 Configure Billing
Ensure billing is enabled for the project. GKE clusters require an active billing account.
# List billing accounts
gcloud billing accounts list
# Link billing account to project
gcloud billing projects link $GCP_PROJECT \
--billing-account=YOUR_BILLING_ACCOUNT_ID
2.4 Configure Image Pull Access
OpenDSO images are served from GCP Artifact Registry. The GKE node service account must have read access.
# Get the GKE node service account email (after cluster creation)
NODE_SA=$(gcloud container clusters describe $CLUSTER_NAME \
--zone=$GCP_ZONE --project=$GCP_PROJECT \
--format='value(nodeConfig.serviceAccount)')
# Default is Compute Engine default SA if not overridden:
# <project-number>-compute@developer.gserviceaccount.com
# Grant Artifact Registry reader role
gcloud projects add-iam-policy-binding $GCP_PROJECT \
--member="serviceAccount:${NODE_SA}" \
--role="roles/artifactregistry.reader"
If OES provides the registry URL and access, they will supply the exact registry path and grant access for your project.
3. Create the GKE Cluster
3.1 Set Variables
export GCP_PROJECT=your-project-id
export GCP_ZONE=us-east1-b
export CLUSTER_NAME=opendso-cluster-1
export NAMESPACE=opendso
export RELEASE_NAME=opendso
export DOMAIN=opendso.example.com
3.2 Create the Cluster
The following command creates a production-ready GKE cluster with autoscaling and all required addons:
gcloud container clusters create $CLUSTER_NAME \
--project=$GCP_PROJECT \
--zone=$GCP_ZONE \
--cluster-version=latest \
--machine-type=e2-standard-4 \
--num-nodes=3 \
--disk-size=50 \
--enable-autoscaling \
--min-nodes=1 \
--max-nodes=10 \
--enable-autorepair \
--enable-autoupgrade \
--enable-ip-alias \
--network=default \
--subnetwork=default \
--no-enable-basic-auth \
--no-issue-client-certificate \
--enable-stackdriver-kubernetes \
--addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver
Node sizing guidance:
Use
--enable-autoscaling --min-nodes=1 --max-nodes=10for production to handle variable load.
3.3 Authenticate kubectl
gcloud container clusters get-credentials $CLUSTER_NAME \
--zone=$GCP_ZONE \
--project=$GCP_PROJECT
# Verify
kubectl cluster-info
kubectl get nodes
3.4 Create the Application Namespace
kubectl create namespace $NAMESPACE
4. Install the nginx Ingress Controller
OpenDSO uses nginx as its ingress controller. All external traffic — HTTP, HTTPS, and NATS WebSocket — routes through it.
4.1 Install via Helm
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx \
--create-namespace \
--set controller.service.type=LoadBalancer \
--set controller.service.externalTrafficPolicy=Local \
--wait --timeout 5m
4.2 Get the LoadBalancer IP
The LoadBalancer IP is needed for DNS configuration. Wait until it is assigned:
kubectl get svc -n ingress-nginx ingress-nginx-controller --watch
Once the EXTERNAL-IP column shows an IP (not <pending>):
export LB_IP=$(kubectl get svc -n ingress-nginx ingress-nginx-controller \
-o jsonpath='{.status.loadBalancer.ingress[0].ip}')
echo "LoadBalancer IP: $LB_IP"
Important: This IP is stable for the lifetime of the LoadBalancer service. Do not delete and recreate the service unless necessary.
5. Configure DNS with Cloud DNS
OpenDSO uses a single base domain with multiple subdomains (one per service). All subdomains point to the nginx LoadBalancer IP.
5.1 Create a Cloud DNS Managed Zone
export DNS_ZONE_NAME=opendso-zone
gcloud dns managed-zones create $DNS_ZONE_NAME \
--dns-name="${DOMAIN}." \
--description="OpenDSO DNS zone" \
--project=$GCP_PROJECT
5.2 Retrieve the Assigned Nameservers
gcloud dns managed-zones describe $DNS_ZONE_NAME \
--project=$GCP_PROJECT \
--format='value(nameServers)'
You will see 4 nameservers such as:
ns-cloud-a1.googledomains.com.
ns-cloud-a2.googledomains.com.
ns-cloud-a3.googledomains.com.
ns-cloud-a4.googledomains.com.
5.3 Delegate the Domain at Your Registrar
In your domain registrar's control panel, add NS records pointing to the 4 Google nameservers. This step delegates the subdomain to Cloud DNS.
Allow 15–30 minutes for NS record propagation. cert-manager DNS-01 challenges will fail if NS records have not propagated yet.
5.4 Create DNS A Records
# Start a DNS transaction
gcloud dns record-sets transaction start \
--zone=$DNS_ZONE_NAME \
--project=$GCP_PROJECT
# Apex record
gcloud dns record-sets transaction add $LB_IP \
--name="${DOMAIN}." \
--ttl=300 \
--type=A \
--zone=$DNS_ZONE_NAME \
--project=$GCP_PROJECT
# Wildcard record (covers all subdomains)
gcloud dns record-sets transaction add $LB_IP \
--name="*.${DOMAIN}." \
--ttl=300 \
--type=A \
--zone=$DNS_ZONE_NAME \
--project=$GCP_PROJECT
# Commit the transaction
gcloud dns record-sets transaction execute \
--zone=$DNS_ZONE_NAME \
--project=$GCP_PROJECT
5.5 Verify DNS Propagation
# Should return the LoadBalancer IP
dig +short $DOMAIN
dig +short keycloak.$DOMAIN
dig +short api.$DOMAIN
# Or using nslookup
nslookup $DOMAIN
Do not proceed to cert-manager setup until DNS resolves correctly. Let's Encrypt DNS-01 challenges require that your nameservers respond with the correct TXT records.
6. Install cert-manager and Configure TLS
OpenDSO requires a wildcard TLS certificate (*.yourdomain.com) to serve all subdomains over HTTPS. cert-manager automates certificate issuance from Let's Encrypt using DNS-01 validation.
6.1 Install cert-manager
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--version v1.14.0 \
--set installCRDs=true \
--wait --timeout 5m
Verify all cert-manager pods are running:
kubectl get pods -n cert-manager
6.2 Apply the GKE cert-manager Fix (Required)
GKE restricts access to the kube-system namespace, which breaks cert-manager's default leader election configuration. Apply the fix before creating any issuers:
# Patch cert-manager to use its own namespace for leader election
kubectl patch deployment cert-manager \
-n cert-manager \
--type=json \
-p='[{"op":"add","path":"/spec/template/spec/containers/0/args/-","value":"--leader-election-namespace=cert-manager"}]'
# Create lease RBAC in cert-manager namespace
kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: cert-manager-leaderelection
namespace: cert-manager
rules:
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["get","list","watch","create","update","patch","delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: cert-manager-leaderelection
namespace: cert-manager
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: cert-manager-leaderelection
subjects:
- kind: ServiceAccount
name: cert-manager
namespace: cert-manager
EOF
# Restart cert-manager to pick up the change
kubectl rollout restart deployment/cert-manager -n cert-manager
kubectl rollout status deployment/cert-manager -n cert-manager
6.3 Create a Service Account for DNS-01 Challenges
cert-manager needs permission to create DNS TXT records in Cloud DNS to prove domain ownership.
export CERT_MANAGER_SA=cert-manager-dns
# Create the service account
gcloud iam service-accounts create $CERT_MANAGER_SA \
--display-name="cert-manager DNS-01 solver" \
--project=$GCP_PROJECT
# Grant DNS admin role
gcloud projects add-iam-policy-binding $GCP_PROJECT \
--member="serviceAccount:${CERT_MANAGER_SA}@${GCP_PROJECT}.iam.gserviceaccount.com" \
--role="roles/dns.admin"
# Create and download a JSON key
gcloud iam service-accounts keys create cert-manager-dns-key.json \
--iam-account="${CERT_MANAGER_SA}@${GCP_PROJECT}.iam.gserviceaccount.com"
# Store the key as a Kubernetes secret in the cert-manager namespace
kubectl create secret generic clouddns-dns01-solver-svc-acct \
--from-file=key.json=cert-manager-dns-key.json \
-n cert-manager
# Clean up the local key file
rm cert-manager-dns-key.json
6.4 Create the Let's Encrypt ClusterIssuer
export LETSENCRYPT_EMAIL=admin@example.com
kubectl apply -f - <<EOF
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: ${LETSENCRYPT_EMAIL}
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- dns01:
cloudDNS:
project: ${GCP_PROJECT}
hostedZoneName: ${DNS_ZONE_NAME}
serviceAccountSecretRef:
name: clouddns-dns01-solver-svc-acct
key: key.json
EOF
Tip: For testing, replace
letsencrypt-prodwithletsencrypt-stagingand point the ACME server tohttps://acme-staging-v02.api.letsencrypt.org/directory. Staging has much higher rate limits. Let's Encrypt production is limited to 50 certificates per registered domain per week.
6.5 Request the Wildcard Certificate
kubectl apply -f - <<EOF
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: opendso-tls
namespace: ${NAMESPACE}
spec:
secretName: ${RELEASE_NAME}-tls-secret
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- "${DOMAIN}"
- "*.${DOMAIN}"
EOF
Monitor until the certificate is READY=True (typically 2–10 minutes):
kubectl get certificate -n $NAMESPACE --watch
kubectl describe certificate opendso-tls -n $NAMESPACE
If it is taking longer than 15 minutes, check the challenge status:
kubectl get challenges,orders -n $NAMESPACE
kubectl describe challenge -n $NAMESPACE
7. Prepare Kubernetes Secrets
All secrets must exist in the cluster before running helm install. These are not managed by Helm so they persist across upgrades and uninstalls.
7.1 Extract TLS Credentials
Once the Certificate is READY, extract the certificate and key for use in derived secrets:
kubectl get secret ${RELEASE_NAME}-tls-secret -n $NAMESPACE \
-o jsonpath='{.data.tls\.crt}' | base64 -d > /tmp/tls.crt
kubectl get secret ${RELEASE_NAME}-tls-secret -n $NAMESPACE \
-o jsonpath='{.data.tls\.key}' | base64 -d > /tmp/tls.key
7.2 Create Service-Specific TLS Secrets
Several services reference TLS material via their own named secrets:
# Root CA (used by services for trust)
kubectl create secret generic root-ca \
--from-file=ca.pem=/tmp/tls.crt \
-n $NAMESPACE
# server-cert and server-key (legacy aliases used by Keycloak and others)
kubectl create secret generic server-cert \
--from-file=server-cert.pem=/tmp/tls.crt \
-n $NAMESPACE
kubectl create secret generic server-key \
--from-file=server-key.pem=/tmp/tls.key \
-n $NAMESPACE
# MongoDB TLS secret (combined PEM format)
cat /tmp/tls.crt /tmp/tls.key > /tmp/mongodb.pem
kubectl create secret generic ${RELEASE_NAME}-mongodb-tls \
--from-file=mongodb.pem=/tmp/mongodb.pem \
--from-file=ca.crt=/tmp/tls.crt \
-n $NAMESPACE
# Clean up temp files
rm /tmp/tls.crt /tmp/tls.key /tmp/mongodb.pem
7.3 Generate NATS Authentication Keys
NATS uses NKey-based authentication. Generate fresh keys — these must be generated once and stored; they cannot change without reconfiguring NATS.
# Generate seeds
ACCOUNT_SEED=$(nk -gen account)
USER_SEED=$(nk -gen user)
XKEY_SEED=$(nk -gen curve)
# Derive public keys
ACCOUNT_PUB=$(echo "$ACCOUNT_SEED" | nk -inkey /dev/stdin -pubout)
USER_PUB=$(echo "$USER_SEED" | nk -inkey /dev/stdin -pubout)
XKEY_PUB=$(echo "$XKEY_SEED" | nk -inkey /dev/stdin -pubout)
echo "ACCOUNT_PUB: $ACCOUNT_PUB"
echo "USER_PUB: $USER_PUB"
echo "XKEY_PUB: $XKEY_PUB"
# Store private seeds in Kubernetes
kubectl create secret generic ${RELEASE_NAME}-nats-auth-keys \
--from-literal=account.nk="$ACCOUNT_SEED" \
--from-literal=user.nk="$USER_SEED" \
--from-literal=xkey.xk="$XKEY_SEED" \
-n $NAMESPACE
# Save public keys to a values override file for Helm
cat > /tmp/values-nats-auth-generated.yaml <<EOF
nats:
authCallout:
enabled: true
issuer: "${ACCOUNT_PUB}"
authUser: "${USER_PUB}"
xkey: "${XKEY_PUB}"
nats-auth-svc:
natsKeysSecret: "${RELEASE_NAME}-nats-auth-keys"
EOF
Important: Save
values-nats-auth-generated.yamlalongside your other values files. The public keys must be passed tohelm installat every upgrade.
7.4 Create Grafana Credentials Secret
GRAFANA_ADMIN_PASSWORD=$(openssl rand -base64 16)
CITUS_PASSWORD=$(openssl rand -base64 16)
MONGODB_PASSWORD=$(openssl rand -base64 16)
OPENDSO_APPS_DB_PASSWORD=$(openssl rand -base64 16)
kubectl create secret generic ${RELEASE_NAME}-grafana-credentials \
--from-literal=admin-user="admin" \
--from-literal=admin-password="$GRAFANA_ADMIN_PASSWORD" \
--from-literal=citus-password="$CITUS_PASSWORD" \
--from-literal=mongodb-password="$MONGODB_PASSWORD" \
--from-literal=opendso-apps-db-password="$OPENDSO_APPS_DB_PASSWORD" \
-n $NAMESPACE
echo "Grafana admin password: $GRAFANA_ADMIN_PASSWORD"
echo "Save the above password — it cannot be recovered from the secret."
7.5 Create Database Secrets
# OpenDSO Apps DB (TimescaleDB)
APPS_DB_PASSWORD=$(openssl rand -base64 16)
kubectl create secret generic ${RELEASE_NAME}-opendso-apps-db-secret \
--from-literal=password="$APPS_DB_PASSWORD" \
-n $NAMESPACE
# Citus DB
CITUS_DB_PASSWORD=$(openssl rand -base64 16)
kubectl create secret generic ${RELEASE_NAME}-citus-db-secret \
--from-literal=password="$CITUS_DB_PASSWORD" \
-n $NAMESPACE
echo "Save these passwords securely before proceeding."
7.6 Create License Secret
Contact Open Energy Solutions for your license credentials, then create the secret:
kubectl create secret generic ${RELEASE_NAME}-opendso-license \
--from-literal=LICENSE_KEY="your-license-key" \
--from-literal=LICENSE_INSTALLATION_KEY="your-installation-key" \
--from-literal=LICENSE_ENVIRONMENT_NAME="$(kubectl get namespace kube-system -o jsonpath='{.metadata.uid}')" \
--from-literal=LICENSE_API_URL="https://license.oesinc.com" \
-n $NAMESPACE
8. Deploy with Helm
8.1 Add the Helm Repository
# If using the chart from a Helm repo (contact OES for the repo URL):
helm repo add opendso <OES_HELM_REPO_URL>
helm repo update
# Or if working from a local chart directory:
cd /path/to/opendso-helm-charts
8.2 Download Chart Dependencies
cd opendso-helm-charts/opendso
helm dependency update
cd ..
This downloads the Grafana subchart (grafana-10.5.14.tgz) into opendso/charts/.
8.3 Pre-create the Topology ConfigMap
The topology site configuration (cim.xml) is 367 KB — larger than the 262 KB Kubernetes annotation limit for client-side apply. It must be pre-created using server-side apply:
kubectl create configmap ${RELEASE_NAME}-topology-genesis-site-config \
--from-file=cim.xml=opendso/configs/ieee13/topology-genesis/cim.xml \
--from-file=cimex.config=opendso/configs/ieee13/topology-genesis/cimex.config \
--namespace $NAMESPACE \
--dry-run=client -o yaml \
| kubectl apply --server-side --field-manager=helm-deployer -f -
8.4 Install the Chart
helm upgrade --install $RELEASE_NAME ./opendso \
--namespace $NAMESPACE \
--create-namespace \
-f opendso/values-gcp.yaml \
-f /tmp/values-nats-auth-generated.yaml \
--set global.domain="$DOMAIN" \
--set global.environment.apiUrl="https://api.${DOMAIN}" \
--set global.keycloak.url="https://keycloak.${DOMAIN}" \
--set global.keycloak.internalUrl="http://${RELEASE_NAME}-keycloak-svc:8080" \
--set keycloak.config.hostname="keycloak.${DOMAIN}" \
--set global.resourceProfile="production" \
--set global.tls.existingSecret="${RELEASE_NAME}-tls-secret" \
--set ingress.tls.secretName="${RELEASE_NAME}-tls-secret" \
--set nats.tls.secretName="${RELEASE_NAME}-tls-secret" \
--set mongodb.tls.existingSecret="${RELEASE_NAME}-mongodb-tls" \
--set grafana.admin.existingSecret="${RELEASE_NAME}-grafana-credentials" \
--set "grafana.envValueFrom.CITUS_PASSWORD.secretKeyRef.name=${RELEASE_NAME}-grafana-credentials" \
--set "grafana.envValueFrom.OPENDSO_APPS_DB_PASSWORD.secretKeyRef.name=${RELEASE_NAME}-grafana-credentials" \
--set "global.topology-genesis.externalConfigMap=true" \
--wait \
--timeout 15m
Resource profile options:
minimal(dev/test),default,production.If the cluster is slow or nodes are still provisioning, increase the
--timeoutto20mor30m.
8.5 Monitor the Rollout
In a separate terminal, watch pods coming up:
kubectl get pods -n $NAMESPACE --watch
Typical startup order: databases → Keycloak → NATS → API services → frontend apps. Expect 5–10 minutes for all pods to reach Running state.
9. Verify the Deployment
9.1 Check Pod Status
All pods should be 1/1 Running (or 2/2, 3/3 for multi-container pods):
kubectl get pods -n $NAMESPACE
Common init containers (mongodb-init, citus-init, opendso-apps-db-init) will be in Completed state — this is expected.
9.2 Check Ingress
kubectl get ingress -n $NAMESPACE
All ingress rules should show the LoadBalancer IP in the ADDRESS column.
9.3 Test Key Endpoints
# Keycloak OIDC discovery
curl -k https://keycloak.${DOMAIN}/realms/oes/.well-known/openid-configuration | jq .issuer
# GMS API health (expects 401 — auth-protected, which means the API is up)
curl -o /dev/null -w "%{http_code}\n" https://api.${DOMAIN}/api/health
# NATS WebSocket upgrade (expects 101 Switching Protocols)
curl -o /dev/null -w "nats-ws: %{http_code}\n" \
--http1.1 \
-H "Connection: Upgrade" \
-H "Upgrade: websocket" \
-H "Sec-WebSocket-Key: SGVsbG8sIFdvcmxkIQ==" \
-H "Sec-WebSocket-Version: 13" \
https://nats.${DOMAIN}
9.4 Access the UI
Open https://<domain> in a browser. You should be redirected to the Keycloak login page.
10. Service Endpoints
All services are accessed via HTTPS subdomains of the base domain.
| URL | Service | Notes |
|---|---|---|
https://<domain> | Main UI (genesis-node-app) | Primary entry point |
https://keycloak.<domain> | Keycloak (identity provider) | Login portal |
https://api.<domain> | GMS REST API | Returns 401 without auth token |
wss://nats.<domain> | NATS WebSocket | Used by browser clients |
https://grafana.<domain> | Grafana dashboards | Use Grafana credentials secret |
https://gis.<domain> | GIS map viewer | |
https://oneline.<domain> | One-line diagram | |
https://dataviewer.<domain> | Data viewer | |
https://eventviewer.<domain> | Event viewer | |
https://inventory.<domain> | Inventory manager | |
https://historian.<domain> | Historian app | |
https://device.<domain> | ESS manager | |
https://esstesting.<domain> | ESS tester | |
https://derdispatch.<domain> | DER dispatch | |
https://scheduledispatch.<domain> | Schedule dispatch | |
https://openfmb.<domain> | OpenFMB inspector | |
https://openfmbeventcreator.<domain> | OpenFMB event creator | |
https://docs.<domain> | Documentation |
11. Troubleshooting
11.1 Useful Diagnostic Commands
# Pod status and recent events
kubectl get pods -n $NAMESPACE
kubectl describe pod <pod-name> -n $NAMESPACE
# Pod logs (last 100 lines)
kubectl logs <pod-name> -n $NAMESPACE --tail=100
# Follow logs in real time
kubectl logs -f <pod-name> -n $NAMESPACE
# Events across the namespace (sorted by time)
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp'
# Certificate status
kubectl get certificate,certificaterequest,order,challenge -n $NAMESPACE
# Ingress details
kubectl describe ingress -n $NAMESPACE
# Check a service is resolving inside the cluster
kubectl run -it --rm debug --image=busybox --restart=Never -- \
nslookup ${RELEASE_NAME}-gms-api.$NAMESPACE.svc.cluster.local
11.2 GMS API Not Starting
Symptom: gms-api pod shows Running but the /api/health endpoint is unreachable (connection refused on port 8000).
Root cause: The API silently fails to start if the MongoDB authsettings document with environmentId='default' does not exist. The HTTP server never binds.
Fix: Verify the mongodb-init job completed:
kubectl get jobs -n $NAMESPACE
kubectl logs job/${RELEASE_NAME}-mongodb-init -n $NAMESPACE
If the job failed or the document is missing, create it manually:
# Get the MongoDB root password
MONGO_PASSWORD=$(kubectl get secret ${RELEASE_NAME}-mongodb \
-n $NAMESPACE -o jsonpath='{.data.mongodb-root-password}' | base64 -d)
# Connect to MongoDB
kubectl exec -it ${RELEASE_NAME}-mongodb-0 -n $NAMESPACE -- mongosh \
-u root -p "$MONGO_PASSWORD" --authenticationDatabase admin
# Inside the shell:
use settings_api
db.authsettings.insertOne({
environmentId: 'default',
keycloakUrl: 'https://${RELEASE_NAME}-keycloak-svc:8443',
keycloakRealm: 'oes',
keycloakClientId: 'gms',
createdAt: new Date(),
updatedAt: new Date()
})
exit
Then restart the API:
kubectl rollout restart deployment/${RELEASE_NAME}-gms-api -n $NAMESPACE
11.3 NATS Pod CrashLoopBackOff
Symptom:
unable to load certificates
Fix: The TLS secret is missing or named incorrectly.
# Verify the secret exists
kubectl get secret ${RELEASE_NAME}-tls-secret -n $NAMESPACE
# If missing, recreate it (cert-manager should have created it)
kubectl get certificate opendso-tls -n $NAMESPACE
kubectl describe certificate opendso-tls -n $NAMESPACE
# After fixing the secret, restart NATS
kubectl rollout restart deployment/${RELEASE_NAME}-nats -n $NAMESPACE
11.4 cert-manager Stuck in kube-system Error
Symptom:
leases.coordination.k8s.io is forbidden: User 'system:serviceaccount:cert-manager:cert-manager'
cannot create resource 'leases' in the namespace 'kube-system'
Fix: Re-apply the GKE cert-manager leader election fix from section 6.2.
11.5 Certificate Stuck in Issuing State
Diagnosis steps:
# Check challenge status
kubectl get challenges -n $NAMESPACE
kubectl describe challenge -n $NAMESPACE | grep -A 20 "Status:"
# Check cert-manager controller logs
kubectl logs -n cert-manager deployment/cert-manager --tail=100
# Verify DNS resolves from outside the cluster
dig +short @8.8.8.8 $DOMAIN
Common causes:
- DNS not propagated — Check that
dig +short $DOMAINreturns the LoadBalancer IP from outside the cluster. Wait and retry. - NS records not set at registrar — Verify the registrar's NS records point to the Cloud DNS nameservers.
- cert-manager DNS-01 IAM issue — Verify the service account has
roles/dns.adminand the secretclouddns-dns01-solver-svc-acctexists in thecert-managernamespace. - Rate limited by Let's Encrypt — Check cert-manager logs for
too many certificateserrors. Use staging issuer while testing.
11.6 MongoDB Permission Denied on /data/db
Symptom: MongoDB pod fails with:
chown: /data/db: Operation not permitted
Fix: The pod security context must set fsGroup: 999. This should be set by the chart. If it is missing, override in values:
mongodb:
podSecurityContext:
runAsUser: 999
runAsGroup: 999
fsGroup: 999
Then redeploy and delete the stuck pod to let it reschedule with the correct context.
11.7 Database Init Scripts Did Not Run
PostgreSQL (Citus DB, OpenDSO Apps DB) only runs init scripts when the data directory is empty (first initialization). If the PVC was pre-existing and the init scripts were skipped:
# Run init scripts manually on Citus DB
kubectl exec -n $NAMESPACE ${RELEASE_NAME}-citus-db-0 -- \
psql -U citususer -f /docker-entrypoint-initdb.d/init.sql
# Run init scripts manually on OpenDSO Apps DB
kubectl exec -n $NAMESPACE ${RELEASE_NAME}-opendso-apps-db-0 -- \
psql -U essuser -d ess_tester -f /docker-entrypoint-initdb.d/10_ess_tester.sql
To force re-initialization (destructive — all data lost):
kubectl delete pvc data-${RELEASE_NAME}-opendso-apps-db-0 -n $NAMESPACE
kubectl delete pod ${RELEASE_NAME}-opendso-apps-db-0 -n $NAMESPACE
11.8 Topology-Genesis ConfigMap Too Large
Symptom:
The ConfigMap "...-topology-genesis-site-config" is invalid: metadata.annotations: Too long
Fix: The cim.xml file (367 KB) exceeds the client-side apply annotation limit of 262 KB. Pre-create the ConfigMap with server-side apply as described in section 8.3, and ensure global.topology-genesis.externalConfigMap: true is set in your values.
11.9 LoadBalancer IP Stuck in <pending>
On standard GKE this resolves within 1–2 minutes. If it stays pending:
# Check for quota issues or provisioning errors
kubectl describe svc ingress-nginx-controller -n ingress-nginx
# Check GCP quotas
gcloud compute project-info describe --project=$GCP_PROJECT | grep -A5 quota
11.10 Image Pull Errors (ImagePullBackOff)
kubectl describe pod <pod-name> -n $NAMESPACE | grep -A 10 "Failed"
Causes:
- GKE node service account lacks
roles/artifactregistry.reader— see section 2.4. - Wrong
global.imageRegistryvalue in Helm values — verify the registry URL with OES.
11.11 Grafana Pods Pending
kubectl describe pod ${RELEASE_NAME}-grafana-xxx -n $NAMESPACE | grep -A5 "Events:"
If the cause is Insufficient memory or Insufficient cpu, the cluster nodes are resource-constrained. Either:
- Scale up the node pool:
gcloud container clusters resize $CLUSTER_NAME --num-nodes=4 ... - Use a smaller resource profile:
--set global.resourceProfile=default
11.12 Keycloak Login Redirect Loop
Symptom: Logging in at https://keycloak.<domain> redirects in a loop or shows a blank page.
Check:
kubectl logs deployment/${RELEASE_NAME}-keycloak -n $NAMESPACE --tail=100 | grep -i error
Common causes:
global.keycloak.urldoes not match the actual hostname — ensure it ishttps://keycloak.<domain>(no trailing slash).- TLS certificate CN does not cover
keycloak.<domain>— verify the wildcard cert covers*.<domain>. - Keycloak realm was not imported — check for
realm importin logs. If the database was initialized but the realm was not imported, delete the Keycloak PVC and redeploy.
12. Backup and Restore
12.1 Database Backups
Run these from your local machine after authenticating kubectl.
MongoDB
# Dump
kubectl exec -n $NAMESPACE ${RELEASE_NAME}-mongodb-0 -- \
mongodump --username root \
--authenticationDatabase admin \
--out /tmp/mongodump
kubectl cp $NAMESPACE/${RELEASE_NAME}-mongodb-0:/tmp/mongodump ./backups/mongodump-$(date +%Y%m%d)
Citus DB (Historian)
kubectl exec -n $NAMESPACE ${RELEASE_NAME}-citus-db-0 -- \
pg_dump -U citususer ofmb_db > ./backups/citus-$(date +%Y%m%d).sql
OpenDSO Apps DB (ESS / Assets)
kubectl exec -n $NAMESPACE ${RELEASE_NAME}-opendso-apps-db-0 -- \
pg_dump -U essuser ess_tester > ./backups/ess-tester-$(date +%Y%m%d).sql
kubectl exec -n $NAMESPACE ${RELEASE_NAME}-opendso-apps-db-0 -- \
pg_dump -U essuser assets > ./backups/assets-$(date +%Y%m%d).sql
Redis
kubectl exec ${RELEASE_NAME}-ess-manager-redis-0 -n $NAMESPACE -- redis-cli SAVE
kubectl cp $NAMESPACE/${RELEASE_NAME}-ess-manager-redis-0:/data/dump.rdb \
./backups/redis-$(date +%Y%m%d).rdb
12.2 Secrets Backup
These secrets are not managed by Helm and persist across helm uninstall. Back them up to a secure vault:
for secret in \
${RELEASE_NAME}-nats-auth-keys \
${RELEASE_NAME}-opendso-license \
${RELEASE_NAME}-tls-secret \
${RELEASE_NAME}-mongodb-tls \
${RELEASE_NAME}-grafana-credentials \
${RELEASE_NAME}-opendso-apps-db-secret \
${RELEASE_NAME}-citus-db-secret \
root-ca server-cert server-key; do
kubectl get secret $secret -n $NAMESPACE -o yaml \
> ./backups/secret-${secret}-$(date +%Y%m%d).yaml
echo "Backed up: $secret"
done
Warning: These YAML files contain base64-encoded secrets. Store them in an encrypted location (e.g., GCP Secret Manager, Vault).
12.3 Restore Order
When restoring to a new cluster:
- Create the GKE cluster and configure kubectl
- Install ingress-nginx and cert-manager (sections 4–6)
- Re-create the namespace and all secrets from backups
helm upgrade --installwith the same values- Restore database contents after pods are running
- Verify the deployment (section 9)
13. Teardown
Remove the Helm Release
helm uninstall $RELEASE_NAME -n $NAMESPACE
Persistent Volume Claims are not deleted by
helm uninstall. Delete them explicitly if you want to free storage:
kubectl delete pvc -n $NAMESPACE --all
Delete Secrets
kubectl delete secret -n $NAMESPACE \
${RELEASE_NAME}-tls-secret \
${RELEASE_NAME}-mongodb-tls \
${RELEASE_NAME}-nats-auth-keys \
${RELEASE_NAME}-grafana-credentials \
${RELEASE_NAME}-opendso-apps-db-secret \
${RELEASE_NAME}-citus-db-secret \
root-ca server-cert server-key
Delete the Namespace
kubectl delete namespace $NAMESPACE
Remove DNS Records
gcloud dns record-sets transaction start --zone=$DNS_ZONE_NAME --project=$GCP_PROJECT
gcloud dns record-sets transaction remove $LB_IP \
--name="${DOMAIN}." --ttl=300 --type=A \
--zone=$DNS_ZONE_NAME --project=$GCP_PROJECT
gcloud dns record-sets transaction remove $LB_IP \
--name="*.${DOMAIN}." --ttl=300 --type=A \
--zone=$DNS_ZONE_NAME --project=$GCP_PROJECT
gcloud dns record-sets transaction execute --zone=$DNS_ZONE_NAME --project=$GCP_PROJECT
gcloud dns managed-zones delete $DNS_ZONE_NAME --project=$GCP_PROJECT
Delete the GKE Cluster
gcloud container clusters delete $CLUSTER_NAME \
--zone=$GCP_ZONE \
--project=$GCP_PROJECT