Skip to content

Deployment Process

YeboLearn's deployment process ensures zero-downtime releases, rapid rollback capability, and comprehensive monitoring. We deploy multiple times daily to dev, weekly to staging, and bi-weekly to production.

Deployment Architecture

Infrastructure Overview

Google Cloud Platform
├── Cloud Run (Container Platform)
│   ├── Production Service (api.yebolearn.app)
│   ├── Staging Service (staging.yebolearn.app)
│   └── Dev Service (dev-api.yebolearn.app)
├── Cloud SQL (PostgreSQL 15)
│   ├── Production Database
│   ├── Staging Database
│   └── Dev Database
├── Artifact Registry (Docker Images)
├── Cloud Storage (Static Assets, Backups)
├── Cloud Load Balancer
└── Cloud Logging & Monitoring

Container Strategy

Docker Multi-Stage Build:

dockerfile
# Build stage
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build

# Production stage
FROM node:18-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY package*.json ./

ENV NODE_ENV=production
EXPOSE 8080

CMD ["node", "dist/server.js"]

Image Optimization:

  • Multi-stage builds (reduces size by 60%)
  • Alpine base image (smaller footprint)
  • Layer caching (faster builds)
  • Security scanning before deployment

Environment Strategy

Environment Configuration

EnvironmentURLDatabasePurposeDeploy Trigger
Developmentdev-api.yebolearn.appDev DB (small instance)Active development, testingAuto on merge to dev
Stagingstaging.yebolearn.appStaging DB (production replica)Pre-prod validation, QAWeekly from dev
Productionapi.yebolearn.appProduction DB (high availability)Live usersBi-weekly release

Environment Variables

Managed via Google Secret Manager:

bash
# Development
NODE_ENV=development
DATABASE_URL=postgresql://dev_db_connection
GEMINI_API_KEY=dev_key_with_limits
MPESA_CONSUMER_KEY=test_key
LOG_LEVEL=debug

# Staging
NODE_ENV=staging
DATABASE_URL=postgresql://staging_db_connection
GEMINI_API_KEY=staging_key_production_like
MPESA_CONSUMER_KEY=sandbox_key
LOG_LEVEL=info

# Production
NODE_ENV=production
DATABASE_URL=postgresql://prod_db_connection
GEMINI_API_KEY=production_key
MPESA_CONSUMER_KEY=production_key
LOG_LEVEL=warn

Environment Isolation

Development:

  • Relaxed rate limits
  • Debug logging enabled
  • Test payment credentials
  • Mock external services (when needed)
  • Sample data in database

Staging:

  • Production-like configuration
  • Real integrations in test mode
  • Anonymized production data copy
  • Performance monitoring
  • QA and stakeholder access

Production:

  • Optimized for performance
  • Strict rate limits
  • Minimal logging (errors/warnings)
  • Real payment processing
  • High availability configuration

Deployment Workflows

Development Deployment

Trigger: Merge to dev branch

Process:

yaml
# .github/workflows/deploy-dev.yml
name: Deploy to Development
on:
  push:
    branches: [dev]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Run Tests
        run: npm test

      - name: Build Docker Image
        run: |
          docker build -t gcr.io/yebolearn/api:dev-${{ github.sha }} .
          docker tag gcr.io/yebolearn/api:dev-${{ github.sha }} gcr.io/yebolearn/api:dev-latest

      - name: Push to Artifact Registry
        run: |
          docker push gcr.io/yebolearn/api:dev-${{ github.sha }}
          docker push gcr.io/yebolearn/api:dev-latest

      - name: Deploy to Cloud Run
        run: |
          gcloud run deploy yebolearn-dev \
            --image gcr.io/yebolearn/api:dev-${{ github.sha }} \
            --platform managed \
            --region africa-south1 \
            --allow-unauthenticated

      - name: Run Smoke Tests
        run: npm run test:smoke -- --env=dev

      - name: Notify Team
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
            -d "{'text': 'Dev deployment successful: ${{ github.sha }}'}"

Timeline:

  • Tests: 3 minutes
  • Build: 2 minutes
  • Deploy: 1 minute
  • Smoke tests: 30 seconds
  • Total: ~7 minutes

Staging Deployment

Trigger: Manual or weekly schedule

Process:

bash
# Manual staging deployment
git checkout staging
git merge dev
git push origin staging

# CI/CD takes over
# 1. Run full test suite
# 2. Build and tag image
# 3. Deploy to staging environment
# 4. Run integration tests
# 5. Notify QA team

Pre-Deployment Checklist:

  • [ ] All dev tests passing
  • [ ] Features validated in dev
  • [ ] Database migrations prepared
  • [ ] QA team notified
  • [ ] Stakeholder demo scheduled

Timeline:

  • Tests: 5 minutes
  • Build: 2 minutes
  • Database migration: 1-5 minutes
  • Deploy: 2 minutes
  • Integration tests: 3 minutes
  • Total: ~15 minutes

Production Deployment

Trigger: Bi-weekly release (Thursday 10 AM)

Process:

1. Pre-Deployment (Tuesday-Wednesday)

bash
# Create release branch
git checkout -b release/v2.5.0 staging

# Final testing
npm run test:all
npm run test:e2e

# Generate changelog
npm run changelog

# Update version
npm version minor -m "Release v2.5.0: AI Essay Grading"

2. Deployment Day (Thursday 10 AM)

bash
# Backup production database
gcloud sql backups create \
  --instance=yebolearn-prod-db \
  --description="Pre-deployment backup v2.5.0"

# Tag release
git tag -a v2.5.0 -m "Release v2.5.0"
git push origin v2.5.0

# Merge to main
git checkout main
git merge release/v2.5.0
git push origin main

# GitHub Actions triggered automatically

3. Blue-Green Deployment

yaml
# Automatic via CI/CD
steps:
  - name: Deploy New Version (Green)
    run: |
      gcloud run deploy yebolearn-api-green \
        --image gcr.io/yebolearn/api:v2.5.0 \
        --no-traffic

  - name: Health Check Green
    run: |
      curl https://green.yebolearn.app/health
      npm run test:smoke -- --env=green

  - name: Run Database Migrations
    run: |
      npm run migrate:prod

  - name: Switch Traffic to Green
    run: |
      gcloud run services update-traffic yebolearn-api \
        --to-revisions=yebolearn-api-green=100

  - name: Monitor for 10 Minutes
    run: |
      sleep 600
      # Check error rates, response times, etc.

  - name: Decommission Blue (if successful)
    run: |
      gcloud run revisions delete yebolearn-api-blue

4. Post-Deployment

bash
# Monitor critical metrics
# - Error rate
# - Response time
# - Database performance
# - Payment success rate

# Verify key user flows
npm run test:smoke:critical

# Update status page
# Notify team of successful deployment

Timeline:

  • Backup: 5 minutes
  • Build & test: 8 minutes
  • Deploy green: 3 minutes
  • Migrations: 2-10 minutes
  • Traffic switch: 1 minute
  • Monitoring period: 10 minutes
  • Total: ~30 minutes

Database Migrations

Migration Strategy

Development:

bash
# Create migration
npm run migrate:create add_ai_essay_grading

# Apply migration
npm run migrate:dev

# Test rollback
npm run migrate:rollback:dev

Production:

bash
# Migrations run automatically during deployment
# But tested thoroughly in staging first

# Zero-downtime patterns:
# 1. Add new column (nullable)
# 2. Deploy code that writes to both old and new
# 3. Backfill data
# 4. Deploy code that reads from new
# 5. Remove old column (next release)

Migration Best Practices:

  • Always reversible (down migration)
  • Test on staging first
  • Backup before running
  • Monitor performance impact
  • Use indexes for large tables
  • Avoid blocking operations in production

Example Migration

typescript
// migrations/20251122_add_essay_grading.ts
import { Knex } from 'knex';

export async function up(knex: Knex): Promise<void> {
  await knex.schema.createTable('essay_submissions', (table) => {
    table.uuid('id').primary().defaultTo(knex.raw('gen_random_uuid()'));
    table.uuid('student_id').notNullable().references('id').inTable('students');
    table.text('content').notNullable();
    table.jsonb('ai_feedback').nullable();
    table.integer('score').nullable();
    table.timestamps(true, true);

    table.index('student_id');
    table.index('created_at');
  });
}

export async function down(knex: Knex): Promise<void> {
  await knex.schema.dropTable('essay_submissions');
}

Rollback Procedures

Automatic Rollback

Triggers:

  • Error rate >5% for 2 minutes
  • Response time >2s (p95) for 5 minutes
  • Health check failures
  • Critical API endpoints down

Process:

yaml
# Automatic rollback in CI/CD
- name: Monitor Deployment
  run: |
    # Check error rate every 30 seconds for 10 minutes
    for i in {1..20}; do
      error_rate=$(curl -s https://api.yebolearn.app/metrics/errors)
      if [ $error_rate -gt 5 ]; then
        echo "Error rate too high, rolling back"
        gcloud run services update-traffic yebolearn-api \
          --to-revisions=yebolearn-api-blue=100
        exit 1
      fi
      sleep 30
    done

Manual Rollback

Quick Rollback (Revert Traffic):

bash
# List recent revisions
gcloud run revisions list --service=yebolearn-api

# Switch traffic back to previous version
gcloud run services update-traffic yebolearn-api \
  --to-revisions=yebolearn-api-v2.4.9=100

# Verify rollback
curl https://api.yebolearn.app/health
npm run test:smoke:critical

Timeline: 2-3 minutes

Database Rollback (If Needed):

bash
# Only if migration is problematic
# Use with extreme caution

# Restore from backup
gcloud sql backups restore <backup-id> \
  --backup-instance=yebolearn-prod-db

# Or run down migration
npm run migrate:rollback:prod

# Redeploy previous version

Timeline: 10-30 minutes

Rollback Decision Tree

Is production broken?
├─ Yes: Critical issue (payments, data loss, security)
│   └─> Immediate rollback (2 minutes)
├─ Partial: Some users affected, workaround exists
│   └─> Evaluate fix time vs rollback
│       ├─ Fix <30 min → Hotfix
│       └─ Fix >30 min → Rollback
└─ No: Minor issue, low impact
    └─> Schedule fix for next release

Monitoring and Alerts

Health Checks

Endpoint: /health

typescript
export async function healthCheck(): Promise<HealthStatus> {
  const checks = await Promise.all([
    checkDatabase(),
    checkRedisCache(),
    checkGeminiAPI(),
    checkPaymentGateway(),
  ]);

  const healthy = checks.every(c => c.status === 'healthy');

  return {
    status: healthy ? 'healthy' : 'degraded',
    timestamp: new Date(),
    checks: {
      database: checks[0],
      cache: checks[1],
      gemini: checks[2],
      payments: checks[3],
    },
    version: process.env.APP_VERSION,
  };
}

Response:

json
{
  "status": "healthy",
  "timestamp": "2025-11-22T10:30:00Z",
  "checks": {
    "database": { "status": "healthy", "latency": "12ms" },
    "cache": { "status": "healthy", "latency": "2ms" },
    "gemini": { "status": "healthy", "latency": "145ms" },
    "payments": { "status": "healthy", "latency": "234ms" }
  },
  "version": "v2.5.0"
}

Metrics Tracking

Key Metrics:

typescript
// Application Performance
- Request rate (requests/second)
- Response time (p50, p95, p99)
- Error rate (%)
- Active users (concurrent)

// Business Metrics
- Quiz completions/hour
- AI features usage
- Payment success rate
- Course enrollments

// Infrastructure
- CPU utilization (%)
- Memory usage (%)
- Database connections
- Container restarts

Monitoring Stack:

yaml
Metrics Collection: Prometheus
Visualization: Grafana
Logging: Google Cloud Logging
Tracing: Google Cloud Trace
Error Tracking: Sentry
Uptime Monitoring: UptimeRobot
Alerting: PagerDuty

Alert Configuration

Critical Alerts (Page On-Call):

yaml
- name: API Down
  condition: uptime < 99% for 2 minutes
  severity: critical
  notify: pagerduty

- name: High Error Rate
  condition: error_rate > 5% for 3 minutes
  severity: critical
  notify: pagerduty

- name: Payment Failures
  condition: payment_failure_rate > 10% for 5 minutes
  severity: critical
  notify: pagerduty

- name: Database Connection Pool Exhausted
  condition: db_connections > 90% for 2 minutes
  severity: critical
  notify: pagerduty

Warning Alerts (Slack):

yaml
- name: Elevated Response Time
  condition: p95_response_time > 1s for 10 minutes
  severity: warning
  notify: slack

- name: Increased Error Rate
  condition: error_rate > 2% for 10 minutes
  severity: warning
  notify: slack

- name: High Memory Usage
  condition: memory_usage > 80% for 15 minutes
  severity: warning
  notify: slack

Logging Strategy

Log Levels:

typescript
// Production: WARN and ERROR only
logger.error('Payment processing failed', {
  userId,
  transactionId,
  error: err.message
});

logger.warn('Gemini API rate limit approaching', {
  currentUsage: 850,
  limit: 1000
});

// Development/Staging: Include INFO and DEBUG
logger.info('Quiz generated successfully', {
  quizId,
  questionCount,
  generationTime
});

logger.debug('Database query executed', {
  query,
  duration,
  rowCount
});

Structured Logging:

typescript
import { logger } from './logger';

// Good: Structured with context
logger.error('Payment failed', {
  event: 'payment_failure',
  userId: 'user-123',
  amount: 500,
  provider: 'mpesa',
  errorCode: 'TIMEOUT',
  transactionId: 'txn-456',
  timestamp: new Date(),
});

// Bad: Unstructured string
logger.error('Payment failed for user-123 amount 500');

Performance Optimization

CDN and Caching

Static Assets:

  • Served from Google Cloud CDN
  • Cache-Control headers configured
  • Versioned filenames for cache busting
  • Compressed (gzip/brotli)

API Caching:

typescript
// Redis for frequently accessed data
import { redis } from './cache';

export async function getQuiz(quizId: string) {
  // Check cache first
  const cached = await redis.get(`quiz:${quizId}`);
  if (cached) return JSON.parse(cached);

  // Fetch from database
  const quiz = await db.quiz.findUnique({ where: { id: quizId } });

  // Cache for 1 hour
  await redis.set(`quiz:${quizId}`, JSON.stringify(quiz), 'EX', 3600);

  return quiz;
}

Database Optimization

Connection Pooling:

typescript
// Prisma configuration
const prisma = new PrismaClient({
  datasources: {
    db: {
      url: process.env.DATABASE_URL,
    },
  },
  // Connection pool for Cloud Run
  connection_limit: 10, // Conservative for serverless
});

Query Optimization:

  • Indexes on frequently queried columns
  • Avoid N+1 queries (use includes/joins)
  • Pagination for large result sets
  • Database query monitoring

Container Optimization

Resource Limits:

yaml
# Cloud Run configuration
resources:
  limits:
    cpu: "2"
    memory: "1Gi"
  requests:
    cpu: "1"
    memory: "512Mi"

autoscaling:
  minInstances: 1      # Always one warm instance
  maxInstances: 100    # Scale up to handle load
  targetCPU: 70        # Scale when CPU > 70%
  targetMemory: 80     # Scale when memory > 80%

Deployment Checklist

Pre-Deployment

  • [ ] All tests passing (unit, integration, E2E)
  • [ ] Code reviewed and approved
  • [ ] Database migrations tested in staging
  • [ ] Feature flags configured
  • [ ] Monitoring dashboards prepared
  • [ ] Rollback plan documented
  • [ ] On-call engineer identified
  • [ ] Stakeholders notified

During Deployment

  • [ ] Backup database
  • [ ] Deploy to green environment
  • [ ] Run health checks
  • [ ] Execute migrations
  • [ ] Switch traffic gradually
  • [ ] Monitor error rates
  • [ ] Verify critical flows
  • [ ] Check business metrics

Post-Deployment

  • [ ] Monitor for 30 minutes
  • [ ] Run smoke tests
  • [ ] Check logs for errors
  • [ ] Verify integrations working
  • [ ] Update status page
  • [ ] Document any issues
  • [ ] Notify team of completion
  • [ ] Schedule retrospective (if issues)

Disaster Recovery

Backup Strategy

Database Backups:

  • Automated daily backups (retained 30 days)
  • Pre-deployment backups (retained 7 days)
  • Weekly full backups (retained 90 days)
  • Point-in-time recovery (7 days)

Restore Process:

bash
# List available backups
gcloud sql backups list --instance=yebolearn-prod-db

# Restore from backup
gcloud sql backups restore <backup-id> \
  --backup-instance=yebolearn-prod-db \
  --backup-project=yebolearn-prod

# Verify data integrity
npm run db:verify

Application State:

  • Docker images retained indefinitely
  • Git tags for all releases
  • Configuration in version control
  • Secrets in Secret Manager (versioned)

Incident Response

Severity Levels:

P0 (Critical): Complete service outage, data loss risk

  • Response time: Immediate
  • Escalation: Page on-call + management
  • Communication: Status page + email users

P1 (High): Major feature broken, payment issues

  • Response time: 15 minutes
  • Escalation: On-call engineer
  • Communication: Status page update

P2 (Medium): Minor feature degraded

  • Response time: 2 hours
  • Escalation: Team Slack
  • Communication: Internal only

P3 (Low): Cosmetic issues, minor bugs

  • Response time: Next business day
  • Escalation: Linear ticket
  • Communication: None required

YeboLearn - Empowering African Education