Deployment Process

YeboLearn's deployment process ensures zero-downtime releases, rapid rollback capability, and comprehensive monitoring. We deploy multiple times daily to dev, weekly to staging, and bi-weekly to production.

Deployment Architecture

Infrastructure Overview

Google Cloud Platform
├── Cloud Run (Container Platform)
│   ├── Production Service (api.yebolearn.app)
│   ├── Staging Service (staging.yebolearn.app)
│   └── Dev Service (dev-api.yebolearn.app)
├── Cloud SQL (PostgreSQL 15)
│   ├── Production Database
│   ├── Staging Database
│   └── Dev Database
├── Artifact Registry (Docker Images)
├── Cloud Storage (Static Assets, Backups)
├── Cloud Load Balancer
└── Cloud Logging & Monitoring

Container Strategy

Docker Multi-Stage Build:

dockerfile

# Build stage
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build

# Production stage
FROM node:18-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY package*.json ./

ENV NODE_ENV=production
EXPOSE 8080

CMD ["node", "dist/server.js"]

Image Optimization:

Multi-stage builds (reduces size by 60%)
Alpine base image (smaller footprint)
Layer caching (faster builds)
Security scanning before deployment

Environment Strategy

Environment Configuration

Environment	URL	Database	Purpose	Deploy Trigger
Development	dev-api.yebolearn.app	Dev DB (small instance)	Active development, testing	Auto on merge to `dev`
Staging	staging.yebolearn.app	Staging DB (production replica)	Pre-prod validation, QA	Weekly from `dev`
Production	api.yebolearn.app	Production DB (high availability)	Live users	Bi-weekly release

Environment Variables

Managed via Google Secret Manager:

bash

# Development
NODE_ENV=development
DATABASE_URL=postgresql://dev_db_connection
GEMINI_API_KEY=dev_key_with_limits
MPESA_CONSUMER_KEY=test_key
LOG_LEVEL=debug

# Staging
NODE_ENV=staging
DATABASE_URL=postgresql://staging_db_connection
GEMINI_API_KEY=staging_key_production_like
MPESA_CONSUMER_KEY=sandbox_key
LOG_LEVEL=info

# Production
NODE_ENV=production
DATABASE_URL=postgresql://prod_db_connection
GEMINI_API_KEY=production_key
MPESA_CONSUMER_KEY=production_key
LOG_LEVEL=warn

Environment Isolation

Development:

Relaxed rate limits
Debug logging enabled
Test payment credentials
Mock external services (when needed)
Sample data in database

Staging:

Production-like configuration
Real integrations in test mode
Anonymized production data copy
Performance monitoring
QA and stakeholder access

Production:

Optimized for performance
Strict rate limits
Minimal logging (errors/warnings)
Real payment processing
High availability configuration

Deployment Workflows

Development Deployment

Trigger: Merge to dev branch

Process:

yaml

# .github/workflows/deploy-dev.yml
name: Deploy to Development
on:
  push:
    branches: [dev]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Run Tests
        run: npm test

      - name: Build Docker Image
        run: |
          docker build -t gcr.io/yebolearn/api:dev-${{ github.sha }} .
          docker tag gcr.io/yebolearn/api:dev-${{ github.sha }} gcr.io/yebolearn/api:dev-latest

      - name: Push to Artifact Registry
        run: |
          docker push gcr.io/yebolearn/api:dev-${{ github.sha }}
          docker push gcr.io/yebolearn/api:dev-latest

      - name: Deploy to Cloud Run
        run: |
          gcloud run deploy yebolearn-dev \
            --image gcr.io/yebolearn/api:dev-${{ github.sha }} \
            --platform managed \
            --region africa-south1 \
            --allow-unauthenticated

      - name: Run Smoke Tests
        run: npm run test:smoke -- --env=dev

      - name: Notify Team
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
            -d "{'text': 'Dev deployment successful: ${{ github.sha }}'}"

Timeline:

Tests: 3 minutes
Build: 2 minutes
Deploy: 1 minute
Smoke tests: 30 seconds
Total: ~7 minutes

Staging Deployment

Trigger: Manual or weekly schedule

Process:

bash

# Manual staging deployment
git checkout staging
git merge dev
git push origin staging

# CI/CD takes over
# 1. Run full test suite
# 2. Build and tag image
# 3. Deploy to staging environment
# 4. Run integration tests
# 5. Notify QA team

Pre-Deployment Checklist:

[ ] All dev tests passing
[ ] Features validated in dev
[ ] Database migrations prepared
[ ] QA team notified
[ ] Stakeholder demo scheduled

Timeline:

Tests: 5 minutes
Build: 2 minutes
Database migration: 1-5 minutes
Deploy: 2 minutes
Integration tests: 3 minutes
Total: ~15 minutes

Production Deployment

Trigger: Bi-weekly release (Thursday 10 AM)

Process:

1. Pre-Deployment (Tuesday-Wednesday)

bash

# Create release branch
git checkout -b release/v2.5.0 staging

# Final testing
npm run test:all
npm run test:e2e

# Generate changelog
npm run changelog

# Update version
npm version minor -m "Release v2.5.0: AI Essay Grading"

2. Deployment Day (Thursday 10 AM)

bash

# Backup production database
gcloud sql backups create \
  --instance=yebolearn-prod-db \
  --description="Pre-deployment backup v2.5.0"

# Tag release
git tag -a v2.5.0 -m "Release v2.5.0"
git push origin v2.5.0

# Merge to main
git checkout main
git merge release/v2.5.0
git push origin main

# GitHub Actions triggered automatically

3. Blue-Green Deployment

yaml

# Automatic via CI/CD
steps:
  - name: Deploy New Version (Green)
    run: |
      gcloud run deploy yebolearn-api-green \
        --image gcr.io/yebolearn/api:v2.5.0 \
        --no-traffic

  - name: Health Check Green
    run: |
      curl https://green.yebolearn.app/health
      npm run test:smoke -- --env=green

  - name: Run Database Migrations
    run: |
      npm run migrate:prod

  - name: Switch Traffic to Green
    run: |
      gcloud run services update-traffic yebolearn-api \
        --to-revisions=yebolearn-api-green=100

  - name: Monitor for 10 Minutes
    run: |
      sleep 600
      # Check error rates, response times, etc.

  - name: Decommission Blue (if successful)
    run: |
      gcloud run revisions delete yebolearn-api-blue

4. Post-Deployment

bash

# Monitor critical metrics
# - Error rate
# - Response time
# - Database performance
# - Payment success rate

# Verify key user flows
npm run test:smoke:critical

# Update status page
# Notify team of successful deployment

Timeline:

Backup: 5 minutes
Build & test: 8 minutes
Deploy green: 3 minutes
Migrations: 2-10 minutes
Traffic switch: 1 minute
Monitoring period: 10 minutes
Total: ~30 minutes

Database Migrations

Migration Strategy

Development:

bash

# Create migration
npm run migrate:create add_ai_essay_grading

# Apply migration
npm run migrate:dev

# Test rollback
npm run migrate:rollback:dev

Production:

bash

# Migrations run automatically during deployment
# But tested thoroughly in staging first

# Zero-downtime patterns:
# 1. Add new column (nullable)
# 2. Deploy code that writes to both old and new
# 3. Backfill data
# 4. Deploy code that reads from new
# 5. Remove old column (next release)

Migration Best Practices:

Always reversible (down migration)
Test on staging first
Backup before running
Monitor performance impact
Use indexes for large tables
Avoid blocking operations in production

Example Migration

typescript

// migrations/20251122_add_essay_grading.ts
import { Knex } from 'knex';

export async function up(knex: Knex): Promise<void> {
  await knex.schema.createTable('essay_submissions', (table) => {
    table.uuid('id').primary().defaultTo(knex.raw('gen_random_uuid()'));
    table.uuid('student_id').notNullable().references('id').inTable('students');
    table.text('content').notNullable();
    table.jsonb('ai_feedback').nullable();
    table.integer('score').nullable();
    table.timestamps(true, true);

    table.index('student_id');
    table.index('created_at');
  });
}

export async function down(knex: Knex): Promise<void> {
  await knex.schema.dropTable('essay_submissions');
}

Rollback Procedures

Automatic Rollback

Triggers:

Error rate >5% for 2 minutes
Response time >2s (p95) for 5 minutes
Health check failures
Critical API endpoints down

Process:

yaml

# Automatic rollback in CI/CD
- name: Monitor Deployment
  run: |
    # Check error rate every 30 seconds for 10 minutes
    for i in {1..20}; do
      error_rate=$(curl -s https://api.yebolearn.app/metrics/errors)
      if [ $error_rate -gt 5 ]; then
        echo "Error rate too high, rolling back"
        gcloud run services update-traffic yebolearn-api \
          --to-revisions=yebolearn-api-blue=100
        exit 1
      fi
      sleep 30
    done

Manual Rollback

Quick Rollback (Revert Traffic):

bash

# List recent revisions
gcloud run revisions list --service=yebolearn-api

# Switch traffic back to previous version
gcloud run services update-traffic yebolearn-api \
  --to-revisions=yebolearn-api-v2.4.9=100

# Verify rollback
curl https://api.yebolearn.app/health
npm run test:smoke:critical

Timeline: 2-3 minutes

Database Rollback (If Needed):

bash

# Only if migration is problematic
# Use with extreme caution

# Restore from backup
gcloud sql backups restore <backup-id> \
  --backup-instance=yebolearn-prod-db

# Or run down migration
npm run migrate:rollback:prod

# Redeploy previous version

Timeline: 10-30 minutes

Rollback Decision Tree

Is production broken?
├─ Yes: Critical issue (payments, data loss, security)
│   └─> Immediate rollback (2 minutes)
├─ Partial: Some users affected, workaround exists
│   └─> Evaluate fix time vs rollback
│       ├─ Fix <30 min → Hotfix
│       └─ Fix >30 min → Rollback
└─ No: Minor issue, low impact
    └─> Schedule fix for next release

Monitoring and Alerts

Health Checks

Endpoint: /health

typescript

export async function healthCheck(): Promise<HealthStatus> {
  const checks = await Promise.all([
    checkDatabase(),
    checkRedisCache(),
    checkGeminiAPI(),
    checkPaymentGateway(),
  ]);

  const healthy = checks.every(c => c.status === 'healthy');

  return {
    status: healthy ? 'healthy' : 'degraded',
    timestamp: new Date(),
    checks: {
      database: checks[0],
      cache: checks[1],
      gemini: checks[2],
      payments: checks[3],
    },
    version: process.env.APP_VERSION,
  };
}

Response:

json

{
  "status": "healthy",
  "timestamp": "2025-11-22T10:30:00Z",
  "checks": {
    "database": { "status": "healthy", "latency": "12ms" },
    "cache": { "status": "healthy", "latency": "2ms" },
    "gemini": { "status": "healthy", "latency": "145ms" },
    "payments": { "status": "healthy", "latency": "234ms" }
  },
  "version": "v2.5.0"
}

Metrics Tracking

Key Metrics:

typescript

// Application Performance
- Request rate (requests/second)
- Response time (p50, p95, p99)
- Error rate (%)
- Active users (concurrent)

// Business Metrics
- Quiz completions/hour
- AI features usage
- Payment success rate
- Course enrollments

// Infrastructure
- CPU utilization (%)
- Memory usage (%)
- Database connections
- Container restarts

Monitoring Stack:

yaml

Metrics Collection: Prometheus
Visualization: Grafana
Logging: Google Cloud Logging
Tracing: Google Cloud Trace
Error Tracking: Sentry
Uptime Monitoring: UptimeRobot
Alerting: PagerDuty

Alert Configuration

Critical Alerts (Page On-Call):

yaml

- name: API Down
  condition: uptime < 99% for 2 minutes
  severity: critical
  notify: pagerduty

- name: High Error Rate
  condition: error_rate > 5% for 3 minutes
  severity: critical
  notify: pagerduty

- name: Payment Failures
  condition: payment_failure_rate > 10% for 5 minutes
  severity: critical
  notify: pagerduty

- name: Database Connection Pool Exhausted
  condition: db_connections > 90% for 2 minutes
  severity: critical
  notify: pagerduty

Warning Alerts (Slack):

yaml

- name: Elevated Response Time
  condition: p95_response_time > 1s for 10 minutes
  severity: warning
  notify: slack

- name: Increased Error Rate
  condition: error_rate > 2% for 10 minutes
  severity: warning
  notify: slack

- name: High Memory Usage
  condition: memory_usage > 80% for 15 minutes
  severity: warning
  notify: slack

Logging Strategy

Log Levels:

typescript

// Production: WARN and ERROR only
logger.error('Payment processing failed', {
  userId,
  transactionId,
  error: err.message
});

logger.warn('Gemini API rate limit approaching', {
  currentUsage: 850,
  limit: 1000
});

// Development/Staging: Include INFO and DEBUG
logger.info('Quiz generated successfully', {
  quizId,
  questionCount,
  generationTime
});

logger.debug('Database query executed', {
  query,
  duration,
  rowCount
});

Structured Logging:

typescript

import { logger } from './logger';

// Good: Structured with context
logger.error('Payment failed', {
  event: 'payment_failure',
  userId: 'user-123',
  amount: 500,
  provider: 'mpesa',
  errorCode: 'TIMEOUT',
  transactionId: 'txn-456',
  timestamp: new Date(),
});

// Bad: Unstructured string
logger.error('Payment failed for user-123 amount 500');

Performance Optimization

CDN and Caching

Static Assets:

Served from Google Cloud CDN
Cache-Control headers configured
Versioned filenames for cache busting
Compressed (gzip/brotli)

API Caching:

typescript

// Redis for frequently accessed data
import { redis } from './cache';

export async function getQuiz(quizId: string) {
  // Check cache first
  const cached = await redis.get(`quiz:${quizId}`);
  if (cached) return JSON.parse(cached);

  // Fetch from database
  const quiz = await db.quiz.findUnique({ where: { id: quizId } });

  // Cache for 1 hour
  await redis.set(`quiz:${quizId}`, JSON.stringify(quiz), 'EX', 3600);

  return quiz;
}

Database Optimization

Connection Pooling:

typescript

// Prisma configuration
const prisma = new PrismaClient({
  datasources: {
    db: {
      url: process.env.DATABASE_URL,
    },
  },
  // Connection pool for Cloud Run
  connection_limit: 10, // Conservative for serverless
});

Query Optimization:

Indexes on frequently queried columns
Avoid N+1 queries (use includes/joins)
Pagination for large result sets
Database query monitoring

Container Optimization

Resource Limits:

yaml

# Cloud Run configuration
resources:
  limits:
    cpu: "2"
    memory: "1Gi"
  requests:
    cpu: "1"
    memory: "512Mi"

autoscaling:
  minInstances: 1      # Always one warm instance
  maxInstances: 100    # Scale up to handle load
  targetCPU: 70        # Scale when CPU > 70%
  targetMemory: 80     # Scale when memory > 80%

Deployment Checklist

Pre-Deployment

[ ] All tests passing (unit, integration, E2E)
[ ] Code reviewed and approved
[ ] Database migrations tested in staging
[ ] Feature flags configured
[ ] Monitoring dashboards prepared
[ ] Rollback plan documented
[ ] On-call engineer identified
[ ] Stakeholders notified

During Deployment

[ ] Backup database
[ ] Deploy to green environment
[ ] Run health checks
[ ] Execute migrations
[ ] Switch traffic gradually
[ ] Monitor error rates
[ ] Verify critical flows
[ ] Check business metrics

Post-Deployment

[ ] Monitor for 30 minutes
[ ] Run smoke tests
[ ] Check logs for errors
[ ] Verify integrations working
[ ] Update status page
[ ] Document any issues
[ ] Notify team of completion
[ ] Schedule retrospective (if issues)

Disaster Recovery

Backup Strategy

Database Backups:

Automated daily backups (retained 30 days)
Pre-deployment backups (retained 7 days)
Weekly full backups (retained 90 days)
Point-in-time recovery (7 days)

Restore Process:

bash

# List available backups
gcloud sql backups list --instance=yebolearn-prod-db

# Restore from backup
gcloud sql backups restore <backup-id> \
  --backup-instance=yebolearn-prod-db \
  --backup-project=yebolearn-prod

# Verify data integrity
npm run db:verify

Application State:

Docker images retained indefinitely
Git tags for all releases
Configuration in version control
Secrets in Secret Manager (versioned)

Incident Response

Severity Levels:

P0 (Critical): Complete service outage, data loss risk

Response time: Immediate
Escalation: Page on-call + management
Communication: Status page + email users

P1 (High): Major feature broken, payment issues

Response time: 15 minutes
Escalation: On-call engineer
Communication: Status page update

P2 (Medium): Minor feature degraded

Response time: 2 hours
Escalation: Team Slack
Communication: Internal only

P3 (Low): Cosmetic issues, minor bugs

Response time: Next business day
Escalation: Linear ticket
Communication: None required

Development Workflow - Overall workflow
Git Conventions - Release tagging
Monitoring - Detailed monitoring setup
Infrastructure - Architecture details

Deployment Process ​

Deployment Architecture ​

Infrastructure Overview ​

Container Strategy ​

Environment Strategy ​

Environment Configuration ​

Environment Variables ​

Environment Isolation ​

Deployment Workflows ​

Development Deployment ​

Staging Deployment ​

Production Deployment ​

Database Migrations ​

Migration Strategy ​

Example Migration ​

Rollback Procedures ​

Automatic Rollback ​

Manual Rollback ​

Rollback Decision Tree ​

Monitoring and Alerts ​

Health Checks ​

Metrics Tracking ​

Alert Configuration ​

Logging Strategy ​

Performance Optimization ​

CDN and Caching ​

Database Optimization ​

Container Optimization ​

Deployment Checklist ​

Pre-Deployment ​

During Deployment ​

Post-Deployment ​

Disaster Recovery ​

Backup Strategy ​

Incident Response ​

Related Documentation ​

Deployment Process

Deployment Architecture

Infrastructure Overview

Container Strategy

Environment Strategy

Environment Configuration

Environment Variables

Environment Isolation

Deployment Workflows

Development Deployment

Staging Deployment

Production Deployment

Database Migrations

Migration Strategy

Example Migration

Rollback Procedures

Automatic Rollback

Manual Rollback

Rollback Decision Tree

Monitoring and Alerts

Health Checks

Metrics Tracking

Alert Configuration

Logging Strategy

Performance Optimization

CDN and Caching

Database Optimization

Container Optimization

Deployment Checklist

Pre-Deployment

During Deployment

Post-Deployment

Disaster Recovery

Backup Strategy

Incident Response

Related Documentation