|
| 1 | +# Codex Prompt: Generate Synthetic SaaS Health Metrics Data |
| 2 | + |
| 3 | +## Context |
| 4 | +Generate synthetic data for a SaaS company's health metrics tracking system (Simetryk SaaS - Murilo's use case). This data will be used to calculate MRR movements, churn rates, cohort retention, and customer health scores for dashboard visualization. |
| 5 | + |
| 6 | +## Requirements |
| 7 | + |
| 8 | +### 1. Generate Three Related Tables |
| 9 | + |
| 10 | +#### Table 1: `day06_customers` |
| 11 | +Create **500 unique customers** with: |
| 12 | +- `customer_id` (TEXT, PRIMARY KEY): Format "cus_" + 16-char random string (Stripe-style: "cus_1234567890abcdef") |
| 13 | +- `email` (TEXT): Realistic email addresses (e.g., "john.smith@techcorp.com") |
| 14 | +- `signup_date` (DATE): Distributed across **24 months** (2023-01-01 to 2024-12-31) |
| 15 | + - **Cohort distribution**: More signups in early months, declining over time (realistic growth curve) |
| 16 | +- `plan_tier` (TEXT): One of "Starter", "Pro", or "Enterprise" |
| 17 | + - Distribution: 50% Starter, 35% Pro, 15% Enterprise |
| 18 | +- `mrr_current` (DECIMAL): Monthly Recurring Revenue based on plan tier: |
| 19 | + - Starter: $29-$99 (random variation) |
| 20 | + - Pro: $199-$499 |
| 21 | + - Enterprise: $999-$2,999 |
| 22 | +- `status` (TEXT): Either "active" or "churned" |
| 23 | + - **Churn rate**: 5-8% monthly (realistic SaaS churn) |
| 24 | + - Older cohorts should have higher churn rates |
| 25 | + |
| 26 | +#### Table 2: `day06_subscriptions` |
| 27 | +Create **subscription history** for all 500 customers (multiple subscriptions per customer if they upgraded/downgraded): |
| 28 | +- `subscription_id` (TEXT, PRIMARY KEY): Format "sub_" + 16-char random string (Stripe-style) |
| 29 | +- `customer_id` (TEXT, FOREIGN KEY): References day06_customers.customer_id |
| 30 | +- `start_date` (DATE): Subscription start date |
| 31 | + - First subscription: Same as customer signup_date |
| 32 | + - Subsequent subscriptions: After previous end_date (upgrades/downgrades) |
| 33 | +- `end_date` (DATE): NULL if currently active, or churn/change date |
| 34 | +- `mrr` (DECIMAL): MRR for this subscription period |
| 35 | +- `plan_tier` (TEXT): "Starter", "Pro", or "Enterprise" |
| 36 | + |
| 37 | +**Business Logic for Subscriptions:** |
| 38 | +- **New MRR**: Customer's first subscription (start_date = signup_date) |
| 39 | +- **Expansion MRR**: Upgrade from lower to higher tier (Starter → Pro → Enterprise) |
| 40 | + - 15-20% of customers should have at least one upgrade |
| 41 | +- **Contraction MRR**: Downgrade from higher to lower tier |
| 42 | + - 5-10% of customers should have at least one downgrade |
| 43 | +- **Churn MRR**: Subscription ends (end_date IS NOT NULL) |
| 44 | + - 5-8% monthly churn rate |
| 45 | +- **Retention**: Active customers maintain same plan tier |
| 46 | + - Most customers (70-75%) should remain in original tier |
| 47 | + |
| 48 | +#### Table 3: `day06_mrr_movements` |
| 49 | +Create **pre-aggregated monthly MRR movements** for 24 months (2023-01 to 2024-12): |
| 50 | +- `month` (DATE, PRIMARY KEY): First day of each month (e.g., "2023-01-01") |
| 51 | +- `new_mrr` (DECIMAL): Sum of MRR from new customers that month |
| 52 | +- `expansion_mrr` (DECIMAL): Sum of MRR increases from upgrades |
| 53 | +- `contraction_mrr` (DECIMAL): Sum of MRR decreases from downgrades |
| 54 | +- `churn_mrr` (DECIMAL): Sum of MRR lost from cancellations |
| 55 | +- `net_mrr` (DECIMAL): Calculated as: new_mrr + expansion_mrr - contraction_mrr - churn_mrr |
| 56 | + |
| 57 | +**Formula**: `net_mrr = new_mrr + expansion_mrr - contraction_mrr - churn_mrr` |
| 58 | + |
| 59 | +### 2. Data Relationships and Business Logic |
| 60 | + |
| 61 | +**Ensure Realistic SaaS Behavior:** |
| 62 | + |
| 63 | +1. **MRR Growth Pattern**: |
| 64 | + - Start with ~$50K MRR in month 1 |
| 65 | + - Grow to ~$200K MRR by month 24 (4x growth) |
| 66 | + - Growth should be mostly linear with some monthly variance |
| 67 | + |
| 68 | +2. **Cohort Retention Curves**: |
| 69 | + - Month 0 (signup): 100% retention |
| 70 | + - Month 1: 90-95% retention |
| 71 | + - Month 3: 80-85% retention |
| 72 | + - Month 6: 70-75% retention |
| 73 | + - Month 12: 60-65% retention |
| 74 | + - Month 24: 50-55% retention |
| 75 | + |
| 76 | +3. **Plan Tier Distribution**: |
| 77 | + - Starter: 50% of customers (entry point) |
| 78 | + - Pro: 35% of customers (sweet spot) |
| 79 | + - Enterprise: 15% of customers (high value) |
| 80 | + |
| 81 | +4. **Customer Lifecycle Examples**: |
| 82 | + - **Happy Path**: Starter → Pro → Enterprise (stays active) |
| 83 | + - **Expansion**: Starter → Pro (stays active) |
| 84 | + - **Contraction**: Pro → Starter (stays active but downgraded) |
| 85 | + - **Churn**: Any tier → churned (end_date set) |
| 86 | + - **Stable**: Same tier throughout (most common) |
| 87 | + |
| 88 | +5. **Edge Cases to Include**: |
| 89 | + - Customers who upgrade twice (Starter → Pro → Enterprise) |
| 90 | + - Customers who downgrade after upgrading (Pro → Starter) |
| 91 | + - Customers who churn immediately (month 1) |
| 92 | + - Customers who stay active for 24+ months |
| 93 | + - Enterprise customers with high MRR ($2K+/month) |
| 94 | + |
| 95 | +### 3. Output Format |
| 96 | + |
| 97 | +Generate Python code using SQLite3 that: |
| 98 | +1. Creates the three tables with proper schema and foreign keys |
| 99 | +2. Inserts synthetic data using realistic SaaS patterns |
| 100 | +3. Saves to `data/day06_saas_metrics.db` |
| 101 | +4. Prints comprehensive summary statistics after generation |
| 102 | +5. Validates data integrity (foreign keys, date ranges, MRR calculations) |
| 103 | + |
| 104 | +### 4. Code Structure |
| 105 | + |
| 106 | +```python |
| 107 | +#!/usr/bin/env python3 |
| 108 | +""" |
| 109 | +Synthetic Data Generator for Day 06: SaaS Health Metrics Foundation |
| 110 | +
|
| 111 | +This script generates realistic SaaS subscription data for Murilo's dashboard: |
| 112 | +- 500 customers across 24 months |
| 113 | +- Subscription history with upgrades/downgrades/churn |
| 114 | +- Pre-aggregated MRR movements for waterfall analysis |
| 115 | +
|
| 116 | +Stakeholder: Murilo (Simetryk SaaS) |
| 117 | +Use Case: MRR tracking, churn analysis, cohort retention, customer health scoring |
| 118 | +
|
| 119 | +Usage: |
| 120 | + python day06_DATA_synthetic_saas.py |
| 121 | +""" |
| 122 | + |
| 123 | +import sqlite3 |
| 124 | +import random |
| 125 | +import hashlib |
| 126 | +from datetime import datetime, timedelta |
| 127 | +from pathlib import Path |
| 128 | +from typing import List, Dict, Tuple |
| 129 | + |
| 130 | +# Configuration |
| 131 | +DAY06_DB_PATH = "data/day06_saas_metrics.db" |
| 132 | +DAY06_NUM_CUSTOMERS = 500 |
| 133 | +DAY06_NUM_MONTHS = 24 |
| 134 | +DAY06_START_DATE = datetime(2023, 1, 1) |
| 135 | + |
| 136 | +# Plan tier pricing |
| 137 | +DAY06_PLAN_PRICING = { |
| 138 | + 'Starter': (29, 99), # Min, Max MRR |
| 139 | + 'Pro': (199, 499), |
| 140 | + 'Enterprise': (999, 2999) |
| 141 | +} |
| 142 | + |
| 143 | +# SaaS metrics targets |
| 144 | +DAY06_MONTHLY_CHURN_RATE = 0.06 # 6% monthly churn |
| 145 | +DAY06_UPGRADE_PROBABILITY = 0.18 # 18% upgrade rate |
| 146 | +DAY06_DOWNGRADE_PROBABILITY = 0.08 # 8% downgrade rate |
| 147 | + |
| 148 | +# Rest of the code here... |
| 149 | +``` |
| 150 | + |
| 151 | +### 5. Key Requirements |
| 152 | + |
| 153 | +- **Naming Convention**: Use `day06_` prefix for all tables, `DAY06_` for constants |
| 154 | +- **Documentation**: Include docstrings explaining SaaS business logic |
| 155 | +- **Data Validation**: |
| 156 | + - No negative MRR values |
| 157 | + - Dates in valid ranges (2023-01-01 to 2024-12-31) |
| 158 | + - end_date > start_date (if not NULL) |
| 159 | + - Foreign keys validated |
| 160 | +- **Indexes**: Create indexes on: |
| 161 | + - `day06_subscriptions.customer_id` |
| 162 | + - `day06_subscriptions.start_date` |
| 163 | + - `day06_customers.signup_date` |
| 164 | + - `day06_customers.status` |
| 165 | +- **Realistic IDs**: Use Stripe-style IDs (e.g., "cus_AbCdEf1234567890") |
| 166 | + |
| 167 | +### 6. Sample Output Expected |
| 168 | + |
| 169 | +``` |
| 170 | +Generating synthetic SaaS metrics data... |
| 171 | +============================================================ |
| 172 | +
|
| 173 | +Created 500 customers across 24 months (2023-01 to 2024-12) |
| 174 | + - Starter tier: 250 customers (50%) |
| 175 | + - Pro tier: 175 customers (35%) |
| 176 | + - Enterprise tier: 75 customers (15%) |
| 177 | + - Active customers: 312 (62%) |
| 178 | + - Churned customers: 188 (38%) |
| 179 | +
|
| 180 | +Generated 687 subscription records |
| 181 | + - New subscriptions: 500 |
| 182 | + - Upgrades (Expansion): 98 (19.6%) |
| 183 | + - Downgrades (Contraction): 42 (8.4%) |
| 184 | + - Churned subscriptions: 188 (37.6%) |
| 185 | +
|
| 186 | +MRR Movements Summary (24 months): |
| 187 | + - Total New MRR: $1,245,678 |
| 188 | + - Total Expansion MRR: $234,567 |
| 189 | + - Total Contraction MRR: $67,890 |
| 190 | + - Total Churn MRR: $456,789 |
| 191 | + - Net MRR Growth: $955,566 |
| 192 | +
|
| 193 | +Current MRR (Month 24): $203,456 |
| 194 | + - Starting MRR (Month 1): $52,345 |
| 195 | + - MRR Growth: 288% over 24 months |
| 196 | +
|
| 197 | +Cohort Retention Analysis: |
| 198 | + - 2023-01 cohort (24 months old): 54% retained |
| 199 | + - 2023-06 cohort (18 months old): 62% retained |
| 200 | + - 2024-01 cohort (12 months old): 67% retained |
| 201 | + - 2024-06 cohort (6 months old): 78% retained |
| 202 | +
|
| 203 | +Data Integrity Checks: |
| 204 | + ✓ All foreign keys valid |
| 205 | + ✓ All dates within valid range |
| 206 | + ✓ All MRR values positive |
| 207 | + ✓ MRR movements balance correctly |
| 208 | + ✓ No orphaned subscriptions |
| 209 | +
|
| 210 | +Database saved to: data/day06_saas_metrics.db |
| 211 | +File size: 248 KB |
| 212 | +============================================================ |
| 213 | +``` |
| 214 | + |
| 215 | +### 7. Testing Requirements |
| 216 | + |
| 217 | +After generation, the script should validate: |
| 218 | +- All `customer_id` in `day06_subscriptions` exist in `day06_customers` |
| 219 | +- All subscription dates are within customer lifetime |
| 220 | +- MRR movements sum correctly for each month |
| 221 | +- At least 50% of customers have only 1 subscription (stable customers) |
| 222 | +- At least 15% of customers have 2+ subscriptions (lifecycle changes) |
| 223 | +- Current MRR matches sum of active subscriptions |
| 224 | +- Churn rate is within expected range (5-8% monthly) |
| 225 | + |
| 226 | +### 8. Advanced Requirements |
| 227 | + |
| 228 | +**Cohort-Based Data Generation:** |
| 229 | +```python |
| 230 | +def day06_generate_cohort_customers(cohort_month: datetime, num_customers: int) -> List[Dict]: |
| 231 | + """ |
| 232 | + Generate customers for a specific signup cohort. |
| 233 | +
|
| 234 | + Apply cohort-specific retention curves: |
| 235 | + - Older cohorts have higher cumulative churn |
| 236 | + - Newer cohorts have higher retention |
| 237 | + """ |
| 238 | + # Implementation details... |
| 239 | +``` |
| 240 | + |
| 241 | +**MRR Movement Calculation:** |
| 242 | +```python |
| 243 | +def day06_calculate_mrr_movements(month: datetime, subscriptions: List[Dict]) -> Dict: |
| 244 | + """ |
| 245 | + Calculate New, Expansion, Contraction, and Churn MRR for a given month. |
| 246 | +
|
| 247 | + Returns: |
| 248 | + { |
| 249 | + 'new_mrr': float, |
| 250 | + 'expansion_mrr': float, |
| 251 | + 'contraction_mrr': float, |
| 252 | + 'churn_mrr': float, |
| 253 | + 'net_mrr': float |
| 254 | + } |
| 255 | + """ |
| 256 | + # Implementation details... |
| 257 | +``` |
| 258 | + |
| 259 | +**Customer ID Generation (Stripe-style):** |
| 260 | +```python |
| 261 | +def day06_generate_customer_id() -> str: |
| 262 | + """Generate Stripe-style customer ID: cus_[16-char random string]""" |
| 263 | + random_str = hashlib.md5(str(random.random()).encode()).hexdigest()[:16] |
| 264 | + return f"cus_{random_str}" |
| 265 | + |
| 266 | +def day06_generate_subscription_id() -> str: |
| 267 | + """Generate Stripe-style subscription ID: sub_[16-char random string]""" |
| 268 | + random_str = hashlib.md5(str(random.random()).encode()).hexdigest()[:16] |
| 269 | + return f"sub_{random_str}" |
| 270 | +``` |
| 271 | + |
| 272 | +## Constraints |
| 273 | + |
| 274 | +- **Execution Time**: < 10 seconds |
| 275 | +- **Database Size**: < 500 KB |
| 276 | +- **Dependencies**: Python standard library + sqlite3 only |
| 277 | +- **Reproducibility**: Use `random.seed(42)` for consistent results |
| 278 | +- **No External Calls**: No API calls or external data sources |
| 279 | + |
| 280 | +## Deliverable |
| 281 | + |
| 282 | +Complete Python script named `day06_DATA_synthetic_saas.py` that: |
| 283 | +1. Generates 500 customers with realistic SaaS behavior |
| 284 | +2. Creates subscription history with upgrades/downgrades/churn |
| 285 | +3. Pre-calculates MRR movements for 24 months |
| 286 | +4. Validates data integrity |
| 287 | +5. Prints comprehensive summary statistics |
| 288 | +6. Saves to `data/day06_saas_metrics.db` |
| 289 | + |
| 290 | +This data will feed into 4 SQL views: |
| 291 | +- `day06_mrr_summary` (MRR waterfall) |
| 292 | +- `day06_churn_by_cohort` (churn analysis) |
| 293 | +- `day06_retention_curves` (retention over time) |
| 294 | +- `day06_customer_health` (LTV/CAC scoring) |
| 295 | + |
| 296 | +--- |
| 297 | + |
| 298 | +**IMPORTANT**: This data should look realistic enough to demonstrate to Murilo how SaaS health metrics work in practice. Focus on creating meaningful patterns that will result in interesting dashboard visualizations on Day 19. |
0 commit comments