Skip to content

Commit eb518b9

Browse files
committed
Refactors config for SaaS metrics use case
Transitions configuration from financial consulting to SaaS health metrics focus, introducing customer, plan, and MRR-related parameters, retention and churn logic, health scoring, and updated helper functions. Improves clarity and alignment with SaaS analytic requirements.
1 parent e374a05 commit eb518b9

File tree

4 files changed

+1213
-178
lines changed

4 files changed

+1213
-178
lines changed
Lines changed: 298 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,298 @@
1+
# Codex Prompt: Generate Synthetic SaaS Health Metrics Data
2+
3+
## Context
4+
Generate synthetic data for a SaaS company's health metrics tracking system (Simetryk SaaS - Murilo's use case). This data will be used to calculate MRR movements, churn rates, cohort retention, and customer health scores for dashboard visualization.
5+
6+
## Requirements
7+
8+
### 1. Generate Three Related Tables
9+
10+
#### Table 1: `day06_customers`
11+
Create **500 unique customers** with:
12+
- `customer_id` (TEXT, PRIMARY KEY): Format "cus_" + 16-char random string (Stripe-style: "cus_1234567890abcdef")
13+
- `email` (TEXT): Realistic email addresses (e.g., "john.smith@techcorp.com")
14+
- `signup_date` (DATE): Distributed across **24 months** (2023-01-01 to 2024-12-31)
15+
- **Cohort distribution**: More signups in early months, declining over time (realistic growth curve)
16+
- `plan_tier` (TEXT): One of "Starter", "Pro", or "Enterprise"
17+
- Distribution: 50% Starter, 35% Pro, 15% Enterprise
18+
- `mrr_current` (DECIMAL): Monthly Recurring Revenue based on plan tier:
19+
- Starter: $29-$99 (random variation)
20+
- Pro: $199-$499
21+
- Enterprise: $999-$2,999
22+
- `status` (TEXT): Either "active" or "churned"
23+
- **Churn rate**: 5-8% monthly (realistic SaaS churn)
24+
- Older cohorts should have higher churn rates
25+
26+
#### Table 2: `day06_subscriptions`
27+
Create **subscription history** for all 500 customers (multiple subscriptions per customer if they upgraded/downgraded):
28+
- `subscription_id` (TEXT, PRIMARY KEY): Format "sub_" + 16-char random string (Stripe-style)
29+
- `customer_id` (TEXT, FOREIGN KEY): References day06_customers.customer_id
30+
- `start_date` (DATE): Subscription start date
31+
- First subscription: Same as customer signup_date
32+
- Subsequent subscriptions: After previous end_date (upgrades/downgrades)
33+
- `end_date` (DATE): NULL if currently active, or churn/change date
34+
- `mrr` (DECIMAL): MRR for this subscription period
35+
- `plan_tier` (TEXT): "Starter", "Pro", or "Enterprise"
36+
37+
**Business Logic for Subscriptions:**
38+
- **New MRR**: Customer's first subscription (start_date = signup_date)
39+
- **Expansion MRR**: Upgrade from lower to higher tier (Starter → Pro → Enterprise)
40+
- 15-20% of customers should have at least one upgrade
41+
- **Contraction MRR**: Downgrade from higher to lower tier
42+
- 5-10% of customers should have at least one downgrade
43+
- **Churn MRR**: Subscription ends (end_date IS NOT NULL)
44+
- 5-8% monthly churn rate
45+
- **Retention**: Active customers maintain same plan tier
46+
- Most customers (70-75%) should remain in original tier
47+
48+
#### Table 3: `day06_mrr_movements`
49+
Create **pre-aggregated monthly MRR movements** for 24 months (2023-01 to 2024-12):
50+
- `month` (DATE, PRIMARY KEY): First day of each month (e.g., "2023-01-01")
51+
- `new_mrr` (DECIMAL): Sum of MRR from new customers that month
52+
- `expansion_mrr` (DECIMAL): Sum of MRR increases from upgrades
53+
- `contraction_mrr` (DECIMAL): Sum of MRR decreases from downgrades
54+
- `churn_mrr` (DECIMAL): Sum of MRR lost from cancellations
55+
- `net_mrr` (DECIMAL): Calculated as: new_mrr + expansion_mrr - contraction_mrr - churn_mrr
56+
57+
**Formula**: `net_mrr = new_mrr + expansion_mrr - contraction_mrr - churn_mrr`
58+
59+
### 2. Data Relationships and Business Logic
60+
61+
**Ensure Realistic SaaS Behavior:**
62+
63+
1. **MRR Growth Pattern**:
64+
- Start with ~$50K MRR in month 1
65+
- Grow to ~$200K MRR by month 24 (4x growth)
66+
- Growth should be mostly linear with some monthly variance
67+
68+
2. **Cohort Retention Curves**:
69+
- Month 0 (signup): 100% retention
70+
- Month 1: 90-95% retention
71+
- Month 3: 80-85% retention
72+
- Month 6: 70-75% retention
73+
- Month 12: 60-65% retention
74+
- Month 24: 50-55% retention
75+
76+
3. **Plan Tier Distribution**:
77+
- Starter: 50% of customers (entry point)
78+
- Pro: 35% of customers (sweet spot)
79+
- Enterprise: 15% of customers (high value)
80+
81+
4. **Customer Lifecycle Examples**:
82+
- **Happy Path**: Starter → Pro → Enterprise (stays active)
83+
- **Expansion**: Starter → Pro (stays active)
84+
- **Contraction**: Pro → Starter (stays active but downgraded)
85+
- **Churn**: Any tier → churned (end_date set)
86+
- **Stable**: Same tier throughout (most common)
87+
88+
5. **Edge Cases to Include**:
89+
- Customers who upgrade twice (Starter → Pro → Enterprise)
90+
- Customers who downgrade after upgrading (Pro → Starter)
91+
- Customers who churn immediately (month 1)
92+
- Customers who stay active for 24+ months
93+
- Enterprise customers with high MRR ($2K+/month)
94+
95+
### 3. Output Format
96+
97+
Generate Python code using SQLite3 that:
98+
1. Creates the three tables with proper schema and foreign keys
99+
2. Inserts synthetic data using realistic SaaS patterns
100+
3. Saves to `data/day06_saas_metrics.db`
101+
4. Prints comprehensive summary statistics after generation
102+
5. Validates data integrity (foreign keys, date ranges, MRR calculations)
103+
104+
### 4. Code Structure
105+
106+
```python
107+
#!/usr/bin/env python3
108+
"""
109+
Synthetic Data Generator for Day 06: SaaS Health Metrics Foundation
110+
111+
This script generates realistic SaaS subscription data for Murilo's dashboard:
112+
- 500 customers across 24 months
113+
- Subscription history with upgrades/downgrades/churn
114+
- Pre-aggregated MRR movements for waterfall analysis
115+
116+
Stakeholder: Murilo (Simetryk SaaS)
117+
Use Case: MRR tracking, churn analysis, cohort retention, customer health scoring
118+
119+
Usage:
120+
python day06_DATA_synthetic_saas.py
121+
"""
122+
123+
import sqlite3
124+
import random
125+
import hashlib
126+
from datetime import datetime, timedelta
127+
from pathlib import Path
128+
from typing import List, Dict, Tuple
129+
130+
# Configuration
131+
DAY06_DB_PATH = "data/day06_saas_metrics.db"
132+
DAY06_NUM_CUSTOMERS = 500
133+
DAY06_NUM_MONTHS = 24
134+
DAY06_START_DATE = datetime(2023, 1, 1)
135+
136+
# Plan tier pricing
137+
DAY06_PLAN_PRICING = {
138+
'Starter': (29, 99), # Min, Max MRR
139+
'Pro': (199, 499),
140+
'Enterprise': (999, 2999)
141+
}
142+
143+
# SaaS metrics targets
144+
DAY06_MONTHLY_CHURN_RATE = 0.06 # 6% monthly churn
145+
DAY06_UPGRADE_PROBABILITY = 0.18 # 18% upgrade rate
146+
DAY06_DOWNGRADE_PROBABILITY = 0.08 # 8% downgrade rate
147+
148+
# Rest of the code here...
149+
```
150+
151+
### 5. Key Requirements
152+
153+
- **Naming Convention**: Use `day06_` prefix for all tables, `DAY06_` for constants
154+
- **Documentation**: Include docstrings explaining SaaS business logic
155+
- **Data Validation**:
156+
- No negative MRR values
157+
- Dates in valid ranges (2023-01-01 to 2024-12-31)
158+
- end_date > start_date (if not NULL)
159+
- Foreign keys validated
160+
- **Indexes**: Create indexes on:
161+
- `day06_subscriptions.customer_id`
162+
- `day06_subscriptions.start_date`
163+
- `day06_customers.signup_date`
164+
- `day06_customers.status`
165+
- **Realistic IDs**: Use Stripe-style IDs (e.g., "cus_AbCdEf1234567890")
166+
167+
### 6. Sample Output Expected
168+
169+
```
170+
Generating synthetic SaaS metrics data...
171+
============================================================
172+
173+
Created 500 customers across 24 months (2023-01 to 2024-12)
174+
- Starter tier: 250 customers (50%)
175+
- Pro tier: 175 customers (35%)
176+
- Enterprise tier: 75 customers (15%)
177+
- Active customers: 312 (62%)
178+
- Churned customers: 188 (38%)
179+
180+
Generated 687 subscription records
181+
- New subscriptions: 500
182+
- Upgrades (Expansion): 98 (19.6%)
183+
- Downgrades (Contraction): 42 (8.4%)
184+
- Churned subscriptions: 188 (37.6%)
185+
186+
MRR Movements Summary (24 months):
187+
- Total New MRR: $1,245,678
188+
- Total Expansion MRR: $234,567
189+
- Total Contraction MRR: $67,890
190+
- Total Churn MRR: $456,789
191+
- Net MRR Growth: $955,566
192+
193+
Current MRR (Month 24): $203,456
194+
- Starting MRR (Month 1): $52,345
195+
- MRR Growth: 288% over 24 months
196+
197+
Cohort Retention Analysis:
198+
- 2023-01 cohort (24 months old): 54% retained
199+
- 2023-06 cohort (18 months old): 62% retained
200+
- 2024-01 cohort (12 months old): 67% retained
201+
- 2024-06 cohort (6 months old): 78% retained
202+
203+
Data Integrity Checks:
204+
✓ All foreign keys valid
205+
✓ All dates within valid range
206+
✓ All MRR values positive
207+
✓ MRR movements balance correctly
208+
✓ No orphaned subscriptions
209+
210+
Database saved to: data/day06_saas_metrics.db
211+
File size: 248 KB
212+
============================================================
213+
```
214+
215+
### 7. Testing Requirements
216+
217+
After generation, the script should validate:
218+
- All `customer_id` in `day06_subscriptions` exist in `day06_customers`
219+
- All subscription dates are within customer lifetime
220+
- MRR movements sum correctly for each month
221+
- At least 50% of customers have only 1 subscription (stable customers)
222+
- At least 15% of customers have 2+ subscriptions (lifecycle changes)
223+
- Current MRR matches sum of active subscriptions
224+
- Churn rate is within expected range (5-8% monthly)
225+
226+
### 8. Advanced Requirements
227+
228+
**Cohort-Based Data Generation:**
229+
```python
230+
def day06_generate_cohort_customers(cohort_month: datetime, num_customers: int) -> List[Dict]:
231+
"""
232+
Generate customers for a specific signup cohort.
233+
234+
Apply cohort-specific retention curves:
235+
- Older cohorts have higher cumulative churn
236+
- Newer cohorts have higher retention
237+
"""
238+
# Implementation details...
239+
```
240+
241+
**MRR Movement Calculation:**
242+
```python
243+
def day06_calculate_mrr_movements(month: datetime, subscriptions: List[Dict]) -> Dict:
244+
"""
245+
Calculate New, Expansion, Contraction, and Churn MRR for a given month.
246+
247+
Returns:
248+
{
249+
'new_mrr': float,
250+
'expansion_mrr': float,
251+
'contraction_mrr': float,
252+
'churn_mrr': float,
253+
'net_mrr': float
254+
}
255+
"""
256+
# Implementation details...
257+
```
258+
259+
**Customer ID Generation (Stripe-style):**
260+
```python
261+
def day06_generate_customer_id() -> str:
262+
"""Generate Stripe-style customer ID: cus_[16-char random string]"""
263+
random_str = hashlib.md5(str(random.random()).encode()).hexdigest()[:16]
264+
return f"cus_{random_str}"
265+
266+
def day06_generate_subscription_id() -> str:
267+
"""Generate Stripe-style subscription ID: sub_[16-char random string]"""
268+
random_str = hashlib.md5(str(random.random()).encode()).hexdigest()[:16]
269+
return f"sub_{random_str}"
270+
```
271+
272+
## Constraints
273+
274+
- **Execution Time**: < 10 seconds
275+
- **Database Size**: < 500 KB
276+
- **Dependencies**: Python standard library + sqlite3 only
277+
- **Reproducibility**: Use `random.seed(42)` for consistent results
278+
- **No External Calls**: No API calls or external data sources
279+
280+
## Deliverable
281+
282+
Complete Python script named `day06_DATA_synthetic_saas.py` that:
283+
1. Generates 500 customers with realistic SaaS behavior
284+
2. Creates subscription history with upgrades/downgrades/churn
285+
3. Pre-calculates MRR movements for 24 months
286+
4. Validates data integrity
287+
5. Prints comprehensive summary statistics
288+
6. Saves to `data/day06_saas_metrics.db`
289+
290+
This data will feed into 4 SQL views:
291+
- `day06_mrr_summary` (MRR waterfall)
292+
- `day06_churn_by_cohort` (churn analysis)
293+
- `day06_retention_curves` (retention over time)
294+
- `day06_customer_health` (LTV/CAC scoring)
295+
296+
---
297+
298+
**IMPORTANT**: This data should look realistic enough to demonstrate to Murilo how SaaS health metrics work in practice. Focus on creating meaningful patterns that will result in interesting dashboard visualizations on Day 19.

day06/data/day06_saas_metrics.db

228 KB
Binary file not shown.

0 commit comments

Comments
 (0)