Curated, versioned datasets for CPU power, GPU power, grid carbon intensity, cloud provider emissions, and embodied carbon — used by CodeCarbon to estimate computing CO2 emissions.
Record counts reflect the most recent collection run. They will change as collectors are enabled or upstream sources update.
| File | Records | Sources | Description |
|---|---|---|---|
data/hardware_gpu.csv |
~2,825 | TechPowerUp (dbgpu) | GPU TDP, memory, clocks, architecture |
data/hardware_cpu.csv |
~2,177 | Intel ARK, AMD specs, ENERGY STAR | CPU TDP, cores, frequencies, cache |
data/grid_emissions.csv |
~352 | Electricity Maps | gCO2/kWh per zone (country & sub-national) |
data/cloud_emissions.csv |
~184 | CCF, AWS, Azure, GCP, Electricity Maps | PUE & carbon intensity per cloud region |
data/embodied_carbon.csv |
~288 | Boavizta API | Embodied CO2 (kgCO2eq) for servers & components |
data/region_to_em_zone.csv |
static | Hand-curated | Maps US/CA/AU sub-national region names to Electricity Maps zone keys |
Each output CSV is described as a Frictionless Data
Package resource in datapackage.json,
with field types, primary keys, and required-field constraints.
Requires Python 3.13+ and uv.
# Install dependencies
uv sync
# Some collectors need API tokens — see "Source credentials" below.
export ELECTRICITY_MAPS_TOKEN=...
# Collect all enabled datasets
uv run codecarbon-data collect all
# Validate output CSVs against the schema
uv run codecarbon-data validate allcodecarbon-data collect [cpu|gpu|grid|cloud|embodied|all] # Run data collection pipeline
codecarbon-data validate [cpu|gpu|grid|cloud|embodied|all] # Validate CSVs against datapackage.json
codecarbon-data sources # List registered data sources and licenses
codecarbon-data publish # Publish to HuggingFace (not yet implemented)
Use --log-level DEBUG on collect or validate for verbose output.
config.yaml controls which collectors run. Each source has an enabled flag
plus per-source settings (URLs, rate limits, tokens). A few collectors are
disabled by default because they are slow, require credentials, or duplicate
data already provided by an enabled source — flip enabled: true and
re-run.
Any value in config.yaml can be overridden by an environment variable. The
prefix is CODECARBON_HD_, and double underscores map to nested keys:
CODECARBON_HD_GPU__DBGPU__ENABLED=false uv run codecarbon-data collect gpu| Source | Credential | How to obtain |
|---|---|---|
| Electricity Maps | ELECTRICITY_MAPS_TOKEN |
Free tier at portal.electricitymaps.com |
| ENERGY STAR | CODECARBON_HD_CPU__ENERGY_STAR__API_KEY |
Optional. See data.energystar.gov |
| ENTSO-E | CODECARBON_HD_GRID__ENTSOE__API_KEY |
Register at transparency.entsoe.eu, then email transparency@entsoe.eu with subject "Restful API access". The key appears under "My Account Settings" once approved (~3 business days). |
| Source | Domain | License | Access method |
|---|---|---|---|
| dbgpu / TechPowerUp | GPU | MIT | Local package |
| Intel ARK | CPU | Proprietary (factual extraction) | HTTP API |
| AMD product specs | CPU | Proprietary (factual extraction) | Browser scrape (Playwright) |
| ENERGY STAR | CPU | US Public Domain | HTTP API |
| Electricity Maps | Grid | Proprietary (factual extraction) | HTTP API |
| Our World In Data — Energy | Grid | CC BY 4.0 | HTTP download |
| Ember Climate | Grid | CC BY 4.0 | HTTP API |
| EPA eGRID | Grid | US Public Domain | HTTP download |
| ENTSO-E | Grid | Open (ENTSO-E terms) | HTTP API |
| Cloud Carbon Footprint | Cloud | Apache 2.0 | HTTP download |
| Google Cloud region-carbon-info | Cloud | Apache 2.0 | HTTP download |
| AWS Sustainability | Cloud | Proprietary (factual extraction) | HTTP API + grid lookup |
| Microsoft Azure Sustainability | Cloud | Proprietary (factual extraction) | HTTP scrape + grid lookup |
| Boavizta API | Embodied | AGPL-3.0 | HTTP API |
sources.yaml is the machine-readable registry. ATTRIBUTION.md is
auto-generated at every collect run from the sources that produced data.
Cloud carbon intensity is collected from multiple overlapping sources and
reconciled in src/collectors/cloud/merge.py:
- Cloud Carbon Footprint (CCF) provides per-region PUE and carbon intensity for AWS, GCP, and Azure from their open-source TypeScript constants.
- GCP publishes a yearly CSV at
GoogleCloudPlatform/region-carbon-infowith grid carbon intensity per region. - AWS regions are fetched from the AWS infrastructure JSON endpoint, then
mapped to country codes and looked up against
data/grid_emissions.csv. - Azure regions use a static mapping plus PUE values scraped from
Microsoft's datacenter efficiency page, with carbon intensity from
data/grid_emissions.csv. - Electricity Maps enrichment maps each known cloud region to its
Electricity Maps zone (
CLOUD_REGION_TO_EM_ZONEinsrc/collectors/cloud/electricity_maps_cloud.py) and joins ondata/grid_emissions.csvto populateem_carbon_intensity_gco2_kwhandem_zone_key. This requires the grid pipeline to have run first.
The merge policy treats provider-specific sources (AWS, GCP, Azure) as authoritative for their regions; CCF and EM enrichment fill in missing fields (PUE, country code, EM zone).
Output CSVs follow the Frictionless Data Package
specification. See datapackage.json for field definitions, types, and
primary keys.
See CONTRIBUTING.md for development setup and ADDING_SOURCES.md for the step-by-step guide to adding a new data source.
- AMD CPU specs — the collector at
src/collectors/cpu/amd_specs.pyuses Playwright to scrape AMD's site. The site occasionally returns HTTP/2 protocol errors under automated access. Contributions for a non-browser alternative are welcome.
The aggregated database is released under
ODbL-1.0. Individual
upstream source licenses are listed in ATTRIBUTION.md and sources.yaml.
- CodeCarbon — the Python package that uses these datasets to estimate computing CO2 emissions.