Skip to content

mlco2/codecarbon-data

Repository files navigation

codecarbon-data

Curated, versioned datasets for CPU power, GPU power, grid carbon intensity, cloud provider emissions, and embodied carbon — used by CodeCarbon to estimate computing CO2 emissions.

Datasets

Record counts reflect the most recent collection run. They will change as collectors are enabled or upstream sources update.

File Records Sources Description
data/hardware_gpu.csv ~2,825 TechPowerUp (dbgpu) GPU TDP, memory, clocks, architecture
data/hardware_cpu.csv ~2,177 Intel ARK, AMD specs, ENERGY STAR CPU TDP, cores, frequencies, cache
data/grid_emissions.csv ~352 Electricity Maps gCO2/kWh per zone (country & sub-national)
data/cloud_emissions.csv ~184 CCF, AWS, Azure, GCP, Electricity Maps PUE & carbon intensity per cloud region
data/embodied_carbon.csv ~288 Boavizta API Embodied CO2 (kgCO2eq) for servers & components
data/region_to_em_zone.csv static Hand-curated Maps US/CA/AU sub-national region names to Electricity Maps zone keys

Each output CSV is described as a Frictionless Data Package resource in datapackage.json, with field types, primary keys, and required-field constraints.

Quick start

Requires Python 3.13+ and uv.

# Install dependencies
uv sync

# Some collectors need API tokens — see "Source credentials" below.
export ELECTRICITY_MAPS_TOKEN=...

# Collect all enabled datasets
uv run codecarbon-data collect all

# Validate output CSVs against the schema
uv run codecarbon-data validate all

CLI commands

codecarbon-data collect [cpu|gpu|grid|cloud|embodied|all]   # Run data collection pipeline
codecarbon-data validate [cpu|gpu|grid|cloud|embodied|all]  # Validate CSVs against datapackage.json
codecarbon-data sources                                     # List registered data sources and licenses
codecarbon-data publish                                     # Publish to HuggingFace (not yet implemented)

Use --log-level DEBUG on collect or validate for verbose output.

Configuration

config.yaml controls which collectors run. Each source has an enabled flag plus per-source settings (URLs, rate limits, tokens). A few collectors are disabled by default because they are slow, require credentials, or duplicate data already provided by an enabled source — flip enabled: true and re-run.

Any value in config.yaml can be overridden by an environment variable. The prefix is CODECARBON_HD_, and double underscores map to nested keys:

CODECARBON_HD_GPU__DBGPU__ENABLED=false uv run codecarbon-data collect gpu

Source credentials

Source Credential How to obtain
Electricity Maps ELECTRICITY_MAPS_TOKEN Free tier at portal.electricitymaps.com
ENERGY STAR CODECARBON_HD_CPU__ENERGY_STAR__API_KEY Optional. See data.energystar.gov
ENTSO-E CODECARBON_HD_GRID__ENTSOE__API_KEY Register at transparency.entsoe.eu, then email transparency@entsoe.eu with subject "Restful API access". The key appears under "My Account Settings" once approved (~3 business days).

Data sources

Source Domain License Access method
dbgpu / TechPowerUp GPU MIT Local package
Intel ARK CPU Proprietary (factual extraction) HTTP API
AMD product specs CPU Proprietary (factual extraction) Browser scrape (Playwright)
ENERGY STAR CPU US Public Domain HTTP API
Electricity Maps Grid Proprietary (factual extraction) HTTP API
Our World In Data — Energy Grid CC BY 4.0 HTTP download
Ember Climate Grid CC BY 4.0 HTTP API
EPA eGRID Grid US Public Domain HTTP download
ENTSO-E Grid Open (ENTSO-E terms) HTTP API
Cloud Carbon Footprint Cloud Apache 2.0 HTTP download
Google Cloud region-carbon-info Cloud Apache 2.0 HTTP download
AWS Sustainability Cloud Proprietary (factual extraction) HTTP API + grid lookup
Microsoft Azure Sustainability Cloud Proprietary (factual extraction) HTTP scrape + grid lookup
Boavizta API Embodied AGPL-3.0 HTTP API

sources.yaml is the machine-readable registry. ATTRIBUTION.md is auto-generated at every collect run from the sources that produced data.

Cloud emissions strategy

Cloud carbon intensity is collected from multiple overlapping sources and reconciled in src/collectors/cloud/merge.py:

  • Cloud Carbon Footprint (CCF) provides per-region PUE and carbon intensity for AWS, GCP, and Azure from their open-source TypeScript constants.
  • GCP publishes a yearly CSV at GoogleCloudPlatform/region-carbon-info with grid carbon intensity per region.
  • AWS regions are fetched from the AWS infrastructure JSON endpoint, then mapped to country codes and looked up against data/grid_emissions.csv.
  • Azure regions use a static mapping plus PUE values scraped from Microsoft's datacenter efficiency page, with carbon intensity from data/grid_emissions.csv.
  • Electricity Maps enrichment maps each known cloud region to its Electricity Maps zone (CLOUD_REGION_TO_EM_ZONE in src/collectors/cloud/electricity_maps_cloud.py) and joins on data/grid_emissions.csv to populate em_carbon_intensity_gco2_kwh and em_zone_key. This requires the grid pipeline to have run first.

The merge policy treats provider-specific sources (AWS, GCP, Azure) as authoritative for their regions; CCF and EM enrichment fill in missing fields (PUE, country code, EM zone).

Schema

Output CSVs follow the Frictionless Data Package specification. See datapackage.json for field definitions, types, and primary keys.

Contributing

See CONTRIBUTING.md for development setup and ADDING_SOURCES.md for the step-by-step guide to adding a new data source.

Known gaps

  • AMD CPU specs — the collector at src/collectors/cpu/amd_specs.py uses Playwright to scrape AMD's site. The site occasionally returns HTTP/2 protocol errors under automated access. Contributions for a non-browser alternative are welcome.

License

The aggregated database is released under ODbL-1.0. Individual upstream source licenses are listed in ATTRIBUTION.md and sources.yaml.

Related

  • CodeCarbon — the Python package that uses these datasets to estimate computing CO2 emissions.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors