codecarbon-data

Curated, versioned datasets for CPU power, GPU power, grid carbon intensity, cloud provider emissions, and embodied carbon — used by CodeCarbon to estimate computing CO2 emissions.

Datasets

Record counts reflect the most recent collection run. They will change as collectors are enabled or upstream sources update.

File	Records	Sources	Description
`data/hardware_gpu.csv`	~2,825	TechPowerUp (dbgpu)	GPU TDP, memory, clocks, architecture
`data/hardware_cpu.csv`	~2,177	Intel ARK, AMD specs, ENERGY STAR	CPU TDP, cores, frequencies, cache
`data/grid_emissions.csv`	~352	Electricity Maps	gCO2/kWh per zone (country & sub-national)
`data/cloud_emissions.csv`	~184	CCF, AWS, Azure, GCP, Electricity Maps	PUE & carbon intensity per cloud region
`data/embodied_carbon.csv`	~288	Boavizta API	Embodied CO2 (kgCO2eq) for servers & components
`data/region_to_em_zone.csv`	static	Hand-curated	Maps US/CA/AU sub-national region names to Electricity Maps zone keys

Each output CSV is described as a Frictionless Data Package resource in datapackage.json, with field types, primary keys, and required-field constraints.

Quick start

Requires Python 3.13+ and uv.

# Install dependencies
uv sync

# Some collectors need API tokens — see "Source credentials" below.
export ELECTRICITY_MAPS_TOKEN=...

# Collect all enabled datasets
uv run codecarbon-data collect all

# Validate output CSVs against the schema
uv run codecarbon-data validate all

CLI commands

codecarbon-data collect [cpu|gpu|grid|cloud|embodied|all]   # Run data collection pipeline
codecarbon-data validate [cpu|gpu|grid|cloud|embodied|all]  # Validate CSVs against datapackage.json
codecarbon-data sources                                     # List registered data sources and licenses
codecarbon-data publish                                     # Publish to HuggingFace (not yet implemented)

Use --log-level DEBUG on collect or validate for verbose output.

Configuration

config.yaml controls which collectors run. Each source has an enabled flag plus per-source settings (URLs, rate limits, tokens). A few collectors are disabled by default because they are slow, require credentials, or duplicate data already provided by an enabled source — flip enabled: true and re-run.

Any value in config.yaml can be overridden by an environment variable. The prefix is CODECARBON_HD_, and double underscores map to nested keys:

CODECARBON_HD_GPU__DBGPU__ENABLED=false uv run codecarbon-data collect gpu

Source credentials

Source	Credential	How to obtain
Electricity Maps	`ELECTRICITY_MAPS_TOKEN`	Free tier at portal.electricitymaps.com
ENERGY STAR	`CODECARBON_HD_CPU__ENERGY_STAR__API_KEY`	Optional. See data.energystar.gov
ENTSO-E	`CODECARBON_HD_GRID__ENTSOE__API_KEY`	Register at transparency.entsoe.eu, then email `transparency@entsoe.eu` with subject "Restful API access". The key appears under "My Account Settings" once approved (~3 business days).

Data sources

Source	Domain	License	Access method
dbgpu / TechPowerUp	GPU	MIT	Local package
Intel ARK	CPU	Proprietary (factual extraction)	HTTP API
AMD product specs	CPU	Proprietary (factual extraction)	Browser scrape (Playwright)
ENERGY STAR	CPU	US Public Domain	HTTP API
Electricity Maps	Grid	Proprietary (factual extraction)	HTTP API
Our World In Data — Energy	Grid	CC BY 4.0	HTTP download
Ember Climate	Grid	CC BY 4.0	HTTP API
EPA eGRID	Grid	US Public Domain	HTTP download
ENTSO-E	Grid	Open (ENTSO-E terms)	HTTP API
Cloud Carbon Footprint	Cloud	Apache 2.0	HTTP download
Google Cloud region-carbon-info	Cloud	Apache 2.0	HTTP download
AWS Sustainability	Cloud	Proprietary (factual extraction)	HTTP API + grid lookup
Microsoft Azure Sustainability	Cloud	Proprietary (factual extraction)	HTTP scrape + grid lookup
Boavizta API	Embodied	AGPL-3.0	HTTP API

sources.yaml is the machine-readable registry. ATTRIBUTION.md is auto-generated at every collect run from the sources that produced data.

Cloud emissions strategy

Cloud carbon intensity is collected from multiple overlapping sources and reconciled in src/collectors/cloud/merge.py:

Cloud Carbon Footprint (CCF) provides per-region PUE and carbon intensity for AWS, GCP, and Azure from their open-source TypeScript constants.
GCP publishes a yearly CSV at GoogleCloudPlatform/region-carbon-info with grid carbon intensity per region.
AWS regions are fetched from the AWS infrastructure JSON endpoint, then mapped to country codes and looked up against data/grid_emissions.csv.
Azure regions use a static mapping plus PUE values scraped from Microsoft's datacenter efficiency page, with carbon intensity from data/grid_emissions.csv.
Electricity Maps enrichment maps each known cloud region to its Electricity Maps zone (CLOUD_REGION_TO_EM_ZONE in src/collectors/cloud/electricity_maps_cloud.py) and joins on data/grid_emissions.csv to populate em_carbon_intensity_gco2_kwh and em_zone_key. This requires the grid pipeline to have run first.

The merge policy treats provider-specific sources (AWS, GCP, Azure) as authoritative for their regions; CCF and EM enrichment fill in missing fields (PUE, country code, EM zone).

Schema

Output CSVs follow the Frictionless Data Package specification. See datapackage.json for field definitions, types, and primary keys.

Contributing

See CONTRIBUTING.md for development setup and ADDING_SOURCES.md for the step-by-step guide to adding a new data source.

Known gaps

AMD CPU specs — the collector at src/collectors/cpu/amd_specs.py uses Playwright to scrape AMD's site. The site occasionally returns HTTP/2 protocol errors under automated access. Contributions for a non-browser alternative are welcome.

License

The aggregated database is released under ODbL-1.0. Individual upstream source licenses are listed in ATTRIBUTION.md and sources.yaml.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
data		data
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
ADDING_SOURCES.md		ADDING_SOURCES.md
ATTRIBUTION.md		ATTRIBUTION.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
datapackage.json		datapackage.json
pyproject.toml		pyproject.toml
sources.yaml		sources.yaml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

codecarbon-data

Datasets

Quick start

CLI commands

Configuration

Source credentials

Data sources

Cloud emissions strategy

Schema

Contributing

Known gaps

License

Related

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

codecarbon-data

Datasets

Quick start

CLI commands

Configuration

Source credentials

Data sources

Cloud emissions strategy

Schema

Contributing

Known gaps

License

Related

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages