Skip to content

[Feature Request] Add vLLM CPU inference image for SageMaker #5809

@timelfrink

Description

@timelfrink

What

Requesting a CPU-only vLLM inference container for SageMaker, similar to the existing GPU-based vLLM images.

Why

This would enable running vLLM on CPU instances for cost-effective workloads that don't require GPU acceleration, such as:

  • Reranking
  • Scoring
  • Embeddings
  • Small generative models

Reference Implementation

I previously submitted PR #5670 which contains a working implementation (Dockerfile + buildspec) that was validated on EC2 (c5.4xlarge):

  • Successful image build (~3.5 GB)
  • Health endpoint returning 200
  • /v1/completions working with facebook/opt-125m
  • Reranker endpoint working with Alibaba-NLP model

Key Implementation Details from PR

  • vLLM v0.15.1 with CPU target build
  • Python 3.12, Ubuntu 22.04
  • tcmalloc + Intel OpenMP for performance
  • Reuses existing SageMaker entrypoint script
  • Tag format: 0.15.1-cpu-py312-ubuntu22.04-sagemaker

I understand external contributions are not accepted, so filing this as a feature request for the team to consider. The PR can serve as a reference for the implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions