-
Notifications
You must be signed in to change notification settings - Fork 533
[Feature Request] Add vLLM CPU inference image for SageMaker #5809
Copy link
Copy link
Open
Description
What
Requesting a CPU-only vLLM inference container for SageMaker, similar to the existing GPU-based vLLM images.
Why
This would enable running vLLM on CPU instances for cost-effective workloads that don't require GPU acceleration, such as:
- Reranking
- Scoring
- Embeddings
- Small generative models
Reference Implementation
I previously submitted PR #5670 which contains a working implementation (Dockerfile + buildspec) that was validated on EC2 (c5.4xlarge):
- Successful image build (~3.5 GB)
- Health endpoint returning 200
/v1/completionsworking with facebook/opt-125m- Reranker endpoint working with Alibaba-NLP model
Key Implementation Details from PR
- vLLM v0.15.1 with CPU target build
- Python 3.12, Ubuntu 22.04
- tcmalloc + Intel OpenMP for performance
- Reuses existing SageMaker entrypoint script
- Tag format:
0.15.1-cpu-py312-ubuntu22.04-sagemaker
I understand external contributions are not accepted, so filing this as a feature request for the team to consider. The PR can serve as a reference for the implementation.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels