Insights Focus

TechCircle Insight: How LCM Makes Stable Diffusion Viable on AWS Lambda

4 Oct, 2024

Artificial intelligence is evolving rapidly, not just in model design, but in how models are deployed and scaled. As GPU access remains limited, engineers are exploring creative, cost-efficient ways to run AI workloads. In this piece, TechCircle speaks with Pranav Prabhakar, Engineering Manager at ManyPets, about his expert opinion on deploying a Latent Consistency Model (LCM)-tuned Stable Diffusion on AWS Lambda without GPUs. Pranav shares how he optimized inference, managed large files via EFS, and balanced latency, scalability, and cost offering practical lessons for builders pushing the limits of serverless ML.

1. Introduction

Imagine generating your own icons from text quickly, cheaply, and without expensive GPUs. This challenge reflects a growing reality for developers exploring generative AI on limited infrastructure. Pranav shows how Stable Diffusion with Latent Consistency Models can run entirely on AWS Lambda. He explains building a lightweight, on-demand API as a proof of concept for low-frequency use and highlights how careful optimization of memory, cold starts, and execution limits pushes the limits of serverless, GPU-free.

2. Why LCM (Latent Consistency Models)?

The cost of standard Stable Diffusion

Stable Diffusion is a latent diffusion model. It works by iteratively denoising a latent noise vector into a coherent image. This usually requires 20 to 50 forward passes through a large UNet model, and each pass is computationally expensive. On a GPU, these steps are fast and manageable. However, in CPU-only environments, such as AWS Lambda, they become a bottleneck.

To give a sense of scaling: even generating a 256x256 image with just 15 steps can take over a minute on Lambda with six vCPUs. Considering that you need to multiply that by higher resolutions or more steps, the process soon becomes infeasible.

What LCM changes

Latent Consistency Models (LCMs) solve the issue of the numerous steps involved. LCMs are trained to predict the final denoised latent in just a few of them instead of the usual 50.

LCMs learn to approximate their outcome directly, rather than simulating the complete step-by-step diffusion process. They are treated as distillations of the original diffusion model, optimised for speed over perfect fidelity.

Why does it make a difference for Lambda?

The key is that with only 2-4 steps, LCMs significantly reduce inference time and maintain high image quality. The approach enables running Stable Diffusion on a CPU in under a minute. In practice, using LCM reduced my inference time to ~40 seconds for a 256x256 image, without requiring a GPU.

In a serverless world, this number is close enough to feel interactive. LCM changes a GPU-requiring model into one that is usable on commodity CPUs. This experiment shows that certain generative Al workloads, like icon generation with Stable Diffusion, can be made practical on serverless CPU infrastructure.

3. Optimising Stable Diffusion for Lambda

Selecting the right model was key. Pranav suggests using Stable Diffusion 1.5 for its balance of quality, size, and community support. He further explains how he used an LCM-distilled version, designed for low-step inference, which cut runtime without sacrificing output quality. While newer models like SDXL offer higher fidelity, their larger size and heavier resource demands make them unsuitable for Lambda’s memory and CPU limits.

Quantisation and weight optimisation (optional)

Pranav suggests not applying additional compression in this setup, but techniques like int8 quantization, model pruning, or using OpenVINO could further reduce inference time. Tools such as torch.compile() in PyTorch 2.x can also optimize execution graphs for CPU-bound workloads. If inference needs to drop below 30 seconds or memory limits become an issue, these strategies are worth exploring.

Reducing cold start time

Cold starts can be challenging, especially with a 7–8 GB container image. With the following steps he mitigated them:

• Minimal dependencies: Installed only what was necessary (torch, diffusers, transformers).
• Pre-loaded model in the global scope, so it initialises only once per container lifecycle.
• Avoided EFS or S3 at runtime: The model was bundled directly into the container image under /opt.
• Lean Docker layers: Used multi-stage builds and cleaned pip cache to minimise image size.

Despite these efforts, cold starts still take ~60–90 seconds. For critical use cases, provisioned concurrency could help keep warm containers alive.

Managing model size under Lambda limits

Lambda has a limit of 10 GB for container images, and how he pushed it to the edge. To achieve that:

• Bundled the entire model inside the container (no external fetch required).
• Avoided EFS to keep things simple and reduce latency.
• Used --strip and caching cleanup in Docker builds to reduce final image size.

If more space were needed, Pranav notes that significant dependencies like PyTorch or Transformers, could be moved into Lambda Layers, which also allow reuse across multiple functions.

This careful packaging ensured the model loaded quickly after cold starts and stayed memory-resident during warm executions.

4. Infrastructure setup

Why AWS Lambda?

Choosing the proper infrastructure for inference workloads is a trade-off between cost, scalability, latency, and complexity. He chose AWS Lambda for the following reasons:

• Zero idle cost: Lambda charges only for actual compute time, making it ideal for sporadic or low-frequency image generation workloads.
• Auto-scaling: Lambda automatically handles concurrent invocations without requiring infrastructure management.
• Simplicity: There is no server provisioning, patching, or scaling logic to worry about. It was easy to deploy, iterate, and monitor.

He considered alternatives:

• EC2: Offers GPUs and full control, but incurs idle costs and operational overhead.
• Fargate: More flexible than Lambda and suitable for containerised workloads, but not ideal for bursty, short-lived tasks.
• Modal: An excellent option for serverless GPU inference, but in this case, I wanted to explore how far CPU-based inference could be pushed within a traditional cloud provider setup.

Container-based Lambda

He used a custom container image (based on Amazon Linux + Python) with:

• Pre-installed dependencies (torch, diffusers)
• The full LCM model
• Inference code with minimal I/O

Loading models from EFS or caching strategies

Pranav skipped using EFS, opting to package the model directly in the container to cut complexity and latency, which proved effective.

• No network-mounted storage latency
• Simplified permission setup
• Guaranteed availability at cold start

The model was located under /opt/sd_model in the image and was loaded into memory during the container's init phase. This process ensured warm invocations did not incur load times.

If the model were larger than 10 GB or needed to be shared across multiple Lambda functions, EFS would be a viable fallback.

Model access

Instead of EFS or S3, the model is embedded directly in the image under /opt. The approach simplifies access and avoids I/O overhead during startup.

Dealing with 15-minute timeout and 10 GB memory constraints

Lambda’s max limits - 10 GB memory and 15-minute timeout - set hard constraints that required careful design:

• Memory: At 10 GB, Lambda allocates roughly six vCPUs, which is crucial for running PyTorch inference efficiently on CPU.
• Timeout: My longest warm invocation took ~40 seconds, leaving a substantial buffer.
• Image size: My Docker container image (including model and dependencies) stayed under the 10 GB limit, so I avoided the need for EFS.

The key was to preload the model during container startup and minimise runtime dependencies to reduce the need for fetches or dynamic installs during execution. If these limits were ever exceeded, I would have explored model quantisation, split functions, or migration to GPU-backed services.

Runtime limits

The function uses 10GB of memory (the maximum Lambda offers), unlocking six vCPUs. The timeout is set to 15 minutes, although inference takes ~40 seconds.

5. The inference pipeline

Processing flow

The image generation process follows a classic diffusion pipeline with a few strategic tweaks to accommodate LCM and serverless constraints:

1. Text prompt input
The user sends a text prompt to an API Gateway or Lambda Function URL. The request body includes the prompt, resolution (typically 256x256), number of steps (defaulting to 4), and optionally a guidance scale.

2. Tokenization
The prompt is tokenised using the CLIP tokeniser, which transforms the natural language into a vector embedding suitable for conditioning the UNet.

3. Latent initialization
A random latent tensor is initialised in the latent space of the model. This tensor represents the starting “noise” which will be denoised into an image.

4. LCM-Guided Denoising (2-4 Steps)
Instead of 20-50 steps as in traditional diffusion, the LCM-tuned pipeline denoises the latent in just 2-4 steps using a modified scheduler (for example, LCMScheduler). This is the primary driver of the performance improvement.

5. Decoding the Latent to RGB
The final latent is passed through the model’s decoder (typically a VAE) to produce an RGB image in PIL format.

6. Encoding and return
The image is converted to PNG, base64-encoded, and returned directly in the Lambda response body. This approach avoids the need for external storage and simplifies integration.

Using 2-4 LCM steps

The LCM model was trained to approximate the output of a 50-step diffusion, allowing inference with as few as two steps; however, he found that four steps struck the optimal balance between visual quality and speed.

Each denoising step is still a forward pass through a large UNet, so minimising steps significantly reduces CPU load and total runtime.

Time benchmarks

Here are the approximate timing benchmarks I observed during testing:

• Cold Start (first invocation):

  ◦ Container pull (if not cached): ~30–60 seconds
  ◦ Model load and init: ~20–30 seconds
  ◦ Total cold start time: 60–90 seconds

• Warm Invocation (subsequent calls):

◦ Text → image (end-to-end): ~40 seconds for 256x256 with 4 LCM steps
◦ Higher resolutions or more steps linearly increase time (e.g., ~1.5–2x for 512x512)

These times include each step, from request parsing to response construction, but exclude any upstream API Gateway latency.

Cost Comparison

Based on the results, I compiled the table of per-image generation costs:

Metric	Value
Memory Allocated	10 GB (max Lambda limit)
Inference Duration	~40 seconds (warm)
Lambda Cost (per request)	~ $0.007

Compared to other options:

• EC2 (g4dn.xlarge with T4 GPU): ~$0.526/hour, faster (1–2s/image), but costs persist even when idle.
• Fargate (CPU only): More configurable, generally slower and still incurs baseline costs.
• Modal or Replicate (GPU-based serverless): Suitable for real-time use, but pricing per second/image can add up quickly.

For low or sporadic workloads, Lambda is cost-effective, scaling to zero and charging only per invocation

6. Lessons learnt

Trade-offs between speed, quality, and cost

Running Stable Diffusion with LCM on Lambda revealed the inherent trade-offs:

• Speed: 40 seconds is sufficient for many asynchronous or low-QPS scenarios. LCM’s reduced denoising steps made this feasible.
• Quality: Images with 2–4 LCM steps were impressive for icon-style outputs. High-detail or photorealistic results still require standard models and GPUs.
• Cost: Lambda is economical for sporadic workloads. You pay per use with no idle costs, making it cheaper than always-on GPU instances for low-volume traffic.

When Lambda is a good choice

Lambda works well when:

• You have low to moderate image generation volume.
• Latency in the 30–60 second range is acceptable.
• You want infrastructure simplicity and auto-scaling.
• Your workload is intermittent or spiky, not constant.

It is less suitable when:

• You need real-time or sub-second inference.
• The model size exceeds Lambda limits or requires GPU acceleration.
• You need high throughput with consistent concurrency. In this case, cold starts can become a bottleneck.

Tips for deploying ML inference on Lambda

• Preload your model globally so it loads only once per container.
• Bundle models inside the image (under /opt) when possible to avoid EFS latency.
• Strip unnecessary packages and use multi-stage builds to keep the image lean.
• Log inference time and memory usage in CloudWatch to monitor performance and cost.
• Test warm vs. cold starts to understand user-facing latency and plan accordingly.
• Consider provisioned concurrency for critical workloads where cold starts are unacceptable.

7. Final thoughts

This project proved that Stable Diffusion can function effectively in a fully serverless, CPU-only Lambda environment. Leveraging Latent Consistency Models reduced inference steps from 50 to 2–4, making execution feasible within Lambda’s constraints. With a streamlined model, containerized setup, and optimized resource handling, Pranav built a lightweight API capable of:

• Runs with no GPU dependency
• Achieves ~40s inference on CPU
• Costs less than a penny per image
• Scales automatically with usage

What is next

There is a lot to evolve this further:

• Experiment with torch.compile() or quantisation for faster inference.
• Add asynchronous queuing and user notifications to mitigate latency.
• Migrate to GPU-backed services, such as Modal or EC2, for higher throughput needs.
• Introduce caching or reuse of common prompts to reduce redundant computation.

This article is curated by Pranav Prabhakar.

Insights Focus

TechCircle Insight: How LCM Makes Stable Diffusion Viable on AWS Lambda

SUBSCRIBE TO NEWSLETTERS

Most Popular

Women’s Day: Mid, senior-level women techies need more role models, upskilling opportunities

AI governance should be an intrinsic part of tech skilling: Geeta Gurnani, IBM

Gender-balanced cyber workforce can lead to greater efficiency: Kris Lovejoy

SUBSCRIBE TO NEWSLETTERS

Sign up for Newsletter

Sign up for Newsletter

SUBSCRIBE TO NEWSLETTERS

Most Popular

Women’s Day: Mid, senior-level women techies need more role models, upskilling opportunities

AI governance should be an intrinsic part of tech skilling: Geeta Gurnani, IBM

Gender-balanced cyber workforce can lead to greater efficiency: Kris Lovejoy

SUBSCRIBE TO NEWSLETTERS

TRENDING STORIES

Sign up for Newsletter