Use Hugging Face models

Hugging Face provides thousands of pre-trained models for natural language processing, computer vision, audio processing, and more. You can integrate these models into your to deploy AI capabilities without training models from scratch. This guide shows you how to load and use Hugging Face models in your Serverless handlers, using sentiment analysis as an example that you can adapt for other model types.

Use cached models for production: The approach shown in this guide downloads models when workers start, which increases cold start times and costs. For production, use cached models instead. Cached models reduce cold starts to just a few seconds and eliminate charges for model download time. See the cached model tutorial for a complete example.

Install dependencies

Your handler needs the transformers library to load Hugging Face models, and torch to run inference. Install both in your development environment:

pip install torch transformers

When deploying to Runpod, you’ll need to include these dependencies in your Dockerfile or requirements file.

Create your handler

Create a file named handler.py and follow these steps to build a handler that performs sentiment analysis using a Hugging Face model.

Import libraries

Start by importing the necessary libraries:

handler.py

import runpod
from transformers import pipeline

The pipeline function from the transformers library provides a simple interface for using pre-trained models. It handles tokenization, model inference, and post-processing automatically.

The pipeline approach shown in this guide is convenient for local testing and development. For production endpoints, you should use cached models instead, which dramatically reduce cold start times and eliminate charges for model download time.

Load the model efficiently

Load your model outside the handler function to avoid reloading it on every request. This significantly improves performance by initializing the model only once when the starts:

handler.py

# Load model once when worker starts
model = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

The pipeline function takes two arguments: the task type (like "sentiment-analysis", "text-generation", or "image-classification") and the specific model identifier from the Hugging Face model hub.

Define the handler function

Create a handler function that extracts input text from the request, validates it, runs inference, and returns results:

handler.py

def handler(job):
    # Extract input from the job
    job_input = job["input"]
    text = job_input.get("text")

    # Validate input
    if not text:
        return {"error": "No text provided for analysis."}

    # Run inference
    result = model(text)[0]

    # Return formatted results
    return {
        "sentiment": result["label"],
        "score": float(result["score"])
    }

The handler follows Runpod’s standard pattern: extract input, validate it, process it, and return results. The model returns a list of predictions, so we take the first result with [0] and extract the label and confidence score.

Start the Serverless worker

Add this line at the end of your file to register the handler and start the worker:

handler.py

runpod.serverless.start({"handler": handler})

Complete implementation

Here’s the complete code:

handler.py

import runpod
from transformers import pipeline

# Load model once when worker starts
model = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

def handler(job):
    # Extract input from the job
    job_input = job["input"]
    text = job_input.get("text")

    # Validate input
    if not text:
        return {"error": "No text provided for analysis."}

    # Run inference
    result = model(text)[0]

    # Return formatted results
    return {
        "sentiment": result["label"],
        "score": float(result["score"])
    }

runpod.serverless.start({"handler": handler})

Test locally

Create a test input file to verify your handler works correctly:

test_input.json

{
  "input": {
    "text": "This is absolutely wonderful and amazing!"
  }
}

Run your handler locally using the Runpod SDK:

python handler.py --rp_server_api

You should see output indicating successful sentiment analysis:

--- Starting Serverless Worker |  Version 1.6.2 ---
INFO   | Using test_input.json as job input.
DEBUG  | Retrieved local job: {'input': {'text': 'This is absolutely wonderful and amazing!'}, 'id': 'local_test'}
INFO   | local_test | Started.
DEBUG  | local_test | Handler output: {'sentiment': 'POSITIVE', 'score': 0.999880313873291}
INFO   | Job local_test completed successfully.

The first time you run this, Hugging Face will download the model files. Subsequent runs will use the cached model.

Adapt for other models

This pattern works for any Hugging Face model. To use a different model:

Choose your model: Browse the Hugging Face model hub to find a model for your task.

Update the pipeline: Change the task type and model identifier:

# Text generation example
model = pipeline("text-generation", model="gpt2")

# Image classification example
model = pipeline("image-classification", model="google/vit-base-patch16-224")

# Translation example
model = pipeline("translation_en_to_fr", model="t5-base")

Adjust input/output handling: Different models expect different input formats and return different output structures. Check the model’s documentation on Hugging Face to understand its API.

Production deployment

When deploying Hugging Face models to production endpoints, follow these best practices:

Use cached models: The approach shown in this guide downloads models when workers start, which increases cold start times and costs. For production, use cached models instead. Cached models reduce cold starts to just a few seconds and eliminate charges for model download time. See the cached model tutorial for a complete example.
Model size: Larger models require more VRAM and take longer to load. Choose the smallest model that meets your accuracy requirements.
GPU utilization: Most Hugging Face models run faster on GPUs. Ensure your endpoint uses GPU workers for optimal performance.
Batch processing: If your model supports batching, process multiple inputs together to improve throughput.

Next steps

For production: Learn about cached models and follow the cached model tutorial to improve cold start times and reduce costs.
Create a Dockerfile to package your handler with its dependencies.
Deploy your worker to a Runpod endpoint.
Explore optimization techniques to improve performance.
Learn about error handling for production deployments.

Get started

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Integrations

Hub

Fine-tuning

Reference

Install dependencies

Create your handler

Complete implementation

Test locally

Adapt for other models

Production deployment

Next steps

Get started

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Integrations

Hub

Fine-tuning

Reference

​Install dependencies

​Create your handler

​Complete implementation

​Test locally

​Adapt for other models

​Production deployment

​Next steps

Install dependencies

Create your handler

Complete implementation

Test locally

Adapt for other models

Production deployment

Next steps