Install dependencies
Your handler needs thetransformers library to load Hugging Face models, and torch to run inference. Install both in your development environment:
Create your handler
Create a file namedhandler.py and follow these steps to build a handler that performs sentiment analysis using a Hugging Face model.
Import libraries
Start by importing the necessary libraries:The
handler.py
pipeline function from the transformers library provides a simple interface for using pre-trained models. It handles tokenization, model inference, and post-processing automatically.Load the model efficiently
Load your model outside the handler function to avoid reloading it on every request. This significantly improves performance by initializing the model only once when the starts:The
handler.py
pipeline function takes two arguments: the task type (like "sentiment-analysis", "text-generation", or "image-classification") and the specific model identifier from the Hugging Face model hub.Define the handler function
Create a handler function that extracts input text from the request, validates it, runs inference, and returns results:The handler follows Runpod’s standard pattern: extract input, validate it, process it, and return results. The model returns a list of predictions, so we take the first result with
handler.py
[0] and extract the label and confidence score.Complete implementation
Here’s the complete code:handler.py
Test locally
Create a test input file to verify your handler works correctly:test_input.json
Adapt for other models
This pattern works for any Hugging Face model. To use a different model:- Choose your model: Browse the Hugging Face model hub to find a model for your task.
-
Update the pipeline: Change the task type and model identifier:
- Adjust input/output handling: Different models expect different input formats and return different output structures. Check the model’s documentation on Hugging Face to understand its API.
Production deployment
When deploying Hugging Face models to production endpoints, follow these best practices:- Use cached models: The approach shown in this guide downloads models when workers start, which increases cold start times and costs. For production, use cached models instead. Cached models reduce cold starts to just a few seconds and eliminate charges for model download time. See the cached model tutorial for a complete example.
- Model size: Larger models require more VRAM and take longer to load. Choose the smallest model that meets your accuracy requirements.
- GPU utilization: Most Hugging Face models run faster on GPUs. Ensure your endpoint uses GPU workers for optimal performance.
- Batch processing: If your model supports batching, process multiple inputs together to improve throughput.
Next steps
- For production: Learn about cached models and follow the cached model tutorial to improve cold start times and reduce costs.
- Create a Dockerfile to package your handler with its dependencies.
- Deploy your worker to a Runpod endpoint.
- Explore optimization techniques to improve performance.
- Learn about error handling for production deployments.