Generating Structured Data with HuggingFace Inference Endpoints

16 November 2024
llms,
structured data,
synthetic data

Structured Data with HuggingFace Inference Endpoints

This tutorial will show you how to generate structured data using the HuggingFace API. The baseline benefit of structured data is that it's easy to parse and use, so you can treat models like APIs. The other benefits even extend to more accurate results with most LMs when generating structured data and speeding up inference.

There are other libraries that can provide structured data, such as langchain structured output and instructor. But for HuggingFace's Inference Endpoints, the capability is built in. However, good documentation of the feature has been hard to find. Most places where it's used and sometimes referenced as documentation, mix it in with more complicated tasks. So I decided to write out a step-by-step example that's simple, yet useful.

Generating data that's more accurate, easier to work with, at lower latency. Structured data delivers a winning combination against unstructured generation techniques, with benefits that we want for every LLM task. And in this tutorial we'll see how simple it is to gain these advantages with HuggingFace Inference Endpoints.

Why use HuggingFace Inference Endpoints

I don't like depending on proprietary models or paying excessively when I can get comparable quality from open models. With HuggingFace, I get to select from many models. I get a generous amount of 20,000 free API calls daily to their serverless endpoint models for $9 per month, and deploying dedicated endpoints is simple if I need a model that isn't available as a serverless endpoint.

Given the quality of open models now, I avoid the restrictive terms of service of proprietary models as often as possible. The terms of both Anthropic and OpenAI could be interpreted to preclude using their models for the project I'm taking this example from, because the goal is to build a dataset suitable for training other models.

Import libraries

The first step is to import the libraries we need.

os is for environment variables. We'll only use this to get the HuggingFace token.
typing is for type hints.
pydantic simplifies getting structured responses. It's ubiquitous in structured data generation libraries.
huggingface_hub is for the inference client that takes care of the API calls.

import os
from typing import List

from pydantic import BaseModel
from huggingface_hub import AsyncInferenceClient

Define the response schema

The next step is to model what we want the LLM to generate. One consideration is simlply how we want to use the response in our pipeline. There are times when breaking out another field makes it easier to use in code, or simply spares us the parsing.

In general, just breakout as much data as you expect to use separately. For example, in a classification task we might ask for the label, the confidence, and the rationale. For an more complex genaration task where we needed to generate a set of users we would include all of the essential fields.

This response schema is for a batch of prompts that we want the LLM to generate which is quite simplistic. The dataset will include more metadata about each record, but the model we want to give to the LLM only needs the prompts field, because that's the only part it needs to generate.

class PromptBatch(BaseModel):
    prompts: List[str]

It's a good idea to check the outputs as we go, to make sure they are what we expect. We'll do that even though this prompt is so simple.

PromptBatch.model_json_schema()

{'properties': {'prompts': {'items': {'type': 'string'}, 'title': 'Prompts', 'type': 'array'}}, 'required': ['prompts'], 'title': 'PromptBatch', 'type': 'object'}

This JSON schema tells the model how to structure the response and Pydantic helps us to model this in Python instead of writing out the JSON. With Pydantic, we used a simper and more readable way to model the schema, and now we validate that this translates to the schema we want in the response.

Build the model input

The model input is a list of messages in this case, because we'll use chat completion. Since I found that models tended to more reliably generate results of a single prompt for the singleton case when the tense was correct, I made a prompt builder to handle singular and plural cases in a grammatically correct way. Since this logic was more complicated, I moved the logic for that to a helper function.

def build_messages(num_generations: int) -> list[dict[str, str]]:
    """Builds the messager for the appropriate cases."""
    if num_generations > 10 or num_generations < 1:
        raise ValueError("num_generations must be between 1 and 10")
    prompt_suffix = "for designing a microservice architecture that can be answered with Kubernetes manifests."
    if num_generations > 1:
        return [
            {
                "role": "user",
                "content": f"Generate {num_generations} unique prompts {prompt_suffix}",
            }
        ]
    if num_generations == 1:
        return [
            {
                "role": "user",
                "content": f"Generate one unique prompt {prompt_suffix}",
            }
        ]

Get the client

The AsyncInferenceClient isn't necessary for this example. We could just use the synchronous InferenceClient, however, this makes it simpler to adapt the code for concurrent use later.

Note that we can also use dedicated inference endpoints, by simply setting the base_url. However, the 20,000 limit on daily API calls for serverless inference endpoints is adequate for small projects and tests. More important, there's a free tier, providing 5,000 API calls per day. , the calls are free.

You can find the currently warm serverless endpoints at this link.

# to use serverless inference endpoints you need to select from currently warm endpoints
# see the list: https://huggingface.co/models?inference=warm&other=endpoints_compatible&sort=trendin
client = AsyncInferenceClient(model="microsoft/Phi-3.5-mini-instruct", api_key=os.getenv("HF_TOKEN"))

Generate the structured output

The key to generating the structured output is the response_format parameter of the chat completion call. This is a dictionary with two keys: type and value. The type key must be set to "json" and the value key must be set to the JSON schema for the response. We can easily do this by calling PromptBatch.model_json_schema(), because pydantic provides the method with the BaseModel we subclassed.

async def generate_prompts(num_prompts: int = 3) -> PromptBatch:
    msgs = build_messages(num_prompts)
    response = await client.chat.completions.create(
        messages=msgs,
        max_tokens=4096,
        temperature=0.5,
        response_format={"type": "json", "value": PromptBatch.model_json_schema()},
    )
    return PromptBatch.model_validate_json(
        response.choices[0].message.content
    )

The hard parts are handled. Now it's time to generate the result.

prompt_batch = await generate_prompts(num_prompts=3)
assert isinstance(prompt_batch, PromptBatch)
# print as JSON
print(prompt_batch.model_dump_json(indent=2))

This produces the following:

    {
      "prompts": [
        "Design a microservice architecture for a real-time analytics platform using Kubernetes. Include manifests for a scalable, fault-tolerant deployment with auto-scaling capabilities, persistent storage for time-series data, and a service mesh for inter-service communication.",
        "Create a Kubernetes manifest for a microservice-based e-commerce application that supports multiple payment gateways. The architecture should include load balancers, service discovery, auto-scaling based on traffic patterns, and secure, isolated environments for sensitive operations like payment processing.",
        "Develop a Kubernetes manifest for a microservice-driven IoT monitoring system. Ensure the architecture can handle high-volume data streams from sensors, provide real-time analytics, and integrate with external data sources. Include manifests for a robust, self-healing cluster with auto-scaling, network policies for secure communication, and a centralized logging and monitoring solution."
      ]
    }

Conclusion

The HuggingFace Inference Endpoints make it easy to generate structured data with LLMs, with built-in parsing logic. However, the feature is documented in tutorials that tend to be more complex than necessary, so I've tried to provide a simple example that gets to the point that help you use the built-in capabilities of the HuggingFace libraries.

I hope this helps you get started with structured data generation with HuggingFace Inference Endpoints.

← Previous
Understanding Atomic Operations in Go
Next →
Implementing a Write-Ahead Log (WAL) in Go