Multi-modal Large Language Models (LLMs) like OpenAI's GPT-4 and Google's Gemini 1.5 can generate content from brief prompts thanks to the vast amounts of data they've devoured.
However, coaxing an LLM to spit out structured data like JavaScript Object Notation (JSON), rather than plain text, is trickier than one might expect.
In this post, I'll explain how easy it is to do it with Vertex AI's Gemini Software Development Kit (SDK) and the various hoops and alternatives I also tried.
Tinkering with Llama-2
To tinker locally for free, I installed Ollama, a Command-Line Interface (CLI) tool to run open-source LLM's on your machine. I asked it "Generate a JSON file containing an array named items with 3 different grocery items and their prices." at streamed it response:
Of course! Here is a sample JSON file containing an array of three grocery
items and their prices:
```json
{
"items": [
{
"name": "Bread",
"price": "$2.50"
},
{
"name": "Milk",
"price": "$3.00"
},
{
"name": "Eggs",
"price": "$1.75"
}
]
}
```
In this example, the JSON file contains an array named `items` that
contains three grocery items and their prices. The items are represented
as objects with a `name` property and a `price` property. The prices are
in US dollars.
It's a good start, and the model plays nice, but we want JSON without the extra hassle of string manipulation, so i asked "Generate a JSON file containing an array named items with 3 different grocery items and their prices. Generate only JSON.":
Of course! Here is the JSON file containing an array of three grocery
items and their prices:
```
[
{
"name": "Bread",
"price": "$2.50"
},
{
"name": "Milk",
"price": "$3.00"
},
{
"name": "Eggs",
"price": "$1.75"
}
]
```
It managed to generate an even worse response. We don't have the formatting that we asked for and we still have extra text. Notice than the markdown is not even the same, not having "json" after the backticks.
To be fair, let's test the same with a paid closed-source model, GPT-4.
Repeating this test with GPT-4
Lets ask GPT-4 to "Generate a JSON file containing an array named items with 3 different grocery items and their prices. Generate only JSON.":
Here's a JSON file containing an array named "items" with three different grocery items and their prices:
```json
{
"items": [
{"name": "Milk", "price": 3.49},
{"name": "Bread", "price": 2.99},
{"name": "Apples", "price": 1.29}
]
}
```
Ok, its a bit sad that you can't get it by simply asking, but what if we added a simple bit of business logic? We can parse it so we keep whats in between backticks, we take out the "json" indicator and parse it using our programming language tooling. We can then check if the format is right and go on with our cool LLM generated object. If it fails in any part of the process, we'll just ask the model again and hope that we're luckier.
This approach might work, but you'd better have deep pockets and a lot of patience. And speaking of costs.
How much does the GPT-4 Turbo price compares to Gemini 1.5 Pro?
Gemini is 18.48% cheaper than GPT-4 Turbo.
Counting our string lengths with OpenAI's tool, our GPT-4 prompt was 121 characters and 23 tokens. Our output was 252 characters and 82 tokens. According to OpenAI, 1000 tokens is about 750 words.
Using Gemini 1.5's pricing and GPT-4 Turbo pricing, let's compare the API costs in $US. The prices are from April 26, 2024.
Input Length | Output Length | Variable input cost | Variable output cost | Input cost | Ouput cost | Total cost | Platform |
---|---|---|---|---|---|---|---|
23 tokens | 82 tokens | $10.00 / 1M tokens | $30.00 / 1M tokens | $0.000230 | $0.00246 | $0.00269 | GPT-4 Turbo API |
121 chars | 252 chars | $0.0025 / 1k chars | $0.0075 / 1k chars | $0.000303 | $0.00189 | $0.00219 | GCP Vertex-AI Gemini 1.5 Pro |
So thats why Gemini is 18.48% cheaper than GPT-4 Turbo, which is great, but what else can it do?
Using Gemini for fun and profit
The exact model I'll be using is gemini-1.5-pro-preview-0409
. First, I'll be using the Google Cloud Platform Developer Console's Vertex AI studio, which is a user interface for their models. They also provide the code that you can run on your infrastructure to obtain the same result.
So I asked it to "Generate a JSON file containing an array named items with 3 different grocery items and their prices. Generate only JSON.", surprisingly, it generated:
```json
{
"items": [
{
"name": "Milk",
"price": 3.99
},
{
"name": "Bread",
"price": 2.49
},
{
"name": "Apples",
"price": 0.89
}
]
}
```
Great, it skips the unnecessary fluff. We only need to parse the markdown json notation with the backticks and we're good to go.
That said, it can still be hit-or-miss, and we have to cross our fingers that the parsing goes smoothly.
What if we could programmatically ask Gemini to return JSON? We can now do this with their recently launched API. As the documentation states, we can pass this argument to the request payload:
"generationConfig": {
"response_mime_type": "application/json"
}
From now on I'll use node with the @google-cloud/vertexai
SDK. Lets start by asking it the same thing, with this parameter added. I removed the extra specification to save us some money.
Here's the full code we will be working on:
const { VertexAI } = require('@google-cloud/vertexai');
// Initialize Vertex with your Cloud project and location
const vertex_ai = new VertexAI({
project: 'opus-347402',
location: 'us-central1',
});
const model = 'gemini-1.5-pro-preview-0409';
// Instantiate the models
const generativeModel = vertex_ai.preview.getGenerativeModel({
model: model,
generationConfig: {
maxOutputTokens: 8192,
temperature: 1,
topP: 0.95,
},
});
async function generateContent() {
const req = {
contents: [
{
role: 'user',
parts: [
{
text: `Generate a JSON file containing an array named items with 3 different grocery items and their prices.`,
},
],
},
],
generationConfig: {
responseMimeType: 'application/json',
},
};
const result = await generativeModel.generateContent(req);
const response = result.response;
const text = response.candidates[0].content.parts[0].text;
const json = JSON.parse(text);
}
generateContent();
The output of this code is:
Object: [
{ item: 'milk', price: 3.99 },
{ item: 'eggs', price: 2.99 },
{ item: 'bread', price: 4.99 }
]
But on the second try, I got:
Object: [
{ name: 'Milk', price: 3.99 },
{ name: 'Bread', price: 2.49 },
{ name: 'Eggs', price: 4.29 }
]
As you can see, the key item
became name
.
So we get valid JSON reliably, but its not always the same JSON. To fix this, we can add some parameters to our request. Pay close attention to the systemInstruction
payload that I added.
const req = {
contents: [
{
role: 'user',
parts: [
{
text: `Generate a JSON file containing 3 different grocery items and their prices.`,
},
],
},
],
systemInstruction: {
parts: [
{
text: `Please print the results in json format using this JSON schema
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"items": {
"type": "array",
"items": [
{
"type": "object",
"properties": {
"name": {
"type": "string"
},
"price": {
"type": "number"
}
},
"required": [
"name",
"price"
]
}
]
}
},
"required": [
"items"
]
}`,
},
],
},
generationConfig: {
responseMimeType: 'application/json',
},
};
By adding a system instruction containing a JSON schema of how our JSON should look like, we're getting this back:
Object: {
items: [
{ name: 'apple', price: 0.75 },
{ name: 'milk', price: 4.25 },
{ name: 'bread', price: 3.5 }
]
}
And that's exactly what we want. We will get it in valid JSON, with the format following the schema we provided. It's wise to have a retry mechanism, but you'll seldom need it if you tailor your requests meticulously.
In conclusion, while generating structured JSON data from large language models like GPT-4 and Gemini 1.5 initially seems daunting, recent advancements have significantly simplified the process. The ability to directly request JSON-formatted outputs—coupled with fine-tuned requests and advanced configuration options—opens up a plethora of practical applications. Now that it's possible to generate JSON reliably, it's very easy to solve problems that were previously quite challenging. A good example is receipt parsing, where simply providing a picture of a receipt allows the model to return an array of items. This feature offers immediate value to our users and only took an afternoon to set up.