Visual QA
Sieve’s Visual-QA app is a versatile tool for developers to analyze and extract insights from images, videos, and audio files. It can provide structured responses in JSON format, allowing for detailed analysis and customization of prompts. The app supports cost-effective processing with parameters like backend
selection and fps
to optimize for various use cases.
Key Features
- Image, Video, and Audio Analysis: The app can process different types of media files to answer questions.
- Structured Responses: Supports text responses aswell as structured JSON responses using the
function_json
param. For details check out the notes section below. - Customizable Prompts: Allows users to specify prompts for tailored use-cases.
- Cost Effective Parameters: Allows developers to specify different parameters such as backend, frames-per-second, start_time, end_time to optimize for their use case.
Use Cases
Visual-QA can be used for a variety of different tasks such as:
- Video Summarizations: Summarize videos for developing recommendation systems, categorization and tagging systems etc.
- Multimodal Sentiment Analysis: Analyse the sentiments expressed in different Images, Audios and videos.
- Visual Search: Develop Visual search systems to enable users to search different objects in a video or image.
- Optical Character Recognition (OCR): Extract text from images and videos for digitizing documents, automating data entry, or enabling searchable media archives.
- Educational Tool: Visual-QA system can be used to explain graphs, diagrams and provide explanations for videos and lectures.
- Product Information: Help users find products by answering questions about images of items, such as identifying details or comparing products.
Pricing
The pricing varies based on the selected backend and the amount of content processed.
Backend | Media Type | < 128k tokens | > 128k tokens |
---|---|---|---|
gemini-1.5-flash | Image | $0.00002/image | $0.00004/image |
Video | $0.00002/second | $0.00004/second | |
Audio | $0.000002/second | $0.000004/second | |
Text Input | $0.00001875/1k chars | $0.0000375/1k chars | |
Text Output | $0.000075/1k chars | $0.00015/1k chars | |
gemini-1.5-pro | Image | $0.00032875/image | $0.0006575/image |
Video | $0.00032875/second | $0.0006575/second | |
Audio | $0.00003125/second | $0.0000625/second | |
Text Input | $0.0003125/1k chars | $0.000625/1k chars | |
Text Output | $0.00125/1k chars | $0.0025/1k chars |
Note: You are charged for the number of images passed + the characters in the prompt given. Each modality is charged at a separate rate. For videos, you will be charged according to the fps
param set. If you set audio_context
to True, the audio is also used to make a prediction. A small processing fee of 0.4$/hour
is charged for compute.
Example: Rick Roll
Let's Analyze Rick Astley's Never Gonna Give You Up music video. The video is 3:33 minutes
long. Let's set the fps
param to be 1 since there aren't instant changes in between the frames. We'll enable the audio_context
to be true and ask the Visual QA to give the summary of the video. We'll use gemini-flash-1.5
as the backend. Here's how much it will cost us.
duration_in_seconds = 213s
total_frames = 1 frame per second * 213 = 213 frames
audio_duration = 213s
# assuming the prompt was 100 chars
text_input = 100 characters
# assuming output summary was 250 chars
text_output = 250 characters
image_cost = 213 * $0.00002 = $0.00426
audio_cost = 213 * $0.000002 = $0.000426
text_input_cost = 100 * (0.00001875/1000) = $0.000001875
text_output_cost = 250 * (0.000075/1000) = $0.0000075
# assume processing took 50s
processing_cost = 50 * (0.4/3600) = 0.005555555556
Total_cost = image_cost + audio_cost + text_input_cost + text_output_cost + processing_cost
Total_cost = $0.010
Thus a video of 3:33s
with audio costs us only $0.010
. For image/video only task we do not explicitly need audio context which further reduces the cost! check out the video summary here.
Notes
Parameter Usage
Function JSON
function_json
parameter allows user to specify a json to get a structured response. Function parameters must be provided in a format that's compatible with the OpenAPI schema. If function_json
param is not provided the response is in plain text.
For Example: Let's suppose we have images from an online-shopping store and we want to categorize them into pants, shoes, bags, shirts etc, with brief descriptions here is how we would do that!
{
"type": "object",
"properties": {
"product_type": {
"type": "string",
"enum": ["shirt", "bag", "pants", "belt", "shoes"],
"description": "The type of the product."
},
"color": {
"type": "string",
"description": "The color of the product."
},
"description": {
"type": "string",
"description": "A brief description of the product."
}
},
"required": ["product_type", "color", "description"]
}
If we pass an image of a brown bag, here's how the output would look like:
{
"product_type": "bag",
"color": "brown",
"description": "A brown leather tote bag."
}
Just like this, we can easily classify millions of unlabeled images. check out the example here!