Florence-2
This is Sieve's open-source implementation of Florence-2, a visual foundation model for a variety of image question-and-answer tasks, which we have extended to video.
This implementation is based on the HuggingFace implementation of Florence-2 Large using the transformers library. You can find the original paper here.
For examples, click here.
Features
- Object Detection: Detect objects with an optional guiding text prompt.
- Captioning: Caption images with varying levels of detail.
- OCR: Optical character recognition to understand text in image and video.
- Object Segmentation: Text promptable segmentation of objects in image and video.
Pricing
This function is hosted on an L4 GPU and is billed at a compute-based pay-as-you-go rate of $1.25/hr. You can find more information about compute pricing on Sieve here.
Parameters
file
: A sieve.File pointing to an image or video to perform QA tasks.task_prompt
: A string representing the task Florence-2 should perform. For more information on the options, click here.text_input
: An optional string that supplies an additional text prompt. This is only applicable for certain tasks, including<CAPTION_TO_PHRASE_GROUNDING>
,<REFERRING_EXPRESSION_SEGMENTATION>
,<OPEN_VOCABULARY_DETECTION>
,<REGION_TO_CATEGORY>
,<REGION_TO_SEGMENTATION>
, and<REGION_TO_DESCRIPTION>
. Other tasks will throw an error. For more info on proper usage, click here.debug_visualization
: A boolean flag that, when set to true, enables the visualization of outputs on the source image for debugging purposes. Only works for tasks that output bounding boxes.start_frame
: An integer specifying the start frame for video processing. If -1, the video will be processed from the beginning.end_frame
: An integer specifying the end frame for video processing. If -1, the video will be processed until the end.frame_interval
: An integer specifying the interval between frames to process. If 1, all frames will be processed. If 2, every other frame will be processed, and so on. Used to speed up video processing. In production settings, we strongly recommend skipping through frames to speed up processing.
Notes
Output Format
Image
For all image tasks, the output is a dictionary where the key is the task_prompt
. Depending on the task, the value's type changes.
For <CAPTION>
, <DETAILED_CAPTION>
, <MORE_DETAILED_CAPTION>
, <OCR>
and <DENSE_REGION_CAPTION>
the value is a string.
For <OD>
and <CAPTION_TO_PHRASE_GROUNDING>
, the value is a dict with two keys, bboxes
and labels
, that point to a list of x1,y1,x2,y2 boxes and their labels, respectively.
For more information and examples, refer here.
Video
For all video tasks, the output is a list of dictionaries, where the keys for each dictionary are frame_number
and the task_prompt
. Under the task_prompt
, the values are the same as specified above.
Task Prompt
Options for the task_prompt
param include:
<OD>
: Object Detection<CAPTION_TO_PHRASE_GROUNDING>
: Caption to Phrase Grounding<CAPTION>
: Caption Generation<DETAILED_CAPTION>
: Detailed Caption Generation<MORE_DETAILED_CAPTION>
: More Detailed Caption Generation<DENSE_REGION_CAPTION>
: Dense Region Captioning<REGION_PROPOSAL>
: Region Proposal<OCR>
: Optical Character Recognition<OCR_WITH_BOXES>
: OCR with Bounding Boxes<REGION_TO_SEGMENTATION>
: Segment a Specific Region<REGION_TO_CATEGORY>
: Categorize a Specific Region with a One Word Descriptor<REGION_TO_DESCRIPTION>
: Describe a Specific Region<REFERRING_EXPRESSION_SEGMENTATION>
: Caption to Generating a Segment<OPEN_VOCABULARY_DETECTION>
: OCR and Object Detection
For more information, refer here.
Text Input
For certain tasks, like <CAPTION_TO_PHRASE_GROUNDING>
, you can supply a prompt to the text_input
parameter to focus on detecting/segmenting particular objects. Good phrases are 1-2 words long, as adjectives can be picked up as objects, nouns, and verbs.
Checkout an example of this here.
For more information, refer here.
Examples
Let's apply Sieve's Florence-2 to this photo of a car! Be sure to log in via sieve login
on your terminal or by setting the SIEVE_API_KEY
environment variable.
Object Detection
For object detection, use a code snippet like this.
import sieve
fl2_fn = sieve.function.get("sieve/florence-2")
debug_image, response = fl2_fn.run(
image=sieve.File("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"),
task_prompt="<OD>",
debug_visualization=True,
)
print("debug image path:", debug_image.path)
print("dict response:", response)
If you run on the image of the car, the response dictionary will look something like:
{
"<OD>": {
"bboxes": [
[...],
[...],
[...],
[...]
],
"labels": [
"car",
"door",
"wheel",
"wheel"
]
}
}
Guided Object Detection with Prompts
To detect specific objects using text, use a code snippet like this.
import sieve
fl2_fn = sieve.function.get("sieve/florence-2")
text_input = "car, door."
debug_image, response = fl2_fn.run(
image=sieve.File("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"),
task_prompt="<CAPTION_TO_PHRASE_GROUNDING>",
text_input=text_input,
debug_visualization=True,
)
print("debug image path:", debug_image.path)
print("dict response:", response)
If you run on the image of the car, the response dictionary will look something like:
{
"<CAPTION_TO_PHRASE_GROUNDING>": {
"bboxes": [
[...],
[...]
],
"labels": [
"car",
"door"
]
}
}
The prompt matters a lot here, so we encourage you to experiment and see what works.
Captioning
To generate a detailed caption of the image, use a code snippet like this.
import sieve
fl2_fn = sieve.function.get("sieve/florence-2")
response = fl2_fn.run(
image=sieve.File("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"),
task_prompt="<DETAILED_CAPTION>",
)
print("dict response:", response)
If you run on the image of the car, the response dictionary will look something like:
{
"<DETAILED_CAPTION>": "The image shows a blue Volkswagen Beetle parked in front of a yellow building with two brown doors, surrounded by trees and a clear blue sky."
}