Job TreeNavigate the job tree to view your child job details
Loading job tree...
A visual language foundation model that can perform a variety of image and video question-answer tasks, such as object detection, image captioning, segmentation, and OCR.
Code
ready
Outputs
waiting for outputs
Logs
listening for logs...
README

Florence-2

This is Sieve's open-source implementation of Florence-2, a visual foundation model for a variety of image question-and-answer tasks, which we have extended to video.

This implementation is based on the HuggingFace implementation of Florence-2 Large using the transformers library. You can find the original paper here.

For examples, click here.

Features

  • Object Detection: Detect objects with an optional guiding text prompt.
  • Captioning: Caption images with varying levels of detail.
  • OCR: Optical character recognition to understand text in image and video.
  • Object Segmentation: Text promptable segmentation of objects in image and video.

Pricing

This function is hosted on an L4 GPU and is billed at a compute-based pay-as-you-go rate of $1.25/hr. You can find more information about compute pricing on Sieve here.

Parameters

  • file: A sieve.File pointing to an image or video to perform QA tasks.
  • task_prompt: A string representing the task Florence-2 should perform. For more information on the options, click here.
  • text_input: An optional string that supplies an additional text prompt. This is only applicable for certain tasks, including <CAPTION_TO_PHRASE_GROUNDING>, <REFERRING_EXPRESSION_SEGMENTATION>, <OPEN_VOCABULARY_DETECTION>, <REGION_TO_CATEGORY>, <REGION_TO_SEGMENTATION>, and <REGION_TO_DESCRIPTION>. Other tasks will throw an error. For more info on proper usage, click here.
  • debug_visualization: A boolean flag that, when set to true, enables the visualization of outputs on the source image for debugging purposes. Only works for tasks that output bounding boxes.
  • start_frame: An integer specifying the start frame for video processing. If -1, the video will be processed from the beginning.
  • end_frame: An integer specifying the end frame for video processing. If -1, the video will be processed until the end.
  • frame_interval: An integer specifying the interval between frames to process. If 1, all frames will be processed. If 2, every other frame will be processed, and so on. Used to speed up video processing. In production settings, we strongly recommend skipping through frames to speed up processing.

Notes

Output Format

Image

For all image tasks, the output is a dictionary where the key is the task_prompt. Depending on the task, the value's type changes.

For <CAPTION>, <DETAILED_CAPTION>, <MORE_DETAILED_CAPTION>, <OCR> and <DENSE_REGION_CAPTION> the value is a string.

For <OD> and <CAPTION_TO_PHRASE_GROUNDING>, the value is a dict with two keys, bboxes and labels, that point to a list of x1,y1,x2,y2 boxes and their labels, respectively.

For more information and examples, refer here.

Video

For all video tasks, the output is a list of dictionaries, where the keys for each dictionary are frame_number and the task_prompt. Under the task_prompt, the values are the same as specified above.

Task Prompt

Options for the task_prompt param include:

  • <OD>: Object Detection
  • <CAPTION_TO_PHRASE_GROUNDING>: Caption to Phrase Grounding
  • <CAPTION>: Caption Generation
  • <DETAILED_CAPTION>: Detailed Caption Generation
  • <MORE_DETAILED_CAPTION>: More Detailed Caption Generation
  • <DENSE_REGION_CAPTION>: Dense Region Captioning
  • <REGION_PROPOSAL>: Region Proposal
  • <OCR>: Optical Character Recognition
  • <OCR_WITH_BOXES>: OCR with Bounding Boxes
  • <REGION_TO_SEGMENTATION>: Segment a Specific Region
  • <REGION_TO_CATEGORY>: Categorize a Specific Region with a One Word Descriptor
  • <REGION_TO_DESCRIPTION>: Describe a Specific Region
  • <REFERRING_EXPRESSION_SEGMENTATION>: Caption to Generating a Segment
  • <OPEN_VOCABULARY_DETECTION>: OCR and Object Detection

For more information, refer here.

Text Input

For certain tasks, like <CAPTION_TO_PHRASE_GROUNDING>, you can supply a prompt to the text_input parameter to focus on detecting/segmenting particular objects. Good phrases are 1-2 words long, as adjectives can be picked up as objects, nouns, and verbs.

Checkout an example of this here.

For more information, refer here.

Examples

Example Image

Let's apply Sieve's Florence-2 to this photo of a car! Be sure to log in via sieve login on your terminal or by setting the SIEVE_API_KEY environment variable.

Object Detection

For object detection, use a code snippet like this.


import sieve

fl2_fn = sieve.function.get("sieve/florence-2")

debug_image, response = fl2_fn.run(
    image=sieve.File("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"),
    task_prompt="<OD>",
    debug_visualization=True,
)

print("debug image path:", debug_image.path)
print("dict response:", response)

If you run on the image of the car, the response dictionary will look something like:

{
  "<OD>": {
    "bboxes": [
      [...],  
      [...],
      [...],
      [...]
    ],
    "labels": [
      "car",
      "door",
      "wheel",
      "wheel"
    ]
  }
}

Guided Object Detection with Prompts

To detect specific objects using text, use a code snippet like this.


import sieve

fl2_fn = sieve.function.get("sieve/florence-2")

text_input = "car, door."

debug_image, response = fl2_fn.run(
    image=sieve.File("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"),
    task_prompt="<CAPTION_TO_PHRASE_GROUNDING>",
    text_input=text_input,
    debug_visualization=True,
)

print("debug image path:", debug_image.path)
print("dict response:", response)

If you run on the image of the car, the response dictionary will look something like:

{
  "<CAPTION_TO_PHRASE_GROUNDING>": {
    "bboxes": [
      [...],  
      [...]
    ],
    "labels": [
      "car",
      "door"
    ]
  }
}

The prompt matters a lot here, so we encourage you to experiment and see what works.

Captioning

To generate a detailed caption of the image, use a code snippet like this.


import sieve

fl2_fn = sieve.function.get("sieve/florence-2")

response = fl2_fn.run(
    image=sieve.File("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"),
    task_prompt="<DETAILED_CAPTION>",
)

print("dict response:", response)

If you run on the image of the car, the response dictionary will look something like:

{
  "<DETAILED_CAPTION>": "The image shows a blue Volkswagen Beetle parked in front of a yellow building with two brown doors, surrounded by trees and a clear blue sky."
}