Table of Contents

Enum VLMTask

Namespace
VisioForge.Core.Types.X.AI
Assembly
VisioForge.Core.dll

The task a Florence-2 vision-language model performs on each processed frame. The task selects the natural-language prompt fed to the model and how its output is interpreted (free text vs. grounded regions).

public enum VLMTask

Fields

Caption = 0

Generate a short one-sentence caption describing the image (Florence-2 <CAPTION>).

DetailedCaption = 1

Generate a more detailed caption of the image (Florence-2 <DETAILED_CAPTION>).

MoreDetailedCaption = 2

Generate a paragraph-length, highly detailed caption (Florence-2 <MORE_DETAILED_CAPTION>).

ObjectDetection = 3

Detect objects and report a category label with a bounding box for each (Florence-2 <OD>).

DenseRegionCaption = 4

Detect regions and report a short description with a bounding box for each (Florence-2 <DENSE_REGION_CAPTION>).

Ocr = 5

Read all text in the image as a single string (Florence-2 <OCR>).

OcrWithRegion = 6

Read text in the image and report each text block with a quadrilateral region (Florence-2 <OCR_WITH_REGION>).

PhraseGrounding = 7

Ground the phrases of a caption supplied in VisioForge.Core.Types.X.AI.VLMSettings.TextInput to image regions (Florence-2 <CAPTION_TO_PHRASE_GROUNDING>).