Table of Contents

Class ClipEmbeddingEngine

Namespace
VisioForge.Core.AI.Clip
Assembly
VisioForge.Core.AI.dll

A CLIP dual-tower embedding engine. It owns two ONNX sessions — a vision tower that turns an image into an embedding and a text tower that turns text into an embedding in the same space — so an image and a natural-language query can be compared by cosine similarity. Both towers include the CLIP projection head, so their outputs share the embedding dimension exposed by VisioForge.Core.AI.Clip.ClipEmbeddingEngine.Dimension. All outputs are L2-normalized.

public sealed class ClipEmbeddingEngine : IDisposable

Inheritance

Implements

Inherited Members

Remarks

Input and output tensor names, and the embedding dimension, are read from the model metadata at VisioForge.Core.AI.Clip.ClipEmbeddingEngine.Init, so an fp16 or quantized re-export of the same model stays drop-in. Text is tokenized with the in-house ClipTokenizer (max length 77, begin/end-of-text wrapped). The engine is thread-safe for concurrent VisioForge.Core.AI.Clip.ClipEmbeddingEngine.EncodeImage(SkiaSharp.SKBitmap) and VisioForge.Core.AI.Clip.ClipEmbeddingEngine.EncodeText(System.String) calls (they use separate sessions, and ONNX Runtime Run is itself thread-safe).

Constructors

ClipEmbeddingEngine(VideoEmbeddingSettings)

Initializes a new instance of the VisioForge.Core.AI.Clip.ClipEmbeddingEngine class.

public ClipEmbeddingEngine(VideoEmbeddingSettings settings)

Parameters

settings VideoEmbeddingSettings

The video embedding settings carrying the CLIP model and tokenizer paths.

Exceptions

ArgumentNullException

Thrown when settings is null.

Properties

ActiveProvider

Gets the execution provider the vision session actually engaged. Valid after VisioForge.Core.AI.Clip.ClipEmbeddingEngine.Init.

public OnnxExecutionProvider ActiveProvider { get; }

Property Value

OnnxExecutionProvider

Dimension

Gets the embedding dimension shared by the vision and text towers, read from the model output metadata at VisioForge.Core.AI.Clip.ClipEmbeddingEngine.Init. Zero before initialization.

public int Dimension { get; }

Property Value

int

Methods

Dispose()

Performs application-defined tasks associated with freeing, releasing, or resetting unmanaged resources.

public void Dispose()

EncodeImage(VideoFrameX)

Encodes an RGBA video frame into an L2-normalized CLIP image embedding.

public float[] EncodeImage(VideoFrameX frame)

Parameters

frame VideoFrameX

The source RGBA frame.

Returns

float[]

The L2-normalized embedding, or null when the frame is empty or the engine failed to init.

EncodeImage(SKBitmap)

Encodes a bitmap into an L2-normalized CLIP image embedding.

public float[] EncodeImage(SKBitmap image)

Parameters

image SKBitmap

The source image (any color type/size).

Returns

float[]

The L2-normalized embedding, or null when the image is null or the engine failed to init.

EncodeText(string)

Encodes a text query into an L2-normalized CLIP text embedding, in the same space as the image embeddings.

public float[] EncodeText(string text)

Parameters

text string

The query text.

Returns

float[]

The L2-normalized embedding.

Exceptions

InvalidOperationException

Thrown when the engine is not initialized, or the text model / tokenizer files were not provided.

Init()

Loads the vision and text CLIP models, resolves their input/output names and the embedding dimension, and loads the CLIP tokenizer.

public bool Init()

Returns

bool

true if initialization succeeded; otherwise, false.

SetContext(BaseContext)

Sets the logging context.

public void SetContext(BaseContext context)

Parameters

context BaseContext

The context.