Class ClipEmbeddingEngine
- Namespace
- VisioForge.Core.AI.Clip
- Assembly
- VisioForge.Core.AI.dll
A CLIP dual-tower embedding engine. It owns two ONNX sessions — a vision tower that turns an image into an embedding and a text tower that turns text into an embedding in the same space — so an image and a natural-language query can be compared by cosine similarity. Both towers include the CLIP projection head, so their outputs share the embedding dimension exposed by VisioForge.Core.AI.Clip.ClipEmbeddingEngine.Dimension. All outputs are L2-normalized.
public sealed class ClipEmbeddingEngine : IDisposableInheritance
Implements
Inherited Members
Remarks
Input and output tensor names, and the embedding dimension, are read from the model metadata at
VisioForge.Core.AI.Clip.ClipEmbeddingEngine.Init, so an fp16 or quantized re-export of the same model stays drop-in. Text is tokenized with
the in-house ClipTokenizer (max length 77, begin/end-of-text wrapped). The engine is thread-safe for
concurrent VisioForge.Core.AI.Clip.ClipEmbeddingEngine.EncodeImage(SkiaSharp.SKBitmap) and VisioForge.Core.AI.Clip.ClipEmbeddingEngine.EncodeText(System.String) calls (they use separate sessions,
and ONNX Runtime Run is itself thread-safe).
Constructors
ClipEmbeddingEngine(VideoEmbeddingSettings)
Initializes a new instance of the VisioForge.Core.AI.Clip.ClipEmbeddingEngine class.
public ClipEmbeddingEngine(VideoEmbeddingSettings settings)Parameters
settingsVideoEmbeddingSettings-
The video embedding settings carrying the CLIP model and tokenizer paths.
Exceptions
- ArgumentNullException
-
Thrown when
settingsis null.
Properties
ActiveProvider
Gets the execution provider the vision session actually engaged. Valid after VisioForge.Core.AI.Clip.ClipEmbeddingEngine.Init.
public OnnxExecutionProvider ActiveProvider { get; }Property Value
- OnnxExecutionProvider
Dimension
Gets the embedding dimension shared by the vision and text towers, read from the model output metadata at VisioForge.Core.AI.Clip.ClipEmbeddingEngine.Init. Zero before initialization.
public int Dimension { get; }Property Value
Methods
Dispose()
Performs application-defined tasks associated with freeing, releasing, or resetting unmanaged resources.
public void Dispose()EncodeImage(VideoFrameX)
Encodes an RGBA video frame into an L2-normalized CLIP image embedding.
public float[] EncodeImage(VideoFrameX frame)Parameters
frameVideoFrameX-
The source RGBA frame.
Returns
- float[]
-
The L2-normalized embedding, or
nullwhen the frame is empty or the engine failed to init.
EncodeImage(SKBitmap)
Encodes a bitmap into an L2-normalized CLIP image embedding.
public float[] EncodeImage(SKBitmap image)Parameters
imageSKBitmap-
The source image (any color type/size).
Returns
- float[]
-
The L2-normalized embedding, or
nullwhen the image is null or the engine failed to init.
EncodeText(string)
Encodes a text query into an L2-normalized CLIP text embedding, in the same space as the image embeddings.
public float[] EncodeText(string text)Parameters
textstring-
The query text.
Returns
- float[]
-
The L2-normalized embedding.
Exceptions
- InvalidOperationException
-
Thrown when the engine is not initialized, or the text model / tokenizer files were not provided.
Init()
Loads the vision and text CLIP models, resolves their input/output names and the embedding dimension, and loads the CLIP tokenizer.
public bool Init()Returns
- bool
-
trueif initialization succeeded; otherwise,false.
SetContext(BaseContext)
Sets the logging context.
public void SetContext(BaseContext context)Parameters
contextBaseContext-
The context.