loki2.cl.model_cl

Minimal projection-only alignment module.

Module Contents

class loki2.cl.model_cl.ProjectionCL(embed_dim: int = 512, modality_dims: Tuple[int, int] = (1280, 768), *, bias: bool = False, logit_scale_init: float = 1.0, num_layers: int = 1, hidden_dim: int | None = None, dropout: float = 0.0, max_logit_scale: float = 10.0, min_logit_scale: float | None = None)

Bases: torch.nn.Module

CLIP-style symmetric contrastive projector for paired embeddings.

Parameters:

embed_dim – Shared embedding dimensionality after projection. Defaults to 512.
modality_dims – Tuple containing input dimensions for the two modalities. Defaults to (1280, 768).
bias – Whether to enable bias terms in the projection layers. Defaults to False.
logit_scale_init – Initial value (not log-space) of the logit scale multiplier s. Defaults to 1.0.
num_layers – Number of layers in each projection head. Defaults to 1.
hidden_dim – Hidden dimension for intermediate projection layers (defaults to embed dim). Defaults to None.
dropout – Dropout rate for hidden layers. Defaults to 0.0.
max_logit_scale – Upper bound for s (enforces a minimum temperature 1 / s). Defaults to 10.0.
min_logit_scale – Optional lower bound for s. Defaults to None.

Raises:

ValueError – If modality_dims does not contain exactly two dimensions, num_layers is less than 1, dropout is not in [0, 1], logit_scale_init is not positive, max_logit_scale is not positive, or min_logit_scale is not in (0, max_logit_scale].
RuntimeError – If a projection head does not contain at least one linear layer.

hidden_dim

proj_a

proj_b

init_value

logit_scale

encode_a(features: torch.Tensor) → torch.Tensor

Encode features from modality A through its projection head.

Parameters:: features – Input features for modality A.
Returns:: Projected features for modality A.
Return type:: torch.Tensor

encode_b(features: torch.Tensor) → torch.Tensor

Encode features from modality B through its projection head.

Parameters:: features – Input features for modality B.
Returns:: Projected features for modality B.
Return type:: torch.Tensor

forward(features_a: torch.Tensor, features_b: torch.Tensor, *, return_loss: bool = False) → Tuple[torch.Tensor, torch.Tensor] | Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

Project both modalities, compute scaled cosine similarities, and optionally return the CLIP-style contrastive loss.

Let f_i and g_i denote the raw embeddings for sample i from modalities a and b. After the projection heads and L2 normalisation we obtain unit vectors \hat f_i and \hat g_i. The learnable logit-scale parameter s = \exp(\text{logit\_scale}) plays the role of the inverse temperature 1/\tau. The logits matrix we feed to the cross-entropy loss is

\[L_{ij} = \min(s, s_{\text{max}})\; \hat f_i^\top \hat g_j,\]

where s_{max} is the configured maximum scale. When return_loss is True we minimise the symmetric InfoNCE objective

\[\mathcal{L} = \tfrac{1}{2} \bigl[ \operatorname{CE}(L, I) + \operatorname{CE}(L^\top, I) \bigr],\]

where CE is the cross-entropy and I indexes the matching pairs along the diagonal.

Parameters:

features_a – Input features for modality A.
features_b – Input features for modality B.
return_loss – If True, also return the contrastive loss. Defaults to False.

Returns:

Tuple[torch.Tensor, torch.Tensor]: Logits for A-B and B-A similarity if return_loss=False.
Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: Logits and loss if return_loss=True.

Return type:

Union[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor, torch.Tensor, torch.Tensor]]

Raises:

ValueError – If return_loss=True and batch sizes for both modalities are not equal.

current_logit_scale() → float: Return the effective logit scale as a Python float for logging.