loki2.cl.model_cl

Minimal projection-only alignment module.

Module Contents

class loki2.cl.model_cl.ProjectionCL(embed_dim: int = 512, modality_dims: Tuple[int, int] = (1280, 768), *, bias: bool = False, logit_scale_init: float = 1.0, num_layers: int = 1, hidden_dim: int | None = None, dropout: float = 0.0, max_logit_scale: float = 10.0, min_logit_scale: float | None = None)

Bases: torch.nn.Module

CLIP-style symmetric contrastive projector for paired embeddings.

Parameters:
  • embed_dim – Shared embedding dimensionality after projection. Defaults to 512.

  • modality_dims – Tuple containing input dimensions for the two modalities. Defaults to (1280, 768).

  • bias – Whether to enable bias terms in the projection layers. Defaults to False.

  • logit_scale_init – Initial value (not log-space) of the logit scale multiplier s. Defaults to 1.0.

  • num_layers – Number of layers in each projection head. Defaults to 1.

  • hidden_dim – Hidden dimension for intermediate projection layers (defaults to embed dim). Defaults to None.

  • dropout – Dropout rate for hidden layers. Defaults to 0.0.

  • max_logit_scale – Upper bound for s (enforces a minimum temperature 1 / s). Defaults to 10.0.

  • min_logit_scale – Optional lower bound for s. Defaults to None.

Raises:
  • ValueError – If modality_dims does not contain exactly two dimensions, num_layers is less than 1, dropout is not in [0, 1], logit_scale_init is not positive, max_logit_scale is not positive, or min_logit_scale is not in (0, max_logit_scale].

  • RuntimeError – If a projection head does not contain at least one linear layer.

hidden_dim
proj_a
proj_b
init_value
logit_scale
encode_a(features: torch.Tensor) torch.Tensor

Encode features from modality A through its projection head.

Parameters:

features – Input features for modality A.

Returns:

Projected features for modality A.

Return type:

torch.Tensor

encode_b(features: torch.Tensor) torch.Tensor

Encode features from modality B through its projection head.

Parameters:

features – Input features for modality B.

Returns:

Projected features for modality B.

Return type:

torch.Tensor

forward(features_a: torch.Tensor, features_b: torch.Tensor, *, return_loss: bool = False) Tuple[torch.Tensor, torch.Tensor] | Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

Project both modalities, compute scaled cosine similarities, and optionally return the CLIP-style contrastive loss.

Let f_i and g_i denote the raw embeddings for sample i from modalities a and b. After the projection heads and L2 normalisation we obtain unit vectors \hat f_i and \hat g_i. The learnable logit-scale parameter s = \exp(\text{logit\_scale}) plays the role of the inverse temperature 1/\tau. The logits matrix we feed to the cross-entropy loss is

\[L_{ij} = \min(s, s_{\text{max}})\; \hat f_i^\top \hat g_j,\]

where s_{max} is the configured maximum scale. When return_loss is True we minimise the symmetric InfoNCE objective

\[\mathcal{L} = \tfrac{1}{2} \bigl[ \operatorname{CE}(L, I) + \operatorname{CE}(L^\top, I) \bigr],\]

where CE is the cross-entropy and I indexes the matching pairs along the diagonal.

Parameters:
  • features_a – Input features for modality A.

  • features_b – Input features for modality B.

  • return_loss – If True, also return the contrastive loss. Defaults to False.

Returns:

  • Tuple[torch.Tensor, torch.Tensor]: Logits for A-B and B-A similarity if return_loss=False.

  • Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: Logits and loss if return_loss=True.

Return type:

Union[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor, torch.Tensor, torch.Tensor]]

Raises:

ValueError – If return_loss=True and batch sizes for both modalities are not equal.

current_logit_scale() float

Return the effective logit scale as a Python float for logging.