Skip to main content

ScaledDotProductAttention

Scaled Dot-Product Attention

Core attention mechanism: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V From the original Transformer paper (Vaswani et al., 2017).

Parameters:

d_k: Dimension of keys (for scaling factor sqrt(d_k))

Shape Contract:

Input query: [*, seq_q, d_k] query vectors
Input key: [*, seq_k, d_k] key vectors
Input value: [*, seq_v, d_v] value vectors (seq_v == seq_k)
Output: [*, seq_q, d_v] attention-weighted values

Notes:

Scaling by sqrt(d_k) prevents softmax saturation
No learnable parameters (projections are in outer layer)
Building block for multi-head attention
Can include optional attention mask for causal/padding masking

Signature

neuron ScaledDotProductAttention(d_k)

Ports

Inputs:

query: [*, seq_q, d_k]
key: [*, seq_k, d_k]
value: [*, seq_v, d_v]

Outputs:

default: [*, seq_q, d_v]

Implementation

Source { source: "core", path: "attention/ScaledDotProductAttention" }

Signature
Ports
Implementation