MultiHeadSelfAttention
Multi-Head Self-Attention
Complete multi-head self-attention mechanism where queries, keys, and values all come from the same input. Core component of transformer architectures.
Parameters:
- dim: Model dimension (d_model)
- num_heads: Number of attention heads (dim must be divisible by num_heads)
Shape Contract:
- Input: [*, seq_len, dim] sequence of embeddings
- Output: [*, seq_len, dim] attended sequence (same shape)
Notes:
- Includes Q, K, V projections and output projection
- head_dim = dim / num_heads
- Each head attends independently, results are concatenated
- Self-attention: Q, K, V all derived from same input
- Can support causal masking for autoregressive models
- Used in BERT, GPT, and virtually all transformers
Signature
neuron MultiHeadSelfAttention(dim, num_heads)
Ports
Inputs:
default:[*, seq_len, dim]
Outputs:
default:[*, seq_len, dim]
Implementation
Source { source: "core", path: "attention/MultiHeadSelfAttention" }