Implement Mixed-Precision Self-Attention

In this task, you are required to modify a self-attention mechanism to support mixed-precision computation. Specifically, you should ensure that the key operations (e.g., matrix multiplications) are performed in 16-bit precision (FP16) while maintaining critical operations (e.g., softmax) in 32-bit precision (FP32) to preserve numerical stability. The function should take as input the query, key, and value tensors and return the output of the self-attention mechanism.

Constraints

  • The input tensors (query, key, value) are provided in FP16.
  • Matrix multiplications should be performed in FP16 to leverage the computational efficiency.
  • The softmax operation should be performed in FP32 to ensure numerical stability.
  • The final output should be in FP16.

Examples

Example 1

{
  "input": {
    "query": "A tensor of shape (batch_size, seq_len, d_model) in FP16",
    "key": "A tensor of shape (batch_size, seq_len, d_model) in FP16",
    "value": "A tensor of shape (batch_size, seq_len, d_model) in FP16"
  },
  "output": "A tensor of shape (batch_size, seq_len, d_model) in FP16 representing the output of the self-attention mechanism"
}

Example 2

{
  "input": {
    "query": "A tensor of shape (1, 10, 512) in FP16",
    "key": "A tensor of shape (1, 10, 512) in FP16",
    "value": "A tensor of shape (1, 10, 512) in FP16"
  },
  "output": "A tensor of shape (1, 10, 512) in FP16"
}

</>Code

Test

Input:

use python data or natural language description

Output: