In this task, you are required to modify a self-attention mechanism to support mixed-precision computation. Specifically, you should ensure that the key operations (e.g., matrix multiplications) are performed in 16-bit precision (FP16) while maintaining critical operations (e.g., softmax) in 32-bit precision (FP32) to preserve numerical stability. The function should take as input the query, key, and value tensors and return the output of the self-attention mechanism.
{
"input": {
"query": "A tensor of shape (batch_size, seq_len, d_model) in FP16",
"key": "A tensor of shape (batch_size, seq_len, d_model) in FP16",
"value": "A tensor of shape (batch_size, seq_len, d_model) in FP16"
},
"output": "A tensor of shape (batch_size, seq_len, d_model) in FP16 representing the output of the self-attention mechanism"
}
{
"input": {
"query": "A tensor of shape (1, 10, 512) in FP16",
"key": "A tensor of shape (1, 10, 512) in FP16",
"value": "A tensor of shape (1, 10, 512) in FP16"
},
"output": "A tensor of shape (1, 10, 512) in FP16"
}
use python data or natural language description