Implement a padding mask to the attention scores, ensuring that the padding elements do not contribute to the attention distribution. You are given a pre-implemented `MultiHeadAttention` class and apply your implementation on it.
{
"input": {
"attention_scores": "A tensor representing the raw attention scores between elements of a sequence. Example shape: (batch_size, sequence_length, sequence_length)",
"padding_mask": "A tensor representing the padding mask, where 1 indicates a real element and 0 indicates padding. Example shape: (batch_size, sequence_length)"
},
"output": "A tensor of the same shape as attention_scores, where the attention scores corresponding to padding positions have been modified to ensure they do not contribute to the attention distribution."
}
{
"input": {
"attention_scores": "A tensor with all elements being real (no padding). Example shape: (batch_size, sequence_length, sequence_length)",
"padding_mask": "A tensor with all elements set to 1, indicating no padding. Example shape: (batch_size, sequence_length)"
},
"output": "The original attention_scores array, as no modification was necessary due to the absence of padding."
}
use python data or natural language description