Implement Padding Mask Handling in Attention Calculations

Implement a padding mask to the attention scores, ensuring that the padding elements do not contribute to the attention distribution. You are given a pre-implemented `MultiHeadAttention` class and apply your implementation on it.

Constraints

  • The function should handle batches of sequences, not just individual sequences.
  • The padding mask is applied such that positions with a value of 0 in the mask (indicating padding) should have their corresponding attention scores set to a large negative value.

Examples

Example 1

{
  "input": {
    "attention_scores": "A tensor representing the raw attention scores between elements of a sequence. Example shape: (batch_size, sequence_length, sequence_length)",
    "padding_mask": "A tensor representing the padding mask, where 1 indicates a real element and 0 indicates padding. Example shape: (batch_size, sequence_length)"
  },
  "output": "A tensor of the same shape as attention_scores, where the attention scores corresponding to padding positions have been modified to ensure they do not contribute to the attention distribution."
}

Example 2

{
  "input": {
    "attention_scores": "A tensor with all elements being real (no padding). Example shape: (batch_size, sequence_length, sequence_length)",
    "padding_mask": "A tensor with all elements set to 1, indicating no padding. Example shape: (batch_size, sequence_length)"
  },
  "output": "The original attention_scores array, as no modification was necessary due to the absence of padding."
}

</>Code

Test

Input:

use python data or natural language description

Output: