Implement a Transformer Encoder Layer

Implement PyTorch module of a single layer of a Transformer encoder. You should implement the forward pass of the encoder layer, including the self-attention mechanism, the feed-forward network, and the necessary residual connections and layer normalizations. The implementation should be efficient and handle batched inputs.

Constraints

  • Do not use pre-built transformer layers from libraries like Hugging Face or PyTorch's nn.Transformer.
  • The implementation should efficiently handle batched inputs.
  • You can use the MultiHeadAttention module you implemented in the previous question.
  • You can use the residual connections and layer normalizations modules from PyTorch

Examples

Example 1

{
  "input": "x = torch.randn(10, 32, 512)  # (batch_size, sequence_length, d_model)",
  "output": "A tensor of shape (10, 32, 512) representing the encoded features."
}

Example 2

{
  "input": "x = torch.randn(1, 64, 512)  # (batch_size, sequence_length, d_model)",
  "output": "A tensor of shape (1, 64, 512) representing the encoded features."
}

</>Code

Test

Input:

use python data or natural language description

Output: