Implement Multi-Layer Transformer Decoder with Causal Masking

Implement a multi-layer Transformer decoder with causal masking. Your implementation should include the multi-head self-attention mechanism, position-wise feed-forward networks, and the necessary causal masking to ensure the autoregressive property of the decoder. The function should take as input a batched sequence of embeddings (batch_size, sequence_length, d_model) and return the processed sequence after applying the Transformer decoder layers (batch_size, sequence_length, d_model).

Constraints

  • The implementation must ensure that the causal masking is correctly applied to prevent any future information leakage.
  • The function should be able to handle variable sequence lengths and embedding dimensions as specified by the input parameters.
  • You are allowed to use standard Python libraries and PyTorch for tensor operations. Do not use high-level Transformer implementations from libraries like Hugging Face or TensorFlow.
  • You can reuse the components you implemented in the previous questions.

Examples

Example 1

{
  "input": "embeddings = [[[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]], num_layers = 2, num_heads = 2, d_model = 2, d_ff = 4",
  "output": "Processed sequence of embeddings after applying the Transformer decoder layers, with shape maintained as (sequence_length, d_model)."
}

Example 2

{
  "input": "embeddings = [[[0.7, 0.8, 0.9], [0.2, 0.3, 0.4]]], num_layers = 1, num_heads = 3, d_model = 3, d_ff = 6",
  "output": "Processed sequence of embeddings after applying the Transformer decoder layers, with shape maintained as (sequence_length, d_model)."
}

</>Code

Test

Input:

use python data or natural language description

Output: