Implement and compare two variants of the Transformer model's layer normalization (LN) placement: Pre-LN and Post-LN. Pre-LN applies layer normalization before the residual connection, whereas Post-LN applies it after. You should implement the forward pass for both architectures, including the necessary components such as multi-head attention and feed-forward networks
{
"input": "A sequence of embeddings with shape (batch_size, sequence_length, embedding_dim)",
"output": "The output of the model after processing the sequence, with shape (batch_size, sequence_length, embedding_dim)"
}
use python data or natural language description