Audio Visual Scene-aware Dialog (AVSD) is the task of generating a response for a question with a given scene, video, audio and the history of previous turns in the dialog. Existing systems for this task are built upon the transformers or recurrent neural network architecture using the encoder-decoder framework. This technique shows superior performance, however, it has major limitations: the model easily overfits only to memorize the grammatical patterns; the model follows the prior distribution of the vocabularies in a dataset. To alleviate the problem, we propose a Multimodal Semantic Transformer Network. It employs a transformer-based architecture with an attention-based word embedding layer that generates words by querying word embeddings. With this design, our model keeps considering the meaning of the words at the generation stage. The empirical results demonstrate the superiority of our proposed model that outperforms all the previous works for the AVSD task.