webApp_DoVoiceInteraction
The model_training folder contains the major code for training of the embedding model. Most of the code is inherited from the publicly available git repository at https://github.com/RF5/simple-speaker-embedding. The model is based on GRU model started with a convolutional encoder. The GRU model has three layers with 786 hidden units each and the model operates on raw waveform. The loss function is the G2E2 loss introduced by Li Wan etc, available at https://arxiv.org/abs/1710.10467. The training is based on the voxceleb1 dataset, where we split the dataset into train, validation and test set in 8:1:1 ratio. The model is stored in ‘convgru_ckpt_forvoxceleb1_strip.pt’ file for local reference and maybe updated in the future. The file ‘show_the_tag.py’ utilizes model to generate embeddings and calculate the speaker embedding similarity based on the cosine distance. To make the clustering/identification method an online manner, for each new recorded utterance, we calculate the cosine distance between the new speaker’s embedding and all existed speakers embedding and give it a new label if the similarity is above the threshold.
#web_server#
The web server is based on Django. The web server provides serveral functionalities for users. The upload function prompts the users to upload a raw wav with a maximum size of 10mb. The trasncription function immediately trasncribes the speech into text. The label function will determine whether the speaker in the currently uploaded audio file is the same speaker as in the previously uploaded sound file. The default label is -1. The lebelling will take slightly longer because the currently uploaded file will be compared with all the embeddings in the trained model.