Skip to content
Snippets Groups Projects
Name Last commit Last update
clustering
helpers
models
pre-processing
README.md

DeepHelp: Understanding users of a crisis text messaging service using Machine Learning

Requirements

  • Python 3.7.9
  • pytorch 1.9
  • pandas 1.3
  • transformers 4.8.2
  • numpy 1.21.0
  • matplotlib 3.4.2
  • tqdm 4.61.2
  • scikit-learn 0.24.2
  • umap 0.5.1

Model training

For the project, three models based on GPT-2 were developed to perform response generation applied to mental health. To fine-tune these 3 models, respectively called GPT-2 model, ConditionalGPT2 and StyleGPT2, we will run the following script as an example.

python3 models/gpt2_model.py \
    --mode=train \
    --ckpt_name=best_ckpt 

The script provided above is for the GPT-2 model but we can also train ConditionalGPT2 and StyleGPT2 using the same script by replacing models/gpt2_model.py with models/conditional_gpt2_model.py or models/style_tokens_model/run_styleGPT2.py respectively. We need to make sure that the --mode argument is set to 'train'. The --ckpt_name argument is not mandatory, but it allows an already trained model to be loaded for further fine-tuning.

At inference time

To use the three models at inference time, the script to run is quite similar to the training script but with more arguments. Some arguments are common to three models, such as:

  • --mode which is equal to 'inference' at inference time
  • --ckpt_name to load an already fine-tuned model
  • --decoding which can take as values 'greedy', 'beam', 'top_k' or 'nucleus' representing respectively response generation strategies such as the greedy approach, beam search, top_k or nucleus sampling

For the GPT-2 model, we can run the following script which includes the --max_time argument that represents the number of successive messages considered as inputs to generate a reply.

python3 models/gpt2_model.py \
    --mode=inference \
    --ckpt_name=best_ckpt \
    --decoding=nucleus \
    --max_time=2

For ConditionalGPT2, we have as arguments --age, --sex and --topic to generate a persona profile on which we condition the response generation.

python3 models/conditional_gpt2_model.py \
    --mode='inference' \
    --ckpt_name=best_ckpt \
    --decoding=top_k \
    --age='over 18' \
    --gender='male' \
    --topic='anxiety' 

For StyleGPT2, the model can generate replies based on the chosen style token thanks to the --style_label argument.

python3 models/style_tokens_model/run_styleGPT2.py \
    --mode='inference' \
    --ckpt_name=best_ckpt \
    --decoding=greedy \
    --style_label=1

Pre-processing

The directory 'pre-processing' includes all the pre-processing techniques performed on the Shout dataset for the project:

  • data_preprocessing.ipynb for the conversations and raw text messages
  • models_preprocessing.ipynb used to create the appropriate inputs to be fed into the models for training
  • survey_preprocessing.ipynb for the texter survey
  • sv_survey_preprocessing.ipynb for the Shout volunteer survey

Clustering

The directory 'clustering' is dedicated to the generation of conversation embeddings and cluster analysis on them:

  • clustering.ipynb is used to perform cluster analysis on features created with simple models such as TF-IDF but also to prepare the data to be used to generate embeddings using transformer methods
  • gpt2_features.py allows the generation of GPT-2 embeddings for each conversation
  • gpt2_emb_clusters.py aims to perform cluster analysis on these GPT-2 embeddings to extract some style tokens.

Helpers

The directory 'helpers' contains useful methods for training models such as providing the pre-processed dataset with custom_data.py and extracting information from surveys such as conversation topics, the age and the gender of the texter.