Skip to content
This repository was archived by the owner on Sep 22, 2025. It is now read-only.

MScDissertation/NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

163 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Environmental impact of running BERT

forthebadge

Analysis of data

Notebooks

  1. power_monitor_analysis/ExtractReading - get reading from power monitor for a time interval. The power monitor writes data in a database every 3 seconds.

  2. Fine-tuningAnalysis - extracts data from nvidia-smi, power monitor and combine for analysis. Time-based models are compared to empirical values from power monitor. Carbon footprint is calculated. Also plotted the dataset size relationship with energy and time

  3. RunInference - run inference on MRPC, CoLA and STS-B models fine-tuned earlier.

  4. InferenceAnalysis - extracts data from nvidia-smi and power monitor for inference and combine for analysis. Time-based models are compared to empirical values from power monitor. Overall carbon footprint is calculated and combined with pre-training and fine-tuning.

  5. CompareTimeModels - compare the time-based models.
    Merges all data from pre-training, fine-tuning and inference to test scaling with time for models compared to analytical models

  6. nvidia-smi data exploration - extract data from nvidia-smi for fine-tuning tasks and initial exploration.

  7. Time series data stationary test - data exploration and test with ADF

Data collection by training and inference

Requirements:

  1. Python 3.6+
  2. TensorFlow 2.2.0
  3. Pytorch 1.5.0
  4. Cuda 10.2

Virtual environment

  1. download miniconda and set the paths

  2. conda update conda
    conda create -n venv python=3.7
    conda install -n venv jupyter scipy numpy matplotlib tensorflow-gpu tensorflow-hub seaborn

  3. conda activate venv

Pre-train

Version issues :(

converted tf1 code to tf2 with tf_upgrade_v2.
tensorflow/tensorflow#26854

Steps:

  1. Download model sh download_uncased_base.sh
  2. Get wiki data from https://github.com/pytorch/examples/tree/master/word_language_model/data
  3. Preprocess data sh pretrain_data.sh
  4. Run training sh pre_train.sh
    OR
    train and record power data
    sh pretrain_and_record_power.sh

Pre-train with more data

google-research/bert#341
https://github.com/dsindex/bert

  1. Download wiki dump

  2. Extract using https://github.com/attardi/wikiextractor
    python ../wikiextractor/WikiExtractor.py /media/data/wikidownload.xml.bz2 --output /media/data/wikidump --processes 1 -q

  3. Clean using
    bash create_pretraining_data.sh

    May need to install and import nltk
    pip install nltk
    import nltk
    nltk.download('punkt')

  4. Run pretraining sh pretrain_large.sh

Data collection for fine-tune training

Runs fine-tune training and record power draw and utilisation with nvidia-smi

sh train_and_record_power.sh task batchsize maxSeqLength model(cased/uncased)

sh train_and_record_power.sh CoLA 32 128 bert-base-cased

Example to fine-tune on MRPC:

  1. Get model - download from https://github.com/google-research/bert

  2. Get data using download_glue_data.py

    python download_glue_data.py --data_dir data --tasks MRPC

  3. Prepare fine tune data using sudo sh fine_tune.sh
    (edit fields)

  4. Run python bert_finetune.py

https://github.com/tensorflow/models/tree/master/official/nlp/bert
https://github.com/tensorflow/models/tree/master/official/nlp/bert#process-datasets

huggingface transformer example

For pytorch implementation

  1. pip install statsmodels

  2. git clone https://github.com/huggingface/transformers

    cd transformers

    pip install .

  3. pip install -r ./examples/requirements.txt

    (git pull
    pip install --upgrade .)

  4. download data as in tensorflow example. No need to download model separately

  5. cd ..

  6. sh fine_tune_example.sh MRPC 32

    Task argument can be CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI
    Second argument, batch size can 16, 32, 64, etc

  7. Record gpu utilization details
    sh nvidiasmi.sh

Language modelling

Download wikitext2 from https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/

sh mlm_fine_tune_bert.sh

More in /modelsFT

About

NLP models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors