Comprehensive data analysis pipelines for experimental linguistic research, covering EEG, speech recordings, eye-tracking, and reaction time studies. These notebooks demonstrate reproducible workflows for processing and analyzing diverse experimental data types commonly used in phonetics, psycholinguistics, and cognitive science research.
Author: Chem Vatho, PhD
Affiliation: University of Cologne, IfL β Phonetik
Contact: chemvatho@gmail.com
This repository contains analysis pipelines developed for the Experimental Service Hub at the Data Center for the Humanities (DCH), University of Cologne. Each notebook provides a complete, reproducible workflow from data loading to statistical analysis and visualization.
| Pipeline | Data Type | Key Methods |
|---|---|---|
| EEG Analysis | EEG recordings | ERP analysis, time-frequency decomposition |
| Speech Recording | Audio files | F0 extraction, formant analysis, voice quality |
| Eye-tracking | Gaze data | Fixation analysis, time course visualization |
| Reaction Time | Behavioral data | RT distributions, mixed-effects modeling |
Notebook: EEG_Analysis_Pipeline.ipynb
A comprehensive pipeline for processing electroencephalography (EEG) data in linguistic experiments, with focus on event-related potentials (ERPs) relevant to language processing.
- Preprocessing: Bandpass filtering (0.1β30 Hz), baseline correction, artifact rejection
- ERP Analysis: N400 and P600 component extraction for semantic/syntactic processing
- Time-Frequency Analysis: Morlet wavelet decomposition for oscillatory dynamics
- Statistical Analysis: T-tests, effect sizes (Cohen's d), condition comparisons
- Visualization: Multi-channel ERP plots, topographic maps, TFR spectrograms
Statistical Results at Electrode Pz
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
N400 Component (300-500 ms):
Congruent: M = -2.15 Β΅V (SD = 1.82)
Incongruent: M = -5.43 Β΅V (SD = 2.01)
t = 8.234, p < .001, Cohen's d = 1.72
mne, numpy, pandas, scipy, matplotlib, seaborn, scikit-learnNotebook: Speech_Recording_Analysis.ipynb
A complete acoustic phonetics pipeline using Praat (via Parselmouth) and librosa for analyzing speech recordings.
- F0 Extraction: Multiple algorithms (Praat autocorrelation, pYIN, CREPE)
- Formant Analysis: F1βF4 tracking with bandwidth measurements
- Voice Quality: Jitter, shimmer, harmonics-to-noise ratio (HNR)
- Visualization: Spectrograms, formant tracks, F1-F2 vowel space plots
- Batch Processing: Automated pipeline for multiple audio files
Voice Quality Measures:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Jitter (local): 0.892%
Shimmer (local): 3.241%
HNR: 18.45 dB
Formant Statistics:
F1: M = 512 Hz (SD = 89)
F2: M = 1543 Hz (SD = 234)
F3: M = 2651 Hz (SD = 187)
parselmouth, librosa, numpy, pandas, scipy, matplotlib, seaborn, soundfileNotebook: Peekbank_Eyetracking_Analysis.ipynb
Analysis pipeline for visual world paradigm eye-tracking experiments using the Peekbank database framework.
- Data Processing: Fixation extraction, area-of-interest (AOI) mapping
- Time Course Analysis: Proportion of looks over time
- Growth Curve Analysis: Polynomial and GAM modeling of looking behavior
- Statistical Analysis: Cluster-based permutation tests, bootstrapped CIs
- Visualization: Time course plots, heatmaps, individual differences
See the dedicated repository: github.com/chemvatho/Peekbank-Analysis
Notebook: Reaction_Time_Analysis.ipynb
A psycholinguistics-focused pipeline for analyzing reaction time data from lexical decision and similar paradigms.
- Preprocessing: Outlier removal (absolute bounds + SD-based), accuracy filtering
- Distribution Analysis: RT histograms, Q-Q plots, log transformation assessment
- Effect Analysis: Lexicality, word frequency, semantic priming effects
- ANOVA: Repeated measures with pairwise comparisons (Bonferroni correction)
- Mixed-Effects Models: Linear mixed models with random slopes for subjects/items
- Visualization: Interaction plots, by-subject distributions, effect size plots
Repeated Measures ANOVA (Frequency Γ Priming):
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Source SS DF MS F p-unc Ξ·Β²
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
frequency 125432.1 1 125432.1 89.234 <.001 0.124
priming 45621.8 1 45621.8 32.451 <.001 0.045
freq*prime 2341.2 1 2341.2 1.665 .207 0.002
pandas, numpy, scipy, statsmodels, pingouin, matplotlib, seaborn..............................
..............................
Click the "Open in Colab" badge on any notebook to run directly in your browserβno installation required.
# Clone the repository
git clone https://github.com/chemvatho/experimental-linguistics-pipelines.git
cd experimental-linguistics-pipelines
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Launch Jupyter
jupyter notebook# requirements.txt
numpy>=1.21.0
pandas>=1.3.0
scipy>=1.7.0
matplotlib>=3.4.0
seaborn>=0.11.0
statsmodels>=0.13.0
pingouin>=0.5.0
mne>=1.0.0
librosa>=0.9.0
parselmouth>=0.4.0
scikit-learn>=1.0.0
soundfile>=0.10.0
These pipelines are designed to work with the following open datasets:
| Data Type | Dataset | Source |
|---|---|---|
| EEG | EEG Dataset | Kaggle |
| EEG | ZuCo (reading EEG) | OSF |
| Speech | Speech Accent Archive | Kaggle |
| Speech | LJSpeech | Kaggle |
| Eye-tracking | Peekbank | peekbank.stanford.edu |
| Reaction Time | British Lexicon Project | crr.ugent.be |
| Reaction Time | MALD Database | Springer |
experimental-linguistics-pipelines/
β
βββ README.md # This file
βββ requirements.txt # Python dependencies
βββ LICENSE # MIT License
β
βββ notebooks/
β βββ EEG_Analysis_Pipeline.ipynb
β βββ Speech_Recording_Analysis.ipynb
β βββ Reaction_Time_Analysis.ipynb
β
βββ data/ # Sample data (gitignored for large files)
β βββ .gitkeep
β
βββ outputs/ # Analysis outputs
β βββ figures/
β βββ results/
β
βββ utils/ # Helper functions
βββ __init__.py
βββ preprocessing.py
βββ visualization.py
- Chem, Vatho. (in prep.). Adapting forced alignment for Khmer, a low-resource language.
- Chem, Vatho. (in prep.). The Illustration of IPA: Khmer (Phnom Penh Dialect). Journal of the IPA.
- Chem, Vatho. (2020). Khmer Vowel System: Structure and Variation. CJBAR, 2(2).
- Khmer G2P Tool: huggingface.co/spaces/Vatho/Khmer-g2p
- Peekbank Analysis: github.com/chemvatho/Peekbank-Analysis
- University Profile: ifl.phil-fak.uni-koeln.de/phonetik/personen/vatho-chem-ma
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License - see the LICENSE file for details.
- Supervisors: Prof. Dr. Martine Grice & PD Dr. Constantijn Kaland (University of Cologne)
- Mentors: Prof. Dr. Reinhold Greisbach & T. Mark Ellison (University of Cologne)
- Funding: KAAD (Katholischer Akademischer AuslΓ€nder-Dienst)
- Tools: Praat, MNE-Python, librosa
The repository was developed with the assistance of Claude (Anthropic), which supported logical reasoning and algorithm design. Its integration with NotebookLM led to the creation of a pilot project, which is documented on GitHub.
Developed for the Experimental Service Hub, Data Center for the Humanities (DCH)
University of Cologne, Germany






