Score Prediction from User Logs with BERT
Applying BERTbased sequence modeling to predict user performance scores from interaction logs — demonstrating how transformer architectures can extract learning
User logs — sequences of clicks, answers, navigation paths, and time-on-task — contain rich signals about cognitive engagement and learning progress. This project explores whether a BERT-based sequence model can extract those signals well enough to predict a user's eventual score on a knowledge assessment.
Problem Framing
Given: a sequence of user interactions with a learning platform (e.g., which questions were attempted, answer correctness, time between attempts, navigation patterns).
Predict: the user's final assessment score.
This is a sequence classification problem. The natural language processing analogy: each interaction is a "token," and the full session is a "sentence" whose meaning (predicted score) we want to model.
Approach
BERT's pre-trained representations are adapted to the interaction-log domain through fine-tuning. Rather than tokenizing text, the model ingests a structured sequence of interaction events, each represented as a learned embedding.
The architecture:
- Event embedding layer — maps each interaction type (question attempt, resource access, navigation) to a dense vector
- BERT encoder — processes the sequence with self-attention, capturing long-range dependencies between events
- Classification head — maps the [CLS] token representation to a predicted score bucket
The key insight: BERT's attention mechanism is well-suited to this task because early events in a session (e.g., which topics a user struggles with in the first 10 minutes) have predictive value for later outcomes, and standard RNNs underweight these early signals.
Results and Takeaways
Fine-tuned BERT outperformed LSTM and GRU baselines on the held-out test set, with the performance gap widening for longer sessions — consistent with the hypothesis that self-attention better captures long-range dependencies in behavioral sequences.
This project illustrates a recurring theme in applied ML: the right framing matters as much as the architecture. Treating user logs as sequences analogous to text, and adapting an NLP architecture accordingly, produced a better model than purpose-building a time-series predictor from scratch.
I write about this kind of work — reliability, uncertainty, building things that work in production. One email per month.