Real-Time Age, Race & Gender Prediction with TensorFlow
Real-time face detection and demographic attribute inference using FaceNet embeddings and TensorFlow — trained on UTKFace and IMDB-WIKI datasets.
This project builds a real-time pipeline that detects faces from a webcam stream and simultaneously predicts three demographic attributes — age, race, and gender — using deep learning. The motivation was practical: I wanted to understand how well off-the-shelf face embeddings from FaceNet could be repurposed for attribute regression and classification, without training a feature extractor from scratch.
Architecture
The pipeline has two stages:
- Face detection — OpenCV's Haar Cascade or MTCNN locates faces in each video frame and crops them to a fixed input size.
- Attribute prediction — A TensorFlow model takes the cropped face, passes it through a FaceNet-based encoder to extract a 128-dimensional embedding, then routes that embedding through three separate output heads:
- Age: regression head (mean absolute error loss)
- Gender: binary classification (sigmoid output)
- Race: multi-class classification (softmax over 5 categories)
Using a shared encoder with task-specific heads keeps inference fast — one forward pass through FaceNet per frame, three predictions out.
Datasets
- UTKFace — ~20,000 labeled face images with age (0–116), gender (binary), and race (5 classes: White, Black, Asian, Indian, Others). Used as the primary training set.
- IMDB-WIKI — a larger, noisier dataset used for age pre-training to improve the regression head's generalization on edge cases (very young, very old).
Data augmentation (horizontal flip, brightness jitter, random crop) was applied during training to reduce overfitting on UTKFace's relatively modest size.
Training Setup
# Shared FaceNet encoder (frozen or fine-tuned)
base_model = load_facenet_model(weights="vggface2")
# Task-specific heads
age_output = Dense(1, activation="linear")(base_model.output)
gender_output = Dense(1, activation="sigmoid")(base_model.output)
race_output = Dense(5, activation="softmax")(base_model.output)
model = Model(
inputs=base_model.input,
outputs=[age_output, gender_output, race_output]
)
model.compile(
optimizer=Adam(lr=1e-4),
loss={
"age": "mae",
"gender": "binary_crossentropy",
"race": "categorical_crossentropy",
},
loss_weights={"age": 0.5, "gender": 1.0, "race": 1.0},
)Loss weights balance the scale difference between regression (age MAE in years) and classification losses.
Results
On the UTKFace validation set after 50 epochs, with age evaluated in bins:
| Task | Validation Accuracy | Cross-Entropy Loss |
|---|---|---|
| Age (binned) | 73% | 0.63 |
| Race | 84% | 0.38 |
| Gender | 93% | 0.15 |
Real-time throughput on a mid-range laptop GPU was sufficient for smooth video: the bottleneck is face detection, not the prediction heads.
What I Learned
Multi-task learning with a shared backbone is a practical pattern when your tasks share low-level features (facial geometry, skin texture) but differ at the output level. The shared encoder acts as a regularizer — tasks with limited data (race categories with fewer examples) benefit from gradient signal flowing through the shared trunk from better-represented tasks.
The bigger lesson: FaceNet embeddings, trained purely for identity verification, transfer well to attribute prediction. You don't need a purpose-built attribute model to get a working prototype — but you do need to be careful about dataset bias, especially for race classification where UTKFace's label distribution is uneven.
Source code and a notebook walkthrough are available on GitHub.
I write about this kind of work — reliability, uncertainty, building things that work in production. One email per month.