DeepFake Detection

Project Description

In this project, we will be training and fine-tuning NNs to detect deepfakes in real-time videos. We will experiment with CNNs to extract features from individual frames, then potentially use transformers to analyze motion and context across the sequence of frames to detect inconsistencies. We’ll be using datasets like FaceForensics++ (https://paperswithcode.com/dataset/faceforensics-1), DeepFake Detection Challenge (https://ai.meta.com/datasets/dfdc/) from Meta.

DeepFake Detection Example Source: DeepFake Detection Challenge from Meta

What Are You Planning to Do?

Preprocess video data, extract frames, and align faces. ( → Use OpenCV to extract frames from videos)
(Main goal) Build a CNN or fine-tune pretrained CNN models (like Xception or ResNet) for image-based feature extraction (frame-by-frame), where each frame is passed through the model to produce a feature vector (numerical) that captures/’encodes’ key spatial patterns and representations in the image, such as textures, shapes, ..,
Without Transformers: After getting the spatial features from each frame using the CNN, these features will be passed through the model layers to classify the frame as real or fake. To account for the whole video context, we could aggregate predictions over multiple frames (such as taking the average of predictions) to make a final classification for the video.
If we explore using transformers (ViT or TimeSformer) the sequence of CNN-extracted feature vectors (each corresponding to one frame) will be used as input to the Transformer model. Theoretically, the transformer model (using the self-attention mechanism) will help analyze the relationships between frames in the video sequence. For instance, it will look at how facial expressions change over time or how other motions evolve between frames, helping detect anomalies that could indicate a deepfake (e.g., inconsistent eye blinking or unnatural head movements).
Either way, we learn to refine the model through hyperparameter tuning and evaluate its performance.

Why Is Your Project Exciting/Fun/Educational?

I’m excited about this project because it is a fun way to experiment with different technologies, and see their limits firsthand. Thorugh this project, I will be able to learn about data processing, fine-tuning, how to build a CNN, etc, that are transferrable to other domains.

What Does Success Look Like?

Create a model that is able to tell real vs. fake videos with good accuracy.
Make a simple website so other people can upload the videos and check it out.
(Most importantly), ability to transfer these skills to other domains in machine learning: how to do data processing, how to work with large data, how to build a NN, etc.

Note: For me, I also think there are a lot of potential limitations to this approach—like not having access to a GPU, figuring out how to properly train transformers on video sequences - how to deal with data embeddings for multi-models, and making real-time detection actually feasible. It’s definitely ambitious, but with some tweaks (like fine-tuning instead of full training or using a smaller dataset), I think it can still work. We can also explore LSTMs instead of transformers for potentially more manageable training.