This study compares the performance of two deep learning models, namely Convolutional Long Short-Term Memory (ConvLSTM) and Long-term Recurrent Convolutional Network (LRCN), in the task of recognizing human activity from videos. Human activity recognition is an important field in computer vision with many applications, such as security monitoring, human-computer interaction, and social media-based video analysis. ConvLSTM is a model that combines convolution operations with long-term memory LSTM, thus capable of capturing spatial and temporal information simultaneously. This approach is ideal for processing video data sequences that have spatial and temporal dimensions. On the other hand, LRCN combines the power of spatial feature extraction from Convolutional Neural Network (CNN) and temporal sequence modeling through Recurrent Neural Network (RNN), specifically LSTM, to understand movement patterns in videos. The study used the UCF50 dataset consisting of 50 activity classes, but was limited to five classes for the focus of the experiment. The dataset was divided into 80% for training and 20% for testing, and the model was drilled for 50 epochs using early stopping to prevent overfitting. The results show that both models have high training performance. ConvLSTM achieved a training accuracy of around 98% and a validation accuracy of 90%, while LRCN achieved a training accuracy of 99.5% and a validation accuracy of 88%. Although ConvLSTM demonstrated good stability on the validation data, further testing using TikTok videos as real-world data showed that LRCN had a higher confidence level in recognizing activities, with most predictions achieving confidence scores above 80%. This difference in performance indicates that while ConvLSTM excels in generalizing on training data, LRCN is more robust to real-world data variations.