Publication Data
Large-scale Video Classification with Convolutional Neural Networks
Abstract: Convolutional Neural Networks (CNNs) have been established
as a powerful class of models for image recognition problems. Encouraged by these
results, we provide an extensive empirical evaluation of CNNs on large-scale video
classification using a dataset of 1 million YouTube videos belonging to 487 classes. We
study multiple approaches for extending the connectivity of a CNN in time domain to
take advantage of local spatio-temporal information and suggest a multi-resolution,
foveated architecture as a promising way of regularizing the learning problem and
speeding up training. Our best spatio-temporal networks display significant performance
improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a
surprisingly modest improvement compared to single-frame models (59.3% to 60.9%). We
further study the generalization performance of our best model by retraining the top
layers on the UCF-101 action Recognition dataset and observe significant performance
improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).
