Authors:
Semantically be able to search through a database of videos (using generated summaries)
The system described here is the overview of the overall system archietecture.
Below is the initial architecture of the video summarization network used to generate video summaries.
Given a minute long video of traffic in Dhaka Bangladesh.
('a man riding a bike down a street next to a large truck .', 'a man riding a bike down a street next to a traffic light .', 'a green truck with a lot of cars on it', 'a green truck with a lot of cars on the road .', 'a city bus driving down a street next to a traffic light .')
To set up the python code create a python3 environment with the following:
# create a virtual environment
$ python3 -m venv env
# activate environment
$ source env/bin/activate
# install all requirements
$ pip install -r requirements.txt
# install data files
$ python dataloader.pyIf you add a new package you will have to update the requirements.txt with the following command:
# add new packages
$ pip freeze > requirements.txtAnd if you want to deactivate the virtual environment
# decativate the virtual env
$ deactivatepython VideoSearchEngine/ImageCaptioningNoYolo/resize.py --image_dir data/coco/train2014/
python VideoSearchEngine/ImageCaptioningNoYolo/resize.py --image_dir data/coco/val2014/ --output_dir data/val_resized2014Our project will, broadly defined, be attempting video searching through video summarization. To do this we propose the following objectives and resulting action plan:
- Break videos down into semantically different groups of frames
- Recognize objects in an image (i.e. a frame)
- Convert a frame to text
- Merge summaries of all frames of a video into one large overall summary
- Build a search engine to query videos via summary.
For our project, we have come up with a basic goal we plan to reach by the time of the presentation, and a stretch goal we hope to reach if time permits
Basic Goal: We will recognize objects through the YOLO algorithm. Convert each frame to text using the algorithm mentioned in this paper. Come up with basic heuristic for skipping frames so not too much overlap in the summary. Surface all of this through a simple UI to search a video database.
Stretch Goal: Investigate other methods for reducing noise in frames (Generative Adversarial Networks), Investigate grouping together semantically similar frames to one common representation to make better summaries.
Lots of labeled data for text generation of video summaries.
Paper about how data was collected and performance.
The location of the video dataset: Source
Consists of labeled images for image captioning
Consists of action videos that can be used to test summaries.
The "MED Summaries" is a new dataset for evaluation of dynamic video summaries. It contains annotations of 160 videos: a validation set of 60 videos and a test set of 100 videos. There are 10 event categories in the test set.
- Microsoft Research Paper on Video Summarization
- YOLO Paper for bounding box object detection
- Using YOLO for image captioning
- Unsupervised Video Summarization with Adversarial Networks
- Long-term Recurrent Convolutional Networks
- Coherent Multi-Sentence Video Description with Variable Level of Detail

