Description

Functionality

This app allows uploading an image or the text, and then uses a CLIP (Contrastive Language-Image Pre-Training) neural network to browse our selection of thousands of photos to find pictures that are similar to the uploaded file or the provided image description. The application suggests dozens of answers which have the highest similarity score to the input.

UI

The UI is a dynamic web page hosted on the Google Cloud Container. The content of web pages won't always remain the same for all users, as we use the API and the model to change the content of the webpage.

Storage

The images are present inside of the Google Cloud Storage and all of them are available from the internet. The images are the subset of the [COCO Dataset] (https://cocodataset.org/#download). Rest of the files are present in the Docker Container, including the CLIP model, index.pkl file (which holds the images and texts from the database vectorized by CLIP and saved by faiss), and the filenames.txt which includes the images names.

Processing

The OpenAI CLIP is dockerized with the rest of the application (hidden data + frontend) and deployed using Cloud Run. The application is purely asynchronous as for the communication with the model and calling the images we are using a FastAPI.

The FastAPI is built on top of the ASGI (Asynchronous Server Gateway Interface) framework, which is designed to handle asynchronous code efficiently. Moreover it uses the async/await syntax to define asynchronous endpoints and functions.

CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. The CLIP matches the performance of the original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples, overcoming several major challenges in computer vision. The model:

Given a batch of images, returns the image features encoded by the vision portion of the CLIP model,
Given a batch of text tokens, returns the text features encoded by the language portion of the CLIP model,
Given a batch of images and a batch of text tokens, returns two Tensors, containing the logit scores corresponding to each image and text input. The values are cosine similarities between the corresponding image and text features, times 100.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
templates		templates
.gitignore		.gitignore
DEV.py		DEV.py
Dockerfile		Dockerfile
Dockerfile2		Dockerfile2
LICENSE		LICENSE
README.md		README.md
filenames.txt		filenames.txt
index.pkl		index.pkl
main.py		main.py
requirements.txt		requirements.txt
urls.tsv		urls.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Description

Functionality

UI

Storage

Processing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

HubertR21/GCC_Project

Folders and files

Latest commit

History

Repository files navigation

Description

Functionality

UI

Storage

Processing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages