YouTube Video Recommendation System using Pinecone Vector Db
In this blog , we will be seeing how to create a YouTube recommendation system using Pinecone – one of the leading vector-db in the market. A vector database, in the context of computing and data management, refers to a type of database that is designed to efficiently store, index, and manage vector data.We will storing the vectorized versions of the youtube video in the pinecone db and later perform similarity operation for the closest vectorized video.
Let’s break down the process step by step.
a) Audio extraction from videos: In this step, we will be extracting an audio mp3 file from the YouTube video using its link. We will take the help library link pytube which will return us the mp3 file of the passed YouTube video link. We will store the file address with the link in the dataframe.
b) Transcription of the mp3 files: In this step, we will use Open-ai’s whisper model to transcribe the mp3 files. In the data frame now, we have the video link, audio file address and the transcription of the audio file.
c) Vectorizing the transcriptions: We can use Open-ai’s ADA model or open source alternatives like AllminiLM-v6 of sentence transformers library from hugging face for vectorizing our texts. In the dataframe now we have the video link ,the audio file path , transcriptions and embeddings.
d) Pinecone vector storage with metadata: We will create an index in Pinecone and initialize it then upsert our metadata and embeddings into the index of Pinecone.
e) Querying for recommendations: Now since we have all the video’s vectorized version in the database. We can query for the best N closest videos for recommendation. We can keep Euclidean distance, cosine similarity, and dot product similarity as the similarity metrics, set them in the Pinecone database, and query them for the closest results.
In this way , for any video in the system , we can have the closest N videos as recommendations for next watch of the user.
Code pipeline for entire flow :
Github : https://github.com/akshaytheau/Data-Science/blob/master/YouTube_recommendation_pinecone.ipynb