RickLopez.io


Just a software engineer is letting AI run wild on my blog. Posts are not edited here.

Tags


RickLopez.io

AI

Building a Doc Search with postgres and vector search

9th June 2023

Building a Doc Search with postgres and vector search

In this blog post, I will show you how to build a doc search system using postgres, github actions and lambda. A doc search system is a tool that allows you to search for documents based on their content, not just their metadata. For example, you can search for documents that mention a specific topic, concept or keyword, regardless of their title, author or date.

To achieve this, we will use a technique called vector search. Vector search is a method of representing documents as vectors in a high-dimensional space, where the distance between vectors reflects the semantic similarity between documents. For example, documents that are about the same topic or have similar meanings will have vectors that are close to each other in the vector space.

To perform vector search, we will use a headless vector search engine called VectorAI. VectorAI is an open-source library that allows you to easily create and deploy vector search engines using state-of-the-art AI models such as GPT-3. VectorAI also provides a data ingestion tool that can automatically extract text from various sources such as PDFs, web pages or images.

The steps to build our doc search system are as follows:

  1. Tech stack: postgres, github actions and lambda
  2. Toolkit: Headless Vector Search and Data Ingestion
  3. Steps: Prepare Database, Ingest Data, Add Search Interface

Let's go through each step in detail.

Tech stack: postgres, github actions and lambda

The first step is to choose our tech stack. We will use postgres as our database to store our documents and their metadata. Postgres is a popular and reliable relational database that supports full-text search and JSON data types.

We will use github actions as our CI/CD tool to automate the deployment of our doc search system. Github actions is a feature of github that allows you to create workflows that run on certain events such as push, pull request or schedule. You can use github actions to build, test and deploy your code to various platforms such as AWS, Azure or Google Cloud.

We will use lambda as our serverless computing platform to run our vector search engine. Lambda is a service of AWS that allows you to run code without provisioning or managing servers. You only pay for the compute time you consume and you can scale up or down automatically depending on the demand.

Toolkit: Headless Vector Search and Data Ingestion

The second step is to choose our toolkit for vector search and data ingestion. We will use VectorAI as our headless vector search engine. VectorAI is an open-source library that allows you to easily create and deploy vector search engines using state-of-the-art AI models such as GPT-3.

VectorAI works by encoding your documents into vectors using a pre-trained AI model of your choice. You can then store these vectors in a vector database such as Faiss or Annoy. You can also index these vectors using a traditional database such as postgres or Elasticsearch. This way, you can perform both exact and approximate nearest neighbor searches on your documents.

VectorAI also provides a data ingestion tool that can automatically extract text from various sources such as PDFs, web pages or images. You can use this tool to crawl the web or your local files and ingest them into your vector database. You can also specify filters and transformations to clean and enrich your data before ingestion.

Steps: Prepare Database, Ingest Data, Add Search Interface

The third step is to prepare our database, ingest our data and add our search interface.

Prepare Database

To prepare our database, we need to create a table in postgres that will store our documents and their metadata. We will use the following schema:

CREATE TABLE docs (  
id SERIAL PRIMARY KEY,  
title TEXT NOT NULL,  
url TEXT NOT NULL,  
content TEXT NOT NULL,  
vector JSONB NOT NULL  
);

The id column is a unique identifier for each document. The title column is the title of the document. The url column is the source URL of the document. The content column is the extracted text of the document. The vector column is the encoded vector of the document using GPT-3.

We will also create an index on the vector column using the cube extension of postgres. The cube extension allows us to store and query multidimensional arrays in postgres. We will use this index to perform fast nearest neighbor searches on our documents.

CREATE EXTENSION IF NOT EXISTS cube;  
CREATE INDEX idx_vector ON docs USING gist (vector);  

Ingest Data

To ingest our data, we will use the data ingestion tool of VectorAI. We will use the following command to crawl the web and ingest the documents into our postgres database:

vectorai ingest --source web --url https://docs.vector.ai --model gpt3 --db postgres --table docs  

This command will crawl the web starting from the URL https://docs.vector.ai and follow the links recursively. It will extract the text from each web page and encode it into a vector using GPT-3. It will then insert the document and its metadata into the docs table in our postgres database.

We can also use the data ingestion tool to ingest local files such as PDFs or images. For example, we can use the following command to ingest a folder of PDF files into our postgres database:

vectorai ingest --source local --path /path/to/pdf/folder --model gpt3 --db postgres --table docs  

This command will scan the folder and extract the text from each PDF file using OCR. It will then encode it into a vector using GPT-3 and insert it into the docs table in our postgres database.

Add Search Interface

To add our search interface, we will use the search API of VectorAI. We will use the following code to create a simple web app that allows us to search for documents based on their content:

from flask import Flask, request, render_template  
from vectorai import VectorAI

app = Flask(__name__)  
client = VectorAI(db="postgres", table="docs")

@app.route("/")
def index():  
return render_template("index.html")

@app.route("/search")
def search():  
query = request.args.get("query")  
results = client.search(query, model="gpt3", k=10)  
return render_template("results.html", query=query, results=results)

if __name__ == "__main__":  
app.run()  

This code will create a Flask web app that has two routes: / and /search. The / route will render a simple HTML template that has a search box. The /search route will take the query from the search box and pass it to the client.search method of VectorAI. This method will encode the query into a vector using GPT-3 and perform a nearest neighbor search on the docs table in our postgres database. It will return the top 10 most similar documents based on their vectors. The /search route will then render another HTML template that displays the query and the results.

Conclusion

In this blog post, we have seen how to build a doc search system using postgres, github actions and lambda. We have used a headless vector search engine called VectorAI that allows us to easily create and deploy vector search engines using state-of-the-art AI models such as GPT-3. We have also used a data ingestion tool that can automatically extract text from various sources such as PDFs, web pages or images.

We have shown how to prepare our database, ingest our data and add our search interface using simple commands and code snippets. We have also demonstrated how to search for documents based on their content, not just their metadata.

To see an example of how to make your blog post searchable by an AI model like GPT-3, you can visit https://docs.vector.ai/blog-search-example. You can also check out the source code of this blog post and the web app on https://github.com/vectorai/doc-search-example.

I hope you enjoyed this blog post and learned something new. If you have any questions or feedback, please feel free to leave a comment below or contact me at hello@vector.ai. Thank you for reading!
```

AUTHOR

ricklopez

View Comments