How to build an AI that can answer questions about your website

This tutorial shows how you can use an [OpenAI tutorial](https://platform.openai.com/docs/tutorials/web-qa-embeddings) and change two lines of code to create a hosted Q&A service with Embedbase. You can try the full process by [downloading the source code](https://github.com/another-ai/embedbase-cookbook/blob/main/apps/web-crawl-q-and-a). ## Inserting data into Embedbase Instead of storing embeddings locally, we send a simple request to the Embedbase API: ```python import requests VAULT_ID = "dev"; URL = "https://embedbase-hosted-usx5gpslaq-uc.a.run.app"; # find your Embedbase API key API_KEY = "https://app.embedbase.xyz/dashboard"; async def embed(texts): requests.post( URL + "/v1/" + VAULT_ID, headers = { "Authorization": "Bearer " + API_KEY, "Content-Type": "application/json" }, json = { # documents is like [{"data": "hello world"}, {"data": "hello world"}] "documents": texts, } ) # create batches of text from the dataframe batches = [] for i in range(0, len(df), 100): batches.append(df.iloc[i:i+100].text.apply(lambda x: {"data": x}).tolist()) import asyncio # run batches in parallel await asyncio.gather(*[embed(batch) for batch in batches]) ``` This should take about 15-30 seconds, after that, the embeddings are properly stored in Embedbase. ## Building a question answer system with your embeddings Here, we don't need to embed and search locally, Embedbase take care of it: ```python import openai openai.api_key = "https://platform.openai.com/account/api-keys" def search(query, vault_id): return requests.post( URL + "/v1/" + vault_id + "/search", headers = { "Authorization": "Bearer " + API_KEY, "Content-Type": "application/json" }, json = { "query": query } ) def create_context( question, max_len=1800 ): """ Create a context for a question by finding the most similar context from the dataframe """ search_response = search(question, VAULT_ID).json() cur_len = 0 returns = [] # Add the text to the context until the context is too long for similarity in search_response["similarities"]: sentence = similarity["data"] # Add the length of the text to the current length n_tokens = len(tokenizer.encode(sentence)) cur_len += n_tokens + 4 # If the context is too long, break if cur_len > max_len: break # Else add it to the text that is being returned returns.append(sentence) # Return the context return "\n\n###\n\n".join(returns) ``` The following is not very different from the original tutorial: ```python def answer_question( model="text-davinci-003", question="Am I allowed to publish model outputs to Twitter, without a human review?", max_len=1800, debug=False, max_tokens=150, stop_sequence=None ): """ Answer a question based on the most similar context from the dataframe texts """ context = create_context( question, max_len=max_len, ) # If debug, print the raw model response if debug: print("Context:\n" + context) print("\n\n") try: # Create a completions using the question and context response = openai.Completion.create( prompt=f"Answer the question based on the context below, and if the question can't be answered based on the context, say \"I don't know\"\n\nContext: {context}\n\n---\n\nQuestion: {question}\nAnswer:", temperature=0, max_tokens=max_tokens, top_p=1, frequency_penalty=0, presence_penalty=0, stop=stop_sequence, model=model, ) return response["choices"][0]["text"].strip() except Exception as e: print(e) return "" ``` It works: ```python answer_question(question="What day is it?", debug=False) answer_question(question="What is our newest embeddings model?") answer_question(question="What is ChatGPT?") ``` ``` "I don't know." 'The newest embeddings model is text-embedding-ada-002.' 'ChatGPT is a model trained to interact in a conversational way. It is able to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests.' ``` ## Pushing this in the real world What can you do with this? For example, you can generate a ChatGPT-like interface for your documentation by instead crawling/ingesting your documentation in Embedbase. You could also use a Git repository as an external knowledge to let your users ask questions about a project and eventually generate the documentation.