OpenAI embeddings - louis030195

#openai #embeddings #text-embedding #similarity #cosine-similarity #infra #ai #llm Created at 140223 # [Anonymous feedback](https://www.admonymous.co/louis030195) # [[Epistemic status]] #shower-thought Last modified date: 2023-02-14 Commit: 0 # Related - [[Computing/Intelligence/LLM fine-tuning is obsolete]] - [[Computing/Intelligence/Machine Learning/GPT3]] - [[Computing/Intelligence/Machine Learning/Embedding is the dark matter of intelligence]] # TODO > [!TODO] TODO # OpenAI embeddings ## Openai content OpenAI blog: https://openai.com/blog/new-and-improved-embedding-model/ ![[Pasted image 20230214113109.png]] ![[Pasted image 20230214113128.png]] ![[Pasted image 20230214113148.png]] ![[Pasted image 20230214113156.png]] ### Non openai content https://medium.com/@nils_reimers/openai-gpt-3-text-embeddings-really-a-new-state-of-the-art-in-dense-text-embeddings-6571fe3ec9d9 This blog post was discussing old OpenAI embeddings, someone commented about the new ones: > Quick follow-up since OpenAI released the '_text-embedding-_**_ada-002_**' embeddings model recently, with claims of improved performance, price cuts, and more (see OpenAI's blog post). > > **Short Subjective Takeaways:** > > - similarity of the Top-10 matches (over 1M records) is in line with t5-xl models and all-mpnet-base-v2; the first ten ranked entries tend to be similar across models > > - subjective text similarity in the longer tail (Top-1000) is much better in ada-002 compared to the other models. Ada-002 seems to follow my intuitive ranking of companies much better in the middle and long tail > > - 1536 float values are more expensive to store, transfer and process. The semantic dilution due to larger dimensionality compared to the typical 768 embedding is not felt (by me), but the costs for storage and computation requirements are definitely felt (12GB of memory for 1M vectors can quickly trigger oom when kept in multiple copies in memory or smaller compute instances) > > - still very expensive: it cost _$70 to run the experiment, vs. $0.50_ in electricity/depreciation to run the same experiment on gtr-t5-xl on local infrastructure - still a >100x difference for me > > **Using the new model or not?** > > Your use case will have its own tradeoff of costs, resources, and requirements for the user. In my case, I may use ada-002 compared to the t5-xl models as it barely fits my budget, and the value of the mid-tail improvement in ranking shall be perceived by the users. However I haven't made the switch yet because I'm still looking for an optimal production-ready vector search database. > > More details on my experiment are below. > > I've run a subjective evaluation on a dataset of 1M organization descriptions, as the update brought the price down. Still, it cost me $70 to get those expensive floats. > > The 1536-long ada-002 embedding seems to perform a bit better than gtr-t5-xl, all-mpnet-base-v2, and sentence-t5-xl. The subjective metric has been in ranking the 1M organizations descriptions by cosine similarity, in other words, seeing which companies relate more to a given company. > > The post I'm attaching this comment to has been an incredible resource, and I wanted to add this comment for people that are coming back to the original post after seeing the ada-002 announcement. Summary bullet list of this comment: - OpenAI's ada-002 text-embedding model has improved performance, lower cost, and more - Subjective tests show similarity of Top-10 matches is in line with other models, and subjective text similarity in the longer tail (Top-1000) is much better in ada-002 - 1536 float values are more expensive to store, transfer, and process, but the value of the mid-tail improvement may be worth it - Experiment cost $70, which is still more expensive than running the same experiment on local infrastructure - Whether to use the new model or not depends on one's use case and budget My belief is that OpenAI will keep improving their embeddings, so if you have the energy to build your own infra, do it (I did the mistake to do it, then switched to OpenAI after some pain), otherwise sticking to OpenAI will need 1000x less effort. Other: https://towardsdatascience.com/generating-state-of-the-art-text-embeddings-with-hardware-accessible-by-everyone-46bc7d084703