估值最高的 AI 搜索独角兽 Perplexity 使用倒排索引做 RAG

3 月份我曾写了一篇博客《在 LLM 时代我们是否还需要倒排索引？》，探讨了在向量数据库崛起的情况下倒排索引仍然存在的价值。

也许仍然有人对传统搜索技术弃如敝履，但前几天我正好看到一个 Lex Fridman 对 Perplexity CEO Aravind Srinivas 的采访，他表示 Perplexity 仍然使用（可能不是仅使用）倒排索引和 BM25 传统算法进行全网搜索。作为估值 30 亿美元的 AI 搜索独角兽公司，在全网搜索上却也选择了倒排索引，而且访谈中非常强调倒排的重要性，这应该算是印证了我的观点吧。

他的这些思考对我也有一些启发，为了保持互联网的记忆，我将这些相关节选的原文和翻译记录如下：

Aravind Srinivas 2024年6月20日采访节选

这个 Podcast 有 3 个小时，可以从 2:03:52 开始看。

Lex Fridman (02:03:52)

这是一个通过（将内容）嵌入到某种向量空间实现的完全的机器学习系统吗？
Is that a fully machine learning system with embedding into some kind of vector space?

Aravind Srinivas (02:03:57)

它并不是纯粹的向量空间。并不是说内容一旦被获取，就会有某种 BERT 模型在所有内容上运行并将其放入一个巨大的向量数据库中进行检索。
It’s not purely vector space. It’s not like once the content is fetched, there is some BERT model that runs on all of it and puts it into a big, gigantic vector database which you retrieve from.

这并不是那样的，因为将网页的所有知识打包到一个向量空间表示中是非常困难的。首先，向量嵌入在处理文本时并没有想象中的那么神奇。理解一个文档是否与一个特定检索词相关非常困难。它应该是与检索词中的人物有关，还是与检索词中的特定事件有关，更或者它是基于对检索词更深层次的理解，使得检索词可以应用到另外一个人物，他也需要被检索到？什么才是（向量）表示应该捕捉的内容？大家可能会不断争论。而让这些向量嵌入具有不同维度，彼此解耦并捕捉不同语义是非常困难的。
It’s not like that, because packing all the knowledge about a webpage into one vector space representation is very, very difficult. First of all, vector embeddings are not magically working for text. It’s very hard to understand what’s a relevant document to a particular query. Should it be about the individual in the query or should it be about the specific event in the query or should it be at a deeper level about the meaning of that query, such that the same meaning applying to a different individual should also be retrieved? You can keep arguing. What should a representation really capture? And it’s very hard to make these vector embeddings have different dimensions, be disentangled from each other, and capturing different semantics.

顺便说一句，这只是排序部分的内容。还有索引部分——假设你有一些处理后的 URL，排序模块会基于你提问的检索词，从索引中召回相关的文档和一些打分。这就是为什么，当你的索引中有数十亿个页面，而你只需要前 K 个结果时，你必须依赖近似算法来获取这些前 K 个结果。
This is the ranking part, by the way. There’s the indexing part, assuming you have a post-process version for URL, and then there’s a ranking part that, depending on the query you ask, fetches the relevant documents from the index and some kind of score. And that’s where, when you have billions of pages in your index and you only want the top K, you have to rely on approximate algorithms to get you the top K.

Lex Fridman (02:05:25)

所以这就是排序，但是将页面转换为可以存储在向量数据库中的内容，这一步似乎非常困难。
So that’s the ranking, but that step of converting a page into something that could be stored in a vector database, it just seems really difficult.

Aravind Srinivas (02:05:38)

它并不必须全存在向量数据库中。你可以使用其他数据结构和其他形式的传统检索。有一种算法叫做 BM25，它正是为此设计的，是 TF-IDF 的一个更复杂的版本。TF-IDF 是词频乘以逆文档频率，是一种非常老派的信息检索系统，实际上即使在今天也仍然非常有效。BM25 是它的一个更复杂的版本，仍然在许多排名中击败大多数嵌入。当 OpenAI 发布他们的嵌入时，围绕它有一些争议，因为在许多检索基准测试中，它甚至没有击败 BM25，这并不是因为他们做得不好，而是因为 BM25 太好了。这就是为什么仅靠纯粹的嵌入和向量空间并不能解决搜索问题。你需要传统的基于词项的检索，你需要某种基于 NGram 的检索。
It doesn’t always have to be stored entirely in vector databases. There are other data structures you can use and other forms of traditional retrieval that you can use. There is an algorithm called BM25 precisely for this, which is a more sophisticated version of TF-IDF. TF-IDF is term frequency times inverse document frequency, a very old-school information retrieval system that just works actually really well even today. And BM25 is a more sophisticated version of that, that is still beating most embeddings on ranking. When OpenAI released their embeddings, there was some controversy around it because it wasn’t even beating BM25 on many retrieval benchmarks, not because they didn’t do a good job. BM25 is so good. So this is why just pure embeddings and vector spaces are not going to solve the search problem. You need the traditional term-based retrieval. You need some kind of NGram-based retrieval.

Aravind Srinivas 2023年12月14日采访节选

其实在去年 Unsupervised Learning 的 Jacob Ephron 采访 Aravind Srinivas 时，他也有过类似的表述，但是没有像最新一次采访那样强调倒排的重要性。这个 Podcast 也很长，可以从 28:08 开始看。

Aravind Srinivas (28:08)

很多人认为，既然我们在网页搜索的 RAG 方面如此擅长，那么 Perplexity 就能轻松搞定内部搜索。不，这是完全不同的两回事。
A lot of people think that because we are so good at RAG for web search, Perplexity will just nail internal search. No, it's two completely different entities.

Google Drive 的搜索之所以糟糕是有原因的。你可能会想，Google作为网页搜索的王者，怎么会这么差劲？他们之所以差劲，是因为索引机制完全不同，需要训练的嵌入模型也完全不同。这不仅仅是嵌入模型的问题，还包括你如何对页面进行摘要、如何进行文本召回——如何使用传统的基于TF-IDF的倒排索引来构建弹性索引，这些都大不相同。
There is a reason why search on Google Drive sucks. Like would you expect Google the king of web search to be so bad? They're bad because of a reason that the indexing is so different, the embeddings that you got to train are so different. Not just the embeddings, but even the way you snippet a page, your text retrieval——the elastic index that you're building with traditional TF-IDF based inverted indexes are so different.

所以需要某家公司专注于这种场景，就像我们专注于网页搜索场景一样。RAG 是一项非常艰巨的任务，在生成式 AI 之外还有很多工作需要完成。这不仅仅是训练一个大型的嵌入模型就能解决的问题。
That you need a company to just obsessively focus on that use case. Just like how we are obsessively focused on the web search use case. So RAG is going to be pretty hard and there's a lot of work that needs to be done outside of generative AI. It's not just training a large embedding model and you're done.

我记得当 OpenAI 发布嵌入 API 时，Sam Altman（OpenAI的CEO）在推特上说，下一个万亿公司可能只需要接入这个 API 就可以建立起来。这虽然是一种很好的营销方式，但事实并非如此。他当时说的是一万亿美元。所以，当听到有人声称他们已经解决了 RAG 问题时，我会非常谨慎。他们可能只是在某个场景下做得很好。
I remember like when OpenAI releases embedding API, Sam Altman tweeted the next 1 trillion company can just plug into this API and be built. That's not true. It's a good way to market it. But that's not true at all. Sorry, he said 1 trillion dollars. So I think that's why I would be very careful when somebody makes claims that they've solve RAG. They probably can do it really well for one use case.

此外，在排名中还有很多其他因素需要考虑，才能使答案真正出色。即使 LLM 最终决定了哪些链接用于答案，你也不能仅仅将一些垃圾信息放入 prompt 中，就指望它能神奇地只选择最相关的链接，并在答案中给出引用。
You know and also there are so many more things to handle in the ranking. That'll make the answer really good. Cuz even though the LLMs are finally the ones that are taking which links to use for the answer, it's not like you can just dump garbage into the prompt and it'll just be magically so good that it'll only take the top most relevant links in the answer and give you the citations with them.

事实上，你向这些长上下文模型提供的信息越多，最终出现幻觉的可能性就越大。因此，你实际上需要在检索模块上投入大量工作，不仅仅是嵌入向量，在索引、嵌入向量和排序上都要投入。排序也应该包含很多除了向量内积之外的其他信号，具体是哪些信号事实上取决于你的场景。
In fact the more information you throw at these really long context models, the more chances that you have a hallucination at the end. So you actually have to do a lot of work in the retrieval component, not just the embeddings, the indexing, the embeddings and the ranking. Ranking should also have a lot of signals outside of just the vector dot products. And then what is those signals are really depend on your use case.

分析（在2024年8月）

从以上不同时期 Aravind 的公开披露的信息分析，几乎可以说 Perplexity 在当前时间点，在召回阶段主要（如果不是全部的话）依赖倒排索引，在排序阶段会用到嵌入向量和其它信号，并且他们很重视除了嵌入向量之外的其它信号。