To measure the quality of an open source LLM project - look at how long their prompts are If the prompts are long - expect the LLM to be narrow and unable to generalise to a wide range of domains and problem space ## Prompt Length as a Proxy for LLM Generalization The hypothesis suggests that the length of prompts required to elicit desired responses from a Language Model (LLM) inversely correlates with the model's ability to generalize. To formalize this, we can use a concept inspired by Kolmogorov Complexity, which in its original form measures the complexity of a string as the length of the shortest possible program that produces the string as output. ### Kolmogorov-Inspired Metric for LLM Generalization: Let's define a function `K_LLM(prompt, response)` that measures the "complexity" of an LLM with respect to a given prompt and its response. This function is the length of the shortest prompt that can be used to generate a satisfactory response from the LLM. $[ K_{LLM}(prompt, response) = \min\{ |p| : LLM(p) = response \} ]$ Where: - `|p|` is the length of prompt `p`. - `LLM(p)` is the response generated by the LLM for prompt `p`. - `response` is the desired output. ### Generalization Metric: We can define a generalization metric `G_LLM` for an LLM as the expected value of the inverse of `K_LLM` over a distribution of tasks `T` that we want the LLM to perform. $[ G_{LLM} = \mathbb{E}_{(prompt, response) \sim T} \left[ \frac{1}{K_{LLM}(prompt, response)} \right] ]$ A higher `G_LLM` indicates a better ability to generalize, as it implies that, on average, shorter prompts are needed to produce the correct response. ### Practical Considerations: In practice, calculating `K_LLM` exactly is infeasible, as it would require an exhaustive search over all possible prompts. However, we can estimate it by sampling a set of tasks and measuring the prompt lengths required for the LLM to generate satisfactory responses. ### Conclusion: This formalization captures the essence of the hypothesis: a more general LLM requires shorter prompts to produce correct responses across a diverse set of tasks, which is reflected in a higher `G_LLM` score. #kolmogorov-complexity #llm-generalization #metrics #llm ### Related - [[Kolmogorov complexity]]