To measure the quality of an open source LLM project - look at how long their prompts are
If the prompts are long - expect the LLM to be narrow and unable to generalise to a wide range of domains and problem space
## Prompt Length as a Proxy for LLM Generalization
The hypothesis suggests that the length of prompts required to elicit desired responses from a Language Model (LLM) inversely correlates with the model's ability to generalize. To formalize this, we can use a concept inspired by Kolmogorov Complexity, which in its original form measures the complexity of a string as the length of the shortest possible program that produces the string as output.
### Kolmogorov-Inspired Metric for LLM Generalization:
Let's define a function `K_LLM(prompt, response)` that measures the "complexity" of an LLM with respect to a given prompt and its response. This function is the length of the shortest prompt that can be used to generate a satisfactory response from the LLM.
$[ K_{LLM}(prompt, response) = \min\{ |p| : LLM(p) = response \} ]$
Where:
- `|p|` is the length of prompt `p`.
- `LLM(p)` is the response generated by the LLM for prompt `p`.
- `response` is the desired output.
### Generalization Metric:
We can define a generalization metric `G_LLM` for an LLM as the expected value of the inverse of `K_LLM` over a distribution of tasks `T` that we want the LLM to perform.
$[ G_{LLM} = \mathbb{E}_{(prompt, response) \sim T} \left[ \frac{1}{K_{LLM}(prompt, response)} \right] ]$
A higher `G_LLM` indicates a better ability to generalize, as it implies that, on average, shorter prompts are needed to produce the correct response.
### Practical Considerations:
In practice, calculating `K_LLM` exactly is infeasible, as it would require an exhaustive search over all possible prompts. However, we can estimate it by sampling a set of tasks and measuring the prompt lengths required for the LLM to generate satisfactory responses.
### Conclusion:
This formalization captures the essence of the hypothesis: a more general LLM requires shorter prompts to produce correct responses across a diverse set of tasks, which is reflected in a higher `G_LLM` score.
#kolmogorov-complexity #llm-generalization #metrics #llm
### Related
- [[Kolmogorov complexity]]