Abstract

We are interested in the task on inventing new conjectures about mathematical theories: a task known as theory exploration or automated conjecturing. In this talk, I will discuss our experiences of investigating the capabilities of the GPT-4 model on this task. We experiment with generating lemmas in Isabelle/HOL syntax, of a similar kind as our symbolic system, Hipster might produce. We see potential in leveraging neural methods for theory exploration, for instance leveraging the flexibility and pattern recognition abilities of large language models (LLMs) compared to purely symbolic methods. On the downside, we note that it is difficult to assess how well the space of potential conjectures has been covered, as the stochastic nature of LLMs means that similar prompt often gives different results on different runs. It is also very difficult to fairly evaluate a proprietary system like GPT-4, and we suspect many of our benchmarks may in fact be in its training data in some form or another.