This joint WG4-WG5 meeting consists of:

  • a Workshop on Natural Formal Mathematics (NatFoM) (September 6-7).

  • a Workshop on Libraries of Formal Proofs and Natural Mathematical Language (September 7-8).

The session on September 7 is common to the two workshops.

Registered participants (24)

Frédéric Blanqui, Angeliki Koutsoukou-Argyraki, Peter Koepke, Mateusz Marcol, Bernhard Schröder, Adrian De Lon, Claudio Sacerdoti Coen, Shashank Pathak, William M. Farmer, Patrick Schäfer, Katie Collins, David Berry, Anand Rao Tadipatri, Roussanka Loukanova, Kyle Miller, Ján Perháč, Mehmet Tahir Sandikkaya, Ursula Martin, Marcel Schütz, Sky Wilshaw, Adrian Groza, Ayşe Sayın, Lucy Horowitz, Farah Al Wardani.

Overall programme

Day and time Session
6 September 14:00-18:00 Workshop on Natural Formal Mathematics
7 September 14:00-18:00 Joint session between the two workshops
8 September 14:00-18:00 Workshop on Libraries of Formal Proofs and Natural Mathematical Language


Lecture Theatre 1, William Gates Building/Computer Laboratory, Department of Computer Science and Technology, University of Cambridge, 15 JJ Thomson Avenue, Cambridge CB30FD, UK

Note that this meeting happens during the 16th Conference on Intelligent Computer Mathematics which takes place on September 4-8 at Emmanuel College. Emmanuel College, St Andrew’s St, is 15 minutes away by bus, 35 minutes away by walk, and 10 minutes away by bike.

NatFoM: Workshop on Natural Formal Mathematics, Cambridge, 6-7 September 2023

In (pure) mathematics there has always existed a strong informal sense of “naturality” of topics and methods. Generally “natural” theories, notions, properties, or proofs are prefered over technical, convoluted, or counterintuitive approaches. If formal mathematics is to become part of mainstream mathematics, its formalizations and usage apparently have to become more “natural”, and thus closer to informal mathematics.

This workshop broadly addresses the issue of naturality in formal mathematics. On 7 September there will be a joint afternoon session with the Workshop on Libraries of Formal Proofs and Natural Mathematical Language. We plan on open discussions on naturalness in formal mathematics.

Topics of interest include (but are not limited to):

  • The notion of naturality in mathematics generally
  • Natural mathematical language
  • Translations between informal and formal mathematics using deep learning or classical techniques
  • Natural input and output languages for formal mathematics systems
  • Controlled natural languages (CNL) for mathematics
  • Making formal mathematics documents readable
  • Naturality of foundational theories (types, sets, HOL, …)
  • Naturality of proof methods
  • Natural proof structures and granularities
  • Natural structurings of formalized mathematical texts and libraries
  • Mathematical type setting (LaTeX) and formal mathematics
  • Natural formalizations

At least one author of each accepted extended abstract/demonstration proposal is expected to attend NatFoM and to presents his/her submission in person. Accepted abstracts will be made available online (one column ceur-ws).

Program Committee:

  • Adrian De Lon, Bonn (co-chair)
  • Peter Koepke, Bonn (co-chair)
  • Andrea Kohlhase, Neu-Ulm
  • Angeliki Koutsoukou-Argyraki, Royal Holloway, University of London and Cambridge
  • Dennis Müller, Erlangen (co-chair)

Workshop on Libraries of Formal Proofs and Natural Mathematical Language, Cambridge, 7-8 September 2023

We welcome contributions (in the form of short talks, demos or presentations on work in progress) on the topics of library descriptions or on methodologies, algorithms and tools trying to tackle the challenges raised by the development, maintenance, refactoring and search of libraries of formal proofs.

We especially welcome contributions focussing on the role of natural language for formal proof libraries, in a very broad sense: natural language processing as a tool to tackle the aforementioned challenges; libraries of formal proofs as training data for large language models and autoformalisation; bridging the gap between formal mathematics and natural language mathematics.

Accepted abstracts will be made available online.

Please note that talks will be in person, therefore at least one author of each accepted submission is expected to attend the workshop and to present the submission in person.

Programme committee:

Abstract submission and funding applications


  • Deadline for funding applications: 23 July.
  • Deadline for abstract submissions: 23 July.
  • Notification of acceptance (abstracts and funding) will be given between 24-30 July. Priority for funding will be given to people submitting an abstract.
  • Camera-ready versions: 20 August.

Funding can be offered to some participants who are members of EuroProofNet (membership application is free).

Before applying check the reimbursement rules. The daily allowance has been fixed at 175 euros.

For funding applications you need to provide an estimation in EUROS of your transport to Cambridge with screen capture as proof, and with the date of your arrival in Cambridge and the date of your departure from Cambridge.

As usual it is possible to arrive a few days before and leave a few days after the meeting but a maximum of 4.4 daily allowances may be reimbursed.

If you want to give a talk or if you want to apply for funding, fill out the Registration Form.


NatFoM (Wednesday, September 6th)

Time Speaker Title
    Session 10C: Invited talk & Paper presentation (Chair: Dennis Müller)
14:00 - 15:00 Roussanka Loukanova Invited Talk Type Theories for Computational Grammar of Natural Language [slides]
15:00 - 15:30 Peter Koepke Experiences with Natural Language Proof Checking [slides]
15:30 - 16:00 Adrian De Lon Scaling a natural proof assistant step by step [slides]
16:00 - 16:30   Coffee Break
    Session 11: Paper presentation (Chair: Roussanka Loukanova)
16:30 - 17:00 Bernhard Schröder Building small worlds in mathematical texts by formal and informal means [slides]
17:00 - 17:30 Peter Koepke, Mateusz Marcol and Patrick Schäfer Formalizing Sets, Numbers, and some “Wiedijk Theorems” in Naproche
17:30 - 18:00 Adrian De Lon Naproche Demo

Joint Session (Thursday, September 7th)

Time Speaker Title
    Session 16B: Invited talk & Paper presentation (Chair: Angeliki Koutsoukou-Argyraki)
14:00 - 15:00 William Farmer Invited Talk Little Theories: A Method for Organizing Mathematical Knowledge Illustrated Using Alonzo, a Practice-Oriented Version of Simple Type Theory [slides]
15:00 - 15:30 Lucy Horowitz MathGloss: Linked Undergraduate Math Concepts [slides]
15:30 - 16:00 Marcel Schütz Foundational libraries in Naproche
16:00 - 16:30 ——————- Coffee Break
    Session 17: Paper presentation (Chair: Peter Koepke)
16:30 - 17:00 David Berry Categorical Proofs are Natural Proofs [slides]
17:00 - 18:00 ——————- Panel Discussion

Workshop on Libraries of Formal Proofs and Natural Mathematical Language (Friday, September 8th)

Time Speaker Title
    Session 22: Invited talk & Paper presentation (Chair: TBA)
14:00 - 15:00 Kyle Miller Invited Talk Informalizing formalized mathematics using the Lean theorem prover [slides]
15:00 - 15:30 Katie Collins CheckMate: an adaptable prototype platform for humans to interact with and evaluate LLMs [slides]
15:30 - 16:00 Farah Al Wardani Formal Reasoning using Distributed Assertions [slides]
16:00 - 16:30 ——————- Coffee Break
    Session 23: Paper presentation (Chair: TBA)
16:30 - 17:00 Anand Rao Tadipatri LeanAIde - A statement autoformalisation tool for the Lean theorem prover [slides]
17:00 - 17:30 Adrian Groza Solving logical puzzles with ChatGPT [slides]
17:30 - 18:00 Claudio Sacerdoti Coen An indexer and query language for libraries written in LambdaPi/Dedukti


Roussanka Loukanova - Rendering Natural Language of Mathematical Texts into Formal Language (Invited Talk) [slides]

The language of mathematical texts, e.g., descriptions, definitions, statements, and proofs, is typically a specialised natural language, suitably interleaved with mathematical symbols and expressions. Fragments of mathematical language are carefully written, e.g., by:

  1. using grammatically well-formed expressions, with respect to major syntactic categories of natural language, of lexemes, phrases, sentences

  2. targeting unambiguous expressions, by eliminating or minimising ambiguities, which are abundant in natural language.

Such factors demonstrate that computational grammar, including syntax and semantics, is essential for processing of mathematical texts. Among the goals is to provide interconnections with verification and proof systems, and integration with machine learning.

Computational syntax of natural language has been an active area since 1950s, with significant achievements of computational grammars. Computational semantics was introduced by methods of mathematical logic, in 1970s. It continues to be an open, active area of development of specialised formal languages, logic, and type theories. For full adequateness, computational grammar targets covering syntax-semantics interfaces, via parsing natural language for rendering it into formal language. Suitable language of logic can provide rendering of mathematical texts into formal expressions, which can be used in computational interfaces with verification and proof systems, as well as with machine learning.

In the talk, I present compositional interface in computational grammar for rendering natural language of mathematics into terms of the formal language (LAR) of acyclic recursion, by using its type theory, calculi, and syntax-semantics interfaces.

I introduce the technique for a class of expressions involving quantifiers, definite descriptors, and relative clauses, in noun phrases (NPs), which are integrate compounds of mathematical texts. I show the roles of logic operators in sentences and in other phrasal constructs.

I employ Generalised Constraint-Based Lexicalized Grammar (GCBLG) of natural language, which represents major, common syntactic characteristics of a variety of approaches to computational grammar, e.g., Head-Driven Phrase Structure Grammar (HPSG), Lexical Functional Grammar (LFG), Categorial Grammar (CG), and Grammatical Framework (GF).

The LAR terms of mathematical language can serve several major tasks:

(1) LAR provides computational semantics of the language of mathematics, via the syntax-semantics of LAR. LAR has two layers of semantics:

(a) denotational semantics, i.e., den(A), for every LAR term A;

(b) algorithmic semantics: the calculi of LAR determines the algorithmic process, alg(A), of every meaningful LAR term A, for computing its denotation den(A)

(2) LAR provides facility for representing major semantic ambiguities and underspecification, at the object level of its formal language, without recourse to meta-language variables. Specific semantic representations can be obtained by instantiations of underspecified LAR terms, in context.

(3) LAR terms can provide input into verification and proof systems (open tasks)

(4) the output of verification and proof systems can provide inferences to be rendered back into natural language of mathematical texts (open tasks).


  • [1] Bresnan, J.: Lexical-Functional Syntax. Blackwell Publishers, Oxford (2001)
  • [2] Buszkowski, W.: Mathematical Linguistics and Proof Theory. In: J. van Benthem, A. ter Meulen (eds.) Handbook of Logic and Language, pp. 683–736. North-Holland, Amsterdam (1997). DOI 10.1016/B978-044481714-3/50016-3
  • [3] DELPH-IN: Deep Linguistic Processing with HPSG (DELPH-IN) (2018, edited). URL http://moin. Accessed 20-Aug-2023
  • [4] Gallin, D.: Intensional and Higher-Order Modal Logic: With Applications to Montague Semantics. North- Holland Publishing Company, Amsterdam and Oxford, and American Elsevier Publishing Company (1975). URL
  • [5] The Grammatical Framework GF. Accessed 20-Aug-2023
  • [6] Loukanova, R.: Relationships between Specified and Underspecified Quantification by the Theory of Acyclic Recursion. ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal 5(4), 19–42 (2016). URL
  • [7] Loukanova, R.: An Approach to Functional Formal Models of Constraint-Based Lexicalized Grammar (CBLG). Fundamenta Informaticae 152(4), 341–372 (2017). DOI 10.3233/FI-2017-1524. URL https: //
  • [8] Loukanova, R.: Computational Syntax-Semantics Interface with Type-Theory of Acyclic Recursion for Underspecified Semantics. In: R. Osswald, C. Retoré, P. Sutton (eds.) IWCS 2019 Workshop on Com- puting Semantics with Types, Frames and Related Structures. Proceedings of the Workshop, pp. 37– 48. The Association for Computational Linguistics (ACL), Gothenburg, Sweden (2019). URL https: //
  • [9] Loukanova, R.: Gamma-Reduction in Type Theory of Acyclic Recursion. Fundamenta Informaticae 170(4), 367–411 (2019). URL
  • [10] Loukanova, R.: Gamma-Star Canonical Forms in the Type-Theory of Acyclic algorithms. In: J. van den Herik, A.P. Rocha (eds.) Agents and Artificial Intelligence, Lecture Notes in Computer Science, vol. 11352, pp. 383–407. Springer International Publishing, Cham (2019). URL 978-3-030-05453-3_18
  • [11] Loukanova, R.: Type-Theory of Acyclic Algorithms for Models of Consecutive Binding of Functional Neuro-Receptors. In: A. Grabowski, R. Loukanova, C. Schwarzweller (eds.) AI Aspects in Reasoning, Languages, and Computation, vol. 889, pp. 1–48. Springer International Publishing, Cham (2020). URL https://doi. org/10.1007/978-3-030-41425-2_1
  • [12] Loukanova, R.: Algorithmic Dependent-Type Theory of Situated Information and Context Assessments. In: S. Omatu, R. Mehmood, P. Sitek, S. Cicerone, S. Rodr ́ıguez (eds.) Distributed Computing and Artificial Intelligence, 19th International Conference, vol. 583, pp. 31–41. Springer International Publishing, Cham (2023). DOI 10.1007/978-3-031-20859-1 4. URL
  • [13] Loukanova, R.: Eta-Reduction in Type-Theory of Acyclic Recursion. ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal 12(1), 1–22, e29199 (2023). URL 14201/adcaij.29199
  • [14] Loukanova, R.: Restricted Computations and Parameters in Type-Theory of Acyclic Recursion. ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal 12(1), 1–40 (2023). URL https: //
  • [15] Montague, R.: The Proper Treatment of Quantification in Ordinary English. In: J. Hintikka, J. Moravcsik, P. Suppes (eds.) Approaches to Natural Language, vol. 49, pp. 221–242. Synthese Library. Springer, Dordrecht (1973). URL
  • [16] Moortgat, M.: Categorial Type Logics. In: J. van Benthem, A. ter Meulen (eds.) Handbook of Logic and Language, pp. 93–177. Elsevier, Amsterdam (1997). URL 50005-9
  • [17] Moschovakis, Y.N.: The formal language of recursion. Journal of Symbolic Logic 54(4), 1216–1252 (1989). URL
  • [18] Moschovakis, Y.N.: Sense and denotation as algorithm and value. In: J. Oikkonen, J. Väänänen (eds.) Logic Colloquium ’90: ASL Summer Meeting in Helsinki, Lecture Notes in Logic, vol. Volume 2, pp. 210– 249. Springer-Verlag, Berlin (1993). URL
  • [19] Moschovakis, Y.N.: The logic of functional recursion. In: M.L. Dalla Chiara, K. Doets, D. Mundici, J. van Benthem (eds.) Logic and Scientific Methods, vol. 259, pp. 179–207. Springer, Dordrecht (1997). URL
  • [20] Moschovakis, Y.N.: On founding the theory of algorithms. In: H. Dales, G. Oliveri (eds.) Truth in mathematics, pp. 71–104. Clarendon Press, Oxford (1998)
  • [21] Moschovakis, Y.N.: A Logical Calculus of Meaning and Synonymy. Linguistics and Philosophy 29(1), 27–89 (2006). URL
  • [22] Plotkin, G.D.: LCF considered as a programming language. Theoretical Computer Science 5(3), 223–255 (1977). URL
  • [23] Scott, D.S.: A type-theoretical alternative to ISWIM, CUCH, OWHY. Theoretical Computer Science 121(1), 411–440 (1993). URL
  • [24] Thomason, R.H. (ed.): Formal Philosophy: Selected Papers of Richard Montague. Yale University Press, New Haven, Connecticut (1974)

Peter Koepke, Mateusz Marcol and Patrick Schäfer - Formalizing Sets, Numbers, and some “Wiedijk Theorems” in Naproche

Freek Wiedijk’s webpage “Formalizing 100 Theorems” [1] is sometimes considered a benchmark for formalizations in interactive proof systems. It also provides immediate comparisons between proof styles in different systems.

In our talk we present a formalization in the Naproche Natural Proof Checker [2, 3] of sets, numbers and about 10 theorems from Wiedijk’s list. The formalization is part of the Isabelle 2023 release together with a pdf-printout. Naproche formalizations aim at natural mathematical readability: texts are written in a LaTeX source format that compiles to pdf documents which read like standard mathematics. Strong automatic theorem provers are employed to achieve textbook-like proof granularities.

Naproche’s input language ForTheL [4] (for Formula Theory Language) is a controlled natural language based on patterns of tokens. The Naproche system parses an input text into an incomplete proof tree of first-order statements, where leaves correspond to axiomatic assumptions, and other nodes are supposed to follow logically from premises higher up in the tree. The system reads through the proof tree and checks statements for type correctness and provability from their premises. These checks are mainly done by external ATPs like E or Vampire. In line with the standard foundations of mathematics in first-order logic and set theory, linguistic types like “natural number” or “function” are interpreted as type guards. Sets (and classes) can be formed by abstraction terms, obeying set existence axioms.

Naproche is a component of the Isabelle prover platform [5]. After downloading and installing Isabelle, Naproche is immediately available and will automatically check .ftl.tex files opened in the Isabelle editor. Isabelle/Naproche ships with a couple of example files ranging from small demos to larger mathematical developments. The present formalization is included as the example file sets_and_numbers.ftl.tex.

In the formalization, set theory is developed up to the well-known theorems of Cantor and Cantor-Schröder-Bernstein which are #63 and #25 on Wiedijk’s list. The standard number systems are introduced “top-down”: we postulate field axioms for the real numbers and then introduce the subsystems of rational, integer and natural numbers. Interestingly, the principle of mathematical induction (#74) follows from the supremum principle for real numbers. Using induction we prove the summation formulas for finite geometric (#66) and arithmetic series (#68), and the number of subsets of finite sets (#52). After the introduction of division and primes in the natural numbers we prove the correctness of Euclid’s algorithm (#69), Bezout’s theorem (#60), the irrationality of \sqrt{2} (#1) and the infinitude of primes (#11). We conclude with the denumerability of the rational numbers (#3) and the uncountability of the real continuum (#22). Overall, the formalization could be used in a seminar which teaches natural and formal proving with Naproche whilst recapitulating basic and interesting facts about sets and numbers.

  • [1]
  • [2]
  • [3] Adrian De Lon, Peter Koepke, Anton Lorenzen, Adrian Marti, Marcel Schütz, Makarius Wenzel: The Isabelle/Naproche Natural Language Proof Assistant. In: Automated Deduction – CADE 28: 28th International Conference on Automated Deduction (Proceedings), 2021, pp. 614–624.
  • [4] Andrei Paskevich: The syntax and semantics of the ForTheL language. 2007. URL:
  • [5]

Adrian De Lon - Scaling a natural proof assistant step by step [slides]

The Naproche proof assistant checks formalizations written in controlled natural language and fills in gaps in proofs with the help of automated provers such as E and Vampire.

Naproche aims to lower the barriers to entry for mathematicians by modelling natural language and informal proof structures, re-using familiar LaTeX syntax, and emphasizing proof automation. It is possible to formalize nontrivial mathematics in Naproche, and students have successfully completed formalizations covering parts of analysis, linear algebra, representation theory, set theory, category theory, and more. However, the current implementation struggles to scale beyond formalizations that span a few chapters of a textbook. The following are some of the issues we are facing:

  • Check times. Long formalizations take hours to check. Small inefficiencies, increasingly difficult tasks for automated provers, and lack of parallelism add up to a painful experience even on powerful PCs.
  • Libraries. These general performance issues combined with Naproche’s limited import mechanism (similar to an include directive) makes it difficult to develop more interconnected formalizations or a standard library that other formalizations can build on.
  • Structures. Naproche has no features dedicated to mathematical structures. This means that users have to set up structures themselves. Moreover, dealing with notation becomes cumbersome, such as having to explicitly annotate which structure an operation belongs to.
  • Surface area for user error. It is still too easy for users of Naproche to set themselves up for problems. Some parts of formalizations, such as some top-level functions, have to be formalized using axioms, which can lead to accidental contradictions. The grammar of the controlled language is in parts overly permissive, leading to unintended results.
  • Stability of formalizations. Adding new theorems in the middle of a formalization (or adding imports) adds additional premises to subsequent tasks exported to automated provers. This makes it more difficult for automated provers to find proofs, which then can break existing proofs in Naproche. When flying too close to the resource limits given to automated provers, formalizations become flaky or fail to be reproducible on other computers.

To address the issues described above, I’m working on a new implementation of Naproche. The new proof assistant is developed in tandem with a more inter- connected and modular “standard library” of mostly undergraduate material, which shall serve as a foundation for future formalizations. Using parallelism and premise selection the new proof assistant is currently roughly ten times faster.

In my talk I will present my current progress and discuss the development of the new proof assistant and its standard library. I will talk about how some of the issues described above are mitigated in the new implementation. I will also talk about the following:

  • Initial user impressions. How does the formalization process compare to the previous implementation? Are there any compromises?
  • Library design. Formalizing a standard library both with readability for humans and granularity for automated proofs in mind.
  • Premise selection. Arguably the biggest contributor to long checking times are increasingly difficult tasks for automated provers. Using premise selection lets automated provers remain effective in large background theories.

Bernhard Schröder - Building small worlds in mathematical texts by formal and informal means [slides]

One aspect of naturalness in natural language proof texts is the limitation of objects to a cognitively manageable number or their grouping in a limited number of aggregates in each part of a proof. More precisely, we should talk about “reference markers” instead of “objects”, abstracting from the ontological structure we are referring to. Here, the term “reference marker” is used in a sense that any (generalized) quantifier introduces one or two new reference markers which can be referenced in the following text. In this sense, when talking about “every group g” one new reference marker g and - if not yet introduced into the discourse - a discourse marker for the set of groups are added to the discourse, only.

A common means of grouping discourse markers in formal notations of natural language proof texts are typographical means and diacritics, cf. the notations for a straight line through the point A and B, a ray starting at A and running through B, a line segment between A und B, and its length. Or consider how functions and their derivatives and antiderivatives are usually written.

Another technique of grouping related objects is the use of ambiguous, particularly polysemous expressions the readings of which are related metonymically. Polysemy means an ambiguity where the different meanings of an expression are conceptually related, like “height” meaning a dimension of measure and a geographical place. A big portion of polysemous readings are based on metonymy. So, e.g., “school” could be used to refer to an institution or to a building which hosts this institution in the same text. Both meanings are ontologically different, but are connected by a “contiguous” relation (cf. Panther/Radden eds. 1999), here the place where an institution is situated. Most cases of metonymy can be explained by a small set of contiguous relations. Commonly, language users become aware of such cases of polysemy in very few contexts.

In a similar metonymical fashion mathematical terms referring to results of arithmetical operations as e.g. “sum” or “product” (cf. (1)) can also refer to the expressions of the arithmetical operations or to various levels of abstraction from the expressions (e.g. by abstracting from commutative or associative transformations of the expressions), cf (2).

(1) The sum of x and y is not divible by 3. (2) r is the square-free part of the product of the numbers in M.

Mathematical notation is sometimes designed to address these levels of abstraction e.g. by leaving open the exact associative structure of an expression, cf.

(3) a+b+c

Another source of polysemy is the implicit introduction of plural entities, i.e. entities like sets, sequences etc. comprising more than one member (see Nouwen 2015, Cramer/Schröder 2012 and Schröder in print). In a sentence like

(4) a, b and c are P.

the predicate P can be applied to a, b and c respectively (distributive interpretation) or to a plural entity formed by a, b and c (collective interpretation), as e.g. the set {a,b,c}. Even more interpretations get possible if further quantifications come into play.

(5) For all i, a_i, b_i and c_i are P.

could have a purely distributive interpretation, but also a collective interpretation for the plural entities constituted by a_i, b_i and c_i for each i, and collective interpretations for plural entities like {a_i, b_i, c_i i} or {{a_i, b_i, c_i} i}. In many proof texts the natural language mechanism of implicitly forming plural entities is used very effectively.

In my contribution I will argue that the notational means to group reference markers and the metonymical uses of mathematical terms are essential for keeping natural language mathematical proofs readable and natural.


  • Carl, Merlin/Cramer, Marcos/Fisseni, Bernhard/Sarikaya, Deniz/Schröder, Bernhard (2021). How to Frame Understanding in Mathematics: A Case Study Using Extremal Proofs. In: Axiomathes 31, 649–676. DOI: 10.1007/s10516-021-09552-9.
  • Cramer, Marcos (2013): Proof-checking mathematical texts in controlled natural language. PhD Thesis. Rheinische Friedrich-Wilhelms-Universität Bonn. (accessed on 1 Jun 2023).
  • Cramer, Marcos/Fisseni, Bernhard/Koepke, Peter/Kühlwein, Daniel/Schröder, Bernhard/Veldman, Jip (2010): The Naproche Project. Controlled Natural Language Proof Checking of Mathematical Texts. In: Fuchs, Norbert E. (Ed.) Controlled Natural Language. CNL 2009. (= Lecture Notes in Computer Science 5972. Berlin, Heidelberg: Springer, 170-186. DOI: 10.1007/978-3-642-14418-9_11.
  • Cramer, Marcos/Schröder, Bernhard (2012). Interpreting Plurals in the Naproche CNL. In: Rosner, Michael/Fuchs, Norbert E. (Eds.): Controlled Natural Language. CNL 2010. (= Lecture Notes in Computer Science 7175). Berlin, Heidelberg: Springer, S. 43-52. DOI: 10.1007/978-3-642-31175-8_3.
  • Fisseni, Bernhard/Sarikaya, Deniz/Schmitt, Martin/Schröder, Bernhard (2019): How to frame a mathematician. In: Centrone, Stefania/Sarikaya, Deniz/Kant, Deborah (Eds.): Reflections on the foundation of mathematics. Heidelberg: Springer, 415-434.
  • Nouwen, R. (2015): Plurality. In Dekker, Paul/Aloni, Maria (Eds.): Cambridge handbook of semantics. Cambridge: Cambridge University Press.
  • Panther, Klaus-Uwe/Radden, Günter (Eds., 1999): Metonymy in Language and Thought. Amsterdam: John Benjamins Publishing.
  • Schröder, Bernhard (in print): Induktiv oder intuitiv? Die Gewinnung von Frames aus mathematischen Beweistexten

Peter Koepke - Experiences with Natural Language Proof Checking [slides]

The Naproche system (for Natural Proof Checking) [1, 2] interactively proof-checks natural language mathematical texts whilst they are being editing. Input texts are written in the controlled natural language ForTheL (for Formula Theory Language) [3] which is intended to approximate the natural mathematical language. Since ForTheL uses a LaTeX format, high-quality typesetting of formalizations is immediately available. Over the last years a number of theorems from non-trivial university mathematics have been formalized in Naproche. Naproche is distributed as part of the Isabelle prover platform which also includes a selection of Naproche formalizations. Some of these texts closely resemble material that could be found in mathematics textbooks.

Although Naproche allowed to achieve natural mathematical readability for strictly proof-checked formalizations, the formalization process has been slow and arduous. Users not only have to write a logically correct formal document for the intended mathematical content, but at the same time they have to struggle with the weaknesses of ATPs in dealing with implicit proof tasks; they have to observe the rules of LaTeX and ForTheL; and ideally they should aim for natural readability and mathematical elegance of the typeset document. These simultanuous requirements are becoming even more problematic as formalizations are getting longer, sometimes comprizing twenty or more typeset pages, so that the automated background provers are having difficulties to resolve proof tasks within seconds. But this is incompatible with productive interactive work.

In my talk I shall illustrate and analyse the present state of affairs and suggest ways in which natural proof checking could be made usable for a wider mathematical public and for more advanced mathematics. I shall discuss several possible ways forward:

A) Invest massively in improving all components of the current ℕaproche system: e.g., with an editor like TeXmacs [4] users could directly work on typeset mathematics; comprehensive caching of proofs should accelerate proving times by an order of magnitude; tune or train external provers for the specific proof tasks generated by ℕaproche; use standard type-checking routines for some ℕaproche type-checking, etc.

B) Program a new natural system that circumvents weaknesses and complications of ℕaproche; at the moment, Adrian De Lon is working towards a novel set-theory based natural proof assistant.

C) Combine the natural language approach of ℕaproche with established systems. Such a project may have some similarities with the introduction of the Mizar-type proof language Isar in Isabelle/HOL [5].

  • [1]
  • [2] Adrian De Lon, Peter Koepke, Anton Lorenzen, Adrian Marti, Marcel Schütz, Makarius Wenzel: The Isabelle/Naproche Natural Language Proof Assistant. In: Automated Deduction – CADE 28: 28th International Conference on Automated Deduction (Proceedings), 2021, pp. 614–624.
  • [3] Andrei Paskevich: The syntax and semantics of the ForTheL language. 2007. URL:
  • [4]
  • [5]

William Farmer - Little Theories: A Method for Organizing Mathematical Knowledge Illustrated Using Alonzo, a Practice-Oriented Version of Simple Type Theory (Invited Talk) [slides]

An attractive and powerful method for organizing mathematical knowledge is the little theories method [5]. A body of mathematical knowledge is represented in the form of a theory graph [7, 8] consisting of theories as nodes and theory morphisms as directed edges. Each mathematical topic is developed in the “little theory” in the theory graph that has the most convenient level of abstraction and the most convenient vocabulary. Then the definitions and theorems produced in the development are transported, as needed, from this abstract theory to other, usually more concrete, theories in the graph via the theory morphisms in the graph.

Alonzo [4] is a practice-oriented classical higher-order logic. Named in honor of Alonzo Church, Alonzo is based on Church’s type theory [1], Church’s formulation of simple type theory [3]. Unlike traditional predicate logics, Alonzo admits partial functions and undefined expressions in accordance with the approach employed in mathematical practice that we call the traditional approach to undefinedness [2]. Alonzo has a simple syntax with a formal notation for machines and a compact notation for humans that closely resembles the notation found in mathematical practice. It has two semantics, one for mathematics based on standard models and one for logic based on Henkin-style general models [6]. By virtue of its syntax and semantics, Alonzo is exceptionally well suited for expressing and reasoning about mathematical ideas in a natural manner. Alonzo is also equipped with a set of mathematical knowledge modules for constructing theories and theory morphisms and for transporting definitions and theorems from one theory to another via theory morphisms.

This talk illustrates the little theories method by showing how to formalize a sample body of mathematical knowledge in Alonzo as a theory graph using Alonzo’s mathematical knowledge modules. We have chosen monoid theory as the sample body of mathematical knowledge. A monoid is an algebraic structure consisting a nonempty set, a total, associative binary function on the set, and a member of the set that is an identity element with respect to the function. Monoids are abundant in mathematics and computing. Our formalization of monoid theory in Alonzo consists of a theory graph containing 12 theories and 24 theory morphisms. Instead of the standard certification-oriented approach to formal mathematics, we follow a communication-oriented approach that (1) is fully formal except proofs are written in a traditional (informal) style and (2) priorities the communication of mathematical ideas over the formal certification of mathematical results.


  • [1] A. Church. A formulation of the simple theory of types. Journal of Symbolic Logic, 5:56–68, 1940.
  • [2] W. M. Farmer. Formalizing undefinedness arising in calculus. In D. A. Basin and M. Rusinowitch, editors, Automated Reasoning — IJCAR 2004, volume 3097 of Lecture Notes in Computer Science, pages 475–489. Springer, 2004.
  • [3] W. M. Farmer. The seven virtues of simple type theory. Journal of Applied Logic, 6:267–286, 2008.
  • [4] W. M. Farmer. Simple Type Theory: A Practical Logic for Expressing and Reasoning About Mathematical Ideas. Computer Science Foundations and Applied Logic. Birkhauser, 2023.
  • [5] W. M. Farmer, J. D. Guttman, and F. J. Thayer. Little theories. In D. Kapur, editor, Automated Deduction — CADE-11, volume 607 of Lecture Notes in Computer Science, pages 567–581. Springer, 1992.
  • [6] L. Henkin. Completeness in the theory of types. Journal of Symbolic Logic, 15:81–91, 1950.
  • [7] M. Kohlhase. Mathematical knowledge management: Transcending the one-brain-barrier with theory graphs. European Mathematical Society (EMS) Newsletter, 92:22–27, June 2014.
  • [8] M. Kohlhase, F. Rabe, and V. Zholudev. Towards MKM in the large: Modular representation and scalable software architecture. In S. Autexier, J. Calmet, D. Delahaye, P. D. F. Ion, L. Rideau, R. Rioboo, and A. P. Sexton, editors, Intelligent Computer Mathematics, volume 6167 of Lecture Notes in Computer Science, pages 370–384. Springer, 2010.

Lucy Horowitz - MathGloss: Linked Undergraduate Math Concepts [slides]

Mathematics is made up of a vast web of interrelated concepts and definitions. To successfully formalize a proof, a mathematician must first possess a deep understanding of the concepts involved. As mathematical knowledge continues to expand in both breadth and depth, it is becoming increasingly challenging for individual mathematicians to be fluent in multiple areas of mathematics. MathGloss seeks to partially address this issue by creating a unified framework for organizing and accessing mathematical concepts, enabling individuals to effortlessly navigate and comprehend them.

Concepts in mathematics are delineated by their definitions, and yet these definitions may take different forms depending on the context in which they are used. Notably, definitions in the proof assistant Lean look very different from the definitions found in undergraduate math textbooks. By gathering and connecting different definitions of the same concept from different sources, we hope to enable both novice and seasoned mathematicians to unify their understanding of basic concepts with an eye to formalization. This is done from several corpora as follows:

From “Chicago Notes,” we have more than 500 definitions curated by an undergraduate student from the courses she took at the University of Chicago.

From “France,” there are approximately 300 terms from the French undergraduate curriculum in mathematics as translated by the Lean Community [1]. Some of these terms have direct links to items in Lean, and they are included where they already exist.

From “Maths Dictionary,” there are 246 terms linked manually both to Wikidata and a French-language translation by Tim Hosgood [2].

From “nLab,” there are approximately 5,000 terms from page titles of the online mathematics wiki called nLab [3] with people, books, and other non-concepts removed as best as we are able. These terms are linked to their pages in the nLab.

The collected terms are mapped to Wikidata [4], the database that serves Wikipedia and many other knowledge projects using a tool called wikimapper [5]. This enables us to match together concepts appearing in several of our corpora together and achieve our goal of combining different sources of information about the same objects. While some terms from the corpora do not have representations in Wikidata at the moment, the majority do and we aim to get as many of these as possible. In short, we combine mathematical concepts with natural mathematical language describing them, with their representations in a well-established knowledge ontology, and with their representations in Lean.

The database can be found at It is intended to be updated easily to include definitions from other sources like the Encyclopedia of Mathematics, Wolfram MathWorld, or PlanetMath. These also have glossaries, but MathGloss is unique in that it provides an overarching view of multiple sources.

Such organization of information is essential if mathematics is to embrace formalization as a useful tool for practice and for education [6]. Given that a proof assistant will “complain that it does not understand the definitions, axioms, or reasoning steps entered by the mathematician,” MathGloss should prove useful for bridging the gap between brain and computer.

  • [1]
  • [2]
  • [3]
  • [4]
  • [5]
  • [6]

Marcel Schütz - Foundational libraries in Naproche

The latest release of the natural language proof assistant Naproche [1], developed at the university of Bonn, Germany, which ships as a component of the current Isabelle release [2,3] comprises new foundational libraries about set and number theory.

The Naproche (“Natural Proof Checking”) proof assistant is being developed for a high degree of naturalness of accepted proof texts, written in the controlled natural language ForTheL (“Formula Theory Language”). ForTheL is designed to approximate the mathematical vernacular, while at the same time being a completely formal language which allows its translation into formal logics [4]. Naproche has a built-in softly typed ontology that can be regarded as a weak fragment of Kelley–Morse set theory. While this ontology can quite flexibly be extended by a user, it does not provide a sufficient basis on its own to formalize sophisticated mathematics. To overcome this issue, Naproche also provides three foundational libraries about set and number theory on top of its built-in ontology [5]:

  1. One library that provides basic set-theoretical notions and operations as well as the most commonly used Kelley–Morse axioms [6].

  2. One that extends the first library with more advanced notions from set theory like that of ordinal and cardinal numbers [7].

  3. One that extends the first library with a new (soft) type of natural numbers together with common arithmetical operations as well as some basic notions of number theory [8].

Since these libraries are written in a LaTeX dialect of ForTheL they not only comprise a high degree of human readability but can also be converted directly to PDF. Moreover, their applicability as practically usable foundations for more advanced formalizations in Naproche is demonstrated by using them as a basis for formalizations of some well-known theorems from set and number theory that ship with Naproche, e.g. the Cantor-Schroeder-Bernstein theorem [9,10] and Furstenberg’s proof of the infinitude of primes [11,12]. However, they also pose some serious problems yet that have to be addressed in future versions of Naproche:

  1. Modularity: Since Naproche lacks a mature module system for organizing formalizations, their reusability is currently fairly limited.

  2. Scalability: The proof checking abilities of Naproche do not scale well with the size of the libraries.

  3. Time redundancy: Currently Naproche has no mechanisms to provide persistent certificates of the correctness of its libraries so they are rechecked by the system every time they are imported to a ForTheL document.

  4. Abstract mathematical structures: Naproche’s current abilities of handling abstract mathematical structures (such as groups or topological spaces) in contrast to “concrete” ones like “the” universe of sets or “the” structure of natural numbers are very limited. This demands for an extension of ForTheL that can handle such abstract structures using language constructs that do not differ remarkably from the mathematical vernacular.

Hence, although Naproche appears as a promising basis for dealing with human-readable, formal mathematical libraries, there is much room for further improvement of its capabilities using methods developed in the ‘intelligent computer mathematics’ community.


  • [1]
  • [2]
  • [3] De Lon, A., Koepke, P., Lorenzen, A., Marti, A., Schütz, M., Wenzel, M. (2021). The Isabelle/Naproche Natural Language Proof Assistant. In: Platzer, A., Sutcliffe, G. (eds) Automated Deduction – CADE 28. CADE 2021. Lecture Notes in Computer Science(), vol 12699. Springer, Cham.
  • [4] De Lon, A., Koepke, P., Lorenzen, A., Marti, A., Schütz, M., Sturzenhecker, E. (2021). Beautiful Formalizations in Isabelle/Naproche. In: Kamareddine, F., Sacerdoti Coen, C. (eds) Intelligent Computer Mathematics. CICM 2021. Lecture Notes in Computer Science(), vol 12833. Springer, Cham.
  • [5] Schütz, M., De Lon, A., Koepke, P. (2022). Setting up Set-Theoretical Foundations in Naproche. In: Work-in-progress papers presented at the 15th Conference on Intelligent Computer Mathematics, CICM 2022 (Informal Proceedings).
  • [6]
  • [7]
  • [8]
  • [9]
  • [10] Schröder, B.S.W. The fixed point property for ordered sets. In: Arab. J. Math. 1, 529–547 (2012).
  • [11]
  • [12] Furstenberg, H. (1955). On the Infinitude of Primes. In: American Mathematical Monthly 62 (1955): 353.

David Berry - Categorical Proofs are Natural Proofs [slides]

Proving the details required for formal proofs in Category Theory is often tedious. This is made worse when using E-Category Theory, as is often found in proof-assistant libraries for category theory today[1]: a destination often called “Setoid Hell”. It is even worse when using P-Category Theory[2] (a variant of E-Category Theory where Partial Equivalence Relations are used), which is required for our work: a new destination of “Subsetoid Hell”. This short talk will present how structuring the proof of the presheaf exponential in a more categorical fashion allows some of this tedium to be reduced (although, not completely eliminated). This also results in a fuller library of categorical results shedding new light on P-category theory, and allowing for improved future development. However, the grass is not completely greener when proofs are more categorical: the proof terms are often worse and less computationally efficient. Techniques to overcome this seem to force a mild return to Hell.

All the work presented is formalized in the Coq Proof Assistant. This is a crucial aspect of the research programme where constructiveness and computability of the formalized proof is a motivating property.

  • [1] Jason Z. S. Hu and Jacques Carette. 2021. Formalizing Category Theory in Agda. In Proceedings of the 10th ACM SIGPLAN International Conference on Certified Programs and Proofs (Virtual, Denmark) (CPP 2021). Association for Computing Machinery, New York, NY, USA, 327–342.
  • [2] Djorde Čubrić, Peter Dybjer, and Philip Scott. 1998. Normalization and the Yoneda Embedding. Mathematical Structures in Computer Science 8, 2 (apr 1998), 153–192.

Kyle Miller - Informalizing formalized mathematics using the Lean theorem prover (Invited Talk) [slides]

One of the applications of interactive theorem provers in pure mathematics is being able to produce machine-verified formal proofs. I will talk about a less-obvious application, which is using formalized mathematics to author interactive informal expositions. I will demonstrate a prototype of an “auto-informalization” system written in Lean that presents the reader with an interface to view proofs at a desired level of detail. I will also discuss thoughts on the impact of such tools in mathematics. This is joint work with Patrick Massot.

Katie Collins - CheckMate: an adaptable prototype platform for humans to interact with and evaluate LLMs [slides]

The standard methodology of evaluating large language models (LLMs) based on static pairs of inputs and outputs is insufficient for developing assistants: this kind of assessments fails to take into account the essential interactive element in their deployment, and therefore limits how we understand language model capabilities. We introduce CheckMate, an adaptable prototype platform for humans to interact with and evaluate LLMs. We conduct a study with CheckMate to evaluate three language models~(InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics, with a mixed cohort of participants from undergraduate students to professors of mathematics. We release the resulting interaction and rating dataset, MathConverse. By analysing MathConverse, we derive a preliminary taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness in LLM generations, amongst other findings. Further, we identify useful scenarios and existing issues of GPT-4 in mathematical reasoning through a series of case studies contributed by expert mathematicians. We conclude with actionable takeaways for ML practitioners and mathematicians: models which communicate uncertainty, respond well to user corrections, are more interpretable and concise may constitute better assistants; interactive evaluation is a promising way to continually navigate the capability of these models; humans should be aware of language models’ algebraic fallibility, and for that reason discern where they should be used.

Farah Al Wardani - Formal Reasoning using Distributed Assertions [slides]

When a proof system checks a formal proof, we can say that its kernel asserts that the formula is a theorem in a particular logic. We describe a general framework in which such assertions can be made globally available so that any other proof assistant willing to trust the assertion’s creator can use that assertion without rechecking any associated formal proof. This framework, called DAMF (Distributed Assertion Management Framework), is heterogeneous and allows each participant to decide which tools and operators they are willing to trust in order to accept external assertions. This framework can also be integrated into existing proof systems by making minor changes to the input and output subsystems of the prover. DAMF achieves a high level of distributivity using such off-the-shelf technologies as IPFS (InterPlanetary File System) [1], IPLD (InterPlanetary Linked Data) [2], and public key cryptography. We illustrate the framework by describing an implemented tool for validating and publishing assertion objects and a modified version of the Abella theorem prover [3] that can use and publish such assertions.

On a higher level, we explore and develop a new dimension in the general quest for interoperability between proof systems and communities. Motivations for achieving interoperability include efficiency through reusing developments, avoiding unnecessary duplications, enhancing collaboration, cross-validation of theorems, and avoiding extreme centralization. Another reason is the appeal of modularity and composability in theories, systems, and logics. Some approaches attempt a solution between two specific systems [4] [5], which is not always feasible mainly due to incompatibilities regarding proof languages and implementations. Others aim for universal interoperability, which is more challenging. The evidential tool bus [6] was proposed as a means to integrate specialized inference systems into a unified formal system. Approaches like Dedukti [7] and MMT [8] aim to create a framework in which other systems and logics would be expressed, combining in this way theorems and proofs produced by different systems in one trusted system. Program verification tools such as TLAPS [9] approach the problem in an opposite direction: an attempted proof is composed of different parts, and each part is delegated upfront to a separate proof system trusted by the main system. Standard formats and languages such as TPTP [10] have been developed as bridges between systems, and universal libraries have been proposed [11] [12].

In DAMF, we consider the natural existence of different communities, different approaches, different languages, etc. Hence, the starting criterion is to consider the diversity of logics, languages, and systems that are not necessarily completely compatible, mapped to each other, nor to a central system. We want to enable heterogeneous producers and consumers of formal proofs to communicate coherently in the light of differences, and thus we start by setting the base ground for such communication.

Publications, documentation, and source code related to DAMF can be found on the distributed-assertions website [13].


  • [1] What is IPFS?
  • [2] IPLD documentation.
  • [3] The Abella theorem prover.
  • [4] Armand, M., Faure, G., Grégoire, B., Keller, C., Théry, L., and Werner, B. A modular integration of SAT/SMT solvers to Coq through proof witnesses. In Certified Programs and Proofs (CPP 2011) (2011), J.-P. Jouannaud and Z. Shao, Eds., vol. 7086 of LNCS, pp. 135–150.
  • [5] Fontaine, P., Marion, J.-Y., Merz, S., Nieto, L. P., and Tiu, A. F. Expressiveness + automation + soundness: Towards combining SMT solvers and interactive proof assistants. In TACAS: Tools and Algorithms for the Construction and Analysis of Systems (2006), H. Hermanns and J. Palsberg, Eds., vol. 3920 of LNCS, Springer, pp. 167–181.
  • [6] Rushby, J. M. An evidential tool bus. In Formal Methods and Software Engineering, 7th International Conference on Formal Engineering Methods, ICFEM 2005, Manchester, UK, November 1-4, 2005, Proceedings (2005), K. Lau and R. Banach, Eds., vol. 3785 of LNCS, Springer, pp. 36–36.
  • [7] Assaf, A., Burel, G., Cauderlier, R., Delahaye, D., Dowek, G., Dubois, C., Gilbert, F., Halmagrand, P., Hermant, O., and Saillard, R. Dedukti: a logical framework based on the λΠ-calculus modulo theory, 2016.
  • [8] Rabe, F. The MMT Language and System., 2022.
  • [9] Chaudhuri, K., Doligez, D., Merz, S., and Lamport, L. The TLA+ proof system: Building a heterogeneous verification platform. In Proceedings of the 7th International Colloquium on Theoretical Aspects of Computing (ICTAC) (Natal, Rio Grande do Norte, Brazil, Sept. 2010), A. Cavalcanti, D. Déharbe, M.-C. Gaudel, and J. Woodcock, Eds., vol. 6256 of LNCS, Springer, p. 44.
  • [10] Sutcliffe, G. The TPTP Problem Library and Associated Infrastructure: The FOF and CNF Parts, v3.5.0. Journal of Automated Reasoning 43, 4 (2009), 337–362.
  • [11] Wiedijk, F. The QED manifesto revisited. Studies in Logic, Grammar and Rhetoric 10, 23 (2007), 121–133.
  • [12] Hales, T. Formal abstracts.
  • [13] Distributed Assertions website.

Anand Rao Tadipatri - LeanAIde - A statement autoformalisation tool for the Lean theorem prover [slides]

Contributors: Siddhartha Gadgil, Anand Rao Tadipatri, Ayush Agrawal, Ashvni Narayanan, Navin Goyal

We present LeanAIde, a tool to translate mathematical statements within Lean doc-strings into formal Lean code. The tool is integrated with the Lean 4 editor via code actions for ergonomic usage.

At its core, LeanAIde relies on the abilities of a large language model to translate between natural language and code. Further optimisations to the input and output of the language model are used to improve the success rate of translation. The prompt to the language model is customised to the statement at hand by gathering similar doc-strings from the Lean mathematical library (mathlib) using sentence embeddings. Through post-processing and filtering, which is done efficiently using Lean’s meta-programming capabilities, only the outputs corresponding to well-formed Lean expressions are retained. The tool is implemented entirely in Lean 4.

When used with OpenAI language models, the tool is able to formalise short mathematical statements at the undergraduate level with about 65% accuracy when tested on 120 theorem statements. The test dataset is split into normal, silly and false categories, where statements from the last two categories are highly unlikely to be a part of the language model’s training data and hence guard against contamination. The tool has also been benchmarked against the new ProofNet dataset for auto-formalisation.

The source code for LeanAIde is available online at


  • [1] Zhangir Azerbayev and Edward W. Ayers. lean-chat: user guide. Lean. 2023. url:
  • [2] Zhangir Azerbayev et al. ProofNet: Autoformalizing and Formally Proving Undergraduate-Level Mathematics. 2023. arXiv: 2302.12433 [cs.CL].
  • [3] The mathlib Community. “The Lean Mathematical Library”. In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs. CPP 2020. New Orleans, LA, USA: Association for Computing Machinery, 2020, 367–381. isbn: 9781450370974. doi:10.1145/3372885.3373824. url:
  • [4] Naman Jain et al. “Jigsaw: Large language models meet program synthesis”. In: Proceedings of the 44th International Conference on Software Engineering. 2022, pp. 1219–1231.
  • [5] Albert Q Jiang et al. “Draft, sketch, and prove: Guiding formal theorem provers with informal proofs”. In: arXiv preprint arXiv:2210.12283 (2022).
  • [6] Leonardo de Moura and Sebastian Ullrich. “The lean 4 theorem prover and programming language”. In: Automated Deduction–CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings 28. Springer. 2021, pp. 625–635.
  • [7] Arvind Neelakantan et al. “Text and code embeddings by contrastive pre-training”. In: arXiv preprint arXiv:2201.10005 (2022).
  • [8] OpenAI. GPT-4 Technical Report. 2023. arXiv: 2303.08774 [cs.CL].
  • [9] Qingxiang Wang et al. “Exploration of neural machine translation in autoformalization of mathematics in Mizar”. In: Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs. 2020, pp. 85–98.
  • [10] Yuhuai Wu et al. “Autoformalization with large language models”. In: Advances in Neural Information Processing Systems 35 (2022), pp. 32353–32368.

Adrian Groza - Solving logical puzzles with ChatGPT [slides]

I will show the limitations and merits of Large Language Models (LLMs) to solve logic puzzles. For experiments I use the 144 puzzles from the library [1]. The library contains puzzles of various types, including arithmetic puzzles, logical equations, Sudoku-like puzzles, zebra-like puzzles, truth telling puzzles, grid puzzles, strange numbers, or self-reference puzzles. The correct solutions were obtained using Prover9 and Mace4 based on human-modelling in Equational First Order Logic.

The aim is to investigate how many of these 144 puzzles are actually solved by LLMs. Part of these puzzles have graphical input that makes them difficult to be textualised for ChatGPT, e.g. Futoshiki, Kakurasu, Takuzo, Kakuro. After eliminating these graphical puzzles, a set of 100 logical puzzles remains. Each puzzle was prompted to ChatGPT with ``Solve this puzzle’’. The results have been classified as follows. Only 7 puzzles were correctly solved and backed by correct reasoning. 1 puzzle was correct, but only incomplete justification was provided by the language model. For 9 puzzles the correct solution was provided, but backed by faulty justifications or reasoning chain – the correctness was actually a guess since some of the puzzles ask to check if a given sentence is true or false. Wrong solutions backed by faulty reasoning were provided for 70 puzzles. Additionally to these wrongly solved puzzles, there were the following answers from ChatGPT: claiming no solution when one exists, claiming not enough information to solve the task, or providing wrong solutions without no justification at all. There was one case in which, instead of the exact solution, an inefficient but valid algorithm was provided. Of the most interest might be one puzzle for which ChatGPT admitted its inability to solve it.

In most of the cases, the answers had a 3-parts structure: (i) task understanding, (ii) solving strategy, and (iii) the chain of reasoning. First, ChatGPT “seems” to master task understanding since it just reformulates the input in a more structured way. Second, we identify and annotate various solving strategies, e.g. “let’s do it step by step”, “backward reasoning”, “systematic search”, “trial and error”, “backtracking”, “euclidian algorithm”, “principle of inclusion-exclusion”, “assumption based”, “analysing possibilities”. Third, we classify the reasoning faults conveyed by ChatGPT, such as: wrong justification, inconsistency, wrong assumption, implication does not hold, wrong evaluation, bad arithmetic.

Related work focused on mathematical problems. Simon et al. [4] analysed the mathematical capabilities of ChatGPT and showed that ChatGPT provides less that 50% correct answers. PaLM enhanced with chain-of-thought prompting and an external calculator was able to solve 58% of basic mathematical reasoning tasks (i.e. from the datasets MathQA or GSM8K) [5]. Patterns of ChatGPT failures have been classified by Borji [3], including reasoning, factual errors, math, coding or bias. I am focusing here on identifying and refining the reasoning flaws of ChatGPT.

The discussion during the workshop targets: (1) which are the patterns of logical faults performed by LLMs?; (2) which types of puzzles are easy/difficult to solve by LLMs?; (3) what are the reasoning abilities of various free LLMs (e. g. ChatGPT, Bart, Claude)?; (4) which prompts are better when solving logical puzzles?.

Alternatively to engineering prompts to solve logic tasks, one could use LLMs to translate the text of the puzzles from natural language to some logic. For instance, Ishay et al. have used LLMs to generate answer set programs [2]. This approach is justified by the fact that the highest skill of LLMs is translation. Part of our 100 puzzles and their corresponding formalisation could be used to fine-tune the GPT-Curie model, while the remaining part to assess the model capabilities.

One dimension is how LLMs can support students to learn first order logic. Logical puzzles are an efficient pedagogical instrument to engage students in learning knowledge representation and reasoning. In this line, pedagogical tasks include LLMs to: (i) signaling logical faults given a formalisation provided by the student (ii) explaining in natural language a formalisation, proof or interpretation model; (iii) completing a partial formalisation provided by the student.

One output of this study is the benchmark of 100 logical puzzles, for which ChatGPT provided both correct answer and justification for 7% only. A second output is the classification of reasoning faults conveyed by ChatGPT. Third, each wrong statement within the ChatGPT answer was manually annotated, aiming to quantify the amount of faulty text generated by the language model.

  • [1] Groza, A. (2021). Modelling Puzzles in First Order Logic (pp. 1-338). Springer.
  • [2] Ishay, A., Zhun Y., and Joohyung L. “Leveraging Large Language Models to Generate Answer Set Programs.” arXiv:2307.07699 (2023).
  • [3] Borji, Ali. “A categorical archive of chatgpt failures.” arXiv preprint arXiv:2302.03494 (2023).
  • [4] Frieder, Simon, et al. “Mathematical capabilities of ChatGPT.” arXiv:2301.13867 (2023).
  • [5] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, and Gaurav Mishra et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311, 2022.

Claudio Sacerdoti Coen - An indexer and query language for libraries written in LambdaPi/Dedukti

I will present recent work on integrating in LambdaPi an indexer and query language for libraries written in LambdaPi/Dedutki syntax. Terms are indexed using discrimination trees to allow queries up to generalization and instantiation. Moreover I provide a query language to combine basic queries using logical operators (conjunction/disjuntion), set operators (restrictions to certain domains) and localization operators (e.g. where a pattern is supposed to match a statement, e.g. in the conclusion or the hypothesis). I use LambdaPi rewriting rules to implement alignments or to remove the clutter from encodings before indexing the terms.

Local organizer