The 8th IJCAR automated theorem proving system competition

Abstract

The CADE ATP System Competition (CASC) is the annual evaluation of fully automatic, classical logic Automated Theorem Proving (ATP) systems. CASC-J8 was the twenty-first competition in the CASC series. Twenty-one ATP systems and system variants competed in the various competition divisions. An outline of the competition design, and a commentated summary of the results, are presented.

Keywords

Automated theorem proving competition

1. Introduction

The CADE ATP System Competition (CASC) [19] is the annual evaluation of fully automatic, classical logic Automated Theorem Proving (ATP) systems – the world championship for such systems. One purpose of CASC is to provide a public evaluation of the relative capabilities of ATP systems. Additionally, CASC aims to stimulate ATP research, to motivate development and implementation of robust ATP systems that are useful and easily deployed in applications, to provide an inspiring environment for personal interaction between ATP researchers, and to expose ATP systems within and beyond the ATP community. Fulfillment of these objectives provides insight and stimulus for the development of more powerful ATP systems, leading to increased and more effective use. CASC-J8 was held on 29th June 2016 in Coimbra, Portugal, as part of the 8th International Joint Conference on Automated Reasoning.1

¹
In 2016 CADE was a constituent of the 8th International Joint Conference on Automated Reasoning, hence “J8” for “8th Joint” conference.

CASC-J8 was the twenty-first competition in the CASC series; see [22] and citations therein for information about individual previous competitions. The CASC-J8 web site provides access to all competition resources: http://www.tptp.org/CASC/J8.

CASC is divided into divisions according to problem and system characteristics. There are competition divisions in which the systems are explicitly ranked, and a demonstration division in which systems demonstrate their abilities without being ranked. (The demonstration division is for systems that cannot be entered into the competition divisions for any reason, e.g., the system runs on specialist hardware, or the entrant is a competition organizer or panel member.) Each competition division uses problems that have certain logical, language, and syntactic characteristics, so that the systems that compete in the division are, in principle, able to attempt all the problems in the division. Some divisions are further divided into problem categories that make it possible to analyze, at a more fine-grained level, which systems work well for what types of problems. The demonstration division uses the same problems as the competition divisions – the entry specifies which competition divisions’ problems are to be used. Table 1 catalogs the divisions and problem categories of CASC-J8.

Table 1

Divisions and problem categories

Division	Problems	Problem categories. The examples can be viewed at http://www.tptp.org/cgi-bin/SeeTPTP?Category=Problems.
THF	Typed Higher-order Form theorems (axioms with a provable conjecture).	TNE – THF with No Equality, e.g., NUM738^1. TEQ – THF with EQuality, e.g., SET171^3.
TFA	Typed First-order form theorems with Arithmetic (axioms with a provable conjecture).	TFI – TFA with only Integer arithmetic, e.g., DAT016=1. TFR – TFA with only Rational arithmetic, e.g., ARI621=2. TFE – TFA with only rEal arithmetic, e.g., MSC022=2.
TFN	Typed First-order form Non-theorems with arithmetic (axioms with a countersatisfiable conjecture, and satisfiable axiom sets). The problems contain only integer, or only rational, or only real, arithmetic.	No problem categories. E.g., NUM881=1.
FOF	First-Order Form theorems (axioms with a provable conjecture).	FNE – FOF with No Equality, e.g., COM003+1. FEQ – FOF with EQuality, e.g., SEU147+3.
FNT	FOF Non-Theorems (axioms with a countersatisfiable conjecture, and satisfiable axioms sets).	FNN – FNT with No equality, e.g., KRS173+1. FNQ – FNT with eQuality, e.g., MGT033+2.
EPR	Effectively PRopositional theorems and non-theorems in clause normal form (unsatisfiable and satisfiable clause sets). Effectively propositional means that the problems are known to be reducible to propositional problems, e.g., CNF problems that have no functions with arity greater than zero.	EPT – Effectively Propositional Theorems (unsatisfiable clause sets), e.g., PUZ037-3. EPS – Effectively Propositional non-theorems (Satisfiable clause sets), e.g., GRP123-2.005.
LTB	First-order form theorems (axioms with a provable conjecture) from Large Theories, presented in Batches. A large theory typically has many functors and predicates, and many axioms of which only a few are required for the proof of a theorem. Problems in a batch typically have a common core set of axioms, and the problems in a batch are given to the ATP system all at once. The batch presentation allows the ATP systems to load and preprocess the common core set of axioms just once, and to share logical and control results between proof searches.	AIM – Problems exported from the AIM loops project [7]. CML – Problems exported from CakeML [9]. HLL – Problems exported from the Flyspeck project [2]. See Section 2.2 for details of these problems’ sources. Each problem category is accompanied by a set of training problems and their solutions, taken from the same source as the competition problems. This training data can be used for ATP system tuning during (typically at the start of) the competition.

Twenty-one ATP systems and system variants, listed in Table 2, competed in the various divisions. The division winners of CASC-25 (the previous CASC) were automatically entered to provide benchmarks against which progress can be judged. Additionally, Prover9 2009-11A is entered into the FOF division each year, as a fixed point against which progress can be judged. Prover9Plus ran in the demonstration division because the developer supplied some of the LTB divisions’ problems. System descriptions of the systems are in [18] and on the CASC-J8 web site.

Table 2

The ATP systems and entrants

ATP system	Divisions	Entrant (associates)	Entrant’s affiliation
Beagle 0.9.47	TFA TFN	Peter Baumgartner (Joshua Bax)	Data61
CVC4 1.5	TFN	CASC	CASC-25 winner
CVC4 1.5.1	TFA TFN FOF FNT	Andrew Reynolds (Clark Barrett, Cesare Tinelli, Tim King)	University of Iowa
E 2.0	FOF FNT EPR LTB	Stephan Schulz (Martin Möhrmann)	DHBW Stuttgart
Geo-III 2016C	FOF FNT EPR	Hans de Nivelle	University of Wrocław
iProver 2.5	FOF FNT EPR LTB	Konstantin Korovin	University of Manchester
Isabelle 2015	THF	Jasmin Blanchette (Lawrence Paulson, Tobias Nipkow, Makarius Wenzel)	Inria Nancy
leanCoP 2.2	FOF	Jens Otten	University of Potsdam
LEO-II 1.7.0	THF	Max Wisniewski (Alexander Steen Christoph Benzmüller)	Freie Universität Berlin
Leo-III 1.0	THF	Max Wisniewski (Alexander Steen Christoph Benzmüller)	Freie Universität Berlin
Leo+III 1.0	THF	Max Wisniewski (Alexander Steen Christoph Benzmüller)	Freie Universität Berlin
Nitpick 2015	FNT	Jasmin Blanchette	Inria Nancy
Princess 160606	TFA TFN	Philipp Rümmer	Uppsala University
Prover9 2009-11A	FOF	CASC (William McCune, Bob Veroff)	CASC fixed point
Prover9Plus 1.0	LTB (demo)	Bob Veroff (William McCune, Josef Urban)	University of New Mexico
Refute 2015	FNT	Jasmin Blanchette (Tjark Weber)	Inria Nancy
Satallax 2.8	THF	CASC	CASC-25 winner
Satallax 3.0	THF	Michael Färber (Chad Brown)	Universität Innsbruck
Vampire 4.0	FOF FNT EPR LTB	CASC	CASC-25 winner
Vampire 4.1	TFA FOF FNT EPR LTB	Giles Reger (Martin Suda, Andrei Voronkov Evgeny Kotelnikov, Laura Kovacs)	University of Manchester
VampireZ3 1.0	TFA	CASC	CASC-25 winner

CASC-J8 was organized by Geoff Sutcliffe, and overseen by a panel consisting of Pascal Fontaine, Aart Middeldorp, and Christoph Wernhard. The competition was run on computers provided by the StarExec project [15] at the University of Iowa.

This paper is organized as follows: Section 2 outlines the design and organization of the competition. Section 3 provides a commentated summary of the results. Section 4 contains short descriptions of three of the ATP systems. Section 5 concludes and discusses plans for future CASCs.

2. Outline of design and organization

The design and organization of CASC has evolved over the years to a sophisticated state. An outline of the CASC-J8 design and organization is provided here. The details are in [18] and on the CASC-J8 web site. Important changes for CASC-J8 were:

The Typed Higher-order Non-theorem (THN) division was put into a hiatus state, due to the limited number and diversity of systems for this type of problem [22].

The TFN division had no problem categories, due to the very small number of TFN problems using only rational or real arithmetic. This dearth of problems was noted in the CASC-25 report [22], but the hope that more suitable problems would be contributed to the TPTP was not fulfilled.

The LTB division was run on StarExec. In CASC-25 the LTB division was run on a cluster at the University of Manchester, because StarExec did not have a facility to download partial results from a job, as is required to display results as the ATP systems work their way through the problems in a batch. The StarExec developers added that facility after CASC-25, so that the LTB division could run on StarExec.2

²
Thanks to Aaron Stump for getting that done for CASC!

2.1. System delivery, execution, and evaluation

The ATP systems were delivered to the competition organizer as StarExec installation packages, which the organizer installed and tested on StarExec. Source code was delivered separately for archiving on the competition web site.

The ATP systems entered into CASC are required to be fully automatic. They are executed as black boxes, on one problem (non-LTB) or one problem batch (LTB) at a time. Any command line parameters have to be the same for all problems/batches in each division. The systems are required to be sound, and are tested for soundness by submitting non-theorems to the systems in the THF, TFA, FOF, EPR, and LTB divisions, and theorems to the systems in the TFN, FNT, and EPR divisions. Claiming to have found a proof of a non-theorem or a disproof of a theorem indicates unsoundness. For the third year in a row, no systems failed this soundness testing.

The THF, TFA, FOF, FNT, and LTB divisions were ranked according to the number of problems solved with an acceptable proof/model output. The TFN and EPR divisions were ranked according to the number of problems solved, but not necessarily accompanied by a proof or model (but systems that do output proofs/models are highlighted in the presentation of results). Ties were broken according to the average time taken over problems solved, CPU time for the non-LTB divisions, wall clock time for the LTB division. Trophies were awarded to the division winners.

In addition to the ranking criteria, three other measures were made and are presented in the results: The state-of-the-art (SoTA) contribution quantifies the unique abilities of each system. For each problem solved by a system, its SoTA contribution for the problem is the reciprocal of the number of systems that solved the problem, so that if a system is the only one to solve a problem then its SoTA contribution for the problem is 1.00, and if all the systems solve a problem their SoTA contribution for the problem is the inverse of the number of systems. A system’s overall SoTA contribution is its average SoTA contribution over the problems it solved. The efficiency combines the time taken for each problem solved with the number of problems solved. It is the average solution rate over the problems solved (the solution rate for one problem is the inverse of the time taken to solve it), multiplied by the fraction of problems solved. Efficiency is computed for CPU time to measure how efficiently the systems use one core, and additionally for wall clock time in the LTB division (which has a wall clock time limit) to measure how efficiently the systems use the multiple cores that are available. The core usage directly measures the extent to which the systems use the multiple cores that are available. It is the average ratio of CPU time to wall clock time taken, over the problems solved. The competition ran on quad-core computers, thus the maximal core usage was 4.0.

2.2. Problems

2.2.1. Non-LTB problems

The problems for the non-LTB divisions were taken from the Thousands of Problems for Theorem Provers (TPTP) problem library [17], v6.4.0. The TPTP version used for CASC is not released until after the competition has started, so that new problems in the release have not been seen by the entrants. The problems have to meet certain criteria to be eligible for selection, as follows:

The TPTP tags problems that are designed specifically to be suited or ill-suited to some ATP system, calculus, or control strategy as biased. Biased problems are excluded from the competition.

The problems must be syntactically non-propositional.

The TPTP uses system performance data in the Thousands of Solutions from Theorem Provers (TSTP) solution library to compute problem difficulty ratings in the range 0.00 (easy) to 1.00 (unsolved) [21]. Systems can be submitted before the competition so that their performance data is used for computing the problem ratings. Difficult problems with ratings in the range 0.21 to 0.99 are normally eligible for CASC. Problems of lesser and greater ratings may also be eligible in some divisions if there are not enough problems with the desired ratings with respect to the resources available. For CASC-J8 there were adequate resources to admit all problems with ratings greater than 0.00 in the TFA, TFN, FNT, and EPR divisions. There were 229 problems with rating less than 0.21 in the TFA division, none in TFN or FNT, and 82 in EPR. There were 46 problems with rating 1.00 in the TFA division, 22 in TFN, 48 in FNT, and 40 in EPR. The easier problems lead to higher solution rates, while the use of problems with rating 1.00 is interesting if the new (versions of) systems in the competition solve some of them.

The selection is constrained so that no division or category contains an excessive number of “very similar” problems [16]. In TPTP v6.4.0 there are very few TFR problems that are not very similar, and therefore 18 problems that are very similar to others had to be selected.

The selection is biased to select problems that are new in the TPTP version used, until 50% of the problems in each problem category have been selected, after which random selection (from old and new problems) continues. The number of new problems used depends on how many new problems are eligible and the limitation on very similar problems.

In order to ensure that no system receives an advantage or disadvantage due to the specific presentation of the problems in the TPTP, the problems are preprocessed to strip out all comment lines (in particular, the problem header), randomly reorder the formulae/clauses (include directives are left before the formulae, type declarations are kept before the symbols’ uses), randomly swap the arguments of associative connectives, randomly reverse implications, and randomly reverse equalities.

The numbers of problems used in each division and problem category are roughly proportional to the numbers of eligible problems, subject to the limitation on the number of very similar problems. The problems used are randomly selected from the eligible problems based on a seed supplied by the competition panel. The problems are given to the ATP systems in TPTP format, in increasing order of TPTP difficulty rating.

Table 3 gives the numbers of eligible problems, the maximal numbers that could have been used after taking into account the limitation on very similar problems (in the context of the numbers of problems used), and the numbers of problems used, in each division and category. There were very few new problems in CASC-J8, due to limited growth of the TPTP in the preceding year.

Table 3
Numbers of eligible and used problems

Division: THF TFA TFN FOF FNT EPR LTB

Category: TNE TEQ TFI TFR TFE FNE FEQ FNN FNQ EPT EPS AIM CML HLL

Eligible 127 616 336 87 125 58 349 3199 236 156 325 114 1155 10,227 7150

Usable 127 616 336 57 125 58 246 828 220 156 325 107 – – –

New 0 0 6 0 0 7 0 33 0 8 4 10 – – –

Used 100 400 300 75 125 50 200 300 200 100 200 100 200 200 200

New 0 0 6 0 0 7 0 33 0 8 4 10 – – –

Division:	THF	TFA	TFN	FOF	FNT	EPR	LTB
Eligible	127	616	336	87	125	58	349	3199	236	156	325	114	1155	10,227	7150
Usable	127	616	336	57	125	58	246	828	220	156	325	107	–	–	–
New	0	0	6	0	0	7	0	33	0	8	4	10	–	–	–
Used	100	400	300	75	125	50	200	300	200	100	200	100	200	200	200
New	0	0	6	0	0	7	0	33	0	8	4	10	–	–	–

2.2.2. LTB problems

The problems for the LTB division were taken from publicly available problem sets. For CASC-J8: the AIM problem category used the AIM2175 problem set,3

³
http://www.tptp.org/CASC/J8/AIM2175.tgz.

exported from the AIM loops project [7]; the CML problem category used the CML10227 problem set,4

⁴

http://grid02.ciirc.cvut.cz/~mptp/CML10227.tar.gz.

exported from CakeML [1,9] in HOL4 [13]; the HLL problem category used the HH7150 problem set,5

⁵

https://github.com/JUrban/HH7150.

exported from the Flyspeck project [2] in HOL Light [4]. Entrants were expected to honestly not use the (publicly available) problem sets for tuning before the competition. The problems and training sets were selected by the competition organizer, as follows:

The AIM2175 problem set contains 1155 test problems, and 1020 training problems with one solution each. For the competition, 200 problems were randomly chosen from the 1155. All the training problems and their solutions were given in the training set.

The CML10227 problem set contains 10,227 problems. Proofs were found by Vampire 4.0 for 2017 of them by careful preselection of axioms. The 2017 proofs were used to construct “easy problems” with only the axioms used in the proofs, which were attempted in non-batch mode by pre-CASC-J8 versions of E, iProver, and Vampire, with a 120s CPU time limit. Only 440 of these easy problems were solved by any of the three systems. One hundred of the 440 solved problems and 100 of the 1577 unsolved problems were randomly selected to be used in the competition, but with their full original axiomatizations. For the training set, 1000 problems (and their Vampire 4.0 solutions) were selected from the remaining 1817 unselected problems.

The HH7150 problem set contains 7150 problems. This problem set was also used in CASC-25, and same criteria were used for problem selection: 140 problems that were solvable in 60s by at least one of the CASC-25 non-LTB versions of the systems, and 60 problems that were not solvable in 60s. For the training set, 1000 problems (and their solutions) were selected from those that were solvable in 60s but not selected for the competition.

The problems are given to the ATP systems in TPTP format, in the natural order of their export from the underlying library. This models the work of ATP systems used to assist in formal proof development.

2.3. Resources

The computers had four Intel(R) Xeon(R) E5-2609, 2.40GHz CPUs (a quad-core chip), 128GB memory, and ran the Red Hat Enterprise Linux Workstation release 6.3 (Santiago) operating system, kernel 2.6.32-431.1.2.el6.x86_64.

In the non-LTB divisions a 360s CPU time limit was imposed for each problem. A wall clock time limit of 720s was also imposed to limit very high memory usage that causes swapping. In the LTB division a 180s wall clock time limit was imposed for each problem, and a 36,000s overall wall clock time limit (200 problems times 180s) for each problem category. Time spent before starting the first problem in a batch, e.g., learning from the training set and pre-loading the common axioms, and times spent between ending a problem and starting the next, e.g., learning from previous proofs, were not part of the time taken on problems. However, time taken on such tasks was part of the overall time taken for the batch.

Demonstration division systems can run on the competition computers, or the computer(s) can be supplied by the entrant. The CASC-J8 demonstration division system used the competition computers.

3. Results

For each ATP system, for each problem, four items of data were recorded: whether or not the problem was solved, the CPU time taken, the wall clock time taken, and whether or not a proof or model was output. This section summarizes the results, and provides commentary. The result tables below give the number of problems solved in the division, the average time over problems solved, whether or not proofs or models were output, the state-of-the-art contribution, the (micro-)efficiency, the core usage, the number of new problems solved, and the number of problems solved in each problem category. In each of the results tables, the CASC-25 winner is emphasized. Detailed results, including the systems’ output files, are available from the CASC-J8 web site.

3.1. The THF division

Table 4 summarizes the results of the THF division. Satallax 3.0 improved over Satallax 2.8, which won the division in CASC-25, by using the E prover as a subsystem (the E prover was available in Satallax 2.8, but not used because E’s proofs had not been integrated into Satallax’s proof output). If Isabelle had output proofs it would have won the division, and the developer has promised that proof output will be implemented for CASC-26 next year. It is also noteworthy that Isabelle solved 316 problems (89% of the problems it solved) within 120s, mostly using the cvc4 (101 solutions), z3 (80 solutions), blast (54 solutions), spass (53 solutions), and auto+spass (48 solutions) tactics. It is strange that the Leo-III system performed worse that LEO-II, as Leo-III uses LEO-II as a subsystem – there should have been valued added in Leo-III. This turned out to be caused by Leo-III giving LEO-II only a few seconds, which rendered LEO-II ineffective – a longer time allocation would fix this problem. Despite their weaker performances, Leo-III and Leo+III are interesting as they are new systems with a novel agent-based architecture. A brief system description of Leo-III and Leo+III is provided in Section 4. The systems’ performances in the two problem categories align with their overall performances, suggesting that there is no specialization to problems with or without equality.

Table 4
THF division results

ATP system THF/500 Avg CPU Prfs out SOTA con. μ eff. Core usage TNE/100 TEQ/400

Satallax 3.0 346 22.1 327 0.32 358 0.95 67 279

Satallax 2.8 315 19.5 313 0.32 360 0.93 63 252

LEO-II 1.7.0 238 20.9 231 0.33 344 0.74 47 191

Leo+III 1.0 89 48.4 88 0.21 38 2.12 33 56

Leo-III 1.0 74 42.8 74 0.21 33 2.17 29 45

Isabelle 2015 356 81.1 0 0.39 29 1.03 65 291

ATP system	THF/500	Avg CPU	Prfs out	SOTA con.	μ eff.	Core usage	TNE/100	TEQ/400
Satallax 3.0	346	22.1	327	0.32	358	0.95	67	279
Satallax 2.8	315	19.5	313	0.32	360	0.93	63	252
LEO-II 1.7.0	238	20.9	231	0.33	344	0.74	47	191
Leo+III 1.0	89	48.4	88	0.21	38	2.12	33	56
Leo-III 1.0	74	42.8	74	0.21	33	2.17	29	45
Isabelle 2015	356	81.1	0	0.39	29	1.03	65	291

The top four systems all failed to output proofs for some problems. In most cases the task of producing and printing a proof took more time than was available before the time limit was reached. For Satallax 3.0 this problem was exacerbated by the fact that only the time until end of the current strategy time slice was made available for proof production (rather than all the time until the time limit).

The SoTA contributions are lower for Leo-III and Leo+III, indicating that they have no distinguishing capabilities. The top three systems had significantly higher efficiencies, reflecting the fact that they solved significantly more problems with lower average CPU times. The high average CPU time of Isabelle, as was noted in the CASC-25 report, is due to a long start-up time and a strategy schedule that is not suited to the TPTP problems. The core usage of all except the Leo systems is about 1.00. The Leos’ good core usage is due to their implementation in Scala, in which garbage collection and JIT compilation is done in parallel with the main thread.

The individual problem results show that 37 problems were unsolved, 32 problems were solved by all the systems, and 66 problems were solved by only one system. Of the 66 unique solutions, 43 were by Isabelle, 14 by LEO-II, and 5 by Satallax 3.0, and 4 by Satallax 2.8. The unique solutions indicate that a portfolio approach would be effective for these THF problems. Giving Isabelle, LEO-II, and Satallax 3.0 120s each would solve 442 problems.

3.2. The TFA division

Table 5 summarizes the results of the TFA division. The winner, Vampire 4.1, is a newcomer to the TFA division, and a brief system description of Vampire 4.1 focusing on the features for the TFA division is provided in Section 4. It dominated the division, solving the most problems with the highest SOTA contribution and highest efficiency, and outperforming the CASC-25 winner VampireZ3. The performances in the integer, rational, and real problem categories align with the overall results except for Princess’s stronger performance on integer problems. This is because Princess has a native theory solver for integers (and various other heuristics, e.g., for non-linear problems), but just uses first-order axioms for rationals and reals – rationals and reals have not yet been a focus of the developer’s work.

Table 5
TFA division results

ATP system TFA/500 Avg CPU Prfs out SOTA con. μ eff. Core usage New/6 TFI/300 TFR/75 TFE/125

Vampire 4.1 419 13.4 419 0.30 585 0.78 3 230 73 116

VampireZ3 1.0 380 9.2 380 0.25 512 0.80 3 202 72 106

CVC4 1.5.1 343 5.7 343 0.27 587 1.15 0 164 73 106

Beagle 0.9.47 300 18.8 300 0.22 282 1.86 5 135 73 92

Princess 160606 342 17.6 271 0.24 294 2.41 5 185 71 86

ATP system	TFA/500	Avg CPU	Prfs out	SOTA con.	μ eff.	Core usage	New/6	TFI/300	TFR/75	TFE/125
Vampire 4.1	419	13.4	419	0.30	585	0.78	3	230	73	116
VampireZ3 1.0	380	9.2	380	0.25	512	0.80	3	202	72	106
CVC4 1.5.1	343	5.7	343	0.27	587	1.15	0	164	73	106
Beagle 0.9.47	300	18.8	300	0.22	282	1.86	5	135	73	92
Princess 160606	342	17.6	271	0.24	294	2.41	5	185	71	86

This is the second CASC in which the ranking in the TFA division has been by number of proofs output. In CASC-25 neither CVC4 1.4 nor Princess 150706 output proofs, and were thus in tie last place. It is pleasing to see that both these systems now output proofs, although Princess did not output proofs in 71 cases. This is because Princess has two procedures, the native module for non-linear integer reasoning, and the procedure for constructing free-variable tableaux, that do not yet log proof information.

Beagle and Princess had notably lower efficiencies than the other three systems, largely due to their higher CPU times (partly caused by the startup time required for the Java Virtual Machine). In contrast, Beagle and Princess had the best core usage, and did better on the six new problems. Their good core usage is (as for the Leos in THF) due to their implementation in Scala. Neither system does any explicit parallel activity across cores, e.g., strategy parallelism (Princess does do strategy parallelism, but internally using its own scheduler). The developers believe that strategy parallelism that takes advantage of multiple cores would be simple to add – a challenge for CASC-26!

As was noted in Section 2.2, 46 problems of rating 1.00 were used in the TFA division. Thirteen of them were solved in the competition, indicating progress since the problem ratings were done. Twenty-five of these 46 problems are hardware verification problems that have a small reliance on arithmetic manipulation and a large combinatorial search space. As such these problems favour systems with strong search features rather than systems that are focused on strong arithmetic capabilities. Only 3 of them were solved, and only by Vampire.

The individual problem results show that 36 problems were unsolved, 235 problems were solved by all the systems, and 43 problems were solved by only one system. Of the 43 unique solutions, 26 were by Vampire, 15 by CVC4, 2 by Beagle, and 2 by Princess. These numbers of unique solutions are reflected in the SOTA contributions. The best portfolio would be to give 90s to each of Vampire, VampireZ3, CVC4, and Beagle, which would solve 459 problems. But noting that it is unlikely that there will be further development of VampireZ3 (it was a “one off” development for CASC-25, and is now replaced by pure Vampire), it is worthy considering a portfolio of just Vampire and CVC4 – giving 250s to Vampire and 110s to CVC4 would solve 455 problems.

As was noted in Section 2.2, 229 problems with rating less than 0.21 were used in the TFA division. This is the reason why so many problems were solved by all systems, particularly in the TFR problem category where 71 of the 75 problems had rating less than 0.21.

3.3. The TFN division

Table 6 summarizes the results of the TFN division. The winner, Beagle, was declared to be the winner only several days after the competition. On the day of CASC-J8 CVC4 1.5 (the CASC-25 winner) solved 15 problems and was declared the winner. However, the following day the CVC4 developer reminded the organizer that the CASC-25 version of CVC4 1.5 is unsound in some corner cases handling non-linear arithmetic, and could incorrectly report models (this bug did not manifest itself in CASC-25). The developer had provided a fixed version of CVC4 1.5, but the CASC organizer forgot to install it for CASC-J8 (whoops!). The fixed version was then installed and run on the division. It solved only 8 problems, thus relegating it to last place, and making Beagle the winner. At no stage was there any intent to deceive, and the old CVC4 1.5 did pass the CASC-J8 soundness testing. The CVC4 developer receives an honourable mention for his report, and says he will win the division next year anyway, by developing the non-linear integer arithmetic capabilities in CVC4. CVC4 1.5.1 was the only system to output models, and this capability is important in applications.

Table 6
TFN division results

ATP system TFN/50 Avg CPU Mdls out SOTA con. μ eff. Core usage New/7

Beagle 0.9.47 10 3.1 × 0.55 58 2.83 0

CVC4 1.5.1 9 0.0 ✓ 0.50 180 0.75 2

Princess 150706 8 1.5 × 0.69 140 1.88 0

CVC4 1.5 8 22.9 × 0.44 101 0.92 1

ATP system	TFN/50	Avg CPU	Mdls out	SOTA con.	μ eff.	Core usage	New/7
Beagle 0.9.47	10	3.1	×	0.55	58	2.83	0
CVC4 1.5.1	9	0.0	✓	0.50	180	0.75	2
Princess 150706	8	1.5	×	0.69	140	1.88	0
CVC4 1.5	8	22.9	×	0.44	101	0.92	1

The various performance measures do not correspond with the numbers of problems solved: Beagle has the highest core usage, but has the lowest efficiency. Princess has the highest SOTA contribution, and CVC4 1.5.1 has the lowest average CPU time and highest efficiency. None of the systems performed well on the 7 new problems. Six of the problems are axiomatizations of some simple data structures, e.g., integer arrays, of which only 2 were solved. The other is a variant of the Collatz conjecture, which of course could not be solved.

As was noted in Section 2.2, 22 problems of rating 1.00 were used in the TFN division – almost half of the division. Nine are hardware verification problems like the 25 rating 1.00 problems in the TFA division, 8 are software verification problems that are considered to be relatively simple benchmarks for infinite state model checkers, 4 are new data structure problems as mentioned above, 2 are arithmetic tests, and the last is the variant of the Collatz conjecture. Only one of them – one of the arithmetic tests – was solved, and only by Beagle.

The individual problem results show that 31 problems were unsolved, no problems were solved by all the systems, and 6 problems were solved by only one system. Of the 6 unique solutions, 3 were by Princess, 2 by Beagle, and 1 by CVC4 1.5.1. A portfolio of Beagle, CVC4 1.5.1, and Princess, giving them 120s each, would solve 19 problems …quite a remarkable improvement over the individual systems.

As was the case in CASC-25, the deficiency of this division was the small number of eligible problems – despite some efforts only a few new eligible problems were added to the TPTP in the last year. Establishing non-theoremhood in the context of arithmetic is a naturally difficult task, but also an important task for applications of ATP. The TFN division will continue to challenge the ATP community, and hopefully over time the ATP systems will become more effective.

3.4. The FOF division

Table 7 summarizes the results of the FOF division. The CASC-25 winner, Vampire 4.0, beat the new Vampire 4.1. Vampire 4.1 used a new core theorem prover, but with Vampire 4.0’s strategy schedules. This mismatch lead to a weaker performance than Vampire 4.0, where the strategy schedule was calibrated to the core theorem prover. (This is another reminder that proof search can be fragile.) However, Vampire 4.1 had a lower average CPU time than Vampire 4.0.

Four of the 8 systems failed to output proofs for some of the problems solved. For Vampire 4.0, as was noted in the CASC-25 report [22], some proofs were not output due to a bug, which has been fixed in Vampire 4.1. For CVC4, iProver, and Prover9, after theoremhood was established, the task of producing and printing a proof took more time than was available before the time limit was reached. For example, for LCL666+1.010 it takes iProver 58s to establish theoremhood, but, as the developer reported after the competition, around an hour to produce the proof.

The problem category results align well with the overall results, with only iProver having its known preference for problems with equality. An examination of the FEQ performance plots at http://www.tptp.org/CASC/J8/WWWFiles/ResultsPlots.html#FEQProblems shows that E has a noticeable performance inflection just past 180s, when it switches from its first strategy that was given 180s (half the time limit) to it’s second strategy that is given 90s (a quarter of the time limit). There is another inflection around 270s when it switches to its third strategy. E’s strategy scheduling is clearly effective, and could be more time efficient if E took advantage of the multiple cores available to run the strategies in parallel (rather than sequentially on one core – it’s core usage is only 0.78).

Table 7
FOF division results

ATP cystem FOF/500 Avg CPU Prfs out SOTA con. μ eff. Core usage New/33 FNE/200 FEQ/300

Vampire 4.0 457 15.4 453 0.24 537 0.89 33 187 270

Vampire 4.1 447 14.2 447 0.23 547 0.87 30 182 265

E 2.0 392 30.9 392 0.23 460 0.78 27 161 231

CVC4 1.5.1 329 35.1 328 0.20 323 0.86 30 148 181

iProver 2.5 278 30.8 274 0.19 163 0.93 12 166 112

leanCoP 2.2 168 78.0 168 0.17 93 0.87 7 98 70

Prover9 1109a 101 30.0 98 0.20 78 0.87 2 25 76

Geo-III 2016C 54 41.8 54 0.23 27 0.96 0 32 22

ATP cystem	FOF/500	Avg CPU	Prfs out	SOTA con.	μ eff.	Core usage	New/33	FNE/200	FEQ/300
Vampire 4.0	457	15.4	453	0.24	537	0.89	33	187	270
Vampire 4.1	447	14.2	447	0.23	547	0.87	30	182	265
E 2.0	392	30.9	392	0.23	460	0.78	27	161	231
CVC4 1.5.1	329	35.1	328	0.20	323	0.86	30	148	181
iProver 2.5	278	30.8	274	0.19	163	0.93	12	166	112
leanCoP 2.2	168	78.0	168	0.17	93	0.87	7	98	70
Prover9 1109a	101	30.0	98	0.20	78	0.87	2	25	76
Geo-III 2016C	54	41.8	54	0.23	27	0.96	0	32	22

Table 8

FNT division results

ATP system	FNT/300	Avg CPU	Mdls out	SOTA con.	μ eff.	Core usage	New/8	FNN/200	FNQ/100
Vampire 4.1	250	40.1	248	0.27	215	0.93	1	199	51
Vampire 4.0	240	36.5	238	0.26	221	0.96	1	191	49
iProver 2.5	200	30.3	200	0.23	260	2.29	1	183	17
Nitpick 2015	139	37.9	139	0.19	26	1.11	0	120	19
CVC4 1.5.1	96	22.4	96	0.15	211	0.69	0	94	2
Geo-III 2016C	76	13.7	75	0.15	138	0.92	0	76	0
E 2.0	70	16.3	70	0.19	124	0.83	0	60	10
Refute 2015	58	69.1	0	0.14	6	1.02	0	57	1

The SOTA contributions were similar for the best performing systems. The efficiencies of the Vampires are notably high, due to the numbers of problems solved with low average CPU times. None of the systems took advantage of the multiple cores, with all of the core usages below 1.00. The systems’ abilities to solve the new problems aligned well with the overall performance, suggesting that the systems are not over tuned to existing TPTP problems.

The individual problem results show that 17 problems were unsolved, 4 problems were solved by all the systems, and 17 problems were solved by only one system. Of the 17 unique solutions, 9 were by E, and 2 by each of Vampire 4.0, CVC4, Prover9, and Geo-III. Several interesting possible portfolios emerge. An obvious one is Vampire 4.0 with E – giving them 180s each would solve 465 problems. Less obvious is the combination of Vampire 4.1, E, and iProver – giving them 120s each would solve 468 problems. Least obvious is the combination of the two Vampires, E, CVC4, and Geo-III, giving them 72s each would solve 469 problems.

3.5. The FNT division

Table 8 summarizes the results of the FNT division. The new Vampire 4.1 easily outperformed the CASC-25 winner Vampire 4.0, mainly due to improvements in its finite model finding techniques [12]. The biggest improvement is an approach that promotes inferred (monotonic) subsorts to proper sorts, and treats the problem as a multi-sorted one, allowing for a more fine-grained search for a finite model. Both the Vampires solve a couple of problems without producing a model, because of an omission in the code for cases when satisfiability is shown by a combination of definition inlining and pure predicate removal – this will be fixed for CASC-26. There is a clear gap between the top three systems and the others, although the fourth placed Nitpick did quite well for a system that is designed for model finding in higher-order logic. The problem category rankings align with the overall ranking except for E’s preference for problems with equality.

The Vampires have the highest SOTA contributions, but iProver has the highest efficiency due to it’s lower average CPU time. Nitpick and Refute have significantly lower efficiencies, because they seldom solve problems very quickly. This is due to the high start-up time of these systems – around 15s for Nitpick and around 6s for Refute. iProver is the only system to make good use of the multiple cores, as reflected in the core usages. iProver ran several strategies in parallel in the FNT division, because of the diversity of problems – some can be solved quickly by finite model finding, while some require other methods such as instantiation and resolution. None of the problems were able to solve more than one of the 8 new problems. All of the new problems are to show the satisfiability of axiomatizations, created during the development of the new Thousands of Models for Theorem Provers (TMTP) model library [20]. The axiomatizations are the bases for theorems in the TPTP, so demonstrating their satisfiability is important for ensuring that proofs for the theorems are not due to contradictory axioms.

Table 9
EPR division results

ATP system EPR/300 Avg CPU Pr M out SOTA con. μ eff. Core usage New/14 EPT/200 EPS/100

iProver 2.5 229 28.3 ✓ ✓ 0.34 300 0.88 4 156 73

Vampire 4.1 222 29.2 ✓ ✓ 0.31 310 0.86 1 153 69

Vampire 4.0 222 35.4 ✓ ✓ 0.31 304 0.92 1 155 67

E 2.0 101 21.9 ✓ ✓ 0.26 153 0.94 2 60 41

Geo-III 2016C 10 55.9 – ✓ 0.30 17 0.82 4 0 10

ATP system	EPR/300	Avg CPU	Pr M out	SOTA con.	μ eff.	Core usage	New/14	EPT/200	EPS/100
iProver 2.5	229	28.3	✓ ✓	0.34	300	0.88	4	156	73
Vampire 4.1	222	29.2	✓ ✓	0.31	310	0.86	1	153	69
Vampire 4.0	222	35.4	✓ ✓	0.31	304	0.92	1	155	67
E 2.0	101	21.9	✓ ✓	0.26	153	0.94	2	60	41
Geo-III 2016C	10	55.9	– ✓	0.30	17	0.82	4	0	10

Table 10

LTB division results

ATP system	LTB/600	Avg CPU	Avg WC	Prfs out	SOTA con.	CPU μ eff.	WC μ eff.	Core usage	AIM/200	CML/200	HLL/200
Vampire 4.0	403	31.1	11.6	403	0.24	199	246	1.95	47	199	157
Vampire L-4.1	398	22.4	9.5	398	0.23	234	277	1.65	44	198	156
Vampire 4.1	396	16.7	8.1	396	0.23	246	285	1.66	48	197	151
E 2.0	305	33.9	12.6	305	0.21	85	145	2.63	17	193	95
iProver 2.5	298	132.2	35.1	298	0.21	20	54	3.60	14	192	92
Demonstration division
Prover9Plus 1.0	85	14.5	14.4	85	0.61	77	78	1.01	85	0	0

As was noted in Section 2.2, 48 problems of rating 1.00 were used in the FNT division. These include 7 of the 8 new problems, and 23 of which are problems to establish satisfiability of axiomatizations. None of the 48 problems were solved.

The individual problem results show that 48 problems were unsolved, 17 problems were solved by all the systems, and 6 problems were solved by only one system. Of the 6 unique solutions, 4 were by Vampire 4.1, and 2 by Vampire 4.0. A portfolio approach does not help.

3.6. The EPR division

Table 9 summarizes the results of the EPR division. iProver had won the EPR division from CASC-J4 in 2008 to CASC-J7 in 2014, but was dethroned by Vampire 4.0 in CASC-25. iProver has now won the division again. The improved performance of iProver is due mainly to improved preprocessing, in particular predicate elimination [6]. The new Vampire 4.1 also beat Vampire 4.0 by virtue of a lower average CPU time. The problem category rankings align with the overall ranking.

The SOTA contributions and efficiencies are unremarkable, and correspond to the overall performances. None of the systems took advantage of the multiple cores, with all of the core usages below 1.00. The performance on the 14 new problems was weak. Four of the new problems were in the EPT problem category, 2 of which were solved by some system. Ten of the new problems were in the EPS problem category, all to show the satisfiability of a TPTP axiomatization, of which 4 were solved by some system.

As was noted in Section 2.2, 40 problems of rating 1.00 were used in the EPR division, 24 in the EPT problem category and 16 in the EPS problem category. Three of the problems were solved, indicating progress since the problem ratings were done. One of 4 new EPT problems had rating 1.00, and 6 of the 10 new EPS problems had rating 1.00, which partly explains the poor performance on the new problems.

The individual problem results show that 53 problems were unsolved, 5 problems were solved by all the systems, and 13 problems were solved by only one system. Of the 13 unique solutions, 10 were by iProver, 2 by E, and 1 by Vampire 4.0. Despite Vampire’s singular unique solution, it would contribute well to a portfolio with iProver – giving iProver and Vampire 4.0 180s each would solve 241 problems.

3.7. The LTB division

Table 10 summarizes the results of the LTB division. The Vampires dominated the division, as they did in CASC-25, with the CASC-25 winner, Vampire 4.0, winning again. The rankings in the three problem categories align with the overall performances for the competition division systems. In the demonstration division, Prover9Plus clearly outperformed the competition division systems in the AIM problem category. Prover9Plus was in the demonstration division because the developer provided the AIM problems, and it attempted only the AIM category’s problems. Note however that Prover9Plus was not tuned to the AIM problems – the settings used are generic settings that the developer uses when starting work on new problems. The strong performance is largely attributed to its use of learning, as explained below.

The competition division systems have similar SOTA contributions, indicating no particular specialization, but Prover9Plus does show special capabilities in the AIM problem category. The Vampires had the highest efficiencies, with respect to both CPU time and wall clock time, with Vampire 4.1 being the best. All the systems except Prover9Plus used multiple cores, ranging from the Vampires’ use of 2 cores to iProver’s use of all 4 cores.

The design of the LTB division makes it possible (and presumably advantageous) to load and preprocess the common core axioms of a batch before starting to solve the problems. The systems took advantage of this possibility to various extents: iProver and Prover9Plus did not preload the common axioms, the Vampires loaded the common axioms into memory but did nothing more because axiom processing is strategy dependent, and E loaded the common axioms into memory and indexed them for its SiNE-style [3] axiom selection technique.

The design of the LTB division facilitates the use of learning from existing problems and their proofs, and from proofs of the problems in the competition. Prover9Plus used the proofs in the training data and the proofs of solved problems to learn “hints” that help guide its search. A later experiment showed that Prover9Plus would have solved only 40 problems if this technique had not been used. As a learning system Prover9Plus is interesting, and a brief system description is provided in Section 4. Vampire L-4.1 uses previous proofs to learn which of its strategies are being most successful, and moves the more successful strategies to earlier in it’s schedule. The Vampire developer reported that the learning did not work as expected.

The individual problem results show that 143 problems were unsolved (109 AIM, 1 CML, 33 HLL), 268 problems were solved by all the systems (13 AIM, 187 CML, 68 HLL), and 54 problems were solved by only one system (43 AIM, 0 CML, 11 HLL). Of the 54 unique solutions, 44 were by Prover9Plus, 6 by Vampire 4.0, 3 by E, 2 by iProver, and 1 by Vampire L-4.1. There is not much to be gained from a portfolio approach in any of the problem categories.

4. System descriptions

Three systems were of particular interest in CASC-J8: Leo-III 1.0 because it’s a new system with an interesting architecture, Vampire 4.1 because of its improved performance in the TFA division, and Prover9Plus because of it’s successful use of learning in the LTB division. The following brief descriptions of these systems were written by their developers.

Leo-III 1.0 [14] is an automated theorem prover for classical higher-order logic (HOL) with Henkin-semantics. It also supports the TPTP’s TFF and FOF dialects. Leo-III is implemented in Scala on top of the LeoPARD framework [24]. It uses an agent-based blackboard architecture, in which independent agents collaborate to share information between internal and external specialists. The coordination of agents is implemented as combinatorial auction game that maximizes a cost-benefit heuristic. Problems are converted into CNF using different advanced normalization methods [25], and processed using relevance filtering. Terms are polymorphically typed and represented as spines with explicit substitutions and perfect term sharing. The overall proof search is based on an extensional paramodulation calculus that employs a higher-order term ordering for restricting the number of inferences. The traversal of the search space is currently realized as a given-clause algorithm with simple heuristics. Although not mature yet, Leo-III can already combine the advantages of different systems, and bridge between external agents – Leo-III used LEO-II as an external HOL specialist, while Leo+III used both LEO-II and Satallax. Future versions will also cooperate with first-order systems, in the spirit of LEO-II.

Vampire [8] is an ATP system that was originally designed for pure first-order logic. Since version 2.6 it has been able to reason over typed first-order logic with arithmetic, and this capability has been a focus of recent development. For arithmetic reasoning Vampire 4.1 employs three techniques: (i) evaluation of ground terms, (ii) ground theory reasoning with Z3 via AVATAR, and (iii) the addition of theory axioms. Beyond the simple operation of evaluating ground expressions, e.g., $2 + 1$ is evaluated to 3, evaluation includes two extra steps. Firstly, it normalizes terms by, e.g., rewriting inequalities in terms of <, and simplifying equations, e.g., $2 x = 4$ is simplified to $x = 2$ . Secondly, it performs simple instantiations that remove literals or lead to inconsistencies, e.g., on observing $x < 1 \lor p (x)$ it might add $p (0)$ , and $x < 1$ might be instantiated to $2 < 1$ . Theory reasoning with Z3 via AVATAR is reported in [11]. Theory axioms such as $x ⩽ y \Rightarrow x < y + 1$ are added to provide incomplete non-ground reasoning.

Prover9Plus 1.0 is an enhanced version of Prover9 [10] that improves search effectiveness for sequences of related theorems by enabling a simple kind of learning. Prover9 is a resolution/paramodulation prover for first-order logic with equality. It uses the “given clause algorithm” (sometimes called the “Otter loop”) and includes several different strategies for clause selection that can be combined in user-specified ways. The strategy most relevant to Prover9Plus is the hints strategy [23], in which a generated clause can be given a higher priority for selection if it matches (e.g., subsumes) a user-supplied hint clause. For the CASC-J8 LTB division, clauses that were in the proofs in the training set, and clauses from all proofs found thus far, were used as hints. The most significant enhancements in Prover9Plus over Prover9 concerns the handling of clauses that match hints. For example, hints can be prioritized, and subsumption relationships between clauses that match hints are now accounted for. As well as being effective in CASC-J8, the hints strategy has played a substantial role in proving several open problems, e.g., [7].

5. Conclusion

CASC-J8 was the twenty-first large scale competition for fully automatic, classical logic ATP systems. CASC-J8 fulfilled its objectives by evaluating the relative abilities of current ATP systems, and stimulating development and interest in ATP.

The highlights of CASC-J8 were the emergence of Vampire 4.1 in the TFA division, iProver regaining its EPR crown, and Prover9Plus’ successful use of learning in the LTB division. The “lowlight” was the surprisingly large number of problems solved without solution output – 29 in the THF division (ignoring Isabelle), 71 TFA, 12 in FOF, and 4 in FNT. Solution output is required for many applications of ATP, and the requirement for solution output in CASC has helped motivate developers to implement it in their systems, e.g., between CASC-25 and CASC-J8 Beagle and Princess have had proof output added. An appropriate next step might be to motivate solution output in the standard TPTP format.

Over the years the numbers of problems entered into CASC have fluctuated, and the 21 systems entered into CASC-J8 is the lowest it has been since CASC-21 in 2007, when 20 systems were entered. The highest number entered in the last 10 years was 39, in CASC-J5 in 2010. These fluctuations are affected by funding, interest in classical ATP, and the nature of the host conference (CADE vs. IJCAR vs. FLoC). Another factor in the current downward trend might be the repeated domination by a few highly mature and tuned systems, making it “impossible” for newer systems to win. The organizer hopes to return to an upward swing with the introduction of new divisions, new awards for systems that demonstrate unique capabilities (even if they do not solve the most problems), and continuing to emphasize the aims of the competition beyond evaluation (see Section 1).

While the design of CASC is mature and stable, each year’s experiences lead to ideas for changes and improvements. Some changes that are being considered for CASC-26 are:

The introduction of a TH1 [5] division, with arithmetic.

Requiring TPTP format solutions for ranking by number of solutions output.

Awarding a trophy for best wall clock efficiency in the FOF and FNT divisions, thus encouraging use of the multiple cores.

Placing the previous year’s winners into the demonstration division, to ensure that one of the new systems wins.

As always, the ongoing success and utility of CASC depends on ongoing contributions of problems to the TPTP. The automated reasoning community is encouraged to continue making contributions of all types of problems.

References

Gauthier and

Kaliszyk, Premise selection and external provers for HOL4, in: Proceedings of the 4th ACM SIGPLAN Conference on Certified Programs and Proofs,

Leroy and

Tiu, eds, ACM Press, 2015, pp. 49–57.

Hales, A revision of the proof of the Kepler conjecture, Discrete and Computational Geometry 44(1) (2010), 1–34. doi:10.1007/s00454-009-9148-4.

Hoder and

Voronkov, Sine qua non for large theory reasoning, in: Proceedings of the 23rd International Conference on Automated Deduction,

Sofronie-Stokkermans and

Bjœrner, eds, Lecture Notes in Artificial Intelligence, Vol. 6803, Springer-Verlag, 2011, pp. 299–314.

Harrison, HOL light: An overview, in: Proceedings of the 22nd International Conference on Theorem Proving in Higher Order Logics,

Nipkow and

Urban, eds, Lecture Notes in Computer Science, Vol. 5674, Springer-Verlag, 2009, pp. 60–66.

Kaliszyk,

Sutcliffe and

Rabe, TH1: The TPTP typed higher-order form with rank-1 polymorphism, in: Proceedings of the 5th Workshop on the Practical Aspects of Automated Reasoning,

Fontaine,

Schulz and

Urban, eds, CEUR Workshop Proceedings, Vol. 1653, 2016, pp. 41–55.

Khasidashvili and

Korovin, Predicate elimination for preprocessing in first-order theorem proving, in: Proceedings of the 19th International Conference on Theory and Applications of Satisfiability Testing,

Creignou and

Le Berre, eds, Lecture Notes in Computer Science, Vol. 9710, Springer-Verlag, 2016, pp. 361–372.

Kinyon,

Veroff and

Vojtechovsky, Loops with Abelian inner mapping groups: An application of automated deduction, in: Automated Reasoning and Mathematics: Essays in Memory of William W. McCune,

M.P.

Bonacina and

Stickel, eds, Lecture Notes in Artificial Intelligence, Vol. 7788, Springer-Verlag, 2013, pp. 151–164. doi:10.1007/978-3-642-36675-8_8.

Kovacs and

Voronkov, First-order theorem proving and vampire, in: Proceedings of the 25th International Conference on Computer Aided Verification,

Sharygina and

Veith, eds, Lecture Notes in Artificial Intelligence, Vol. 8044, Springer-Verlag, 2013, pp. 1–35. doi:10.1007/978-3-642-39799-8_1.

Kumar,

Myreen,

Norrish and

Owens, CakeML: A verified implementation of ML, in: Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages,

Sewell, ed., ACM Press, 2014, pp. 179–191. doi:10.1145/2535838.2535841.

10.

W.W.

McCune, Prover9, http://www.cs.unm.edu/~mccune/prover9/.

11.

Reger,

Bjørner,

Suda and

Voronkov, AVATAR modulo theories, in: Proceedings of the 2nd Global Conference on Artificial Intelligence,

Benzmüller,

Sutcliffe and

Rojas, eds, EPiC Series in Computing, Vol. 41, EasyChair Publications, 2016, pp. 39–52.

12.

Reger,

Suda and

Voronkov, Finding finite models in multi-sorted first order logic, in: Proceedings of the 19th International Conference on Theory and Applications of Satisfiability Testing,

Creignou and

Le Berre, eds, Lecture Notes in Computer Science, Springer-Verlag, 2016.

13.

Slind and

Norrish, A brief overview of HOL4, in: Proceedings of the 21st International Conference on Theorem Proving in Higher Order Logics,

Mohamed,

Munoz and

Tahar, eds, Lecture Notes in Computer Science, Vol. 5170, Springer-Verlag, 2008, pp. 28–32. doi:10.1007/978-3-540-71067-7_6.

14.

Steen,

Wisniewski and

Benzmüller, Agent-based HOL reasoning, in: Proceedings of the 5th International Congress on Mathematical Software,

G.-M.

Greuel,

Koch,

Paule and

Sommese, eds, Lecture Notes in Computer Science, Vol. 9725, Springer-Verlag, 2016, pp. 75–81.

15.

Stump,

Sutcliffe and

Tinelli, StarExec: A cross-community infrastructure for logic solving, in: Proceedings of the 7th International Joint Conference on Automated Reasoning,

Demri,

Kapur and

Weidenbach, eds, Lecture Notes in Artificial Intelligence, Vol. 8562, 2014, pp. 367–373.

16.

Sutcliffe, The CADE-16 ATP system competition, Journal of Automated Reasoning 24(3) (2000), 371–396. doi:10.1023/A:1006393501098.

17.

Sutcliffe, The TPTP problem library and associated infrastructure. The FOF and CNF parts, v3.5.0, Journal of Automated Reasoning 43(4) (2009), 337–362. doi:10.1007/s10817-009-9143-8.

18.

Sutcliffe, in: Proceedings of the 8th IJCAR ATP System Competition, Coimbra, Portugal, 2016, http://www.tptp.org/CASC/J8/Proceedings.pdf .

19.

Sutcliffe, The CADE ATP system competition – CASC, AI Magazine 37(2) (2016), 99–101.

20.

Sutcliffe and

Schulz, The thousands of models for theorem provers (TMTP) model library – First steps, in: Proceedings of the 11th International Workshop on the Implementation of Logics,

Konev,

Schulz and

Simon, eds, EPiC Series in Computing, Vol. 40, EasyChair Publications, 2016, pp. 106–121.

21.

Sutcliffe and

C.B.

Suttner, Evaluating general purpose automated theorem proving systems, Artificial Intelligence 131(1–2) (2001), 39–54. doi:10.1016/S0004-3702(01)00113-8.

22.

Sutcliffe and

Urban, The CADE-25 automated theorem proving system competition – CASC-25, AI Communications 29(3) (2016), 423–433. doi:10.3233/AIC-150691.

23.

Veroff, Using hints to increase the effectiveness of an automated reasoning program: Case studies, Journal of Automated Reasoning 16(3) (1996), 223–239. doi:10.1007/BF00252178.

24.

Wisniewski,

Steen and

Benzmüller, LeoPARD – A generic platform for the implementation of higher-order reasoners, in: Proceedings of the International Conference on Intelligent Computer Mathematics,

Kerber,

Carette,

Kaliszyk,

Rabe and

Sorge, eds, Lecture Notes in Computer Science, Vol. 9150, Springer-Verlag, 2015, pp. 325–330.

25.

Wisniewski,

Steen,

Kern and

Benzmüller, Effective normalization techniques for HOL, in: Proceedings of the 8th International Joint Conference on Automated Reasoning,

Olivetti and

Tiwari, eds, Lecture Notes in Artificial Intelligence, Vol. 9706, 2016, pp. 362–370.

Division:	THF		TFA			TFN	FOF		FNT		EPR		LTB

Category:	TNE	TEQ	TFI	TFR	TFE		FNE	FEQ	FNN	FNQ	EPT	EPS	AIM	CML	HLL
Eligible	127	616	336	87	125	58	349	3199	236	156	325	114	1155	10,227	7150
Usable	127	616	336	57	125	58	246	828	220	156	325	107	–	–	–
New	0	0	6	0	0	7	0	33	0	8	4	10	–	–	–
Used	100	400	300	75	125	50	200	300	200	100	200	100	200	200	200
New	0	0	6	0	0	7	0	33	0	8	4	10	–	–	–

The 8th IJCAR automated theorem proving system competition – CASC-J8

Abstract

Keywords

1. Introduction

1 In 2016 CADE was a constituent of the 8th International Joint Conference on Automated Reasoning, hence “J8” for “8th Joint” conference.

2 Thanks to Aaron Stump for getting that done for CASC!

2.2. Problems

2.2.1. Non-LTB problems

3 http://www.tptp.org/CASC/J8/AIM2175.tgz.

3. Results

3.1. The THF division

Table 6 TFN division results ATP system TFN/50 Avg CPU Mdls out SOTA con. μ eff. Core usage New/7 Beagle 0.9.47 10 3.1 × 0.55 58 2.83 0 CVC4 1.5.1 9 0.0 ✓ 0.50 180 0.75 2 Princess 150706 8 1.5 × 0.69 140 1.88 0 CVC4 1.5 8 22.9 × 0.44 101 0.92 1

3.7. The LTB division

4. System descriptions

5. Conclusion

References

¹
In 2016 CADE was a constituent of the 8th International Joint Conference on Automated Reasoning, hence “J8” for “8th Joint” conference.

²
Thanks to Aaron Stump for getting that done for CASC!

³
http://www.tptp.org/CASC/J8/AIM2175.tgz.

Table 6
TFN division results

ATP system TFN/50 Avg CPU Mdls out SOTA con. μ eff. Core usage New/7

Beagle 0.9.47 10 3.1 × 0.55 58 2.83 0

CVC4 1.5.1 9 0.0 ✓ 0.50 180 0.75 2

Princess 150706 8 1.5 × 0.69 140 1.88 0

CVC4 1.5 8 22.9 × 0.44 101 0.92 1