Abstract
The CADE ATP System Competition (CASC) is the annual evaluation of fully automatic, classical logic Automated Theorem Proving (ATP) systems. CASC-J8 was the twenty-first competition in the CASC series. Twenty-one ATP systems and system variants competed in the various competition divisions. An outline of the competition design, and a commentated summary of the results, are presented.
Introduction
The CADE ATP System Competition (CASC) [19] is the annual evaluation of fully automatic, classical logic Automated Theorem Proving (ATP) systems – the world championship for such systems. One purpose of CASC is to provide a public evaluation of the relative capabilities of ATP systems. Additionally, CASC aims to stimulate ATP research, to motivate development and implementation of robust ATP systems that are useful and easily deployed in applications, to provide an inspiring environment for personal interaction between ATP researchers, and to expose ATP systems within and beyond the ATP community. Fulfillment of these objectives provides insight and stimulus for the development of more powerful ATP systems, leading to increased and more effective use. CASC-J8 was held on 29th June 2016 in Coimbra, Portugal, as part of the 8th International Joint Conference on Automated Reasoning.1
In 2016 CADE was a constituent of the 8th International Joint Conference on Automated Reasoning, hence “J8” for “8th Joint” conference.
CASC is divided into divisions according to problem and system characteristics. There are competition divisions in which the systems are explicitly ranked, and a demonstration division in which systems demonstrate their abilities without being ranked. (The demonstration division is for systems that cannot be entered into the competition divisions for any reason, e.g., the system runs on specialist hardware, or the entrant is a competition organizer or panel member.) Each competition division uses problems that have certain logical, language, and syntactic characteristics, so that the systems that compete in the division are, in principle, able to attempt all the problems in the division. Some divisions are further divided into problem categories that make it possible to analyze, at a more fine-grained level, which systems work well for what types of problems. The demonstration division uses the same problems as the competition divisions – the entry specifies which competition divisions’ problems are to be used. Table 1 catalogs the divisions and problem categories of CASC-J8.
Divisions and problem categories
Twenty-one ATP systems and system variants, listed in Table 2, competed in the various divisions. The division winners of CASC-25 (the previous CASC) were automatically entered to provide benchmarks against which progress can be judged. Additionally, Prover9 2009-11A is entered into the FOF division each year, as a fixed point against which progress can be judged. Prover9Plus ran in the demonstration division because the developer supplied some of the LTB divisions’ problems. System descriptions of the systems are in [18] and on the CASC-J8 web site.
The ATP systems and entrants
CASC-J8 was organized by Geoff Sutcliffe, and overseen by a panel consisting of Pascal Fontaine, Aart Middeldorp, and Christoph Wernhard. The competition was run on computers provided by the StarExec project [15] at the University of Iowa.
This paper is organized as follows: Section 2 outlines the design and organization of the competition. Section 3 provides a commentated summary of the results. Section 4 contains short descriptions of three of the ATP systems. Section 5 concludes and discusses plans for future CASCs.
The design and organization of CASC has evolved over the years to a sophisticated state. An outline of the CASC-J8 design and organization is provided here. The details are in [18] and on the CASC-J8 web site. Important changes for CASC-J8 were:
The Typed Higher-order Non-theorem (THN) division was put into a hiatus state, due to the limited number and diversity of systems for this type of problem [22]. The TFN division had no problem categories, due to the very small number of TFN problems using only rational or real arithmetic. This dearth of problems was noted in the CASC-25 report [22], but the hope that more suitable problems would be contributed to the TPTP was not fulfilled. The LTB division was run on StarExec. In CASC-25 the LTB division was run on a cluster at the University of Manchester, because StarExec did not have a facility to download partial results from a job, as is required to display results as the ATP systems work their way through the problems in a batch. The StarExec developers added that facility after CASC-25, so that the LTB division could run on StarExec.2 Thanks to Aaron Stump for getting that done for CASC!
The ATP systems were delivered to the competition organizer as StarExec installation packages, which the organizer installed and tested on StarExec. Source code was delivered separately for archiving on the competition web site.
The ATP systems entered into CASC are required to be fully automatic. They are executed as black boxes, on one problem (non-LTB) or one problem batch (LTB) at a time. Any command line parameters have to be the same for all problems/batches in each division. The systems are required to be sound, and are tested for soundness by submitting non-theorems to the systems in the THF, TFA, FOF, EPR, and LTB divisions, and theorems to the systems in the TFN, FNT, and EPR divisions. Claiming to have found a proof of a non-theorem or a disproof of a theorem indicates unsoundness. For the third year in a row, no systems failed this soundness testing.
The THF, TFA, FOF, FNT, and LTB divisions were ranked according to the number of problems solved with an acceptable proof/model output. The TFN and EPR divisions were ranked according to the number of problems solved, but not necessarily accompanied by a proof or model (but systems that do output proofs/models are highlighted in the presentation of results). Ties were broken according to the average time taken over problems solved, CPU time for the non-LTB divisions, wall clock time for the LTB division. Trophies were awarded to the division winners.
In addition to the ranking criteria, three other measures were made and are presented in the results: The state-of-the-art (SoTA) contribution quantifies the unique abilities of each system. For each problem solved by a system, its SoTA contribution for the problem is the reciprocal of the number of systems that solved the problem, so that if a system is the only one to solve a problem then its SoTA contribution for the problem is 1.00, and if all the systems solve a problem their SoTA contribution for the problem is the inverse of the number of systems. A system’s overall SoTA contribution is its average SoTA contribution over the problems it solved. The efficiency combines the time taken for each problem solved with the number of problems solved. It is the average solution rate over the problems solved (the solution rate for one problem is the inverse of the time taken to solve it), multiplied by the fraction of problems solved. Efficiency is computed for CPU time to measure how efficiently the systems use one core, and additionally for wall clock time in the LTB division (which has a wall clock time limit) to measure how efficiently the systems use the multiple cores that are available. The core usage directly measures the extent to which the systems use the multiple cores that are available. It is the average ratio of CPU time to wall clock time taken, over the problems solved. The competition ran on quad-core computers, thus the maximal core usage was 4.0.
Problems
Non-LTB problems
The problems for the non-LTB divisions were taken from the Thousands of Problems for Theorem Provers (TPTP) problem library [17], v6.4.0. The TPTP version used for CASC is not released until after the competition has started, so that new problems in the release have not been seen by the entrants. The problems have to meet certain criteria to be eligible for selection, as follows:
The TPTP tags problems that are designed specifically to be suited or ill-suited to some ATP system, calculus, or control strategy as biased. Biased problems are excluded from the competition.
The problems must be syntactically non-propositional.
The TPTP uses system performance data in the Thousands of Solutions from Theorem Provers (TSTP) solution library to compute problem difficulty ratings in the range 0.00 (easy) to 1.00 (unsolved) [21]. Systems can be submitted before the competition so that their performance data is used for computing the problem ratings. Difficult problems with ratings in the range 0.21 to 0.99 are normally eligible for CASC. Problems of lesser and greater ratings may also be eligible in some divisions if there are not enough problems with the desired ratings with respect to the resources available. For CASC-J8 there were adequate resources to admit all problems with ratings greater than 0.00 in the TFA, TFN, FNT, and EPR divisions. There were 229 problems with rating less than 0.21 in the TFA division, none in TFN or FNT, and 82 in EPR. There were 46 problems with rating 1.00 in the TFA division, 22 in TFN, 48 in FNT, and 40 in EPR. The easier problems lead to higher solution rates, while the use of problems with rating 1.00 is interesting if the new (versions of) systems in the competition solve some of them.
The selection is constrained so that no division or category contains an excessive number of “very similar” problems [16]. In TPTP v6.4.0 there are very few TFR problems that are not very similar, and therefore 18 problems that are very similar to others had to be selected.
The selection is biased to select problems that are new in the TPTP version used, until 50% of the problems in each problem category have been selected, after which random selection (from old and new problems) continues. The number of new problems used depends on how many new problems are eligible and the limitation on very similar problems.
In order to ensure that no system receives an advantage or disadvantage due to the specific presentation of the problems in the TPTP, the problems are preprocessed to strip out all comment lines (in particular, the problem header), randomly reorder the formulae/clauses (
The numbers of problems used in each division and problem category are roughly proportional to the numbers of eligible problems, subject to the limitation on the number of very similar problems. The problems used are randomly selected from the eligible problems based on a seed supplied by the competition panel. The problems are given to the ATP systems in TPTP format, in increasing order of TPTP difficulty rating.
Table 3 gives the numbers of eligible problems, the maximal numbers that could have been used after taking into account the limitation on very similar problems (in the context of the numbers of problems used), and the numbers of problems used, in each division and category. There were very few new problems in CASC-J8, due to limited growth of the TPTP in the preceding year.
Numbers of eligible and used problems
Numbers of eligible and used problems
The problems for the LTB division were taken from publicly available problem sets. For CASC-J8: the AIM problem category used the AIM2175 problem set,3
The AIM2175 problem set contains 1155 test problems, and 1020 training problems with one solution each. For the competition, 200 problems were randomly chosen from the 1155. All the training problems and their solutions were given in the training set.
The CML10227 problem set contains 10,227 problems. Proofs were found by Vampire 4.0 for 2017 of them by careful preselection of axioms. The 2017 proofs were used to construct “easy problems” with only the axioms used in the proofs, which were attempted in non-batch mode by pre-CASC-J8 versions of E, iProver, and Vampire, with a 120s CPU time limit. Only 440 of these easy problems were solved by any of the three systems. One hundred of the 440 solved problems and 100 of the 1577 unsolved problems were randomly selected to be used in the competition, but with their full original axiomatizations. For the training set, 1000 problems (and their Vampire 4.0 solutions) were selected from the remaining 1817 unselected problems.
The HH7150 problem set contains 7150 problems. This problem set was also used in CASC-25, and same criteria were used for problem selection: 140 problems that were solvable in 60s by at least one of the CASC-25 non-LTB versions of the systems, and 60 problems that were not solvable in 60s. For the training set, 1000 problems (and their solutions) were selected from those that were solvable in 60s but not selected for the competition.
The computers had four Intel(R) Xeon(R) E5-2609, 2.40GHz CPUs (a quad-core chip), 128GB memory, and ran the Red Hat Enterprise Linux Workstation release 6.3 (Santiago) operating system, kernel 2.6.32-431.1.2.el6.x86_64.
In the non-LTB divisions a 360s CPU time limit was imposed for each problem. A wall clock time limit of 720s was also imposed to limit very high memory usage that causes swapping. In the LTB division a 180s wall clock time limit was imposed for each problem, and a 36,000s overall wall clock time limit (200 problems times 180s) for each problem category. Time spent before starting the first problem in a batch, e.g., learning from the training set and pre-loading the common axioms, and times spent between ending a problem and starting the next, e.g., learning from previous proofs, were not part of the time taken on problems. However, time taken on such tasks was part of the overall time taken for the batch.
Demonstration division systems can run on the competition computers, or the computer(s) can be supplied by the entrant. The CASC-J8 demonstration division system used the competition computers.
Results
For each ATP system, for each problem, four items of data were recorded: whether or not the problem was solved, the CPU time taken, the wall clock time taken, and whether or not a proof or model was output. This section summarizes the results, and provides commentary. The result tables below give the number of problems solved in the division, the average time over problems solved, whether or not proofs or models were output, the state-of-the-art contribution, the (micro-)efficiency, the core usage, the number of new problems solved, and the number of problems solved in each problem category. In each of the results tables, the CASC-25 winner is emphasized. Detailed results, including the systems’ output files, are available from the CASC-J8 web site.
The THF division
Table 4 summarizes the results of the THF division. Satallax 3.0 improved over Satallax 2.8, which won the division in CASC-25, by using the E prover as a subsystem (the E prover was available in Satallax 2.8, but not used because E’s proofs had not been integrated into Satallax’s proof output). If Isabelle had output proofs it would have won the division, and the developer has promised that proof output will be implemented for CASC-26 next year. It is also noteworthy that Isabelle solved 316 problems (89% of the problems it solved) within 120s, mostly using the
THF division results
THF division results
The top four systems all failed to output proofs for some problems. In most cases the task of producing and printing a proof took more time than was available before the time limit was reached. For Satallax 3.0 this problem was exacerbated by the fact that only the time until end of the current strategy time slice was made available for proof production (rather than all the time until the time limit).
The SoTA contributions are lower for Leo-III and Leo+III, indicating that they have no distinguishing capabilities. The top three systems had significantly higher efficiencies, reflecting the fact that they solved significantly more problems with lower average CPU times. The high average CPU time of Isabelle, as was noted in the CASC-25 report, is due to a long start-up time and a strategy schedule that is not suited to the TPTP problems. The core usage of all except the Leo systems is about 1.00. The Leos’ good core usage is due to their implementation in Scala, in which garbage collection and JIT compilation is done in parallel with the main thread.
The individual problem results show that 37 problems were unsolved, 32 problems were solved by all the systems, and 66 problems were solved by only one system. Of the 66 unique solutions, 43 were by Isabelle, 14 by LEO-II, and 5 by Satallax 3.0, and 4 by Satallax 2.8. The unique solutions indicate that a portfolio approach would be effective for these THF problems. Giving Isabelle, LEO-II, and Satallax 3.0 120s each would solve 442 problems.
Table 5 summarizes the results of the TFA division. The winner, Vampire 4.1, is a newcomer to the TFA division, and a brief system description of Vampire 4.1 focusing on the features for the TFA division is provided in Section 4. It dominated the division, solving the most problems with the highest SOTA contribution and highest efficiency, and outperforming the CASC-25 winner VampireZ3. The performances in the integer, rational, and real problem categories align with the overall results except for Princess’s stronger performance on integer problems. This is because Princess has a native theory solver for integers (and various other heuristics, e.g., for non-linear problems), but just uses first-order axioms for rationals and reals – rationals and reals have not yet been a focus of the developer’s work.
TFA division results
TFA division results
This is the second CASC in which the ranking in the TFA division has been by number of proofs output. In CASC-25 neither CVC4 1.4 nor Princess 150706 output proofs, and were thus in tie last place. It is pleasing to see that both these systems now output proofs, although Princess did not output proofs in 71 cases. This is because Princess has two procedures, the native module for non-linear integer reasoning, and the procedure for constructing free-variable tableaux, that do not yet log proof information.
Beagle and Princess had notably lower efficiencies than the other three systems, largely due to their higher CPU times (partly caused by the startup time required for the Java Virtual Machine). In contrast, Beagle and Princess had the best core usage, and did better on the six new problems. Their good core usage is (as for the Leos in THF) due to their implementation in Scala. Neither system does any explicit parallel activity across cores, e.g., strategy parallelism (Princess does do strategy parallelism, but internally using its own scheduler). The developers believe that strategy parallelism that takes advantage of multiple cores would be simple to add – a challenge for CASC-26!
As was noted in Section 2.2, 46 problems of rating 1.00 were used in the TFA division. Thirteen of them were solved in the competition, indicating progress since the problem ratings were done. Twenty-five of these 46 problems are hardware verification problems that have a small reliance on arithmetic manipulation and a large combinatorial search space. As such these problems favour systems with strong search features rather than systems that are focused on strong arithmetic capabilities. Only 3 of them were solved, and only by Vampire.
The individual problem results show that 36 problems were unsolved, 235 problems were solved by all the systems, and 43 problems were solved by only one system. Of the 43 unique solutions, 26 were by Vampire, 15 by CVC4, 2 by Beagle, and 2 by Princess. These numbers of unique solutions are reflected in the SOTA contributions. The best portfolio would be to give 90s to each of Vampire, VampireZ3, CVC4, and Beagle, which would solve 459 problems. But noting that it is unlikely that there will be further development of VampireZ3 (it was a “one off” development for CASC-25, and is now replaced by pure Vampire), it is worthy considering a portfolio of just Vampire and CVC4 – giving 250s to Vampire and 110s to CVC4 would solve 455 problems.
As was noted in Section 2.2, 229 problems with rating less than 0.21 were used in the TFA division. This is the reason why so many problems were solved by all systems, particularly in the TFR problem category where 71 of the 75 problems had rating less than 0.21.
Table 6 summarizes the results of the TFN division. The winner, Beagle, was declared to be the winner only several days after the competition. On the day of CASC-J8 CVC4 1.5 (the CASC-25 winner) solved 15 problems and was declared the winner. However, the following day the CVC4 developer reminded the organizer that the CASC-25 version of CVC4 1.5 is unsound in some corner cases handling non-linear arithmetic, and could incorrectly report models (this bug did not manifest itself in CASC-25). The developer had provided a fixed version of CVC4 1.5, but the CASC organizer forgot to install it for CASC-J8 (whoops!). The fixed version was then installed and run on the division. It solved only 8 problems, thus relegating it to last place, and making Beagle the winner. At no stage was there any intent to deceive, and the old CVC4 1.5 did pass the CASC-J8 soundness testing. The CVC4 developer receives an honourable mention for his report, and says he will win the division next year anyway, by developing the non-linear integer arithmetic capabilities in CVC4. CVC4 1.5.1 was the only system to output models, and this capability is important in applications.
TFN division results
TFN division results
The various performance measures do not correspond with the numbers of problems solved: Beagle has the highest core usage, but has the lowest efficiency. Princess has the highest SOTA contribution, and CVC4 1.5.1 has the lowest average CPU time and highest efficiency. None of the systems performed well on the 7 new problems. Six of the problems are axiomatizations of some simple data structures, e.g., integer arrays, of which only 2 were solved. The other is a variant of the Collatz conjecture, which of course could not be solved.
As was noted in Section 2.2, 22 problems of rating 1.00 were used in the TFN division – almost half of the division. Nine are hardware verification problems like the 25 rating 1.00 problems in the TFA division, 8 are software verification problems that are considered to be relatively simple benchmarks for infinite state model checkers, 4 are new data structure problems as mentioned above, 2 are arithmetic tests, and the last is the variant of the Collatz conjecture. Only one of them – one of the arithmetic tests – was solved, and only by Beagle.
The individual problem results show that 31 problems were unsolved, no problems were solved by all the systems, and 6 problems were solved by only one system. Of the 6 unique solutions, 3 were by Princess, 2 by Beagle, and 1 by CVC4 1.5.1. A portfolio of Beagle, CVC4 1.5.1, and Princess, giving them 120s each, would solve 19 problems …quite a remarkable improvement over the individual systems.
As was the case in CASC-25, the deficiency of this division was the small number of eligible problems – despite some efforts only a few new eligible problems were added to the TPTP in the last year. Establishing non-theoremhood in the context of arithmetic is a naturally difficult task, but also an important task for applications of ATP. The TFN division will continue to challenge the ATP community, and hopefully over time the ATP systems will become more effective.
Table 7 summarizes the results of the FOF division. The CASC-25 winner, Vampire 4.0, beat the new Vampire 4.1. Vampire 4.1 used a new core theorem prover, but with Vampire 4.0’s strategy schedules. This mismatch lead to a weaker performance than Vampire 4.0, where the strategy schedule was calibrated to the core theorem prover. (This is another reminder that proof search can be fragile.) However, Vampire 4.1 had a lower average CPU time than Vampire 4.0.
Four of the 8 systems failed to output proofs for some of the problems solved. For Vampire 4.0, as was noted in the CASC-25 report [22], some proofs were not output due to a bug, which has been fixed in Vampire 4.1. For CVC4, iProver, and Prover9, after theoremhood was established, the task of producing and printing a proof took more time than was available before the time limit was reached. For example, for
The problem category results align well with the overall results, with only iProver having its known preference for problems with equality. An examination of the FEQ performance plots at http://www.tptp.org/CASC/J8/WWWFiles/ResultsPlots.html#FEQProblems shows that E has a noticeable performance inflection just past 180s, when it switches from its first strategy that was given 180s (half the time limit) to it’s second strategy that is given 90s (a quarter of the time limit). There is another inflection around 270s when it switches to its third strategy. E’s strategy scheduling is clearly effective, and could be more time efficient if E took advantage of the multiple cores available to run the strategies in parallel (rather than sequentially on one core – it’s core usage is only 0.78).
FOF division results
FOF division results
FNT division results
The SOTA contributions were similar for the best performing systems. The efficiencies of the Vampires are notably high, due to the numbers of problems solved with low average CPU times. None of the systems took advantage of the multiple cores, with all of the core usages below 1.00. The systems’ abilities to solve the new problems aligned well with the overall performance, suggesting that the systems are not over tuned to existing TPTP problems.
The individual problem results show that 17 problems were unsolved, 4 problems were solved by all the systems, and 17 problems were solved by only one system. Of the 17 unique solutions, 9 were by E, and 2 by each of Vampire 4.0, CVC4, Prover9, and Geo-III. Several interesting possible portfolios emerge. An obvious one is Vampire 4.0 with E – giving them 180s each would solve 465 problems. Less obvious is the combination of Vampire 4.1, E, and iProver – giving them 120s each would solve 468 problems. Least obvious is the combination of the two Vampires, E, CVC4, and Geo-III, giving them 72s each would solve 469 problems.
Table 8 summarizes the results of the FNT division. The new Vampire 4.1 easily outperformed the CASC-25 winner Vampire 4.0, mainly due to improvements in its finite model finding techniques [12]. The biggest improvement is an approach that promotes inferred (monotonic) subsorts to proper sorts, and treats the problem as a multi-sorted one, allowing for a more fine-grained search for a finite model. Both the Vampires solve a couple of problems without producing a model, because of an omission in the code for cases when satisfiability is shown by a combination of definition inlining and pure predicate removal – this will be fixed for CASC-26. There is a clear gap between the top three systems and the others, although the fourth placed Nitpick did quite well for a system that is designed for model finding in higher-order logic. The problem category rankings align with the overall ranking except for E’s preference for problems with equality.
The Vampires have the highest SOTA contributions, but iProver has the highest efficiency due to it’s lower average CPU time. Nitpick and Refute have significantly lower efficiencies, because they seldom solve problems very quickly. This is due to the high start-up time of these systems – around 15s for Nitpick and around 6s for Refute. iProver is the only system to make good use of the multiple cores, as reflected in the core usages. iProver ran several strategies in parallel in the FNT division, because of the diversity of problems – some can be solved quickly by finite model finding, while some require other methods such as instantiation and resolution. None of the problems were able to solve more than one of the 8 new problems. All of the new problems are to show the satisfiability of axiomatizations, created during the development of the new Thousands of Models for Theorem Provers (TMTP) model library [20]. The axiomatizations are the bases for theorems in the TPTP, so demonstrating their satisfiability is important for ensuring that proofs for the theorems are not due to contradictory axioms.
EPR division results
EPR division results
LTB division results
As was noted in Section 2.2, 48 problems of rating 1.00 were used in the FNT division. These include 7 of the 8 new problems, and 23 of which are problems to establish satisfiability of axiomatizations. None of the 48 problems were solved.
The individual problem results show that 48 problems were unsolved, 17 problems were solved by all the systems, and 6 problems were solved by only one system. Of the 6 unique solutions, 4 were by Vampire 4.1, and 2 by Vampire 4.0. A portfolio approach does not help.
Table 9 summarizes the results of the EPR division. iProver had won the EPR division from CASC-J4 in 2008 to CASC-J7 in 2014, but was dethroned by Vampire 4.0 in CASC-25. iProver has now won the division again. The improved performance of iProver is due mainly to improved preprocessing, in particular predicate elimination [6]. The new Vampire 4.1 also beat Vampire 4.0 by virtue of a lower average CPU time. The problem category rankings align with the overall ranking.
The SOTA contributions and efficiencies are unremarkable, and correspond to the overall performances. None of the systems took advantage of the multiple cores, with all of the core usages below 1.00. The performance on the 14 new problems was weak. Four of the new problems were in the EPT problem category, 2 of which were solved by some system. Ten of the new problems were in the EPS problem category, all to show the satisfiability of a TPTP axiomatization, of which 4 were solved by some system.
As was noted in Section 2.2, 40 problems of rating 1.00 were used in the EPR division, 24 in the EPT problem category and 16 in the EPS problem category. Three of the problems were solved, indicating progress since the problem ratings were done. One of 4 new EPT problems had rating 1.00, and 6 of the 10 new EPS problems had rating 1.00, which partly explains the poor performance on the new problems.
The individual problem results show that 53 problems were unsolved, 5 problems were solved by all the systems, and 13 problems were solved by only one system. Of the 13 unique solutions, 10 were by iProver, 2 by E, and 1 by Vampire 4.0. Despite Vampire’s singular unique solution, it would contribute well to a portfolio with iProver – giving iProver and Vampire 4.0 180s each would solve 241 problems.
The LTB division
Table 10 summarizes the results of the LTB division. The Vampires dominated the division, as they did in CASC-25, with the CASC-25 winner, Vampire 4.0, winning again. The rankings in the three problem categories align with the overall performances for the competition division systems. In the demonstration division, Prover9Plus clearly outperformed the competition division systems in the AIM problem category. Prover9Plus was in the demonstration division because the developer provided the AIM problems, and it attempted only the AIM category’s problems. Note however that Prover9Plus was not tuned to the AIM problems – the settings used are generic settings that the developer uses when starting work on new problems. The strong performance is largely attributed to its use of learning, as explained below.
The competition division systems have similar SOTA contributions, indicating no particular specialization, but Prover9Plus does show special capabilities in the AIM problem category. The Vampires had the highest efficiencies, with respect to both CPU time and wall clock time, with Vampire 4.1 being the best. All the systems except Prover9Plus used multiple cores, ranging from the Vampires’ use of 2 cores to iProver’s use of all 4 cores.
The design of the LTB division makes it possible (and presumably advantageous) to load and preprocess the common core axioms of a batch before starting to solve the problems. The systems took advantage of this possibility to various extents: iProver and Prover9Plus did not preload the common axioms, the Vampires loaded the common axioms into memory but did nothing more because axiom processing is strategy dependent, and E loaded the common axioms into memory and indexed them for its SiNE-style [3] axiom selection technique.
The design of the LTB division facilitates the use of learning from existing problems and their proofs, and from proofs of the problems in the competition. Prover9Plus used the proofs in the training data and the proofs of solved problems to learn “hints” that help guide its search. A later experiment showed that Prover9Plus would have solved only 40 problems if this technique had not been used. As a learning system Prover9Plus is interesting, and a brief system description is provided in Section 4. Vampire L-4.1 uses previous proofs to learn which of its strategies are being most successful, and moves the more successful strategies to earlier in it’s schedule. The Vampire developer reported that the learning did not work as expected.
The individual problem results show that 143 problems were unsolved (109 AIM, 1 CML, 33 HLL), 268 problems were solved by all the systems (13 AIM, 187 CML, 68 HLL), and 54 problems were solved by only one system (43 AIM, 0 CML, 11 HLL). Of the 54 unique solutions, 44 were by Prover9Plus, 6 by Vampire 4.0, 3 by E, 2 by iProver, and 1 by Vampire L-4.1. There is not much to be gained from a portfolio approach in any of the problem categories.
System descriptions
Three systems were of particular interest in CASC-J8: Leo-III 1.0 because it’s a new system with an interesting architecture, Vampire 4.1 because of its improved performance in the TFA division, and Prover9Plus because of it’s successful use of learning in the LTB division. The following brief descriptions of these systems were written by their developers.
Conclusion
CASC-J8 was the twenty-first large scale competition for fully automatic, classical logic ATP systems. CASC-J8 fulfilled its objectives by evaluating the relative abilities of current ATP systems, and stimulating development and interest in ATP.
The highlights of CASC-J8 were the emergence of Vampire 4.1 in the TFA division, iProver regaining its EPR crown, and Prover9Plus’ successful use of learning in the LTB division. The “lowlight” was the surprisingly large number of problems solved without solution output – 29 in the THF division (ignoring Isabelle), 71 TFA, 12 in FOF, and 4 in FNT. Solution output is required for many applications of ATP, and the requirement for solution output in CASC has helped motivate developers to implement it in their systems, e.g., between CASC-25 and CASC-J8 Beagle and Princess have had proof output added. An appropriate next step might be to motivate solution output in the standard TPTP format.
Over the years the numbers of problems entered into CASC have fluctuated, and the 21 systems entered into CASC-J8 is the lowest it has been since CASC-21 in 2007, when 20 systems were entered. The highest number entered in the last 10 years was 39, in CASC-J5 in 2010. These fluctuations are affected by funding, interest in classical ATP, and the nature of the host conference (CADE vs. IJCAR vs. FLoC). Another factor in the current downward trend might be the repeated domination by a few highly mature and tuned systems, making it “impossible” for newer systems to win. The organizer hopes to return to an upward swing with the introduction of new divisions, new awards for systems that demonstrate unique capabilities (even if they do not solve the most problems), and continuing to emphasize the aims of the competition beyond evaluation (see Section 1).
While the design of CASC is mature and stable, each year’s experiences lead to ideas for changes and improvements. Some changes that are being considered for CASC-26 are:
The introduction of a TH1 [5] division, with arithmetic.
Requiring TPTP format solutions for ranking by number of solutions output.
Awarding a trophy for best wall clock efficiency in the FOF and FNT divisions, thus encouraging use of the multiple cores.
Placing the previous year’s winners into the demonstration division, to ensure that one of the new systems wins.
As always, the ongoing success and utility of CASC depends on ongoing contributions of problems to the TPTP. The automated reasoning community is encouraged to continue making contributions of all types of problems.
