Libraries for Vicare Scheme: srfi compare-procedures design

In this section we present our reasoning behind the design decisions made for this SRFI. We would like to be explicit on this because we believe that design is not about the outcome of decisions but about the alternatives considered. The section is organized as a Q&A list.

Order predicates (2-way) or 3-way comparisons?

It is mathematical tradition to specify a total order in terms of a “less or equal” (<=) relation. This usually carries over to programming languages in the form of a <= predicate procedure.

However, there are inherently three possible relations between two elements x and y with respect to a total order: x < y, x = y, and x > y. (With respect to a partial order there is a fourth: x and y are uncomparable.) This implies that any mechanism based on 2-valued operations (be it <, or (= , <), or other) has cases in which two expressions must be evaluated in order to determine the relation between two elements.

In practice, this is a problem if a comparison is computationally expensive. Examples of this are implicitly defined orders in which the order of elements depends on their relative position in some enumeration. (Think of comparing graphs by isomorphism type.) In this case, each order predicate is as expensive as a compare procedure—implying that a proper 3-way branch could be twice as fast as cascaded 2-way branches. Hence, there is a potentially considerable loss in performance, and it is purely due to the interface for comparisons.

The primary disadvantage of bare 3-way comparisons is that they are less convenient, both in use and in their definition. Luckily, this problem can be solved quite satisfactorily using the syntactic (macro) and procedural abstractions of Scheme.

How to represent the three cases?

We have considered the following alternatives for representing the three possible results of a comparison:

The representation acts as an internal interface between programs comparing objects and programs using these comparisons.

The advantage of using only three values is that the representation of each case is uniquely defined. In particular, this enables the use of case instead of if, and it ensures portability. Portability of numbers is problematic in R5RS due to underspecification and inexactness.

The advantage of using a non–unique (numerical) representation is that the result of a computation can sometimes immediately be used in a branch, much like the “non-#f means true”–convention. However, with the operations defined this advantage hardly matters. Moreover, the “non-#f means true”–convention is a major cause of unexpected program behavior itself.

The advantage of using {-1, 0, +1} over using three symbols is that the integers support additional operations, for example they can directly be used in index computations. A particularly useful operation is (* sign (compare x y)) which inverts the order relation depending on sign (either -1 or +1). In addition, the integers are unique—once it is known that comparisons result in integers it is obvious which integers. A minor consideration is that Scheme systems usually treat small integers as unboxed values, and that integers are self–evaluating literals.

The advantage of using three symbols is that they can be chosen to be more descriptive. For example, it is more instructive to see (symbol-compare 'foo 'bar) result in 'greater than in 1. Unfortunately, there is no obvious choice of name for the three symbols. Amoung the choices that make sense are 'less, 'equal, 'greater, or 'lt, 'eq, 'gt, or '<, '=, '>. A disadvantage of using symbols for the three cases is that Scheme symbols are ordered, too, and this ordering may differ from the desired ordered for the three cases.

Some Scheme implementations provide a mechanism for defining enumeration types. For example define-enumerated-type of Scheme 48 can be used to define a type comparison consisting of three objects, say lt, eq, gt. The enumeration can also (directly) be defined on top of SRFI-9 (Defining Record Types) [10] by defining three new record types, each of which having a single instance. We regard this approach as preferable over three symbols because comparison results have their own type, and a sufficiently advanced compiler could use this information to eliminate redundant type–checks.

One step further in this direction is the following design alternative we have considered: Due to the fundamental nature of the type comparison for programming, it would be worthwhile integrating it into the core language of Scheme. This could take the following form: There are three self–evaluating constants, e.g. written #< #= #>, and these are the only instances of the type comparison. The type supports two operations: comparison? and comparison-compare. Furthermore, eq?, eqv?, and equal? need to understand the comparison values. In other words, comparison is designed after boolean. It is unclear, however, which problem this tight integration of comparisons into the language is solving.

Given this situation, we have chosen for {-1, 0, +1}, while providing facilities for using this conveniently—in particular it is hardly ever necessary to deal with the integers directly.

How to order complex numbers?

Mathematically, no total order of the complex numbers exists which is compatible with the algebraic or topological structure. Nevertheless, it is useful for programming purposes to have some total order of complex numbers readily available.

Several total orders on the complex numbers are at least compatible with the natural ordering of real numbers. The least surprising of these is lexicographic on (re, im).

How to order special floating point symbols?

Floating point formats often do not only represent rational numbers but extend this set by special symbols, for example +Inf, -Inf, NaN (“Not a number”), and -0. How should these symbols be ordered with respect to the ordinary numerical values and with respect to each other? (Refer to the discussion archive starting with msg00010.)

Let us briefly recall the purpose of the special symbols. The general rationale for introducing special symbols into a floating point format is for numerical calculations to continue in the presence of data–dependent errors, while still retaining some meaningful information about the result.

As +Inf and -Inf are designed to represent extremal numbers, their ordering with respect to real numbers is obvious. For signed zeros, the ordering is also obvious. However, the notion of two zeros (or even three: -0, 0, and +0) is incompatible with the arithmetic structure of the real numbers. Hence, in most situations all zeros should be treated as equal, even though this can destroy information about results. But the alternative design may also make sense in certain situations where the full information carried in a floating point object is to be retained.

For NaN (or even several NaNs) the situation is even more ambiguous because there is not even a natural order relation of NaN with the other possible floating point values. One design alternative is to raise an error if NaN is to participate in a comparison; the reasoning being “if the control flow depends on a NaN you are in trouble anyway”. An alternative is to define some order by force; the reasoning being “if an object satisfies real? then it can be compared with real-compare.” Neither approach is obviously better than the other.

Given this situation, we have decided to leave the effect of using a special floating point value in real-compare unspecified, in line with the approach of R5RS. This approach might change once Scheme itself is more explicit about floating point representations and numerical computation.

How to define default-compare?

The purpose of default-compare is providing some well–defined way of comparing two arbitrary Scheme values. This can be used in all situations in which the user is unwilling to define a compare procedure explicitly, for example because the actual details of the total order do not really matter.

As an example, consider the task of dealing with sets of sets of integers. In this case, one could simply use sorted lists without repetition for representing lists and default-compare already provides a total order.

However, there are limits as to how default-compare can be defined. For example, default-compare cannot easily be based on a hash code derived from the pointer representing an object due to the close dependency with the garbage collection mechanism. Also, we believe it to be more useful to applications if default-compare is based on type and structure.

Unfortunately, this imposes limits on what can be compared using default-compare because it is very desireable to have a portable reference implementation. In particular, portable ways of dealing with circular structures are overly costly.

Naturally, the question arises how the types should be ordered. For this question it is useful to understand that boolean-compare and pair-compare both already define a total order for all values (at least in priciple). Hence, default-compare could refine one of them, but unfortunately not both at the same time (unless #f and () are minimum and maximum of the order, respectively). Since pair-compare is more frequently used than boolean-compare we base default-compare on pair-compare. The other portably comparable types are ordered by increasing complexity, which clearly is an arbitrary choice.

What is the “lexicographic order”?

The lexicographic order is a general way of defining an ordering for sequences from an ordering of elements:

In the lexicographic order, the empty sequence is the smallest sequence of all, and two non–empty sequences are first compared by their first element and only if these are equal the residual sequences are compared, recursively.

The lexicographic order has its name from its use in a lexicon: For example, fun < funloving < jolly.

What is the “natural order” of lists and vectors?

By “natural order” of an abstract data type we mean a total order that is defined to match the basic operations operations supported by the data type.

The basic access operations with constant execution time for Scheme lists are null?, car, and cdr. These are exactly the operations needed for comparing two sequences lexicographically.

The constant time access operations for Scheme vectors are vector-length (size) and vector-ref (ref). Using these operations, the fundamental ordering of vectors is first comparing by size, and only if the sizes are equal, by comparing the elements lexicographically.

Why are vectors not ordered lexicographically?

In this SRFI, lists and strings are ordered lexicographically (‘LEX’) by default, e.g. "12" < "2". The default order of vectors is first by length and then lexicographically (‘LENGTH-LEX’), e.g. #(2) < #(1 2). Alternatively, vectors could be ordered purely lexicographically, too. In the extreme, lists, strings, and vectors could even be ordered lexicographically as sequences without distinguishing the concrete representation, implying

"12" = (#\1
#\2) = #(#\1 #\2)

The choice affects vector-compare, default-compare, and the way orders are interpreted conceptually. Moreover, this SRFI introduces the terminology “ordered as lists” and “ordered as vectors” to refer to the two fundamental ways of lifting an order to sequences (LEX and LENGTH-LEX). The choice also has implications for any other SRFI introducing container data types (e.g. 66 and 74), in case the author wishes to specify default compare procedures compatible with this SRFI.

We consider 2. the most important due to its mathematical nature, followed by 1. because it simplifies the design. While this controversial, we think that it is preferable to introduce different orders for different data types, and not derive every order from a single one for sequences. Finally, we consider 3. a weak argument because the default ordering of strings is motivated primarily historically for ordering written words of (small alphabet) natural languages.

Concerning other vector–like data types, such as those introduced by SRFI-66 and SRFI-74, we recommend to define a default ordering which appears most natural for the type. These can conveniently be named type-as-ordering. In cases where the order is of minor importance, we recommend to be compatible with this SRFI.

Why so few higher–order constructions?

An alternative for the control structures (macros) refine-compare, select-compare, and cond-compare is a set of higher–order procedures for constructing compare procedures.

We have chosen for control structures instead of higher–order procedures for simplicity. This becomes particularly evident when a recursive compare procedure, e.g. default-compare, is to be defined. Using select-compare it is possible to define default-compare simply as a procedure calling itself in some branches. In the higher–order approach, the procedure under construction must also be able to call itself, with arguments that are application specific. Expressing this with a flexible higher–order procedure is much more indirect.

Why the operations <?, <=? etc.?

Programs need both 2-way branching and 3-way branching. For 3-way branching, the conditional if3 is provided.

For 2-way branching, the set {-1, 0, +1} of results of a comparison is mapped onto the set {#f, #t}. There are eight functions from a 3-set into a 2-set; all six non–constant functions are provided as =?, <?, etc.

The five monotonic functions can be generalized to chains of values. In order to make the compare procedure parameter optional in the ordinary comparisons, separate operations (chain<?, chain<=?, etc.) are defined for chains. For the sixth operation (not=?) the generalization to pairwise unequality is defined as pairwise-not=?. This operation can be implemented efficiently because the compare procedure also defines a total order.

As chains of length three are still frequently tested in programs (think of a range check 0 < i < n), and often two different relations are combined, there are special operations for chains of length three (</<?, </<=?, etc.)

For convenience, the compare procedure argument is made optional as often as possible. Unfortunately, this opens up a possibility for mistake: Writing (<=? x y z) where (<=/<=? x y z) is meant. Fortunately, the mistake will likely manifest itself at the time (x y z) is evaluated.

Why are <? etc. procedures, not macros?

The procedures <?, </<?, chain<? etc. could also have been specified as macros. This would have the advantage that they could make full use of “short evaluation”: A chain of comparisons stops as soon as one of the comparisons has failed; all remaining argument expressions and comparisons need not be evaluated. This is potentially more efficient.

The advantage of procedures, on the other hand, is that in Scheme they are “first class citizens,” meaning that they can be passed as arguments and returned from higher–order procedures.

Taking this approach one step further, one can even require the compare procedures to check the types of all arguments, even if the result of the comparison is already known. This is what Section 6.2.5 of R5RS calls “transitive“ behavior of the predicates =, <, etc. For example, (< 0 x y) first tests if x is positive, and only if this is the case (< x y) is tested. But even if x is not positive it is checked that y is indeed a real—otherwise an error is raised. In “short evaluation,” on the contrary, if x is not positive, y can be an arbitrary Scheme value.

Clearly, “transitive” tests have an overhead, namely that they need to execute potentially redundant type checks. Even worse, as types are only known to the compare procedure the only way to check the type of a value is to compare it, maybe with itself (which should result in 0 by definition of a compare procedure).

The advantage of “transitive” comparisons is the automatic insertion of a type assertion. For example, after

(chain<? integer-compare
x y z)

has been evaluated, no matter the result, it is known that x, y, and z are integers. We consider this advantage sufficiently important to pay the price.

Why compare-by< etc.?

It is often easier to define an order predicate, and possibly a separate equivalence relation, than it is to define a compare procedure. For this case, compare< etc. provide a convenient and robust way of constructing the associated compare procedure.

As has been learned from writing the reference implementation, despite the fact that each of these procedures is just a few lines of trivial code, they miraculously attract bugs.

How do I define a compare function from just an equivalence?

A compare function defines a total order on equivalence classes, and vice versa. Hence, a compare procedure compare can be used to test equivalence: (=? compare x y).

In reverse, one could be tempted to define a “compare function” c from just an equivalence relation ~ as c(x, y) = 0 if x ~ y and c(x, y) = 1 otherwise. However, c is not antisymmetric (unless all objects are equivalent, i.e. c(x,y) = 0 for all x, y) and hence it is not a compare function. In fact, there is no way at all of avoiding a total order on the equivalence classes.

This is also reflected in the fact that there are efficient (log–time) search data structures based on a total order, but we know of no efficient (sublinear worst–case) data structures based solely on an equivalence relation. The following program takes time and space O(h), where h is the number of equivalence classes in use:

(define (equal->compare equal)
  (let ((reps '()) (length-reps 0))
    (define (index x)
      (let loop ((i (- length-reps 1)) (rs reps))
        (if (null? rs)
            (let ((i length-reps))
              (set! reps (cons x reps))
              (set! length-reps (+ length-reps 1))
              i)
            (if (equal x (car rs))
                i
                (loop (- i 1) (cdr rs))))))
    (lambda (x y)
      (integer-compare (index x) (index y)))))

If equal is an equivalence predicate (i.e. it is reflexive, symmetric, and transitive) then (equal->compare equal) is a compare procedure for the objects comparable by equal. The total order defined is unspecified (as it depends on call sequence).

Note that the equivalence predicate equal could be defined by using a union–find data structure. But keep in mind that the equivalence relation represented by equal must not change while (equal->compare equal) is in use–so the union–find data structure must be unite classes.

How do I switch from R5RS to this SRFI?

As it happens, the specification of this SRFI is fully compatible with the 25 order predicates found in R5RS. The easiest way of switching is by defining the R5RS operations in terms of this SRFI.

Alternatively, each expression involving a reference to an R5RS order predicate can be transformed into an equivalent expression using the facilities of this SRFI. Be reminded though that this requires an understanding of the context of the expression in question, in particular variable bindings, macro definitions, and the use of eval.

However, if the meaning of an expression may be altered, it is often possible to increase type safety or simplicity. Consider for example the following potential replacements of (and (<= 0 i) (< i n)):

(and (<=? real-compare 0 i) (<? real-compare i n))
(<=/<? real-compare 0 i n)    ; always compares n
(<=/<? integer-compare 0 i n) ; only integer i, n
(<=/<? 0 i n)                 ; uses default-compare

Only the first alternative is equivalent to the original expression, but the other alternatives might be useful, too, depending on the goal.

Why be so tight with types?

Most procedures and macros in this SRFI are required to signal an error if an argument is not according to the type specified, in particular comparison values must be exact integer in

{-1, 0,
+1}

at all times. Alternatively, we could have specified that procedures and macros accept values as general as makes sense.

We believe that being tight on types at this fundamental level of a language pays off quickly. In particular, this will simplify debugging. Moreover, static analysis of a program will recognize more variables of a known type, which allows for more unboxed values and tighter compiled code. (Clearly, at the time of this writing this is speculative.)

Is there a performance penalty for this SRFI?

The focus of the reference implementation is correctness and portability; performance will very likely suffer due to the overhead of internal procedure calls and type-checking.

But as the word “SRFI” suggests, this document is a “request for implementation,” meaning we would love to see this SRFI being implemented efficiently by the implementation experts of particular Scheme systems. In practice, this means that most of the operations defined here, if not all, are supported natively.

In this case, there is no performance penalty for using the mechanisms of this SRFI—using this SRFI might even be faster due to explicit 3-way branching and better typing.

Why are there optional leading arguments?

Some operations have an optional first argument. This is in contrast to common practice in Scheme to put optional arguments after mandatory arguments.

The leading optional argument is always the argument compare, representing the total order to be used. If it is missing default-compare is used.

In the cases where we have chosen to make compare optional it is for the sake of brevity, e.g. in (<? x y) instead of enforcing

(<?
default-compare x y)

. Although an option introduces potential for confusion (e.g. (<? x y z) vs. (</<? x y z)), we consider it an important feature for interactive use and convenient programming (e.g. in (do ((i 0 (+ i 1))) ((=? i n)))).

Given our decision for optional compare, the question arises how to pass the option. In the absence of other widely accepted mechanisms for options, we can only vary the length of the argument list. For historical reasons—before case-lambda of SRFI-16—optional arguments are passed at the end of the argument list for simplified parsing. On the other hand, (<? compare x y) is more consistent with the rest of the SRFI than (<? x y compare).

Unfortunately, any particular choice here is a compromise, and it is also controversial. (Please refer to the discussion archive for details, in particular msg00051.) We have chosen for notational convenience in the common case (optional compare) and for consistency within this SRFI (leading optional argument).

Why chain<? etc. and not a predicate parameter?

This SRFI specifies the five chain predicates chain=?, chain<?, chain>?, chain<=?, and chain>=?. An alterative is to define a single chain predicate that has the ordering as a parameter. (Refer to the discussion archive starting with msg00012.)

The reason we have chosen for five chain predicates is that we use compare procedures to represent orders, not predicate procedures. There are five possible order relations predicates for which a chain test makes sense. (The sixth, not=?, is not not transitive and hence requires pairwise testing.) The five chain tests are clearly defined and can be implemented efficiently, their main overhead being the call to the compare procedure.

Why not more higher–order procedures?

In this SRFI min-compare accepts a compare procedure as a first mandatory argument, applying the minimum operation to the list of all other arguments. An alternative is to have min-compare accept only the compare procedure (possibly optional) and returing a procedure computing the minimum of all its arguments (with respect to the compare procedure.) In a similar fashion other operations can specified as higher–order procedures.

We have avoided higher–order procedures in this SRFI for simplicity and efficiency. As said repeatedly, compare procedures are the main vehicle to transport total orders from the code site definine an order to the code site using an order. Moreover, most operations made available through this SRFI appear rather infrequently in programs, so either way there is little to be gained. Finally, dealing with higher–order procedures often involves writing more parentheses and the more simple–minded Scheme systems will create many short–lived closures.

Why do <? etc. have so many options?

The procedures =?, <? etc. accept an optional compare procedure but also two optional arguments to compare. This could be made simpler by not specifying some of the cases, or by specifying different procedures for the different functions.

The operations <? etc. are the primary mechanism for using compare procedures. As such they should be versatile and concise.

Our original design had two mandatory arguments for objects to compare and an optional argument for the compare procedure, i.e. it provides a parametric comparison (<? compare x y) of two objects. Amir Livne Bar-On then raised the issue of having better support for a higher–order style of programming, i.e. ((<? compare) x y). (Refer to msg00012.)

However, in Scheme the higher–order style is less convenient than it is in curried programming languages like Haskell or ML. In practice this manifests itself as follows: The most basic and frequent case of comparing atomic objects with respect to the default ordering would read ((<=?) x y), which is just two parentheses short of optimal.

Fortunately, Dave Mason proposed a syntax for resolving the apparent alternative parametric test vs. higher order style. (Refer to msg00014.) By combining both functionalities into a single procedure, the user can choose the style at any moment.

2.28.6 Design rationale

Order predicates (2-way) or 3-way comparisons?

How to represent the three cases?

How to order complex numbers?

How to order special floating point symbols?

How to define `default-compare`?

What is the “lexicographic order”?

What is the “natural order” of lists and vectors?

Why are vectors not ordered lexicographically?

Why so few higher–order constructions?

Why the operations `<?`, `<=?` etc.?

Why are `<?` etc. procedures, not macros?

Why `compare-by<` etc.?

How do I define a compare function from just an equivalence?

How do I switch from R5RS to this SRFI?

Why be so tight with types?

Is there a performance penalty for this SRFI?

Why are there optional leading arguments?

Why `chain<?` etc. and not a predicate parameter?

Why not more higher–order procedures?

Why do `<?` etc. have so many options?

2.28.6 Design rationale

Order predicates (2-way) or 3-way comparisons?

How to represent the three cases?

How to order complex numbers?

How to order special floating point symbols?

How to define default-compare?

What is the “lexicographic order”?

What is the “natural order” of lists and vectors?

Why are vectors not ordered lexicographically?

Why so few higher–order constructions?

Why the operations <?, <=? etc.?

Why are <? etc. procedures, not macros?

Why compare-by< etc.?

How do I define a compare function from just an equivalence?

How do I switch from R5RS to this SRFI?

Why be so tight with types?

Is there a performance penalty for this SRFI?

Why are there optional leading arguments?

Why chain<? etc. and not a predicate parameter?

Why not more higher–order procedures?

Why do <? etc. have so many options?

How to define `default-compare`?

Why the operations `<?`, `<=?` etc.?

Why are `<?` etc. procedures, not macros?

Why `compare-by<` etc.?

Why `chain<?` etc. and not a predicate parameter?

Why do `<?` etc. have so many options?