Next: srfi compare-procedures related, Previous: srfi compare-procedures spec, Up: srfi compare-procedures [Index]
In this section we present our reasoning behind the design decisions made for this SRFI. We would like to be explicit on this because we believe that design is not about the outcome of decisions but about the alternatives considered. The section is organized as a Q&A list.
It is mathematical tradition to specify a total order in terms of a
“less or equal” (<=
) relation. This usually carries over to
programming languages in the form of a <=
predicate procedure.
However, there are inherently three possible relations between two elements x and y with respect to a total order: x < y, x = y, and x > y. (With respect to a partial order there is a fourth: x and y are uncomparable.) This implies that any mechanism based on 2-valued operations (be it <, or (= , <), or other) has cases in which two expressions must be evaluated in order to determine the relation between two elements.
In practice, this is a problem if a comparison is computationally expensive. Examples of this are implicitly defined orders in which the order of elements depends on their relative position in some enumeration. (Think of comparing graphs by isomorphism type.) In this case, each order predicate is as expensive as a compare procedure—implying that a proper 3-way branch could be twice as fast as cascaded 2-way branches. Hence, there is a potentially considerable loss in performance, and it is purely due to the interface for comparisons.
The primary disadvantage of bare 3-way comparisons is that they are less convenient, both in use and in their definition. Luckily, this problem can be solved quite satisfactorily using the syntactic (macro) and procedural abstractions of Scheme.
We have considered the following alternatives for representing the three possible results of a comparison:
-1
, 0
, and +1
(used in this
SRFI),
real?
,
<
, =
, and >
),
#<
,
#=
, and #>
).
The representation acts as an internal interface between programs comparing objects and programs using these comparisons.
The advantage of using only three values is that the representation of each case is uniquely defined. In particular, this enables the use of case instead of if, and it ensures portability. Portability of numbers is problematic in R5RS due to underspecification and inexactness.
The advantage of using a non–unique (numerical) representation is that
the result of a computation can sometimes immediately be used in a
branch, much like the “non-#f
means true”–convention. However,
with the operations defined this advantage hardly matters. Moreover,
the “non-#f
means true”–convention is a major cause of
unexpected program behavior itself.
The advantage of using {-1, 0, +1}
over using three symbols is
that the integers support additional operations, for example they can
directly be used in index computations. A particularly useful operation
is (* sign (compare x y))
which inverts the order relation
depending on sign (either -1
or +1
). In addition, the
integers are unique—once it is known that comparisons result in
integers it is obvious which integers. A minor consideration is that
Scheme systems usually treat small integers as unboxed values, and that
integers are self–evaluating literals.
The advantage of using three symbols is that they can be chosen to be
more descriptive. For example, it is more instructive to see
(symbol-compare 'foo 'bar)
result in 'greater
than in
1
. Unfortunately, there is no obvious choice of name for the
three symbols. Amoung the choices that make sense are 'less
,
'equal
, 'greater
, or 'lt
, 'eq
, 'gt
,
or '<
, '=
, '>
. A disadvantage of using symbols for
the three cases is that Scheme symbols are ordered, too, and this
ordering may differ from the desired ordered for the three cases.
Some Scheme implementations provide a mechanism for defining enumeration
types. For example define-enumerated-type
of Scheme 48 can be
used to define a type comparison consisting of three objects, say
lt
, eq
, gt
. The enumeration can also (directly) be
defined on top of SRFI-9 (Defining Record Types) [10] by defining
three new record types, each of which having a single instance. We
regard this approach as preferable over three symbols because comparison
results have their own type, and a sufficiently advanced compiler could
use this information to eliminate redundant type–checks.
One step further in this direction is the following design alternative
we have considered: Due to the fundamental nature of the type comparison
for programming, it would be worthwhile integrating it into the core
language of Scheme. This could take the following form: There are three
self–evaluating constants, e.g. written #<
#=
#>
,
and these are the only instances of the type comparison. The type
supports two operations: comparison?
and
comparison-compare
. Furthermore, eq?
, eqv?
, and
equal?
need to understand the comparison values. In other words,
comparison is designed after boolean. It is unclear, however, which
problem this tight integration of comparisons into the language is
solving.
Given this situation, we have chosen for {-1, 0, +1}
, while
providing facilities for using this conveniently—in particular it is
hardly ever necessary to deal with the integers directly.
Mathematically, no total order of the complex numbers exists which is compatible with the algebraic or topological structure. Nevertheless, it is useful for programming purposes to have some total order of complex numbers readily available.
Several total orders on the complex numbers are at least compatible with
the natural ordering of real numbers. The least surprising of these is
lexicographic on (re, im)
.
Floating point formats often do not only represent rational numbers but
extend this set by special symbols, for example +Inf
,
-Inf
, NaN
(“Not a number”), and -0
. How should
these symbols be ordered with respect to the ordinary numerical values
and with respect to each other? (Refer to the discussion archive
starting with msg00010
.)
Let us briefly recall the purpose of the special symbols. The general rationale for introducing special symbols into a floating point format is for numerical calculations to continue in the presence of data–dependent errors, while still retaining some meaningful information about the result.
+Inf
and -Inf
indicate that the calculation
has produced a value exceeding the representable range.
-0
, indicates that a calculation has produced
a value of unrepresentable small magnitude, but retains the information
that the underflow approached zero from the negative side (otherwise it
would be +0
). This sign information is useful in the presence of
branch–cuts.
NaN
indicates that the information about the value has
been lost entirely (example: -Inf + Inf
) NaN
avoids
raising an exception and allows carrying on with other parts of the
calculation. It should be noted that several NaN
s can exist.
For example in the IEEE 754 standard many bit patterns represent
NaN
(whatever the interpretation).
As +Inf
and -Inf
are designed to represent extremal
numbers, their ordering with respect to real numbers is obvious. For
signed zeros, the ordering is also obvious. However, the notion of two
zeros (or even three: -0
, 0
, and +0
) is
incompatible with the arithmetic structure of the real numbers. Hence,
in most situations all zeros should be treated as equal, even though
this can destroy information about results. But the alternative design
may also make sense in certain situations where the full information
carried in a floating point object is to be retained.
For NaN
(or even several NaN
s) the situation is even more
ambiguous because there is not even a natural order relation of
NaN
with the other possible floating point values. One design
alternative is to raise an error if NaN
is to participate in a
comparison; the reasoning being “if the control flow depends on a NaN
you are in trouble anyway”. An alternative is to define some order by
force; the reasoning being “if an object satisfies real? then it can be
compared with real-compare.” Neither approach is obviously better than
the other.
Given this situation, we have decided to leave the effect of using a
special floating point value in real-compare
unspecified, in line
with the approach of R5RS. This approach might change once Scheme
itself is more explicit about floating point representations and
numerical computation.
default-compare
?The purpose of default-compare
is providing some well–defined
way of comparing two arbitrary Scheme values. This can be used in all
situations in which the user is unwilling to define a compare procedure
explicitly, for example because the actual details of the total order do
not really matter.
As an example, consider the task of dealing with sets of sets of
integers. In this case, one could simply use sorted lists without
repetition for representing lists and default-compare
already
provides a total order.
However, there are limits as to how default-compare
can be
defined. For example, default-compare
cannot easily be based on
a hash code derived from the pointer representing an object due to the
close dependency with the garbage collection mechanism. Also, we
believe it to be more useful to applications if default-compare
is based on type and structure.
Unfortunately, this imposes limits on what can be compared using
default-compare
because it is very desireable to have a portable
reference implementation. In particular, portable ways of dealing with
circular structures are overly costly.
Naturally, the question arises how the types should be ordered. For
this question it is useful to understand that boolean-compare
and
pair-compare
both already define a total order for all values (at
least in priciple). Hence, default-compare
could refine one of
them, but unfortunately not both at the same time (unless #f
and
()
are minimum and maximum of the order, respectively). Since
pair-compare
is more frequently used than boolean-compare
we base default-compare on pair-compare
. The other portably
comparable types are ordered by increasing complexity, which clearly is
an arbitrary choice.
The lexicographic order is a general way of defining an ordering for sequences from an ordering of elements:
In the lexicographic order, the empty sequence is the smallest sequence of all, and two non–empty sequences are first compared by their first element and only if these are equal the residual sequences are compared, recursively.
The lexicographic order has its name from its use in a lexicon: For
example, fun < funloving < jolly
.
By “natural order” of an abstract data type we mean a total order that is defined to match the basic operations operations supported by the data type.
The basic access operations with constant execution time for Scheme
lists are null?
, car
, and cdr
. These are exactly
the operations needed for comparing two sequences lexicographically.
The constant time access operations for Scheme vectors are
vector-length
(size) and vector-ref
(ref). Using these
operations, the fundamental ordering of vectors is first comparing by
size, and only if the sizes are equal, by comparing the elements
lexicographically.
In this SRFI, lists and strings are ordered lexicographically (‘LEX’)
by default, e.g. "12" < "2"
. The default order of vectors is
first by length and then lexicographically (‘LENGTH-LEX’),
e.g. #(2) < #(1 2)
. Alternatively, vectors could be ordered
purely lexicographically, too. In the extreme, lists, strings, and
vectors could even be ordered lexicographically as sequences without
distinguishing the concrete representation, implying "12" = (#\1
#\2) = #(#\1 #\2)
.
The choice affects vector-compare
, default-compare
, and
the way orders are interpreted conceptually. Moreover, this SRFI
introduces the terminology “ordered as lists” and “ordered as
vectors” to refer to the two fundamental ways of lifting an order to
sequences (LEX and LENGTH-LEX). The choice also has implications for
any other SRFI introducing container data types (e.g. 66 and 74), in
case the author wishes to specify default compare procedures compatible
with this SRFI.
Summarizing the discussion, there seem to be three major arguments:
(Please refer to the discussion archive for details, in particular
msg00054
.)
We consider 2. the most important due to its mathematical nature, followed by 1. because it simplifies the design. While this controversial, we think that it is preferable to introduce different orders for different data types, and not derive every order from a single one for sequences. Finally, we consider 3. a weak argument because the default ordering of strings is motivated primarily historically for ordering written words of (small alphabet) natural languages.
Concerning other vector–like data types, such as those introduced by
SRFI-66 and SRFI-74, we recommend to define a default ordering
which appears most natural for the type. These can conveniently be
named type-as-ordering
. In cases where the order is
of minor importance, we recommend to be compatible with this SRFI.
An alternative for the control structures (macros)
refine-compare
, select-compare
, and cond-compare
is
a set of higher–order procedures for constructing compare procedures.
We have chosen for control structures instead of higher–order
procedures for simplicity. This becomes particularly evident when a
recursive compare procedure, e.g. default-compare
, is to be
defined. Using select-compare
it is possible to define
default-compare
simply as a procedure calling itself in some
branches. In the higher–order approach, the procedure under
construction must also be able to call itself, with arguments that are
application specific. Expressing this with a flexible higher–order
procedure is much more indirect.
<?
, <=?
etc.?Programs need both 2-way branching and 3-way branching. For 3-way
branching, the conditional if3
is provided.
For 2-way branching, the set {-1, 0, +1}
of results of a
comparison is mapped onto the set {#f, #t}
. There are eight
functions from a 3-set into a 2-set; all six non–constant functions are
provided as =?
, <?
, etc.
The five monotonic functions can be generalized to chains of values. In
order to make the compare procedure parameter optional in the ordinary
comparisons, separate operations (chain<?
, chain<=?
, etc.)
are defined for chains. For the sixth operation (not=?
) the
generalization to pairwise unequality is defined as
pairwise-not=?
. This operation can be implemented efficiently
because the compare procedure also defines a total order.
As chains of length three are still frequently tested in programs (think
of a range check 0 < i < n), and often two different relations
are combined, there are special operations for chains of length three
(</<?
, </<=?
, etc.)
For convenience, the compare procedure argument is made optional as
often as possible. Unfortunately, this opens up a possibility for
mistake: Writing (<=? x y z)
where (<=/<=? x y z)
is
meant. Fortunately, the mistake will likely manifest itself at the time
(x y z)
is evaluated.
<?
etc. procedures, not macros?The procedures <?
, </<?
, chain<?
etc. could also
have been specified as macros. This would have the advantage that they
could make full use of “short evaluation”: A chain of comparisons
stops as soon as one of the comparisons has failed; all remaining
argument expressions and comparisons need not be evaluated. This is
potentially more efficient.
The advantage of procedures, on the other hand, is that in Scheme they are “first class citizens,” meaning that they can be passed as arguments and returned from higher–order procedures.
Taking this approach one step further, one can even require the compare
procedures to check the types of all arguments, even if the result of
the comparison is already known. This is what Section 6.2.5 of R5RS
calls “transitive“ behavior of the predicates =
, <
, etc.
For example, (< 0 x y)
first tests if x is positive, and
only if this is the case (< x y)
is tested. But even if x
is not positive it is checked that y is indeed a real—otherwise
an error is raised. In “short evaluation,” on the contrary, if
x is not positive, y can be an arbitrary Scheme value.
Clearly, “transitive” tests have an overhead, namely that they need to
execute potentially redundant type checks. Even worse, as types are
only known to the compare procedure the only way to check the type of a
value is to compare it, maybe with itself (which should result in
0
by definition of a compare procedure).
The advantage of “transitive” comparisons is the automatic insertion
of a type assertion. For example, after (chain<? integer-compare
x y z)
has been evaluated, no matter the result, it is known that
x, y, and z are integers. We consider this advantage
sufficiently important to pay the price.
compare-by<
etc.?It is often easier to define an order predicate, and possibly a separate
equivalence relation, than it is to define a compare procedure. For
this case, compare<
etc. provide a convenient and robust way of
constructing the associated compare procedure.
As has been learned from writing the reference implementation, despite the fact that each of these procedures is just a few lines of trivial code, they miraculously attract bugs.
You better don’t.
A compare function defines a total order on equivalence classes, and
vice versa. Hence, a compare procedure compare can be used to test
equivalence: (=? compare x y)
.
In reverse, one could be tempted to define a “compare function”
c from just an equivalence relation ~
as c(x, y) =
0 if x ~ y and c(x, y) = 1 otherwise. However, c
is not antisymmetric (unless all objects are equivalent,
i.e. c(x,y) = 0 for all x, y) and hence it is not a
compare function. In fact, there is no way at all of avoiding a total
order on the equivalence classes.
This is also reflected in the fact that there are efficient (log–time) search data structures based on a total order, but we know of no efficient (sublinear worst–case) data structures based solely on an equivalence relation. The following program takes time and space O(h), where h is the number of equivalence classes in use:
(define (equal->compare equal) (let ((reps '()) (length-reps 0)) (define (index x) (let loop ((i (- length-reps 1)) (rs reps)) (if (null? rs) (let ((i length-reps)) (set! reps (cons x reps)) (set! length-reps (+ length-reps 1)) i) (if (equal x (car rs)) i (loop (- i 1) (cdr rs)))))) (lambda (x y) (integer-compare (index x) (index y)))))
If equal is an equivalence predicate (i.e. it is reflexive,
symmetric, and transitive) then (equal->compare equal)
is a
compare procedure for the objects comparable by equal. The total
order defined is unspecified (as it depends on call sequence).
Note that the equivalence predicate equal could be defined by using a
union–find data structure. But keep in mind that the equivalence
relation represented by equal must not change while
(equal->compare equal)
is in use–so the union–find data
structure must be unite classes.
As it happens, the specification of this SRFI is fully compatible with the 25 order predicates found in R5RS. The easiest way of switching is by defining the R5RS operations in terms of this SRFI.
Alternatively, each expression involving a reference to an R5RS order predicate can be transformed into an equivalent expression using the facilities of this SRFI. Be reminded though that this requires an understanding of the context of the expression in question, in particular variable bindings, macro definitions, and the use of eval.
However, if the meaning of an expression may be altered, it is often
possible to increase type safety or simplicity. Consider for example
the following potential replacements of (and (<= 0 i) (< i n))
:
(and (<=? real-compare 0 i) (<? real-compare i n)) (<=/<? real-compare 0 i n) ; always compares n (<=/<? integer-compare 0 i n) ; only integer i, n (<=/<? 0 i n) ; uses default-compare
Only the first alternative is equivalent to the original expression, but the other alternatives might be useful, too, depending on the goal.
Most procedures and macros in this SRFI are required to signal an
error if an argument is not according to the type specified, in
particular comparison values must be exact integer in {-1, 0,
+1}
at all times. Alternatively, we could have specified that
procedures and macros accept values as general as makes sense.
We believe that being tight on types at this fundamental level of a language pays off quickly. In particular, this will simplify debugging. Moreover, static analysis of a program will recognize more variables of a known type, which allows for more unboxed values and tighter compiled code. (Clearly, at the time of this writing this is speculative.)
Yes and no.
The focus of the reference implementation is correctness and portability; performance will very likely suffer due to the overhead of internal procedure calls and type-checking.
But as the word “SRFI” suggests, this document is a “request for implementation,” meaning we would love to see this SRFI being implemented efficiently by the implementation experts of particular Scheme systems. In practice, this means that most of the operations defined here, if not all, are supported natively.
In this case, there is no performance penalty for using the mechanisms of this SRFI—using this SRFI might even be faster due to explicit 3-way branching and better typing.
Some operations have an optional first argument. This is in contrast to common practice in Scheme to put optional arguments after mandatory arguments.
The leading optional argument is always the argument compare,
representing the total order to be used. If it is missing
default-compare
is used.
In the cases where we have chosen to make compare optional it is for the
sake of brevity, e.g. in (<? x y)
instead of enforcing (<?
default-compare x y)
. Although an option introduces potential for
confusion (e.g. (<? x y z)
vs. (</<? x y z)
), we consider
it an important feature for interactive use and convenient programming
(e.g. in (do ((i 0 (+ i 1))) ((=? i n)))
).
Given our decision for optional compare, the question arises how to pass
the option. In the absence of other widely accepted mechanisms for
options, we can only vary the length of the argument list. For
historical reasons—before case-lambda
of SRFI-16—optional
arguments are passed at the end of the argument list for simplified
parsing. On the other hand, (<? compare x y)
is more consistent
with the rest of the SRFI than (<? x y compare)
.
Unfortunately, any particular choice here is a compromise, and it is
also controversial. (Please refer to the discussion archive for
details, in particular msg00051
.) We have chosen for notational
convenience in the common case (optional compare) and for consistency
within this SRFI (leading optional argument).
chain<?
etc. and not a predicate parameter?This SRFI specifies the five chain predicates chain=?
,
chain<?
, chain>?
, chain<=?
, and chain>=?
.
An alterative is to define a single chain predicate that has the
ordering as a parameter. (Refer to the discussion archive starting with
msg00012
.)
The reason we have chosen for five chain predicates is that we use
compare procedures to represent orders, not predicate procedures. There
are five possible order relations predicates for which a chain test
makes sense. (The sixth, not=?
, is not not transitive and hence
requires pairwise testing.) The five chain tests are clearly defined
and can be implemented efficiently, their main overhead being the call
to the compare procedure.
In this SRFI min-compare
accepts a compare procedure as a
first mandatory argument, applying the minimum operation to the list of
all other arguments. An alternative is to have min-compare
accept only the compare procedure (possibly optional) and returing a
procedure computing the minimum of all its arguments (with respect to
the compare procedure.) In a similar fashion other operations can
specified as higher–order procedures.
We have avoided higher–order procedures in this SRFI for simplicity and efficiency. As said repeatedly, compare procedures are the main vehicle to transport total orders from the code site definine an order to the code site using an order. Moreover, most operations made available through this SRFI appear rather infrequently in programs, so either way there is little to be gained. Finally, dealing with higher–order procedures often involves writing more parentheses and the more simple–minded Scheme systems will create many short–lived closures.
<?
etc. have so many options?The procedures =?
, <?
etc. accept an optional compare
procedure but also two optional arguments to compare. This could be
made simpler by not specifying some of the cases, or by specifying
different procedures for the different functions.
The operations <?
etc. are the primary mechanism for using
compare procedures. As such they should be versatile and concise.
Our original design had two mandatory arguments for objects to compare
and an optional argument for the compare procedure, i.e. it provides a
parametric comparison (<? compare x y)
of two objects. Amir
Livne Bar-On then raised the issue of having better support for a
higher–order style of programming, i.e. ((<? compare) x y)
.
(Refer to msg00012
.)
However, in Scheme the higher–order style is less convenient than it is
in curried programming languages like Haskell or ML. In practice this
manifests itself as follows: The most basic and frequent case of
comparing atomic objects with respect to the default ordering would read
((<=?) x y)
, which is just two parentheses short of optimal.
Fortunately, Dave Mason proposed a syntax for resolving the apparent
alternative parametric test vs. higher order style. (Refer to
msg00014
.) By combining both functionalities into a single
procedure, the user can choose the style at any moment.
Next: srfi compare-procedures related, Previous: srfi compare-procedures spec, Up: srfi compare-procedures [Index]