Chapter 4
The Standard Library
4.1 General Purpose Exceptions and Functions
The following exceptions are defined:
- exception Bind is raised when a value binding (val or val rec) fails to match. The Definition of
standard ML stipulates that Bind may be raised only in situations where a match is not exhaustive, and that the
compiler must warn the user of this. Map pattern matching complicates a lot static analysis of pattern matching,
so the GimML compiler does not warn the user for now. GimML does not give any warning about redundancy
of patterns either.
A typical example of a non-exhaustive pattern is:
let val a::l = r in a end
which raises Bind if r is the empty list.
- exception Match is raised when a function cannot be applied to its argument, though they agree on types. A typical
example is:
(fn (a::l) => a) r
or the equivalent:
case r of a::l => a
which raises Match if r is the empty list.
- exception ParSweep is raised when a comprehension (or quantification, or iteration) using the || separator is
meant to sweep through domains of different cardinalities (the ni in the explanation of comprehensions in section
2.1.1).
However, the moment where this exception is raised is voluntarily left unspecified. For example, the current interpreter
checks the cardinalities before sweeping, but it would be easier to check them after sweeping in compiled code, for
example.
- exception NoMem is raised when the garbage collector fails to get enough space to complete a computation. This
exception cannot be handled in production code, because trying to correct the problem usually implies that we allocate
some memory, which only makes the situation worse.
- exception Stack was intended to be raised whenever the stack overflowed. However, in GimML, it never overflows,
because when the stack becomes full, the system reifies the current continuation, empties the stack and proceeds with the
computation.
- exception NonSense is raised when evaluating a non-sensical expression. In theory, whether it is in Standard ML
or in GimML, this can never happen, because all expressions that type-check are well-formed. However, when some
expression fails to type-check, it is usually bad practice to stop compiling immediately. The usual solution is to allow for
several errors to occur before stopping compilation. In GimML, we have chosen not to stop compilation at all,
and instead to compile a special expression raise NonSense in place of all non-sensical (ill-typed)
expressions.
General functions are:
4.2 Lists
The exception:
exception Nth
is defined. It is raised whenever it is attempted to access an element outside of a list. Recall that lists are defined by the
datatype declaration:
datatype ’a list = nil — :: of ’a * ’a list
:: : 'a ⋆ 'a list -> 'a list is the list constructor, and is declared infix, right associative of precedence
5.
- null : 'a list -> bool returns true if the list in argument is empty. This is equivalent to
fn nil => true | _ => false
- hd : 'a list -> 'a returns the first element of the list in argument, or raises Match if the list is empty. This is
equivalent to:
fun hd (x :: _) = x
- tl : 'a list -> 'a list returns the list composed of all but the first element of the list in argument, or raises
Match if the list is empty. This is equivalent to:
fun tl (_ :: x) = x
- len : 'a list -> int computes the length of a list.
- nth : 'a list ⋆ int -> 'a gets the n + 1th element of the list l when applied to the arguments l and n. If n is
out of range (n < 0 or n >=len l), Nth is raised. Note that elements are indexed starting at 0. nth is declared infix,
left associative of precedence 9.
- nthtail : 'a list ⋆ int -> 'a list gets the nth tail of the list l when applied to the arguments l and n.
If n out of range (n < 0 or n >len l), Nth is raised. Note that tails are indexed starting at 0. When
n <len l, op nth (l,n) is the first element of nthtail l,n); if n =len l, nthtail l,n) is
nil.
- @ : 'a list ⋆ 'a list -> 'a list concatenates two lists. It could have been defined as:
fun op @ (nil,l) = l | op @ (a::l1,l2) = a:: op @(l1,l2)
@ is declared infix, right associative (as in Standard ML of New Jersey; the Definition of Standard ML defines it to be left
associative) of precedence 5.
- append : 'a list list -> 'a list is the distributed concatenation of lists. It could be defined
as:
fun append nil = nil | append (l::r) = l @ append r
The application of append to a list comprehension is compiled in an optimized form, where the concatenations are done
on the fly, without building the list comprehension first.
- revappend : 'a list ⋆ 'a list -> 'a list appends the reversion of the first list to the second list. We
could define:
fun revappend (nil,l) = l | revappend (a::l1,l2) = revappend (l1,a::l2)
- rev : 'a list -> 'a list reverses a list. It could be defined as:
fun rev l = revappend (l,nil)
- map : ('a -> 'b) -> 'a list -> 'b list applies its first argument to each element in turn of the list in
second argument, and return the list of all results. This is equivalent to:
fun map f l = [f x | x in list l]
- app : ('a -> 'b) -> 'a list -> unit applies its first argument to each element in turn of the list in second
argument. This is used purely for side-effects.
fun app f l = iterate f x | x in list l end
- fold : ('a ⋆ 'b -> 'b) -> 'b -> 'a list -> 'b combines the elements of the list in third argument
by applying the binary operation in first argument on all elements of the list. The second argument is assumed to be the
neutral element of the binary operation. For example, fold (op +) 0 l computes the sum of the
elements of the list l. fold computes from the right end of the list towards the left end. It can be defined
by:
fun fold f b =
let fun f2 nil = b
| f2 (e :: r) = f(e,f2 r)
in
f2
end
- revfold : ('a ⋆ 'b -> 'b) -> 'b -> 'a list -> 'b combines the elements of the list in third
argument by applying the binary operation in first argument on all elements of the list. The second argument is assumed
to be the neutral element of the binary operation. For example, revfold (op +) 0 l computes the sum of the
elements of the list l. revfold computes from the left end of the list towards the right end. It can be defined
by:
fun revfold f b =
let fun f2 b nil = b
| f2 b (e::r) = f2 (f (e,b)) r
in
f2
end
4.3 Sets and Maps
Exceptions related to sets and maps are:
- exception MapGet, raised when a map is applied to an element outside of its domain (with the ? function).
- exception Empty, raised when taking the distributed intersection of the empty set, or choosing an element
in the empty map.
- exception Range, raised when trying to build a range {m,…,n} with non-integer bounds (with the to
operator)
Functions on sets and maps are:
- empty : (''a -m> 'b) -> bool tests whether a map is empty. It is equivalent to fn m => m={}.
- ? : (''a -m> 'b) -> ''a -> 'b applies a map to an element. It is currified, so that the expression
?m(a) retrieves the element associated with a in m. If there is no element associated with a in m, the exception
MapGet is raised.
- inset : ''a ⋆ (''a -m> 'b) -> bool returns true if its first argument belongs to the domain of
the map in second argument. It can be defined as:
fun op inset (a,m) = (?m(a); true) handle MapGet => false
It is declared infix, of precedence 4.
? and inset share a one-entry cache, where the last maplet is stored, so that testing inset then using ? incurs almost
no speed penalty.
- inmap : (''a ⋆ ''b) ⋆ (''a -m> 'b) -> bool returns true if its first argument is a pair (x,y) such that
the maplet x => y is in the map in second argument. It can be defined as:
fun op inset (a,m) = (?m(a)=b) handle MapGet => false
It is declared infix, of precedence 4.
? and inset share a one-entry cache, where the last maplet is stored, so that testing inset then using ? incurs almost
no speed penalty.
- subset : (''a -m> 'b) ⋆ (''a -m> 'b) -> bool tests the inclusion of the domains of maps in
argument. It may be defined as:
fun op subset (m1,m2) = all x inset m2 | x in set m1 end
It is declared infix, of precedence 4.
- submap : (''a -m> ''b) ⋆ (''a -m> ''b) -> bool tests the inclusion of maps: it returns true if the first
map is a submap of the second. It may be defined as:
fun op submap (m1,m2) =
all x inset m2 andalso ?m1(x) = ?m2(x) | x in set m1 end
It is declared infix, of precedence 4.
- dom : (''a -m> 'b) -> ''a set computes the domain of a map; it is the identity on sets. It is syntactic sugar
for:
fun dom m = {x | x in set m}
- rng : (''a -m> ''b) -> ''b set computes the range of a map. The range of a non-empty set is always
{()}. rng is syntactic sugar for:
fun rng m = {y | _ => y in map m}
- card : (''a -m> 'b) -> int computes the cardinal of a map. It may be defined as:
fun card {} = 0 | card {_ => _} U m' = 1+card m'
- <| : (''a -m> 'c) ⋆ (''a -m> 'b) -> (''a -m> 'b) is the “domain restrict to” function. If s is a map
and m is another map, then s <| m is the map m, whose domain is restricted to the elements in the domain of s. This
may be defined by:
fun op <| (s,m) = {x => y | x => y in map m such that x inset s}
It is declared infix, right associative of precedence 7.
- <-| : (''a -m> 'c) ⋆ (''a -m> 'b) -> (''a -m> 'b) is the “domain restrict by” function. If s is a
map and m is another map, then s <-| m is the map m, whose domain is restricted to the elements outside the domain
of s. This may be defined by:
fun op <-| (s,m) = {x => y | x => y in map m such that not (x inset s)}
It is declared infix, right associative of precedence 7.
- |> : (''a -m> ''b) ⋆ (''b -m> 'c) -> (''a -m> ''b) if the “range restrict to” function. If s is a
map and m is another map, the m |> s is the map m, where only the maplets x => y such that y is in the domain of s
are considered. This may be defined by:
fun op |> (m,s) = {x => y | x => y in map m such that y inset s}
It is declared infix, left associative of precedence 7.
- |-> : (''a -m> ''b) ⋆ (''b -m> 'c) -> (''a -m> ''b) if the “range restrict by” function. If s is a
map and m is another map, the m |-> s is the map m, where only the maplets x => y such that y is outside the
domain of s are considered. This may be defined by:
fun op |-> (m,s) = {x => y | x => y in map m such that not (y inset s)}
It is declared infix, left associative of precedence 7.
- ++ : (''a -m> 'b) ⋆ (''a -m> 'b) -> (''a -m> b) is the map overwriting operator. It takes two maps
m and m′, and returns a map m′′, whose domain is the union of the domains of m and m′, and which maps every
element x of the domain of m′ to ?m′(x), and every other element x to ?m(x). Thus m′′ is m, over which the
associations of m′ have been written.
To underwrite instead of overwriting, write m′ ++ m instead of m ++ m′. The only difference is that m′ will be
evaluated before m.
++ is declared infix, left associative of precedence 6.
- overwrite : (''a -m> 'b) list -> ''a -m> 'b is the distributed overwriting function. It returns the first
map in the list, overwritten by the second, the third, etc. It is equivalent to:
fun overwrite nil = {}
| overwrite (m1::rest) = m1 ++ overwrite rest
The application of overwrite to a list comprehension is compiled in an optimized form, where the overwritings are
done on the fly, without building the list comprehension first.
- underwrite : (''a -m> 'b) list -> ''a -m> 'b is the distributed underwriting function. It returns the
last map in the list, overwritten by the next to last, the penultimate, etc. It is equivalent to:
fun underwrite nil = {}
| underwrite (m1::rest) = underwrite rest ++ m1
The application of underwrite to a list comprehension is compiled in an optimized form, where the underwritings are
done on the fly, without building the list comprehension first.
Note that overwrite o rev=underwrite, and underwrite o rev=overwrite.
- delta : (''a -m> 'b) ⋆ (''a -m> 'b) -> (''a -m> 'b) computes the symmetric difference of two
maps. This is the overwriting of one map by another, restricted by the intersection of the domains. It may be defined
by:
fun op delta (m,m') = (m' <-| m) ++ (m <-| m')
(or equivalently, (m <-| m') ++ (m' <-| m)). It generalizes the classical notion of symmetric difference of sets.
It is declared infix, left associative of precedence 7.
- choose : (''a -m> 'b) -> ''a chooses an element in the domain of the map in argument. It raises Empty if
the map is empty. It is syntactic sugar for:
fun choose {} => raise Empty
| choose {x => _,...} = x
or for:
fun choose m = case (some x | x in set m end) of
SOME x => x
| NONE => raise Empty
- choose_rng : (''a -m> 'b) -> 'b chooses an element in the range, and raises Empty if the map is empty.
The element chosen in the range is precisely the image by the map of the one chosen by choose, as shows the
equivalent form:
fun choose_rng m = ?m(choose m)
- split : (''a -m> 'b) -> (''a -m> 'b) ⋆ (''a -m> 'b) splits a map in two disjoint maps, whose
union is the original one. In general, the splitting won’t yield maps of equal (or nearly equal) cardinalities. However, the
splitting has the following properties:
- Splitting a map of cardinality greater than or equal to 2 yields two non empty maps. Hence, recursive
procedures that decompose their map arguments with split until its cardinality goes down to 1 or 0 will
always terminate.
- If (m1,m2)=split m, then all elements in the domain of m1 are less than elements in the domain of
m2 in the system order.
- Splitting depends only on the domain, not on the range. That is, assume if dom m=dom m′,
and both (m1,m2)=split m and (m′1,m′2)=split m′, then dom m1=dom m′1 and
dom m2=dom m′2.
- Splitting has the statistical property that if two maps have similar domains (that is, their domains differ
for a small number of elements), recursively splitting both domains will on average unearth an optimum
number of common subdomains. For a more detailed description of this property, refer to [10].
As a consequence, recursive memo functions with map arguments that recurse through splittings of their
arguments should be good incremental functions, recomputing quickly their results on data similar to
previous similar data.
- U : ''a set ⋆ ''a set -> ''a set is the union of two sets. It may be defined as a special case of map
overwriting, because sets are special cases of maps:
fun op U (s : ''a set,s' : ''a set) = s ++ s'
(or equivalently, s' ++ s). It is declared infix, left associative, of precedence 6.
- & : ''a set ⋆ ''a set -> ''a set is the intersection of two sets. It may be defined as a special case of map
domain restriction, because sets are special cases of maps:
fun op & (s : ''a set,s' : ''a set) = s <| s'
It is declared infix, left associative, of precedence 7.
- intersects : (''a -m> 'b) ⋆ (''a -m> 'c) -> bool returns true if its two arguments have domains
with a non-empty intersection. It can be defined as:
fun op intersects (m,m') = not (empty (m <| m'))
It is declared infix, of precedence 4.
- \ : ''a set ⋆ ''a set -> ''a set is the difference of two sets. It may be defined as a special case of the
domain restrict by operator, because sets are special cases of maps:
fun op & (s : ''a set,s' : ''a set) = s' <-| s
It is declared infix, left associative, of precedence 7.
- union : (''a set -m> 'b) -> ''a set is the distributed union of the sets in the domain
of the argument, it is quite similar to the overwrite and underwrite functions. It can be defined
as:
fun union {} = {}
| union {s => _} = s
| union ss = let val (s1,s2) = split ss
in
union s1 U union s2
end
The application of union to a set or map comprehension is compiled in an optimized form, where the unions are done
on the fly, without building the set comprehension first.
- inter : ((''a -m> ''b) -m> 'c) -> (''a -m> ''b) is the distributed intersection of maps (and hence
of sets, too) in the domain of its arguments. It is mostly used when ''b and 'c are both unit, in which case it
computes the distributed intersection of a set of sets. It raises Empty on the empty set. It can be defined
as:
fun inter {} = raise Empty
| inter {s => _} = s
| inter ss = let val (s1,s2) = split ss
in
inter s1 & inter s2
end
The application of inter to a set or map comprehension is compiled in an optimized form, where the intersections are
done on the fly, without building the set comprehension first.
- mapadd : (''a ⋆ 'b) ⋆ (''a -m> 'b) -> (''a -m> 'b) adds one maplet to a map, overwriting the
map. It may be defined as:
fun mapadd ((x,y),m) = m ++ {x => y}
It is only a bit faster than writing m ++ {x => y}.
- mapaddunder : (''a ⋆ 'b) ⋆ (''a -m> 'b) -> (''a -m> 'b) adds one maplet to a map,
underwriting the map. It may be defined as:
fun mapaddunder ((x,y),m) = {x => y} ++ m
It is only a bit faster than writing {x => y} ++ m (and the order of evaluation is different).
- mapremove : ''a ⋆ (''a -m> 'b) -> (''a -m> 'b) removes a maplet from a map. It may be defined
as:
fun mapremove (x,m) = {x} <-| m
It is only a bit faster than writing {x} <-| m.
- inv : (''a -m> ''b) -> (''b -m> ''a) inverses a map. Its definition is the same as:
fun inv m = {y => x | x => y in map m}
so in case m is not invertible, inv returns a map that maps y to the largest x in the system order such that m maps x to
y.
- O : (''b -m> 'c) ⋆ (''a -m> ''b) -> (''a -m> 'c) is the composition of maps. It is precisely
defined by:
fun op O (m,m') = {x => ?m(y) | x => y in map m' such that y inset m}
It is declared infix, left associative of precedence 3. It is the map version of the o function composition
operator.
- to : int ⋆ int -> int set is the range function. If a and b are the first and second argument respectively, to
returns the set of all integers x such that a ≤ x ≤ b. We could have defined to by:
fun op to (a,b) =
let fun f x = if x>b then {} else {x} U f(x+1)
in
f a
end
to is declared infix, left associative of precedence 9, so that we may write a to b.
- mapoflist : 'a list -> (int -m> 'a) converts a list to a map from its indices to its elements. It may be
defined by:
fun mapoflist l =
let fun f(_,nil) = {}
| f(n,a::l) = {n => a} ++ f(n+1,l)
in
f(0,l)
end
- inds : 'a list -> int set computes the set of indices of a list l, i.e, {0,…,len l-1}. It may be defined
by:
fun inds l = 0 to (len l-1)
or by val inds = dom o mapoflist.
- elems : ''a list -> ''a set computes the set of elements of the list in argument. It may be defined
by:
fun elems l = {x | x in list l}
or by val elems = rng o mapoflist.
There is now an alternative data type for maps in GimML, the table type. This implements imperative maps: unlike
the map type, objects of type table are updated in a destructive way, just like hash-tables. Objects of type
table, which we shall just call tables, are not really faster than using applicative references, but they put less
stress on the sharing mechanism and the garbage collector of GimML, which may result in space and even time
savings.
The type of tables mapping objects of type ''a to 'b is
type (''a, 'b) table
You can think as a kind of equivalent of the type (''a -m> 'b) ref. Associated functions are:
- table : unit -> (''_a, '_b) table creates a fresh, empty table. If tables were implemented as objects of
type (''a -m> 'b) ref, this would be equivalent to:
fun table () = ref {};
- t_get : (''a, 'b) table -> ''b -> 'a option reads an element off a table. Precisely, t_get t x
returns SOME y if x is mapped to y by the table t, and NONE if x has no entry in t. If tables were implemented as
objects of type (''a -m> 'b) ref, this would be equivalent to:
fun t_get t x =
SOME (?(!t) x) handle MapGet => NONE;
- t_put : (''a, 'b) table -> ''a ⋆ 'b -> unit, called as t_put t (x, y) adds a new binding from
x to y to the table t, erasing any previously existing binding for x. If tables were implemented as objects of type
(''a -m> 'b) ref, this would be equivalent to:
fun t_put t (x, y) =
t := !t ++ {x => y};
- t_put_behind : (''a, 'b) table -> ''a ⋆ 'b -> unit, called as t_put_behind t (x, y)
adds a new binding from x to y to the table t, except if any binding for x already existed. If tables were implemented as
objects of type (''a -m> 'b) ref, this would be equivalent to:
fun t_put_behind t (x, y) =
t := {x => y} ++ !t;
- t_remove : (''a, 'b) table -> ''a -> unit, removes any entry associated with its second argument
from the table given in first argument. If tables were implemented as objects of type (''a -m> 'b) ref, this would
be equivalent to:
fun t_remove t x =
t := {x} <-| !t;
- t_iter : (''a, 'b) table -> (''a ⋆ 'b -> bool) -> boolis the standard iterator over tables. This
is meant to implement a form of existential quantification, but can be used to loop over the table, and doing some
computation on each entry, just like the iterate quantifier. Calling t_iter t f iterates over all elements of the
table t, in some indefinite order, calls f (x,y) for each entry x => y in t. This stops when the table has been fully
traversed, and then returns false, or after the first call to f returns true, in which case iter t f returns
true.
If tables were implemented as objects of type (''a -m> 'b) ref, this would be equivalent to:
fun t_iter t f =
exists
f (x, y)
| x => y in map !t
end;
This can be used to implement iterate-style loops, by writing
t_iter t (fn (x, y) => (f(x,y); false)); ()
Warning: it is definitely a bad idea to modify the table t while you iterate on it.
- t_collect : (''a, 'b) table -> (''a ⋆ 'b -> (''c -m> 'd)) -> (''c -m> 'd) is an
iterator over tables, just like t_iter. This one is rather meant to implement set and map comprehensions:
t_collect t f iterates over all elements of the table t, in some indefinite order (although it is guaranteed to be the
same as the one used by t_iter), calls f (x,y) for each entry x => y in t, and computes the overwrite of all
maps returned by each call to f.
If tables were implemented as objects of type (''a -m> 'b) ref, this would be equivalent to:
fun t_collect t f =
overwrite [f (x, y)
| x => y in map !t];
For example, getting the contents of the table t as a map can be effected as:
t_collect t (fn (x, y) => {x => y})
Building the map of all x => y in t such that the predicate P (x,y) holds can be done by:
t_collect t (fn (x, y) => if P (x,y) then {x => y} else {})
Warning: it is definitely a bad idea to modify the table t while you iterate on it.
- t_reset : (''a, 'b) table -> unit resets the table in argument. If tables were implemented as objects of
type (''a -m> 'b) ref, this would be equivalent to:
fun t_reset t = (t := {});
4.4 Refs and Arrays
Refs and arrays are the fundamental mutable data structures of GimML, as in Standard ML of New Jersey (only refs exist in the
definition of Standard ML). Ref and array types always admit equality, even if their argument types do not. Equality is decided
according to the following rules:
- Every ref or array is equal to itself.
- Any newly created ref (by ref), or newly created array (by array or arrayoflist or iarray or
iarrayoflist), is different from any other ref or array.
In short, refs and arrays are not shared, and are compared by their
addresses.
Some syntax extensions are defined to allow you to write a more readable code to handle arrays. In particular:
There is however no similar syntactic sugar for non-polymorphic integer arrays (of type intarray), only for polymorphic
arrays (of type 'a array, for any 'a).
The following exception is defined:
exception Subscript
It is raised when accessing an element outside of an array, by sub or update, or by isub or iupdate.
The following functions are provided:
- ref : '1a -> '1a ref creates a new mutable reference.
- := : 'a ref ⋆ 'a -> unit modifies the value of a reference, that is, assigns the value of the second
argument to the reference in first argument. It is declared infix, of precedence 3.
- ! : 'a ref -> 'a is the dereferencing function. It gets the value stored inside a reference. It could have
been defined as:
fun !(ref v) = v
because ref is understood in ML as a (fake) datatype constructor.
- array : int ⋆ '1a -> '1a array creates an array with n elements equal to e, where n and e
are the arguments. If n is negative, the exception Subscript is raised. n is called the length of the
array.
- sub : 'a array ⋆ int -> 'a dereferences the nth element of the array a (indices start at 0), where a and n are
the arguments. If n is negative or greater than or equal to the length of the array, the exception Subscript is
raised.
A more readable notation for sub (a,n) is a.(n).
- update : 'a array ⋆ int ⋆ 'a -> unit modifies the nth element of the array a, replacing it by the value e,
where a, n and e are the arguments. If n is negative or greater than or equal to the length of the array, the exception
Subscript is raised.
A more readable notation for update (a, n, e) is a.(b) .:= e.
- length : 'a array -> int gets the length of the array in argument.
- arrayoflist : '1a list -> '1a array converts a list into an array with the same elements in the same
order. The length of the resulting array is exactly the length of the argument list.
- iarray : int ⋆ int -> intarray creates an array with n integer elements equal to e, where n and e
are the arguments. If n is negative, the exception Subscript is raised. n is called the length of the
array.
- isub : intarray ⋆ int -> int dereferences the nth element of the integer array a (indices start at 0), where a
and n are the arguments. If n is negative or greater than or equal to the length of the array, the exception Subscript is
raised.
- iupdate : intarray ⋆ int ⋆ int -> unit modifies the nth element of the array a, replacing it by the
integer value e, where a, n and e are the arguments. If n is negative or greater than or equal to the length of the array, the
exception Subscript is raised.
- ilength : intarray -> int gets the length of the array of integers in argument.
- iarrayoflist : int list -> intarray converts a list into an array with the same integer elements in the
same order. The length of the resulting array is exactly the length of the argument list.
4.5 Strings
Strings are ordered sequences of characters (we call size of the string the length of the sequence). ML does not provide a type
of characters, though one-character strings may serve this purpose. Strings are assumed coded according to the 7-bit ASCII
standard. However, characters are usually 8 bits wide: the characters of code greater than 127 are interpreted in a
system-dependent fashion. The string length is encoded separately from the sequence of characters that composes it: no special
character is reserved to represent the end of a string, in particular, any string may contain the NUL character
(\000).
The following exceptions are defined:
- exception Ascii, raised when using a number that is not the code of an ASCII code (with 8-bit characters,
a number that is not an integer between 0 and 255).
- exception StringNth, raised when accessing an element outside a string.
The functions are:
- explode : string -> string list converts a string into the list of its characters, where characters
are represented as one element strings. For instance explode "abc" is ["a","b","c"].
- implode : string list -> string is the inverse of explode. If given a list of characters
(one-character strings), it produces the corresponding string. Actually, it accepts any list of strings, whatever
their size, and concatenates them, as a generalization of the intended semantics.
- ^ : string ⋆ string -> string concatenates two strings. It could have been defined as:
fun op ^ (s,s') = implode (explode s @ explode s')
(or implode [s,s'], noticing the generalization of implode though ^ is more efficient. ^ is declared infix, right
associative of precedence 6.
- concat : string list -> string is the distributed concatenation of strings. Because of the general character
of implode, implode and concat are synonymous.
- size : string -> int computes the size of a string. This may be defined by:
fun size s = len (explode s)
although it does not build the exploded list.
- chr : int -> string builds the string having as only character the character with code n given as argument. If n
is not the code of a character, the exception Ascii is raised.
- ord : string -> int gets the code of the first character of the string. The exception StringNth is raised if the
string is empty ("").
- ordof : string ⋆ int -> int gets the code of the n + 1th character in the string s, where s and n are the
arguments. The index n starts from 0. If n < 0 or n is greater than or equal to the size of s, the exception StringNth is
raised. This is equivalent to:
fun ordof (s,n) = ord (explode s nth n) handle Nth => raise StringNth
- substr : string ⋆ int ⋆ int -> string extracts a substring from the string s, given as first argument.
Call i the second argument, and j the third. substr(s,i,j) returns the string of all characters from indices i included
to j excluded. substr raises StringNth if i < 0 or j >len s, or i or j is not integer. It may be defined
as:
fun substr (s,i,j) =
let fun f k =
if k>=j
then ""
else chr (ordof (s,k)) ^ f(k+1)
in
f i
end
- strless : string ⋆ string -> bool compares strings in lexicographical order, the order on characters
being defined by the order on their codes. Equivalently:
fun op strless (s,s') =
let fun less (_,nil) = false
| less (nil,_) = true
| less (c::l,c'::l') =
ord c<ord c' orelse
ord c=ord c' andalso less(l,l')
in
less (explode s,explode s')
end
strless is declared infix, of precedence 4.
- intofstring : string -> int converts the input string into an integer. This assumes the integer is written in
decimal; minus signs may be written in standard notation (-) or as in ML (~). No error is returned, even if the input
string is not the decimal representation of an integer. To check this, call regexp "[~-]?[0-9]+"
first.
Note: to convert back an integer i to a string, use
let val f as |[convert, ...]| = outstring ""
in
print f (pack (i:int));
convert ()
end
- numofstring : string -> num converts the input string into a real number. This assumes the real is written in
decimal; minus signs may be written in standard notation (-) or as in ML (~). No error is returned, even if the input
string is not the decimal representation of a real. To check this, call
regexp "[~-]?[0-9]+(\\.[0-9]⋆)?([eEfFgG][~-]?[0-9]+)?"
first.
Note: to convert back a number x to a string, use
let val f as |[convert, ...]| = outstring ""
in
print f (pack (x:num));
convert ()
end
4.6 Regular expressions
GimML includes a sophisticated multiple regular expression matching engine. This rests on two abstract types:
- regex is the type of regular expressions; the function re_parse converts from a string such as "^a(b⋆)c$"
to a regular expression proper.
- 'a nfa is the type of non-deterministic finite automata, with output of type 'a. Non-deterministic finite
automata are build from regular expressions through re_make_nfa, and are run against a string by calling
nfa_run.
The following exception is defined:
- exception RE of int, raised with an integer error code when the regular expression parser re_parse
encounters an error. The remsg function can be used to get a plain text description of the error. The numbers
are:
- 1: the regular expression is not terminated, maybe there is a missing closing parenthesis or bracket;
- 2: expected a closing brace }, but found another character;
- 3: expected a comma or a closing brace, but found another character;
- 4: a repetition interval {i,j} was given, but j < i;
- 5: unused (was meant to indicate that a repetition interval {i,j} specified too large a value for j, namely
j ≥ 256; this has been deactivated, although it is not a good idea to use too large a value of j);
- 6: expected a closing parenthesis ), but found another character;
- 7: found a lone backslash \, not followed by any character;
- 8: the regular expression contains several branches (separated by |), but not all branches contain the same
number of submatches (groups enclosed between ( and )).
The functions are:
- re_parse : string -> regex reads a string, and parses it as a regular expression. A regular expression
is really something like a parse tree. The type regex is abstract, and one cannot actually inspect a regular
expression. May raise RE n, with n ranging from 1 to 8, corresponding to various kinds of syntax errors. Call
remsg n in order to obtain a short description of the error that happened.
The syntax of regular expressions is that of POSIX 1003.2. On Unix systems, type man re_format for a
documentation. The whole standard is implemented, including character classes [:alnum:], [:alpha:],
[:blank:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:],
[:space:], [:upper:], [:xdigit:], but not collating sequences or equivalence classes. It also does not
implement the extensions [[:<:]] and [[:>:]], which are outside of POSIX 1003.2.
- remsg : int -> string converts an error number, as returned in exception RE n, to a short description
of the error.
- re_make_nfa : (regex ⋆ (string ⋆ intarray -> 'a)) list -> 'a nfa converts a list
of regular expressions, each paired with an action, into a non-deterministic finite automaton. The simplest case
is when the list contains just one pair (re,f) of one regular expression re (as obtained through re_parse)
and one action f : string ⋆ intarray -> 'a. This will produce a non-deterministic finite automaton
that matches just the regular expression re (using nfa_run), and if successful, will call f. The function f is
called on the string s passed to nfa_run and on an array of integers a that holds information about the match:
isub(a,0) and isub(a,1) holds the start and end position of the substring matched by re; if there are any
groups in re, then isub(a,2) and isub(a,3) will hold the start and end position of the first group matched,
isub(a,4) and isub(a,5) will hold the start and end position of the second group matched, and so on.
In general, re_make_nfa takes a list of such pairs (re,f). Matching then proceeds by calling the function f
of the first pair that matches. This is much more efficient than building a non-deterministic finite automaton for
each regular expression re, and running them one after another: the non-deterministic finite automaton built by
re_make_nfa shares the computations involved in matching all the regular expressions in the list.
- nfa_run : 'a nfa ⋆ string -> 'a option runs the non-deterministic finite automaton in first
argument (as produced by re_make_nfa on a list [(re1,f1),…,(ren,fn)] of pairs of regular expressions and
actions) against the string s given as second argument. If the string is matched by one of the regular expressions
rei, then it picks the first one, namely the one with the smallest index i, it calls the action fi on s and the
intarray of positions (see re_make_nfa), getting an object x: 'a; and it returns SOME x. Otherwise,
returns NONE.
- nfa_run_substr : 'a nfa ⋆ string ⋆ int ⋆ int -> 'a option words as nfa_run,
except runs on the substring of the second argument between positions given as third and fourth
arguments. An important difference between calling nfa_run_substr (nfa,s,i,j) and calling nfa_run
(nfa,substr(s,i,j)), apart from the fact that the former is faster, is that positions reported in the intarrays
given to the actions are relative to the original string s, not the substring substr(s,i,j).
- re_subst : string ⋆ intarray ⋆ int -> string takes a string s, an intarray as submitted
to an action f given to re_make_nfa, and an index n, and returns the substring of s at the position matched
by the nth group in the corresponding regular expression. The expression re_subst(s,a,n) is equivalent to
substr(s,isub(a,2 * n),isub(a,2 * n + 1)).
Let us give a few examples. First, we build a regular expression that matches substrings of the form ab followed by an
arbitrary number of letters c, followed by one d:
val r = re_parse "abc⋆d";
We build the corresponding non-deterministic finite automaton:
val nfa1 = re_make_nfa [(r, fn _ => ())];
The action fn _ => () does nothing. Let us run the non-deterministic finite automaton nfa1:
nfa_run (nfa1, "toto"); (⋆ fails, i.e., returns NONE ⋆)
nfa_run (nfa1, "abccd"); (⋆ succeeds, i.e., returns SOME () ⋆)
nfa_run (nfa1, "xyaabdz"); (⋆ succeeds, too ⋆)
In the last example, notice that nfa_run looks for substrings that match the regular expression. In order to match the string
itself, use "^abc⋆d$" instead; the symbol ^ matches the empty substring provided it occurs at the beginning of the string, and
$ matches the empty substring at the end of the string.
Let us modify our non-deterministic finite automaton in order to return the positions of the substring that
matches:
val nfa2 = re_make_nfa [(r, fn (_, a) => (isub(a,0), isub(a,1)))];
Then nfa_run (nfa2, "xyaabdz") should return SOME (3,6).
Let us proceed with a more complicated example. Let us write:
val r = re_parse "^([0-9]+) (.⋆)";
val nfa = re_make_nfa [(r, fn (_, a) => a)];
val s = "80 action: dismiss";
val SOME a = nfa_run (nfa, s);
Subexpressions between parentheses such as ([0-9]+) and (.⋆) are groups. The first one matches any non-empty
sequence of digits, the second one matches any sequence of characters. Note that r specifies a space that should be matched
between the two groups. (In order to match a parenthesis, write \\( or \\), and similarly with square brackets.) Now look at
the parts that have been matched in the string s:
re_subst (s,a,0); (⋆ returns the whole substring that was matched ⋆)
re_subst (s,a,1); (⋆ returns the substring matched by the first group ⋆)
re_subst (s,a,2); (⋆ returns the substring matched by the second group ⋆)
Let us finish by an example of simultaneous matching of several regular expressions. This is an excerpt from an actual
application, which analyzes messages from the Linux log auditing mechanism.
val r1 = re_parse "^(Accepted|Failed|Postponed) (password|publickey|hostbased)\
\ for ([^ ]+) from ([0-9.]+) port ([0-9]+) (.⋆)$";
val r2=re_parse "^(Accepted|Failed|Postponed) (password|publickey|hostbased)\
\ for (invalid user|illegal user) ([^ ]+)\
\ from ([0-9.]+) port ([0-9]+) (.⋆)$";
val r3 = re_parse "^(Accepted|Failed) (rhosts-rsa)\
\ for ([^ ]+) from ([0-9.]+) port ([0-9]+) ruser (.⋆)$";
val r4 = re_parse "^userauth_hostbased mismatch:\
\ client sends ([^,]+), but we resolve ([0-9.]+) to (.⋆)$";
val nfa = re_make_nfa [(r1, fn (_,a) => (1,a)),
(r2, fn (_,a) => (2,a)),
(r3, fn (_,a) => (3,a)),
(r4, fn (_,a) => (4,a))];
val s = "Failed rhosts-rsa for azumi from 192.10.1.7 port 443 ruser zorro";
val SOME (n, a) = nfa_run (nfa, s);
Run it by yourself. Why is n equal to 3? Which regular expression was matched successfully? Type re_subst(s,a,1), …,
re_subst(s,a,6) in order to see which substrings were matched.
4.7 Integer Arithmetic
GimML integers are machine integers, that is, integers in the range [min_int,max_int], and are represented with nbits
bits.
Exceptions on integer operations can be:
- exception Arith, which is raised whenever an arithmetic error occurs, for example division by 0.
(Overflow in addition or multiplication is normally not dealt with, and just produces wrong results—more
precisely, results modulo the word size—on most machines. This is implementation-dependent.)
Non-negative integers are 0, 1, 2, …, 42, …Negative integers are written with the ~ negation operator in
front.
The following operations are defined. Basic operations like +, -, etc. are not overloaded as in Standard ML. For example, +
is always an integer function. For analogous numerical (floating-point) functions, see Section 4.8. For example, floating-point
addition is #+, not +.
- + : int ⋆ int -> int is integer addition. It is declared infix, left associative of precedence 6.
- - : int ⋆ int -> int is integer subtraction. It is declared infix, left associative of precedence 6.
- ⋆ : int ⋆ int -> int is integer multiplication. It is declared infix, left associative of precedence 7.
- ~ : int -> int is integer negation.
- sqr : int -> int squares its argument.
- abs : int -> int computes the absolute value of its argument.
- < : int ⋆ int -> bool returns true if its first argument is less than its second argument.
- > : int ⋆ int -> bool returns true if its first argument is greater than its second argument.
- <= : int ⋆ int -> bool returns true if its first argument is less than or equal to its second argument.
- >= : int ⋆ int -> bool returns true if its first argument is greater than or equal to its second argument.
- min : '#a ⋆ '#a -> '#a returns the least of its two arguments.
- max : int ⋆ int -> int returns the greatest of its two arguments.
- div : int ⋆ int -> int is quotient extraction. It is declared infix, left associative of precedence 7. The
definition conforms to the definition of Standard ML, where the quotient has the smallest possible absolute value
(the remainder has the same sign as x).
- mod : int ⋆ int -> int is the remainder operation.
It is syntactic sugar for fn (x,y) => x - y⋆(x div y). It is declared infix, left associative of
precedence 7.
- divmod : int ⋆ int -> int ⋆ int returns both the quotient and remainder of the integers in
argument. This could be defined as fn (x,y) => (x div y,x mod y), but is normally faster.
- bor : int ⋆ int -> int computes the binary inclusive or (the disjunction) of the two integers in
argument. There is a one bit at some position in the result if there was a one at the same position in at least one
of the argument integers.
- bxor : int ⋆ int -> int computes the binary exclusive or of the two integers in argument. There is
a one bit at some position in the result if there was a one at the same position in exactly one of the argument
integers.
- band : int ⋆ int -> int computes the binary and (the conjunction) of the two integers in argument.
There is a one bit at some position in the result if there was a one at the same position in both of the argument
integers.
- bnot : int -> int computes the binary negation of the integer in argument. There is a one but at some
position in the result if there was a zero at the same position in the argument.
- lsl : int ⋆ int -> int computes the logical shift left of the first integer by the number of bits specified
in the second argument. Its behavior is unspecified if the number of bits is negative.
- lsr : int ⋆ int -> int computes the logical shift right of the first integer by the number of bits
specified in the second argument. Its behavior is unspecified if the number of bits is negative.
- asr : int ⋆ int -> int computes the arithmetic shift right of the first integer by the number of bits
specified in the second argument (arithmetic means that the sign bit is kept as is, and propagates through the
shifting). Its behavior is unspecified if the number of bits is negative.
- inc : int ref -> unit increments the integer pointed to by the reference in argument.
- dec : int ref -> unit decrements the integer pointed to by the reference in argument.
- rand : unit -> int draws a pseudo-random equidistributed number in the integer interval [min_int,
max_int]. The method used is based on a generalized feedback shift register [7], with generating polynomial
x55 + x24 + 1. The period is 255 - 1, which means that, practically, the generator never loops back to any
previous state.
- irand : int -> int draws a pseudo-random equidistributed number in the integer interval [0,x[,
where x is the given argument. If x ≤ 0, it instead draws an equidistributed pseudo-random integer in
[0,max_int] ∪ [min_int,x[ (think of integers as two’s complement integers). So if x = 0, this is equivalent
to rand. This calls zrand anyway.
This can be used to draw equidistributed integers in [a,b[ (with a < b) by calling a + irand(b - a).
4.8 Numerical Operations
The real and imaginary parts of numbers are floating-point numbers, which are classified according to the IEEE754 standard. A
floating-point number may be:
The fact that an infinity or a Nan is produced (in IEEE754 systems), or an Arith exception is raised (in non-IEEE754
systems) by a computation means usually that the computation is wrong. In this case, the exception system is
better suited to find numerical bugs, provided we have a means of locating the place where the exception is
raised.
However, in production code, these error conditions may still crop up for limit situations. It is then unacceptable for code to
stop functioning because of this: so infinities and NaNs should be used. In general, users had better use a system
obeying the IEEE754 standard (or its successors). The IEEE754 standard defines trapping mechanisms to be
used for debugging numerical codes; they are not used in GimML now, but may be in a future version of the
debugger.
Classification of numbers is accomplished with the auxiliary types:
datatype numClass = MNaN of int | MInf | MNormal | MDenormal
| Zero | PDenormal | PNormal | PInf | PNaN of int
numClass encodes both the class and the sign of a real number. In the case of a NaN, the NaN code is also provided as an
integer. NaN codes are system-dependent; only NaN codes from 1 to 7 are recognized in GimML; unrecognized codes are
represented by 0. NaN codes and the constructors MNaN, MInf, MDenormal, PDenormal, PInf, PNan are never used in a
non-IEE754 system. The exception:
exception Arith
is raised for all arithmetical errors in non-IEEE754 systems, and is never used in IEEE754 systems.
On a different subject, remember that testing complex numbers for equality is hazardous. Therefore, = is not to be used
lightly on numbers (for example, 0.2 ⋆ 5 may print as 1 but differ from 1 internally). Normally, though,
computations on sufficiently small integers should be exact (at least on real IEEE754 systems). This warning applies
not only to the explicit = equality function, but also to all operations that use internally the equality test, in
particular the test for being inside a set inset or the application of a map to an object ?, when this object contains
numbers.
Moreover, following the principle, valid in GimML, that an object is always equal to itself, +inf = +inf,
~inf = ~inf, and all NaNs are equal to themselves, though they are incomparable with any number (for example,
+NAN(0) <= +NAN(0) is false, but +NAN(0) = +NAN(0) is true).
Numerical constants are entered and printed in GimML as x or x:y, where x and y represent real numbers, possibly
followed by a scale. The notation x:y denotes the complex number x + i.y. There can be no confusion with the : character
used in type constraints, because no type may begin with a character lying at the start of a number. A real number is defined by
the regular expression (in lex syntax):
where DIG = [0-9] and INT = (\~?{DIG}+), with the obvious interpretations. Infinities and NaNs cannot be entered
in a non-IEEE754 system.
The exception:
is raised whenever an operation defined only on real values is applied on a complex with a non-zero imaginary part (even if
it is denormalized).
The numerical functions are:
- classify : '#a -> numClass ⋆ numClass classifies the real and imaginary parts of the numerical
value in argument.
- denormal : '#a -> bool returns true if and only if its argument is zero or a denormalized number. That
is, it returns true if and only if classify returns a pair of classifications, which are both MDenormal, Zero
or PDenormal. denormal is useful because dividing by a number is in fact legal only when this number
is not denormal, in the IEEE754 representation system. In short, being denormalized is the right criterion for
testing whether a number if invertible (and not comparison with 0).
- overflown : '#a -> bool returns true if and only if its argument is infinite or a NaN. That is, it returns
true if and only if its real part or its imaginary part is infinite or a NaN. This function is useful to detect when
a computation has produced an infinite number; usually, when doing some operations on infinite numbers, like
subtraction, the results are NaNs, hence the test for being either infinite or a NaN.
- #+ : '#a ⋆ '#a -> '#a is addition. It is declared infix, left associative of precedence 6.
- #- : '#a ⋆ '#a -> '#a is subtraction. It is declared infix, left associative of precedence 6.
- #⋆ : '#a ⋆ '#b -> '#a ‘ '#b is multiplication. It is declared infix, left associative of precedence 7.
- #/ : '#a ⋆ '#b -> '#a ‘ '#b^ ~1 is division; it does not raise an exception when dividing by 0 or
a denormalized, except on non-IEEE754 systems (the Arith exception). It is declared infix, left associative of
precedence 7.
- #^ : num ⋆ num -> num is exponentiation (as in FORTRAN); the typing engine knows a special case
about this one: when n is a constant number and e is an expression, then #^ in the expression e #^ n is given
type '#a ⋆ num -> '#a^n. It is declared infix, right associative of precedence 8.
We could almost define #^ as fn (a,b) => exp (b #⋆ log a), except for rounding mechanisms and
type inference.
- #~ : '#a -> '#a is negation.
- pi : num is the famous number π.
- fsqr : #a -> #a^2 squares its argument.
- fsqrt : '#a -> '#a^0.5 takes the square root of its argument. It is the principal determination of the
square root, defined by
=
.ei.θ∕2
for r ≥ 0, -π < θ ≤ π.
- fabs : '#a -> '#a computes the norm of its argument. The norm of z is |z| =
, where z is the
complex conjugate of z. When z is real, the norm of z is simply its absolute value.
- conj : '#a -> '#a computes the conjugate of a complex numerical value. The complex conjugate of
x + i.y is x + i.y = x - i.y. When z is real, the conjugate of z is z itself.
- re : '#a -> '#a computes the real part of its argument. It is always a real.
- im : '#a -> '#a computes the imaginary part of its argument. It is always a real.
- #< : '#a ⋆ '#a -> bool returns true if its first argument is less than its second argument, considered as
real quantities. The exception Complex is raised if one of the numbers was complex. Note that comparison of
NaNs may yield surprising results. For quantities with dimensions, the arguments are scaled to the default scale
of the dimension '#a, and then compared. Confusing results can ensue if you are using negative scaling factors.
- #> : '#a ⋆ '#a -> bool returns true if its first argument is greater than its second argument, considered
as real quantities. The exception Complex is raised if one of the numbers was complex. Note that comparison
of NaNs may yield surprising results. For quantities with dimensions, the arguments are scaled to the default
scale of the dimension '#a, and then compared. Confusing results can ensue if you are using negative scaling
factors.
- #<= : '#a ⋆ '#a -> bool returns true if its first argument is less than or equal to its second argument,
considered as real quantities. The exception Complex is raised if one of the numbers was complex. Note that
comparison of NaNs may yield surprising results. For quantities with dimensions, the arguments are scaled
to the default scale of the dimension '#a, and then compared. Confusing results can ensue if you are using
negative scaling factors.
- #>= : '#a ⋆ '#a -> bool returns true if its first argument is greater than or equal to its second
argument, considered as real quantities. The exception Complex is raised if one of the numbers was complex.
Note that comparison of NaNs may yield surprising results. For quantities with dimensions, the arguments are
scaled to the default scale of the dimension '#a, and then compared. Confusing results can ensue if you are
using negative scaling factors.
- fmin : '#a ⋆ '#a -> '#a returns the least of its two arguments, considered as real quantities. The
exception Complex is raised if one of the numbers was complex. Comparison of NaNs may give some surprises,
as well as the use of negative scaling factors. (See <.)
- fmax : '#a ⋆ '#a -> '#a returns the greatest of its two arguments, considered as real quantities. The
exception Complex is raised if one of the numbers was complex. Comparison of NaNs may give some surprises,
as well as the use of negative scaling factors. (See >.)
- exp : num -> num is the exponential, or natural anti-logarithm.
- log : num -> num is the principal determination of the natural logarithm (base e). In the complex plane,
log(r.ei.θ) = log r + i.θ for r ≥ 0 and -π < θ ≤ π.
- exp1 : num -> num is almost syntactic sugar for fn x => exp x #- 1.0, but gives precise results
even when the argument tends to zero.
- log1 : num -> num is almost syntactic sugar for fn x => log (1.0 #+ x), but gives precise
results even when the argument tends to zero.
- sin : num -> num is the (trigonometric) sine.
The default angle scale is the radian (defined as scale 1‘r = 1), but of course the scale system may be used
to input angles in any other unit.
- cos : num -> num is the (trigonometric) cosine. The default angle scale is the radian.
- tan : num -> num is the (trigonometric) tangent, quotient of sine by cosine.
- asin : num -> num is the principal determination of the arc sine. In the complex plane, asinx is defined
as -i.log(i.z +
).
- acos : num -> num is the principal determination of the arc cosine. In the complex plane, acosx is defined
as -i.log(z +
).
- atan : num -> num is the principal determination of the arc tangent. In the complex plane, atanx is
defined as i∕2.log i+z
i-z .
- sh : num -> num is the hyperbolic sine.
Equivalently, it might be defined as fn z => 0.5 #⋆ (exp z #- exp (#~ z)), except that this
would not be as precise near 0.
- ch : num -> num is the hyperbolic cosine function.
It could have been defined as fn z => 0.5 #⋆ (exp z #+ exp (#~ z)).
- th : num -> num is the hyperbolic tangent, quotient of the hyperbolic sine by the hyperbolic cosine.
- ash : num -> num is the principal determination of the argument hyperbolic sine. We have ashz =
log(z +
).
- ach : num -> num is the principal determination of the argument hyperbolic cosine. We have achz =
log(z +
).
- ath : num -> num is the principal determination of the argument hyperbolic tangent. We have athz =
1∕2.log 1+z
1-z .
- arg : num -> num is the principal determination of the argument. This is always a real in the semi-closed
interval ] - π,π]. It could be defined as fn x => im(log x).
- floor : num -> num computes the lower integer part, or floor, of a real: this is the greatest integer less
than or equal to the argument. It raises Complex if the argument is not real.
- ceil : num -> num computes the upper integer part, or ceiling, of a real: this is the smallest integer
greater than or equal to the argument. It raises Complex if the argument is not real. It is syntactic sugar for
fn x => #~ (floor (#~ x)).
- fdiv : '#a ⋆ '#a -> num is quotient extraction, it always returns an integer-valued real. This is
syntactic sugar for fn (x,y) => floor(re(x/y)). It is declared infix, left associative of precedence 7.
- fmod : '#a ⋆ '#a -> '#a is the remainder operation.
It is syntactic sugar for fn (x,y) => x #- y #⋆ (x fdiv y). It is declared infix, left associative of
precedence 7.
- fdivmod : '#a ⋆ '#a -> num ⋆ '#a returns both the quotient and remainder of the numerical
quantities in argument. This could be defined as fn (x,y) => (x fdiv y,x fmod y). fdivmod is
not especially quicker than computing fdiv and fmod separately, because the three functions maintain a
common one-entry cache for quotients and remainders.
- ldexp : '#a ⋆ int -> '#a applied to x and n computes x.2n.
- frexp : num -> num ⋆ int applied to a real number x≠0 returns a real number y of the same sign as x
and an integer n such that 0.5 ≤|y| < 1 and x = y.2n. If x = 0.0, returns (0.0,0). In general, if x is complex
and ℜx≠0, then it returns (y,n) such that 0.5 ≤|ℜy| < 1 and x = y.2n; if x is complex and ℜx = 0, then it
returns (x,0).
- modf : num -> num ⋆ num splits its real argument into integer and fractional part. If its argument is not
real, then it raises Complex. This function might be defined as fn x => (floor x, x #- floor x)
on non-negative numbers, but it differs on negative numbers; e.g., modf ~3.15 is (~3, ~0.15), not
(~4, 0.85).
- real : '#a -> bool tests whether the argument is real, that is, if it has a zero imaginary part. This is the
same as fn z => im z=0.0.
- integer : num -> bool tests whether the argument is an integer (in particular, real). This is the same as
fn z => floor(re z)=z.
- natural : num -> bool tests whether the argument is a natural number.
This is the same as fn z => integer z andalso z>=0.0.
- int : num -> int returns the integer value of the number in argument, as an integer. It raises Complex
if the argument is a complex number. If the argument is real, but not in the range [min_int,max_int], the
result is unspecified. If the argument is not an integer, rounding direction is unspecified.
- num : int -> num converts an integer to a real number having the same value.
- random : unit -> num draws a pseudo-random equidistributed number in the real interval [0,1[. The
method used is based on a generalized feedback shift register [7], with generating polynomial x55 + x24 + 1.
The period is 255 - 1, which means that, practically, the generator never loops back to any previous state.
- zrandom : unit -> num draws a pseudo-random equidistributed number in the complex square [0,1[×
[0,1[. The method used is the same as for random.
- maybe : unit -> bool draws a pseudo-random equidistributed boolean. The method used is the same as
for random and zrandom, and is quicker than the semantically equivalent
fn () => random() >= 0.5.
4.9 Large Integer Arithmetic
GimML includes a port of Arjen Lenstra’s large integer package, which provides arbitrary precision arithmetic, i.e., arithmetic
over a type Int that represents actual integers, without any size limitation as with the int type.
The exception:
exception Lip of int
is raised whenever an error occurs in any of the functions of this section. The numbers as arguments to the Lip exception are
as follows:
- 1: division by zero;
- 2: modulus is zero in modular arithmetic functions;
- 3: bad modulus in Montgomery arithmetic functions—this means the modulus is 0, negative or positive but
even;
- 4: Montgomery arithmetic modulus is undefined—use zmontstart to define it first;
- 5: moduli in the chinese remaindering function zchirem are not coprime;
- 6: a function was called that expected a positive argument, and got a zero or negative argument; this can be
raised with zln, zsqrt, and a few other functions like zchirem;
- 7: zbezout was called with both arguments zero, or zroot was called to compute an nth root with n = 0;
- 8: a bug occurred in zbezout, please report it to goubault@lsv.ens-cachan.fr, see MAINTENANCE
at the end of the OPTIONS file;
- 9: zchirem was called with identical moduli but different remainders;
- 10: an exponentiation function was called with a negative exponent;
- 11: the square root or some nth root with n even of a negative number was attempted;
- 12: a bug occurred in zpollardrho, please report it to goubault@lsv.ens-cachan.fr, see
MAINTENANCE at the end of the OPTIONS file;
- 13: the second argument to zjacobi or zsjacobi is even;
- 14: zrandompprime was given a non-positive value for q;
- 15: a bug occurred in zecm, please report it to goubault@lsv.ens-cachan.fr, see MAINTENANCE at
the end of the OPTIONS file.
- 16: a wrong base was supplied to one of the functions converting integers to a list of digits (e.g., zstobas); a
wrong base is one that is < 2.
- 17: too many small primes were requested.
- 18: one of the random prime generator functions failed.
- 19: zecm failed to factor the input argument, but did not estimate it was prime with any sufficiently high
probability.
The functions are classified into several subgroups.
Conversions
- Int : int -> Int converts a machine integer to an actual integer.
- zint : Int -> int converts an actual integer to a machine integer. In case of overflow, no exception is
raised: on machines with integers in two’s complement arithmetic on k bits (which is most common), zint n
the unique m, -2k-1 ≤ m < 2k-1, such that n = m mod 2k.
- znum : Int -> num converts an integer to an approximate floating-point value. Overflows are not checked,
and may yield unspecified results.
- zsbastoz : int ⋆ int list -> Int converts a list of digits in a given base to an integer. More
precisely, zsbastoz (b,[an,…,a1,a0]) = anbn + … + a1b + a0.
- zbastoz : Int ⋆ Int list -> Int converts a list of digits in a given base to an integer. More
precisely, zbastoz (b,[an,…,a1,a0]) = anbn + … + a1b + a0.
- zstobas : int ⋆ Int -> int list converts an integer to a list of digits in the given base. More
precisely, given a base b and an integer k, returns a list [an,…,a1,a0] such that 0 ≤ ai < b for every i, 0 ≤ i ≤ n,
an≠0 if k≠0, and anbn + … + a1b + a0 = |k|. Converts k = 0 to nil, and raises Lip 16 if b < 2.
- zstosymbas : int ⋆ Int -> int list converts an integer to a list of digits in the given base. More
precisely, given a base b and an integer k, returns a list [an,…,a1,a0] such that |ai|≤ b∕2 for every i, 0 ≤ i ≤ n,
an≠0 if n≠0, and anbn + … + a1b + a0 = |k|. Converts k = 0 to nil, and raises Lip 16 if b < 2.
- ztobas : Int ⋆ Int -> Int list converts an integer to a list of digits in the given base. More
precisely, given a base b and an integer k, returns a list [an,…,a1,a0] such that 0 ≤ ai < n for every i, 0 ≤ i ≤ n,
an≠0 if n≠0, and anbn + … + a1b + a0 = |k|. Converts k = 0 to nil, and raises Lip 16 if b < 2.
- ztosymbas : Int ⋆ Int -> Int list converts an integer to a list of digits in the given base. More
precisely, given a base b and an integer k, returns a list [an,…,a1,a0] such that |ai|≤ b∕2 for every i, 0 ≤ i ≤ n,
an≠0 if n≠0, and anbn + … + a1b + a0 = |k|. Converts k = 0 to nil, and raises Lip 16 if b < 2.
Large Integer Arithmetic
- zcompare : Int ⋆ Int -> int compares two integers, returns -1 if the first is less than the second, 0
if they are equal, and 1 if the first is greater than the second.
- zneg : Int -> Int returns the opposite of the integer in argument.
- zabs : Int -> Int returns the absolute value of the argument.
- zsign : Int -> int returns the sign of the argument, -1 if negative, 0 if zero, 1 if positive.
- zsadd : Int ⋆ int -> Int adds two integers. The first is a large integer, the second a machine integer.
- zadd : Int ⋆ Int -> Int adds two large integers.
- zsub : Int ⋆ Int -> Int takes two large integers m and n, and returns m - n.
- zsmul : Int ⋆ int -> Int multiplies two integers. The first is a large integer, the second a machine
integer.
- zmul : Int ⋆ Int -> Int multiplies two large integers. This uses Karatsuba’s algorithm, which
computes the product of two n-digit numbers in O(n1.58) steps. This is faster than naive multiplication (O(n2)
steps), and in theory is slower than Toom-Cook multiplication (O(n23.5
) steps) or Fast Fourier Transform
algorithms (O(nlog n) steps). But the latter are more complex, and start to be faster in practice for rather large
integers only.
- zsqr : Int -> Int takes the square of the large integer given as argument. This is faster than multiplying
the argument by itself, and uses a variant of Karatsuba’s algorithm.
- zsdivmod : Int ⋆ int -> Int ⋆ int computes the quotient and remainder of two integers given as
argument. The dividend m is a large integer, while the divisor n is a machine integer. This returns (q,r), where
m = nq+r, |r| < |n|, and the sign of r is the same as n. Note that this is not the same condition as for divmod.
Raises Lip 1 if n = 0.
- zsmod : Int ⋆ int -> int computes the remainder of two integers given as argument. The dividend
m is a large integer, while the divisor n is a machine integer. This returns r, where m = nq + r, |r| < |n|, and
the sign of r is the same as n. Note that this is not the same condition as for mod. Raises Lip 1 if n = 0.
- zdivmod : Int ⋆ Int -> Int ⋆ Int computes the quotient and remainder of two integers given as
argument. The dividend m and the divisor n are large integers. This returns (q,r), where m = nq +r, |r| < |n|,
and the sign of r is the same as n. Note that this is not the same condition as for divmod. Raises Lip 1 if
n = 0.
- zmod : Int ⋆ Int -> Int computes the remainder of two integers given as argument. The dividend m
and the divisor n are large integers. This returns r, where m = nq + r, |r| < |n|, and the sign of r is the same
as n. Note that this is not the same condition as for mod. Raises Lip 1 if n = 0.
- zsexp : Int ⋆ int -> Int, applied to (a,e), computes ae; note that, although a is a large integer, e is
a machine integer. Raises Lip 10 if e < 0 and |a|≠1. Note that this function may take some time and build a
huge result: time and space is O(elog a), i.e., proportional to the size of a times e itself, that is, an exponential
of its size. By convention, 00 = 1.
- zexp : Int ⋆ Int -> Int, applied to (a,e), computes ae; both a and e are large integers. Raises
Lip 10 if e < 0 and |a|≠1. Note that this function may take some time and build a huge result: time and space
is O(elog a), i.e., proportional to the size of a times e itself, that is, an exponential of its size. In short, although
e is a large integer, it should not be too large for zexp to return at all. By convention, 00 = 1.
- zlsl1 : Int -> Int multiplies the argument by 2, that is, shifts it left 1 bit.
- zasl : Int ⋆ int -> Int, called on (a,n), shifts a left n bits (if n ≥ 0), or right -n bits (if n < 0).
That is, it returns the greatest integer in absolute value that is less in absolute value than a.2n (this is exactly
a.2n unless n < 0).
- zasr1 : Int -> Int divides the argument by 2, that is, shifts it right 1 bit. For negative arguments, this
does not work as zdivmod, rather as divmod; this does not work as asr either. That is, while asr (~3,1)
returns ~2, zasr1 (Int ~3) returns ~1.
- zasr : Int ⋆ int -> Int, called on (a,n), shifts a right n bits (if n ≥ 0), or left -n bits (if n < 0).
That is, it returns the greatest integer in absolute value that is less in absolute value than a.2-n.
- zodd : Int -> bool returns true if and only if the argument is odd. Testing zodd (n) is equivalent to
zbit (n,0).
- zmakeodd : Int -> int ⋆ Int returns (k,a) with k highest such that the argument equals 2k.a. Note
that k is a machine integer, while a is a large integer. Raises Lip 7 if argument is 0.
- sqrt : int -> int returns the floor of the square root of the argument, which is a machine integer. Raises
Lip 11 if argument is negative.
- zsqrt : Int -> Int ⋆ Int returns (r,d), where r is the floor of the square root of the argument a, and
d is the remainder a - r2. Raises Lip 11 if argument is negative.
- zroot : Int ⋆ int -> Int returns the floor of the nth root of a, where (a,n) is provided as argument.
If a is negative and n is odd, returns the opposite of the nth root of -a. Raises Lip 7 if n = 0, Lip 11 if a
is negative but n is even, Lip 1 if a = 0 and n < 0.
- zlog : Int -> num returns the natural logarithm of the argument, or at least a good approximation (on
32-bit architectures, if computes it from the upper 60 bits, for a 56-bit result). If the argument x is negative, the
principal branch of the logarithm is chosen, and log(-x) + iπ is returned. If x = 0, then Lip 7 is raised.
- zjacobi : Int ⋆ Int -> intapplied to (m,n), returns the Jacobi symbol
. This is an extension
of the Legendre symbol (defined when n is prime), defined whenever n > 0 is odd, and which can also be
computed in polynomial time.
If n = ∏
i=1npi, where pi is prime, the Jacobi symbol
equals ∏
i=1n
, a product of Legendre
symbols. If p is prime and odd, the Legendre symbol
is 0 if m and p are not coprime, otherwise it is 1 if
m is a quadratic residue modulo p (i.e., m = x2 mod p for some x), otherwise it is -1.
This implies that in general, zjacobi (m,n) returns 0 is m and n are not coprime, and if it returns -1 then m
is not a quadratic residue modulo n.
Computation is based on the identities (where n > 0):
= 0;
= 1;
= (-1)(n-1)∕2
when
n is odd;
= (-1)(n2-1)∕8
when n is odd;
= (-1)(m-1)(n-1)∕4
when both m and n
are odd.
Raises Lip 6 if n ≤ 0, Lip 13 if n is not odd.
- zsjacobi : int ⋆ int -> int computes the Jacobi symbol of the integers in argument. See
zjacobi for details.
Bit Manipulation
Large integers also serve as bit vectors, of varying size. Every non-negative integer represents a finite set of bits, those that are
set in the binary representation of the integer; while every negative integer represents a cofinite set of bits (i.e., a set of bits
whose complementary is finite), those that are set in the two’s complement binary representation of the integer. For example, 23
is …00010111 in binary, and represents the set {0,1,2,4}, since 23 = 24 + 22 + 21 + 20. And -23 is …11101001, and
represents the set {0,3,5,6,7,…}.
- zsnbits : int -> int returns the number of bits needed to represent the argument x, which is a machine
integer. This is ⌈log 2(|x| + 1)⌉, where log 2 is base 2 logarithm.
- znbits : Int -> int returns the number of bits needed to represent the argument x, which is a large
integer. This is ⌈log 2(|x| + 1)⌉, where log 2 is base 2 logarithm.
- zsweight : int -> int returns the number of bits set to one in the binary representation of the machine
integer in argument. When the argument is negative, this is also the number of 0 bits in the binary negation of
the argument (see bnot), on two’s complement machines.
- zweight : Int -> int returns the number of bits set to one in the binary representation of |x|, where x
is the large integer in argument. In particular, for negative n, zweight (Int n) does not return the same
number as zsweight n.
- zlowbits : Int ⋆ int -> Int, applied to the large integer m and the machine integer n, returns the n
lowest bits of m as a large integer. Works as though m is in two’s complement, and always returns a non-negative
number.
- zhighbits : Int ⋆ int -> Int, applied to the large integer m and the machine integer n, returns the
n highest bits of |m| as a large integer. zhighbits (m,n) is equivalent to zasr (|m|,zsnbits (m) - n)
if zsnbits (m) > n, |m| otherwise.
- zcat : Int ⋆ Int -> Int concatenates the bit vectors in argument. Calling zcat (m,n) is equivalent
to zor (zasl (m,zsnbits (n)),n). Concatenating two bit vectors m and n, where n is instead known to
contain k significant bits, should instead be done by calling zor (zasl (m,k),n).
- zsreverse : int ⋆ int -> int returns the bit reversal of the machine integer m in first argument.
Second argument k is the number of bits in m that should be considered, and is truncated to 0 if negative, and
to the number of bits in a machine integer if greater.
- zreverse : Int ⋆ int -> Int returns the bit reversal of the large integer m in first argument. The
second argument k is the number of bits in m that should be considered, and is truncated to 0 if negative. The
integer m is viewed in two’s complement, and even if m < 0, zreverse (m,k) is ≥ 0. Just reversing m is
done by calling zreverse (m, znbits (m)), provided m ≥ 0.
- setofInt : Int -> int set returns the set of bits set to 1 in the integer argument. Raises Lip 10 if
the argument is negative, in which case the set would be infinite and hence not representable of type int set.
- Intofset : (int -m> 'a) -> Int returns the integer whose 1 bits are at positions specified by the
set in argument. To be precise, it returns ∑
i2i, where i ranges over all non-negative machine integers in the
domain of the map in argument. This function is not intended for sets of machine integers that are too large,
otherwise there is a risk of memory overflow.
- znot : Int -> Int returns the binary negation of the argument n. This is exactly the same as computing
-n - 1. Seeing integers as sets of machine integers, this computes the complement of the given set.
- zand : Int ⋆ Int -> Int return the bitwise conjunction of the integers in argument. As sets of machine
integers, zand computes the intersection of the sets described by the integers in argument.
- zor : Int ⋆ Int -> Int return the bitwise disjunction of the integers in argument. As sets of machine
integers, zor computes the union of the sets described by the integers in argument.
- zxor : Int ⋆ Int -> Int return the bitwise exclusive or of the integers in argument. As sets of machine
integers, zxor computes the symmetric difference of the sets described by the integers in argument.
- zbit : Int ⋆ int -> bool, applied to (m,n), returns bit number n of the large integer m. One bits are
returned as true, zero bits as false. Returns false if n < 0. The large integer m is taken to be in two’s
complement. This is equivalent to zand (m,Intofset{n})≠Int 0.
- zaddbit : Int ⋆ int -> Int, applied to (m,n), returns the large integer m with bit n set. Does
nothing if n < 0. The large integer m is taken to be in two’s complement. This is equivalent to zor
(m,Intofset{n}).
- zrembit : Int ⋆ int -> Int, applied to (m,n), returns the large integer m with bit n cleared. Does
nothing if n < 0. The large integer m is taken to be in two’s complement. This is equivalent to zand (m,znot
(Intofset{n})).
Euclidean Algorithms
- zgcd : Int ⋆ Int -> Int computes the greatest common divisor of the arguments by the binary
method. Raises Lip 7 if both arguments are 0.
- zgcde : Int ⋆ Int -> Int computes the greatest common divisor of the arguments, by the classical
Euclidean algorithm. This might be faster than zgcd in special cases. Raises Lip 7 if both arguments are 0.
- zbezout : Int ⋆ Int -> (Int ⋆ Int) ⋆ Int : zbezout (a,b) computes the gcd d of a and b,
and two coefficients xa, xb as in Bezout’s Theorem, i.e., such that a.xa + b.xb = d. This is also sometimes
called the extended Euclidean algorithm. The gcd is always ≥ 1.
This raises Lip 7 if both a and b are zero, in which case the gcd is undefined.
This might raise Lip 8, but should not. If this happens, this is a bug, and this should be reported to the author
goubault@lsv.ens-cachan.fr, see MAINTENANCE at the end of the OPTIONS file.
- zchirem : Int ⋆ Int ⋆ Int ⋆ Int -> Int applied to (ma,a,mb,b), returns d such that d = a
modulo ma, and d = b modulo mb, and 0 ≤ d < ma.mb—unless ma = mb, in which case a and b should be
equal, and d = a = b.
Raises Lip 6 if ma ≤ 0 or mb ≤ 0. Raises Lip 9 if ma = mb but a≠b, and Lip 5 if ma≠mb but ma and
mb are not coprime.
Standard Modular Arithmetic
- zmadd : Int ⋆ Int ⋆ Int -> Int adds modulo m: zmadd (a,b,m) computes (a + b) mod m. It is
assumed that 0 ≤ a,b < m; result is also ≥ 0 and < m. Produces erratic results otherwise. Raises Lip 2 if
m = 0.
- zmsub : Int ⋆ Int ⋆ Int -> Int subtracts modulo m: zmsub (a,b,m) computes (a-b) mod m.
It is assumed that 0 ≤ a,b < m; result is also ≥ 0 and < m. Produces erratic results otherwise. Raises Lip 2
if m = 0.
- zmmul : Int ⋆ Int ⋆ Int -> Int multiplies modulo m: zmmul (a,b,m) computes ab mod m. All
arguments are large integers. It is assumed that 0 ≤ a,b < m; result is also ≥ 0 and < m. Produces erratic
results otherwise. Raises Lip 2 if m = 0.
- zsmmul : Int ⋆ int ⋆ Int -> Int multiplies modulo m: zsmmul (a,b,m) computes ab mod m.
The first argument a is a large integer, while b is a machine integer. It is assumed that 0 ≤ a,b < m; result is
also ≥ 0 and < m. Produces erratic results otherwise. Raises Lip 2 if m = 0.
- zmsmul : int ⋆ int ⋆ int -> int, called on (a,b,n), returns ab mod n; raises Lip 2 if n = 0.
All arguments are machine integers.
- zmsqr : Int ⋆ Int -> Int squares modulo m: zmsqr (a,m) computes a2 modm. It is assumed that
0 ≤ a < m; result is also ≥ 0 and < m. Produces erratic results otherwise. Raises Lip 2 if m = 0.
- zmdiv : Int ⋆ Int ⋆ Int -> Int divides modulo m: zmdiv (a,b,m) computes a∕b mod m. It is
assumed that 0 ≤ a,b < m; result is also ≥ 0 and < m. Produces erratic results otherwise. Raises Lip 2 if
m = 0, Lip 1 if b = 0. If m is not coprime with b, it might also be the base that the quotient is undefined, in
which case no exception is raised, rather some factor of m is returned.
- zminv : Int ⋆ Int -> Int computes inverses modulo m: zminv (a,m) computes a-1 mod m. It is
assumed that 0 ≤ a < m; result is also ≥ 0 and < m. Produces erratic results otherwise. Raises Lip 2 if
m = 0, Lip 1 if a = 0. If m is not coprime with a, it might also be the base that the inverse is undefined, in
which case no exception is raised, rather some factor of m is returned—namely gcd(a,m).
- zmsexp : int ⋆ int ⋆ int -> int, applied to (a,e,m), computes a|e| mod m. All arguments are
machine integers. Raises Lip 2 if m = 0. By convention, 00 = 1.
- zsmexp : Int ⋆ int ⋆ Int -> Int, applied to (a,e,m), computes ae mod m. The exponent e is a
machine integer; however, this is not faster than calling zmexp, since zsmexp calls zmexp. Raises Lip 2 if
m = 0, Lip 10 if e is negative and a and m are not coprime. By convention, 00 = 1.
- zmexp : Int ⋆ Int ⋆ Int -> Int, applied to (a,e,m), computes ae mod m. The exponent e is a
large integer. Raises Lip 2 if m = 0, Lip 10 if e is negative and a and m are not coprime. By convention,
00 = 1.
- zmexpm : Int ⋆ Int ⋆ Int -> Int, applied to (a,e,m), computes ae mod m, just like zmexp.
However, it uses the m-ary method, which is faster than zmexp if a is not too small. Raises Lip 2 if m = 0,
Lip 10 if e is negative and a and m are not coprime. By convention, 00 = 1.
- zm2exp : Int ⋆ Int -> Int, applied to (e,m), computes 2e mod m. Raises Lip 2 if m = 0,
Lip 10 if e is negative and m is even.
- zmexp2 : Int ⋆ Int ⋆ Int ⋆ Int ⋆ Int -> Int, called on (a1,e1,a2,e2,m), computes
a1e1.a2e2 mod m. Raises Lip 2 if m = 0, Lip 10 if e1 or e2 is negative. This uses Shamir’s method with
sliding window of size 1, 2 or 3, depending on the maximal size of e1 or e2. This should be used if e1 and e2 are
approximately of the same size.
Montgomery Modular Arithmetic
Modular multiplications can be done division free and therefore somewhat faster (about 20%), if the Montgomery
representation is used [9]. Converting to and from Montgomery representation takes one Montgomery multiplication each, so it
only pays to use Montgomery representation if many multiplications have to be carried out modulo a fixed odd
modulus.
To use Montgomery arithmetic, first initialize the modulus N by using zmontstart, and convert all operands to their
Montgomery representation by ztom, but do not convert exponents. Use the addition, subtraction, multiplication, squaring,
division, inversion, and exponentiation functions, which all start with zmont, below on the converted operands, just as you
would use the ordinary modular functions (starting with zm). The results can be converted back from Montgomery
representation to ordinary numbers modulo N using zmonttoz.
- zmontstart : Int -> unit initializes Montgomery arithmetic mod N, the argument. Raises Lip 3 if
N is not both positive and odd.
If no exception is raised, all subsequent computations using Montgomery arithmetic will use modulus N. Mixing
ordinary large integers and Montgomery numbers, or Montgomery numbers based on different moduli yields
surprising results; this should be left to experimented users only.
- ztomont : Int -> Int converts an ordinary large integer to the corresponding Montgomery number mod
N. Raises Lip 3 if N is undefined.
- zmonttoz : Int -> Int converts a Montgomery number mod N to the corresponding ordinary large
integer. Raises Lip 3 if N is undefined.
- zmontadd : Int ⋆ Int -> Int adds two Montgomery numbers mod N. This actually just calls zmadd
with N as modulus. Raises Lip 3 if N is undefined.
- zmontsub : Int ⋆ Int -> Int subtracts two Montgomery numbers mod N. This actually just calls
zmsub with N as modulus. Raises Lip 3 if N is undefined.
- zsmontmul : Int ⋆ int -> Int takes one Montgomery number mod N and an ordinary machine
integer, multiplies them and returns the result as a Montgomery number mod N. This actually just calls zsmmul
with N as modulus. Raises Lip 3 if N is undefined.
- zmontmul : Int ⋆ Int -> Int multiplies two Montgomery numbers mod N. This is essentially the
only function, apart from zmontsqr and zmontinv, that justifies using Montgomery arithmetic. Raises
Lip 3 if N is undefined.
- zmontsqr : Int -> Int squares the Montgomery number mod N in argument. This is essentially the
only function, apart from zmontmul and zmontinv, that justifies using Montgomery arithmetic. Raises
Lip 3 if N is undefined.
- zmontdiv : Int ⋆ Int -> Int divides two Montgomery numbers mod N. Raises Lip 3 if N is
undefined, Lip 1 if second argument is 0 mod N. If N is not coprime with the second argument, it might also
be the case that the quotient is undefined. In this case, no exception is raised, rather some factor of N is returned
(as a regular integer, not as a Montgomery number).
- zmontinv : Int -> Int computes the inverse of a Montgomery number mod N. Raises Lip 3 if N is
undefined, Lip 1 if argument is 0 mod N. If N is not coprime with the argument, it might also be the case that
the inverse is undefined. In this case, no exception is raised, rather some factor of N is returned (as a regular
integer, not as a Montgomery number).
This is one of the rare functions, apart from zmontmul and zmontsqr, that justifies using Montgomery
arithmetic. Indeed, computing the inverse mod N in Montgomery representation means doing just one
Montgomery multiplication by a constant.
- zmontexp : Int ⋆ Int -> Int applied to (a,e), computes ae mod N. Although a is a Montgomery
number, e is not, and is a regular large integer. Raises Lip 3 if N is undefined, Lip 10 if e is negative and a
and N are not coprime. By convention, 00 = 1.
- zmontexpm : Int ⋆ Int -> Int, called on (a,e), computes ae mod N, just like zmontexp.
However, it uses the m-ary method, which is faster than zmontexp if a is not too small. Raises Lip 3 if N is
undefined, Lip 10 if e is negative and a and N are not coprime. By convention, 00 = 1.
- zmontexp2 : Int ⋆ Int ⋆ Int ⋆ Int -> Int, called on (a1,e1,a2,e2), computes a1e1.a2e2 mod
N. Raises Lip 2 if m = 0, Lip 10 if e1 or e2 is negative. This uses Shamir’s method with sliding window
of size 1, 2 or 3, depending on the maximal size of e1 or e2. This should be used if e1 and e2 are approximately
of the same size.
Primes, Factoring
- primes : int -> unit -> int is a small prime enumerator generator. That is, first call primes n for
some integer n; this returns a function, call it nextprime. Then calling nextprime () repeatedly returns
2, then 3, 5, 7, 11, …, i.e., all primes that hold in a machine integer (small primes). After nextprime has
exhausted all small primes, raises Lip 17.
Note that each new call to primes generates a new prime enumerator. That is, just calling primes n ()
repeatedly will just return 2 each time, by computing a new enumerator each time, and calling it only once.
The small prime number enumerator primes is basically an implementation of Eratosthenes’ sieve. The
number n is an estimation of the largest prime you will need. It serves to allocate a table of m elements, where m
is the largest integer such that 2m(2m+1) ≤n. On 32 machines, calling primes max_int is recommended.
On 64 machines, doing so will generate a huge table, which will take a few seconds just to initialize, so a smaller
value of n, say, 5 000 000 000, is recommended.
- zpollardrho : Int ⋆ int -> (Int ⋆ Int) option is an implementation of Pollard’s ρ
algorithm. zpollardrho (n,k) tries to factor n, in k iterations or less. The value k = 0 is special and is
meant to denote no bound on the number of iterations.
If the algorithm succeeds, it returns SOME (r,c), where r is a non-trivial factor of n, c = n∕r and r ≤ c. If
n < 0, returns SOME (-1,n). If the algorithm fails, returns NONE.
Raises Lip 12 if a bug occurred in zpollardrho. If this happens, this should be reported to the author
goubault@lsv.ens-cachan.fr, see MAINTENANCE at the end of the OPTIONS file.
- ztrydiv : Int ⋆ int ⋆ int -> (int ⋆ Int) option applies to a large integer n, and two
machine integers a and b, and computes the smallest (positive) prime divisor p of n that is ≥ a and ≤ b. If p
exists, it returns SOME (p,n∕p), otherwise NONE.
This works by calling primes b to get a function that enumerates all small primes. All primes < a are first
discarded, then all primes between a and b are tested. This is only intended for large n, as it does not stop as
soon as it goes past
, for small a, as enumerating all primes < a takes some time for large a, and for b not
too large (see remark on the efficiency of primes on 64 bit machines).
- zprime : Int ⋆ int -> bool takes a large integer m and a machine integer k, and tests whether m is
prime. This runs a probabilistic tests for at most k + 1 runs. It first uses ztrydiv to find small factors first. If
none is found, the Miller-Rabin test is run k + 1 times: when zprime returns false, then m is definitely not
prime, otherwise it is prime with probability at least 1 - (1∕4)k+1.
- zrandl : int ⋆ (int -> int) -> Int draws an equidistributed large integer of |k| bits, where k
is the first argument. If k < 0, returns opposites of such numbers. The second argument is a pseudo-random
generator function, like irand, which takes a bound x and returns an equidistributed random integer in
[0,x[. Using irand is not recommended for cryptographic applications, instead cryptographically secure
pseudo-random number generators should be used.
The second argument should preferrably be a function that does not raise any exception, otherwise memory
leaks may occur.
- zrandl1 : int -> Int is equivalent to fn nbits => zrandl (nbits, irand), but faster.
- zrand : Int ⋆ (int -> int) -> Int draws an equidistributed large integer of absolute value < m,
where m is the first argument (or returns 0 if m = 0), and of the same sign as m. The second argument is a
pseudo-random generator function, like irand, which takes a bound x and returns an equidistributed random
integer in [0,x[. Using irand is not recommended for cryptographic applications, instead cryptographically
secure pseudo-random number generators should be used.
The second argument should preferrably be a function that does not raise any exception, otherwise memory
leaks may occur.
- zrand1 : Int -> Int is equivalent to fn nbits => zrand (nbits, irand), but faster.
- zrandprime : int ⋆ int ⋆ (int -> int) -> Int applied to k, n and f, draws random primes
of n bits. This calls f to draw random numbers, which are then tested for primality.
More precisely, if |n|≥ 2, then this returns a random probable prime of |n| bits, where prime testing is done by
calling zprime with argument k; if |n| < 2, then it raises Lip 18. If n < 0, in addition, the returned prime
number is congruent to 3 modulo 4.
This works by picking odd numbers of the right size, keeping adding two until it is probably prime; or it
is too large, in which case we pick again, and start adding again, in accordance with NIST DSS Appendix.
Using f =irand is not recommended for cryptographic applications, instead cryptographically secure
pseudo-random number generators should be used.
The third argument should preferrably be a function that does not raise any exception, otherwise memory leaks
may occur.
- zrandprime1 : int ⋆ int -> Int is equivalent to
fn (nbits, ntries) => zrandprime (nbits, ntries, irand)
- zrandpprime : int ⋆ int ⋆ Int ⋆ (int -> int) -> Int ⋆ Int is used to generate a probable
prime number p of exactly |k| bits such that q divides p- 1, where q is given. The arguments are k, a machine integer n
(used in calling zprime, so that p is prime with probability ≥ 1 - (1∕4)n+1), the large integer q, and a
pseudo-random generator function f taking a bound x and returning an equidistributed random integer in
[0,x[, e.g., irand. Raises Lip 14 if q ≤ 0, Lip 18 if failed to generate p, typically because q already
uses ≥|k| bits. Otherwise, returns (p,⌊p∕q⌋). If k < 0, in addition, p will be congruent to 3 modulo
4.
This works by generating p at random amongst |k|-bit numbers such that q divides p - 1, until p is probably prime, in
accordance with NIST DSS Appendix. This only works if n is substantially larger than znbits (q). Using f =irand
is not recommended for cryptographic applications, instead cryptographically secure pseudo-random number generators
should be used.
zrandpprime is used to generate pairs (p,q) of prime numbers such that q divides p- 1, where p and q have specified
numbers of bits. To do this, first generate q using zrandprime, then call zrandpprime to generate
p.
The fourth argument should preferrably be a function that does not raise any exception, otherwise memory leaks may
occur.
- zrandpprime1 : int ⋆ int ⋆ Int -> Int ⋆ Int is equivalent to
fn (k, n, q) => zrandpprime (k, n, q, irand)
- zrandfprime : int ⋆ int ⋆ Int ⋆ (int -> int) -> Int ⋆ Int generates a pair of probable prime
numbers p and q such that q has |k| bits and p = qr + 1, where k and r are given. The arguments are (k,n,r,f), where n
is given to zprime to test whether both p and q are probably prime, r is the machine integer above, and f is a
pseudo-random generator taking a bound x and returning an equidistributed random integer in [0,x[, e.g., irand. If
k < 0, in addition q will be congruent to 3 modulo 4. Raises Lip 18 if fails, typically if r < 2 or r is
odd.
Using f =irand is not recommended for cryptographic applications, instead cryptographically secure pseudo-random
number generators should be used.
The fourth argument should preferrably be a function that does not raise any exception, otherwise memory leaks may
occur.
- zrandfprime1 : int ⋆ int ⋆ Int -> Int ⋆ Int is equivalent to
fn (k, n, r) => zrandfprime (k, n, r, irand)
- zrandgprime : int ⋆ int ⋆ bool ⋆ (int -> int) -> Int ⋆ Int ⋆ Int takes four
arguments k, n, first, and f, and returns p,q,g where p and q random probable primes (tested with zprime
with argument n; if k < 0, in addition, q is congruent to 3 modulo 4) such that p = 2q + 1 (this uses
zrandfprime), and a generator g of the multiplicative group of all numbers prime to p less than p. If first is
true, then g will be the smallest generator, otherwise g will be selected at random. Raises Lip 18 if it
fails.
Using f =irand is not recommended for cryptographic applications, instead cryptographically secure pseudo-random
number generators should be used.
The fourth argument should preferrably be a function that does not raise any exception, otherwise memory leaks may
occur.
- zrandgprime1 : int ⋆ int ⋆ bool -> Int ⋆ Int ⋆ Int is equivalent to
fn (k, n, first) => zrandfprime (k, n, first, irand)
- zecm : Int ⋆ int ⋆ int ⋆ int -> (Int ⋆ bool) option, called on (n,ncurves,ϕ1,ntries), tries to
factor the large integer n. It returns NONE if n was found to be probably prime after ntries primality tests. Otherwise, it
returns SOME (f,b): if b =true, then f is a non-trivial factor of n; if b =false (which should only happen in very
exceptional circumstances), then a non-trivial factor of n can be obtained by looking at the factorization of
fn - f.
This runs by using Arjen Lenstra’s elliptic curve algorithm. Up to ncurves random elliptic curves are drawn with first
phase bound ϕ1 (this increases by 1.02 at each new elliptic curve).
This may raise Lip 19 if zecm fails to reach any conclusion, or if n is too small (on 32-bit machines,
n ≤ 215 = 32768; in general, n >
is never too small: in these cases, use ztrydiv first).
Raises Lip 15 if a bug occurred in zecm. In case this happens, this should be reported to the author
goubault@lsv.ens-cachan.fr, see MAINTENANCE at the end of the OPTIONS file.
4.10 Input/Output
Two types are defined, first that of output streams outstream:
type 'rest outstream = |[put:string -> unit,... : 'rest]|
means that output streams are extensible records (classes) with at least a put method, to output a string to the stream. For
example, to make a stream that prints to a string stored inside a ref sr:string ref:
|[put = fn text => (sr := !sr ^ text)]|
The type of input streams instream is:
type 'rest instream : |[get:int -> string,... : 'rest]|
providing at least a get method, that reads up to n characters from the stream, n being the number in argument. If less than
n characters have been read, it usually means that an end of stream has been reached. (After the end of stream is reached,
repeated calls to get will return the empty string.) For example, an input stream reading from a string s:string, with
current position stored in a ref posr:num ref is:
|[get = fn n => let val goal = !posr+n
val newpos = (if goal<size s
then goal
else size s)
in
substr(s,!posr,newpos)
before (posr := newpos)
end]|
Note that if you want to read from a string, the instring primitive does this faster.
The following exception is defined:
exception IO of int
is raised when the system was unable to open a file or any other I/O-related operation; it returns as argument the error code
returned by the operating system. (Note that this is OS-dependent.)
The other methods that may be present in predefined streams are:
The input/output values and functions are:
- print : 'a outstream -> dynamic -> unit prints the value stored in the dynamic on the
specified output stream, in the same format that is used by the toplevel loop. The type is not
printed, though. Strings are printed with quotes, and with control characters escaped. If it is desired
to print strings merely as the sequence of their characters, apply the put method of the stream. Try:
#put stdout "Hello, world\n". (And don’t forget to #flush stdout () afterwards to see your
result printed.)
- pretty : 'a outstream -> dynamic -> unit pretty-prints the value stored in the dynamic on the
specified output stream, as done by the toplevel loop. The type is not printed, and the right margin is the system
default (usually, 80).
- stdout : |[put:string -> unit,flush:unit -> unit]| is the standard output stream, that
prints on the console if not redirected. This stream is buffered, the flush method flushes the buffer.
- stderr : |[put:string -> unit,flush:unit -> unit]| is the standard error stream; it prints
on the console if not redirected. This stream is buffered, the flush method flushes the buffer (on most operating
systems, an end-of-line also flushes the buffer).
-
outfile : string -> |[put:string -> unit,
flush:unit -> unit,
seek:int -> unit,
advance:int -> unit,
seekend:int -> unit,
tell:unit -> int,
truncate:unit -> unit,
close:unit -> unit]|
opens the file whose name is given, creating it if it does not exist, reinitializing it to zero length, and returns an output
stream. If the file could not be opened, the IO exception is raised, applied to the error code.
The stream is buffered, the flush method flushes the buffer.
-
appendfile : string -> |[put:string -> unit,
flush:unit -> unit,
seek:int -> unit,
advance:int -> unit,
seekend:int -> unit,
tell:unit -> int,
truncate:unit -> unit,
close:unit -> unit]|
opens the file whose name is given, creating it if it does not exist, setting the current position to the end of file, and
returns an output stream. If the file could not be opened, the IO exception is raised, applied to the error
code.
The stream is buffered, the flush method flushes the buffer.
-
outprocess : string ⋆ string list -> |[put:string -> unit,
flush:unit -> unit,
kill:unit -> unit]|
creates a process which will execute the shell command given in argument in parallel with the current GimML process.
This shell command is given in the form (command name, arguments). The command name is searched for, using the
PATH shell variable, with arguments as given.
Strings can be sent to the standard input of this process by using the put method (followed by flush to really send the
message), and the process can be terminated by calling the kill method.
Contrarily to files, which can be closed and then revived when necessary, a killed process cannot be revived, and an IO
exception will be raised when attempting to write to a dead process.
If the process could not be created, then an IO exception is raised (normally, IO 2, “no such file or directory”).
The kill method raises an IO exception when the process exited with a non-zero exit code (as returned by the C
function exit). Then, if n is this code, IO n is raised. (To be fair, n is first taken modulo 256, and only if the result is
non-zero is the exception raised, with n mod 256 as argument.)
The child process can exit by itself, and this can be detected by the fact that putting a string to the child (then
flushing, to be sure that the message has really been sent) will raise an IO error (normally, IO 32, “broken pipe”). It
is then good policy to kill the process, as it allows the operating system to reclaim process structures allocated for the
child (at least on Unix, where this is necessary).
- stdin : |[get:int -> string,getline:unit -> string]| is the standard input stream, that reads
from the console if not redirected. This stream is usually buffered, so that characters cannot be read until a newline
character \n is typed as input.
-
readline : |[prompt : string,
masked : bool,
multiline : bool,
completion : (string ⋆ string
-> (string ⋆ string ⋆ string) list) option,
hints : (string -> (string ⋆ int ⋆ bool) option) option,
history : (int ⋆ string list) ref option ]|
-> string
reads a line from stdin, with line editing facilities. One may obtain a line from stdin by calling
#getline stdin (), but readline does a better job in several respects. It features line editing, so that, say, cursor
movement keys actually do what you expect; a command line history; the possibility of masking user
input, e.g., for entering passwords; the possibility of multi-line editing; auto-completion; and suggesting
hints.
For a simple example, type the following:
readline |[prompt="$ ",masked=false,multiline=false,completion=NONE,
hints=NONE,history=NONE]|;
then type something after the $ prompt, and see what you get.
The readline takes a record as input, with the following fields.
- prompt : string, the command-line prompt, shown at the beginning of the input line;
- masked : bool, set to true if user input should be masked (namely, each input letter will be shown
as ⋆ instead);
- multiline : bool, set to true for multi-line input, to false otherwise; if set to false, then long
lines will scroll horizontally instead of overflowing to the next line;
- completion :
(string ⋆ string -> (string ⋆ string ⋆ string) list) option is an optional
completion function; auto-completion is deactivated if set to NONE; if set to SOME f for some function f,
then, whenever the user type the tabulation key, f will be called with two strings as arguments, respectively
the contents of the current line to the left of the cursor and the contents of the current line to the right of the
cursor; f is then expected to return a list of triples (p,c,s) of strings: each one is the result of a possible
completion, and if chosen, the line will be updated to be the concatenation of p, c, and s, with cursor placed
at the end of c.
At this point, you may try to type the following in order to understand what auto-completion is.
fun my_completion (prefix, suffix) =
if #matches (regexp "abc$") prefix
then [(prefix, "def", suffix), (prefix, "bcba", suffix)]
else [];
readline |[prompt="$ ",masked=false,multiline=false,
completion=SOME my_completion,hints=NONE,history=NONE]|;
After the $ prompt, try to type abc then the tabulation key: you should be proposed to add "def" to what you
have just typed. Type the tabulation key a second time, and the choice should change to "bcba". Typing it a third
time should emit a beep, telling you that you have reached the end of the list of choices. Typing the escape key
allows you to abort the auto-completion process. Any other key will accept the current auto-completion and resume
line editing.
- hints : (string -> (string ⋆ int ⋆ bool) option) option is an optional function for
suggesting hints; if set to NONE, then the hints functionality is deactivated; if set to SOME f, for some function f,
then f will be repeatedly called on the current contents of the input line; f is meant to return NONE if no hint should
be displayed, and SOME (s,color,bold) otherwise: then s will be the displayed hint, color is the integer code of the
color to be used to display that hint, and bold is set to true in order to display the hint in bolface. Color codes
include the following.
|
|
|
|
|
|
|
|
Black | 30 | Red | 31 | Green | 32 | Yellow | 33 |
|
|
|
|
|
|
|
|
Blue | 34 | Magenta | 35 | Cyan | 36 | White | 37 |
|
|
|
|
|
|
|
|
Gray | 90 | Bright Red | 91 | Bright Green | 92 | Bright Yellow | 93 |
|
|
|
|
|
|
|
|
Bright Blue | 94 | Bright Magenta | 95 | Bright Cyan | 96 | Bright White | 97 |
|
|
|
|
|
|
|
|
|
For an example, type the following.
fun my_hints buf =
if #matches (regexp "gag$") buf
then SOME (" foo!", 35, true)
else NONE;
readline |[prompt="$ ",masked=false,multiline=false,completion=NONE,
hints=SOME my_hints,history=NONE]|;
and observe what happens whenever the line eventually ends in "gag". Try also to use SOME my_completion
for the completion field instead of NONE.
- history : (int ⋆ string list) ref option is an optional history; if set to NONE, then readline
will share the same history as the GimML toplevel; otherwise, history contains a reference to a
pair (n,h), where n is taken to be the maximum number of recorded lines, and h is a list of lines
recorded in the history; when readline returns, h will be updated to the new history. Try the
following:
val history = ref (100, []);
readline |[prompt="$ ",masked=false,multiline=false,
completion=NONE, history=SOME history, hints=NONE]|;
and look at the contents of history after the call. You may also use SOME my_completion for the
completion field, and SOME my_hints for the hints field.
The usual key commands, such as cursor left, cursor left, delete, work as expected. The cursor up and down keys navigate
throught the command line history. Tabulate calls auto-completion, escape escapes auto-completion mode. The following
Emacs-like keys are also recognized:
- control-C: end line input prematurely, forcing readline to return the empty string;
- control-D: remove character at the right of cursor if there is one; if there is none and line is empty, force
readline to return the empty string; otherwise do nothing;
- control-T: swap character with previous one; do nothing if at end of line;
- control-B: cursor left;
- control-F: cursor right;
- control-P: go to previous line in history, as with cursor up;
- control-N: go to next line in history, as with cursor down;
- control-U: delete the whole line;
- control-K: delete from current position to end of line;
- control-A: go to start of line (the home key does the same, if you have one);
- control-E: go to end of line (the end key does the same, if you have one);
- control-L: clear screen;
- control-W: delete previous word;
- escape-B: move back one word;
- escape-F: move forward one word;
- escape-D: delete next word.
-
infile : string -> |[get:int -> string,
getline:unit -> string,
seek:int -> unit,
advance:int -> unit,
seekend:int -> unit,
tell:unit -> int,
close:unit -> unit]|
opens the file whose name is given, and returns an input stream. If the file could not be opened, the IO exception
is raised, applied to the error code. The stream is buffered for speed, but no flushing method should be
necessary.
-
inprocess : string ⋆ string list -> |[get:int -> string,
getline:unit -> string,
kill:unit -> unit]|
creates a process which will execute the shell command given in argument in parallel with the current GimML process.
This shell command is given in the form (command name, arguments). The command name is searched for, using the
PATH shell variable, with arguments as given.
All text that is printed on the parallel process’s standard output can be read by the GimML process by using the get and
getline methods, and the process can be terminated by calling the kill method.
Contrarily to files, which can be closed and then revived when necessary, a killed process cannot be revived, and an IO
exception will be raised when attempting to read from a dead process.
If the process could not be created, then an IO exception is raised (normally, IO 2, “no such file or directory”).
The kill method raises an IO exception when the process exited with a non-zero exit code (as returned by the C
function exit). Then, if n is this code, IO n is raised. (To be fair, n is first taken modulo 256, and only if the result is
non-zero is the exception raised, with n mod 256 as argument.)
The child process can exit by itself, and this can be detected by the fact that the get and getline methods will all
return empty strings (end of file, at least after the internal buffer is emptied). It is then good policy to kill the process,
as it allows the operating system to reclaim process structures allocated for the child (at least on Unix, where this is
necessary).
-
instring : string -> |[get:int -> string,
getline:unit -> string,
seek:int -> unit,
advance:int -> unit,
seekend:int -> unit,
tell:unit -> int]|
opens a stream for reading on the string given as argument . This is a souped up version of the input stream example at
the beginning of the section.
-
outstring : string -> |[put:string -> unit,
seek:int -> unit,
advance:int -> unit,
seekend:int -> unit,
tell:unit -> int,
truncate:unit -> unit,
convert:unit -> string]|
opens a stream to write on, as if it were a file. The convert method is used to get the current contents of the stream in
the form of a string. This is useful notably to print data to a string. The stream is initialized with the string given as
argument to outstring.
This is a souped up version of the output stream example at the beginning of the section.
-
inoutprocess : string ⋆ string list -> |[get:int -> string,
getline:unit -> string,
put:string -> int,
flush:unit -> unit,
kill:unit -> unit]|
creates a process which will execute the shell command given in argument in parallel with the current GimML process.
This shell command is given in the form (command name, arguments). The command name is searched for, using the
PATH shell variable, with arguments as given.
All text that is printed on the parallel process’s standard output can be read by the GimML process by using the get and
getline methods; moreover, GimML can sent data, as text, by writing to its standard input with the put method,
while flush empties the output buffers to really send the text to the process; and the latter can be terminated by calling
the kill method.
Contrarily to files, which can be closed and then revived when necessary, a killed process cannot be revived, and an IO
exception will be raised when attempting to read from a dead process. For other remarks, see items inprocess and
outprocess.
- delete : string -> unit deletes the file whose name is given in argument. If any I/O error occurs, then the
exception IO is raised, applied to the corresponding error code. In particular, if the argument is the name of a directory,
rmdir should be used instead.
- rename : string ⋆ string -> unit renames the file whose name is given as first argument to the name given
as second argument. If any I/O error occurs, then the exception IO is raised, applied to the corresponding error code.
Note that the capabilities of rename vary greatly from system to system. For example, rename can move files from any
place to any other place on BSD Unix; this is restricted on Unix System V to move files only inside the same volume (file
system).
- filetype : string -> string set takes the name of a file and returns a set of properties of this file as
character strings. If the file does not exist or any other I/O error occurs, then the exception IO is raised, applied to the
corresponding error code. Otherwise, the following strings are used for properties:
- "b" means this is a block special file (Unix only);
- "c" means this is a character special file (Unix only);
- "d" means the file is a directory;
- "e" means the file is erasable, either by delete or by rmdir (by the owner, on Unix systems; by all,
on other systems); on Unix, "eg" means the file is writable by users in the same group, "eo" means the
file is writable by all other users.
- "f" means this is a fifo, a.k.a. a pipe (Unix only);
- "g" means the file has the setgid bit set (Unix only);
- "l" means the file is a symbolic link (Unix only);
- "n" means this is a regular (normal) file;
- "r" means the file is readable (by the owner, on Unix systems; by all, on other systems); on Unix, "rg"
means the file is readable by users in the same group, "ro" means the file is readable by all other users.
- "S" means the file is a socket (Unix only);
- "u" means the file has the setuid bit set (Unix only);
- "v" means the file has the “save swapped text after use” option (Unix only, should be obsolete);
- "w" means the file is writable (by the owner, on Unix systems; by all, on other systems); on Unix systems,
if a file is writable, it is also erasable; on Unix again, "wg" means the file is writable by users in the same
group, "wo" means the file is writable by all other users.
- "x" means the file is executable (by the owner, on Unix systems; by all, on other systems); on Unix again,
"xg" means the file is writable by users in the same group, "xo" means the file is writable by all other
users.
- dir : string -> string set takes a path as input, which should be a correct path name for a directory, and
returns the set of all file names (except . and .. on Unix systems) inside this directory. If any I/O error occurs, then the
exception IO is raised, applied to the corresponding error code.
- cd : string -> unit changed the current directory to the one specified by the path given as input. If any I/O error
occurs, then the exception IO is raised, applied to the corresponding error code.
- pwd : unit -> string returns a name for the current directory, as changed by cd. If any I/O error occurs, then the
exception IO is raised, applied to the corresponding error code. This may notably be the case if the path name is too
long.
- mkdir : string -> unit creates a new directory by the name given in argument. If any I/O error occurs, then the
exception IO is raised, applied to the corresponding error code.
- rmdir : string -> unit creates a new directory by the name given in argument. If any I/O error occurs, then the
exception IO is raised, applied to the corresponding error code. In particular, on most operating systems, rmdir can be
used on empty directories only. To delete files, use delete.
- system : string -> int issues the shell command given as argument. On Unix systems, this calls the
system(3S) function, which calls a sh-compatible shell to execute the command. system launches the command,
waits for it to return, and returns the exit code of the command (0 if all went well). If an error occurs (not enough
memory, not enough processes available, command interrupted by a signal, typically), then an IO exception is
raised.
- getenv : string -> string option reads the value of the environment variable whose name is given, and
returns NONE if it has not been defined, and SOME of its value, otherwise.
- args : unit -> string list returns the list of command-line options given after the -- switch on GimML’s
command-line.
- iomsg : int -> string translates the I/O error code in argument (as returned as argument of a IO exception) into
a string in human-readable form. This is the same message as the one printed by the C function perror() on Unix
systems.
- leftmargin : int ref defines the left margin for printing, as a number of spaces to print at the beginning of a
new line when pretty-printing. It is ref 0 by default.
- rightmargin : int ref defines the right margin for printing, as a number of columns (counted as characters)
from the beginning of the line where a new line has to be forced by the pretty-printing functions. It is ref 80 by
default.
- maxprintlines : int ref defines the limit on the number of lines printed by pretty-printing functions, mainly to
avoid looping while printing infinite structures, or to avoid printing structures of humongous sizes fully. This is also valid
for printing values defined on the toplevel, since pretty is used for this purpose. The default value is ref 100. To
suppress the limit in practice, write maxprintlines := max_int.
- numformat : string ref is a reference to the string that is used to format floating-point values for printing. Any
C-style format for printing doubles may be used, i.e. it is %[-∣+∣ ∣#] * [0 - 9] * (\.[0 - 9]+)?[feEgG], where - forces
left adjustment, + forces a sign in front of the value, a blank puts a space instead of a plus sign in front of the value, a
hash sign (#) forces a radix character (i.e, a dot in general; besides, trailing zeroes are not removed in the case of g and G
conversions); the following optional decimal number specifies the minimum field width (with the left adjustment flag -,
the number is padded on the right if it is not large enough); if this is followed by a dot, and a decimal number, this
specifies the number of digits to appear after the radix character for e and f conversions, the maximum number of
significant digits for the g conversion. The possible conversion characters are: f prints in the format [ ] ddd.ddd
with 6 digits by default (this can be modified by precision specifications; 0 says not to output the radix
character); e or E print in the format [ ] d.dde[+ ]dd (or with E instead of e, respectively), with one digit
before the radix character, and 6 digits by default; g or G choose between f and e (resp. E) conversions:
the latter is chosen if the exponent is less than -4 or greater than or equal to the precision (by default,
6).
The default is "%G".
4.11 Lexing and Parsing
GimML comes with a lexical analyzer generator, glex, and a parser generator, gyacc, inspired from the standard tools lex
and yacc. They are documented separately, and GimML offers a few supporting datatypes, exceptions and functions that are
documented here.
The supporting datatypes are:
- glex_data is the type of internal glex states;
- 'a glex_tables is the type of automata that glex simulates, returning lexemes of type 'a';
- glex_buffer is the type of input buffers, which are an abstration of input files used by glex;
- ('a, 'b) gyacc_data is the type of internal gyacc states, where 'a is the type of lexemes, and 'b is
the type of gyacc’s semantic values;
- 'b gyacc_tables is the type of automata that gyacc simulates, with semantic values of type 'b.
There are also exceptions:
- Glex n, for various values of the integer n, documented under glexmsg below;
- Gyacc n, for various values of the integer n, documented under gyaccmsg below;
- Parse ℓ, where ℓ is a list of strings.
The supporting primitives for glex are the following. Not all are useful. The ones at the bottom are used internally by the
scanner generated by the glex tool.
- glex_data : |[get:int -> string, getline : unit -> string, ...:'a]|
⋆ (unit -> bool) -> glex_datacreates a new glex_data internal scanner state object for use by
a glex scanner. The scanner will read its input from the input stream given as first argument. The second
argument is a function called by the scanner on reaching an end of file, and usually called yywrap. If it returns
true, the scanner calls the corresponding end-of-file action, otherwise it assumes that the input source has been
changed by using glex_switch, and scanning continues.
- glex_loc : glex_data -> intarray takes a glex_data, as created by glexdata, and returns a
mutable array of 4 integers (start line, start position, end line, end position) giving the location of the current
scanned token at every moment.
- glex_begin : glex_data ⋆ int -> unit takes a glex_data, as created by glexdata, and places
the scanner in the start condition given as second argument (those defined between < and > in the scanner file,
or INITIAL). Used inside the actions of the scanner, where the glex_data object is called yyd.
- glex_start : glex_data -> int returns the current start condition.
- glex_text : glex_data -> string takes a glex_data, as created by glexdata, and returns the
text matched by the current token. Used inside the actions of the scanner, where the glex_data object is
called yyd.
- glex_length : glex_data -> int returns the length of the current token. Equivalent to
size o glex_text, but faster.
- glex_flush : glex_data ⋆ glex_buffer - unitflushes the buffer given as second argument.
Mainly used in the scanner, when matching <<EOF>> (end of file), where you should call
glex_flush (yyd, glex_current_buffer yyd), so that the next time the scanner attempts to
match a token, it will first refill the buffer using glex_input, hopefully reading from another file.
- glex_current_buffer : glex_data -> glex_buffer returns the current buffer used by the
scanner whose state is given as first argument.
- glex_input : glex_data -> int reads the next character from the input stream, returning its ASCII
code, or -1 on end of file.
- glex_unput : glex_data ⋆ int -> unit pushes back the character given as second argument onto
the input buffer for future reading. Raises Glex 3 if too much is pushed back.
- glex_less : glex_data ⋆ int -> unit returns all but the first n characters of the current token
back to the input stream, where n is the second argument; they will be rescanned when the scanner looks for the
next match. Raises Glex 3 if too much is pushed back, Glex 6 if n is negative, Glex 7 if n is larger than
the length of the current token.
- glex_push_state : glex_data ⋆ int -> unit pushes the start condition given as second
argument onto an internal stack, held inside the glex_data given as first argument.
- glex_pop_state : glex_data -> unit pops the start condition from the internal stack of start
conditions, held inside the glex_data given as first argument; raises Glex5 if that stack is empty.
- glex_top_state : glex_data -> int returns the start condition on top of the internal stack of start
conditions, held inside the glex_data given as first argument; raises Glex5 if that stack is empty.
- glex_set_interactive : glex_data ⋆ bool -> unit sets the scanner to interactive mode if
second argument is true, to non-interactive mode otherwise. Interactive scanners are meant to be faster than
non-interactive ones, but may behave strangely on stdin.
- glex_interactive : glex_data -> bool returns the interactive status of the scanner.
- glex_set_bol : glex_data ⋆ bool -> unit can be used to control whether the current buffer’s
scanning context for the next token match is done as though at the beginning of a line. A true macro argument
makes rules anchored with '^' active, while a zero argument makes '^' rules inactive.
- glex_at_bol : glex_data -> bool returns true if the next token scanned from the current buffer
will have '^' rules active, false otherwise.
- glexmsg : int -> string returns a printable explanation of the number given in argument, which is typically the
argument of a Glex exception. The messages are:
0 | no action found |
1 | end of buffer missed |
2 | scanner input buffer overflow |
3 | scanner push-back overflow |
4 | out of memory expanding start-condition stack |
5 | start-condition stack underflow |
6 | less on negative argument |
7 | less on argument>glex_length |
8 | scanner jammed |
|
- glex_buffer : glex_data ⋆|[get:int -> string, getline : unit -> string, ...:'a]|
-> glex_buffer takes an internal scanner state and an input stream, and creates a buffer associated with the given
stream.
- glex_switch : glex_data ⋆ glex_buffer -> unittakes an internal scanner state and a buffer, as
typically created by glex_buffer, or as returned by a previous call to glex_current_buffer, and switches the
scanner’s input buffer so that subsequent tokens will come from the buffer in argument. It is recommended to use
glex_switch inside the yywrap function passed as second argument to glex_data in case the input stream should
change on reading an end of file.
- glex_tables : (glex_data -> 'a option) list ⋆ int ⋆ int ⋆ int list ⋆ int list
⋆ int list ⋆ int list ⋆ int list ⋆ int list ⋆ int list ⋆ int list ⋆ int ⋆ int
⋆ int -> 'a glex_tables is used internally in the scanner generated by the glex tool in order to build the
tables used by the glex function.
- glex : 'a glex_tables -> glex_data -> 'a is used internally in the scanner generated by the glex tool:
applied to the tables generated by glex_tables, it returns the actual scanner.
The supporting primitives for gyacc are the following.
- gyacc_data : 'c ⋆ ('c -> 'a)
⋆ 'b ⋆ 'b ref ⋆ intarray ⋆ (string list -> unit) -> ('a, 'b) gyacc_data
creates a new gyacc_data internal parser state object for use by a gyacc parser. The first argument is the
internal scanner state, the second argument is the scanner itself. The third argument is the default semantic
value. The fourth argument is a reference to the current token, which the scanner is expected to update. The fifth
argument is a 4-element integer array which the scanner may update to inform the parser of the location of the
last token read (see glex_loc). The sixth argument is an error reporting function. In case a parse error occurs,
this will be called with the list of all tokens that were expected at this point of the parsing. A typical parser is
created by:
let val f = infile ⟨my-file-name⟩
val yyd = glex_data (f, fn _ => true)
val yyloc = glex_loc yyd
val yyvalue = ref⟨default-value⟩
val hyd = gyacc_data (yyd, yylex,⟨default-value⟩, yyvalue, yyerror yyloc)
where yylex is a scanner generated by the glex tool, the ⟨default-value⟩ is some value as specifed in the
%union gyacc declaration in the grammar file, and yyerror is an error reporting function that you will have
written.
- gyacc_max_depth : ('a,'b) gyacc_data -> int returns the current maximum depth of the
parser stacks, where the machine state is given as argument. When the stacks grow larger than this limit,
Gyacc 0 is raised.
- gyacc_set_max_depth : ('a,'b) gyacc_data ⋆ int -> unit changes
the current maximum depth of the parser stacks to its second argument; the first argument is the machine state.
This is useful to avoid stack overflows in complex grammars.
- gyacc_char : ('a,'b) gyacc_data -> int returns the integer value of the current look-ahead
token. Error-recovery rule actions may examine this variable.
- gyacc_default_value : ('a,'b) gyacc_data -> 'b returns the default semantic value of the
current parser. The argument is the current machine state. It was created by gyacc_data, and the default
semantic value was given as its third argument then.
- gyacc_value : ('a,'b) gyacc_data -> int -> 'b returns the semantics value for the nth
component of the current rule, where n is given as second argument. The first argument is the internal
machine state. You should never need to use this function. Instead, writing $n in a semantic action expands to
gyacc_value hyd n; hyd is a variable that will always hold the current machine state, inside any semantic
action.
- gyacc_location : ('a,'b) gyacc_data -> int -> intarrayreturns the location of the nth
component of the current rule, where n is given as second argument. The first argument is the internal
machine state. You should never need to use this function. Instead, writing @n in a semantic action expands
to gyacc_location hyd n ; hyd is a variable that will always hold the current machine state, inside any
semantic action.
- gyacc_error_ok : ('a,'b) gyacc_data -> unit informs the parser to resume outputting error
messages. Must be applied to hyd inside semantics actions; hyd is a variable that will always hold the current
machine state, inside any semantic action. (Normally, and to prevent an outpouring of error messages, the
parser will output no error message for another syntax error that happens shortly after the first; only after three
consecutive input tokens have been successfully shifted will error messages resume, unless gyacc_error_ok
is called.)
- gyacc_clear_in : ('a,'b) gyacc_data -> unit clears the previous lookahead token. Must be
applied to hyd inside semantics actions; hyd is a variable that will always hold the current machine state, inside
any semantic action. Typically used in error handling routines that advance the input stream to some point where
parsing should once again commence. The next symbol returned by the lexical scanner is probably correct. The
previous look-ahead token then ought to be discarded with gyacc_clear_in.
- gyacc_recovering : ('a,'b) gyacc_data -> bool returns true if the parser is currently
recovering from an error, and that error messages are temporarily suppressed. Must be applied to hyd inside
semantics actions; hyd is a variable that will always hold the current machine state, inside any semantic action.
- gyacc_set_debug : ('a,'b) gyacc_data ⋆ bool -> unit will set the parser debug facility
on or off depending on the second argument.
- gyaccmsg : int -> string returns a printable explanation of the number given in argument, which is typically
the argument of a Gyacc exception. The messages are:
0 | parser stack overflow |
1 | illegal $ value |
2 | illegal @ value |
|
- gyacc_tables :(('a,'b) gyacc_data -> 'b option) list ⋆ string array ⋆ int ⋆ int
⋆ int ⋆ int ⋆ int ⋆ int ⋆ int list ⋆ int list ⋆ int list ⋆ int list
⋆ int list ⋆ int list ⋆ int list ⋆ int list ⋆ int list ⋆ int list ⋆ int list
⋆ int list -> 'b gyacc_tables is used internally in the parser generated by the gyacc tool in order to
build the tables used by the gyacc function.
- gyacc : 'b gyacc_tables -> (int,'b) gyacc_data -> 'b optionis used internally in the parser
generated by the gyacc tool: applied to the tables generated by gyacc_tables, it returns the actual
parser.
4.12 Miscellaneous
Interesting values are:
- it : unit is a special variable containing the last expression evaluated at toplevel. It is implicitly redefined
every time an expression (not a declaration) is input at toplevel. Indeed, an expression e at toplevel is
conceptually equivalent to writing the declaration:
-
features : |[OS : string,
reftyping : string,
continuations : string set,
numbers : string set,
structures : string set,
profiling : string set,
debugging : string set,
floatformat : string,
version : int ⋆ int ⋆ string ⋆ int,
recompiled : string,
started : string,
maintenance : string,
creator : string]|
(may be extended in future versions) is a description of the features in the implementation and the session. Its use is
informative, but it may also serve for conditional compilation. The fields currently defined are:
- OS defines the operating system for which this GimML compiler was compiled; it may
be "Unix System V" or "Unix BSD" for now. Mac OS X versions are reported as
"Unix System V".
- reftyping defines the scheme used to handle types of imperative constructs in GimML. Its value
may be "imperative tyvars", meaning the Standard ML typing scheme [5], "weak tyvars",
meaning the Standard ML of New Jersey typing scheme with weak type variables (assumed throughout
this document), or "effect tyvars", meaning typing with more precise effect descriptions[13], or
yet another, not yet implemented.
Frankly, the latter was never fully implemented, and while the weak tyvars discipline remained the default
for years, and was buggy and not really required. Today, GimML uses the imperative tyvars discipline.
- continuations defines the style of continuation reifying mechanism. It is a set of strings, which is
currently empty: GimML does not implement any first-class continuations.
- numbers is a set of strings defining the styles of numerical objects we may handle: it may
contain "complex floating point", which defines complex floating point values with attached
dimensions and scales, and also "big rationals" which defines infinite precision rational numbers.
Only the first of these options has been implemented.
- structures defines the structuring mechanism used in GimML (modules). Its value is a set of strings
that may contain "Standard ML structures", indicating the structure system described in [5]
and [8], and "separate compilation a la CaML", indicating we have a separate compilation
module system a la CaML. The first (not yet implemented) provides a sound way of structuting name
spaces for types and values. The second only provides separate compilation facilities. It is described in
Section 5.5.
- profiling defines the profiling possibilities. It is a set that may contain:
- "call counts", meaning that the number of times each function is called is tallied,
- "proper times", meaning that a statistical estimation of the time spent in each function
(excluding any other profiled function it has called) is computed,
- "total times" (same, but including time spent in every called function),
- "proper allocations" and "total allocations". The last two options mean that
memory consumption is recorded, as well, but have not been implemented yet. See Section 5.4 for a
detailed description of the profiler.
- debugging defines the debugging support. It may contain:
- "standard", which means we can set breakpoints and read values at breakpoints, and we can
consult the call stack;
- and "replaying debugger", if the debugger may reverse execution up to specified control
points to show the origins of a bug. Only the last option has not been implemented in GimML. (See
Section 5.3 for a description of the debugger.)
- floatformat defines the format of floating point numbers. This is important since certain numerical features are
not provided in certain modes, and since the management of rounding may differ with a format or another. The
floatformat may be "IEEE 754" or "unknown".
- version defines the current version of GimML. It is composed of:
- a major version (currently 1), incremented each time a fundamental decision is made by the creator of
the language which affects deeply the way we program in it (deletions or modifications of values or
constructions in the language may qualify if they are important enough)
- a minor version (currently 0), incremented for any minor revision (addition of functionalities qualify;
deletion or modification of minor functions qualify too)
- the code status, a string which may be:
-
"alpha"
- , if only the creator (or close friends) have access to the implementation;
-
"beta"
- , if released for beta-testing to a group of people;
-
"gamma"
- , if deemed enough stable and bug-free to be used by final users;
-
"release"
- , if distributable.
- the revision (currently 11), incremented each time a bug is corrected or something is made better inside the
language, without changing its overall behavior.
- recompiled is the last date of recompilation of this revision of GimML.
- started is the date the current session was started. This may be used inside a GimML program to see if it has
been restarted from disk or if it has just been transferred to another session.
- maintenance is a description of the person or service the user should contact if there is any problem with
GimML.
- creator is the name of the creator of the language, that is, "Jean Goubault".
- times : unit -> time ⋆ time computes the user time (first component of the returned pair) and the system
time (second component). On Unix systems, this time takes its origin at the moment the current GimML process was
launched. On other systems, the origin may be the time at which the machine was last booted. These times are in
seconds, the default scale of dimension time, defined as:
dimension time(s)
- force : 'a promise -> 'a forces evaluation of a promise (created with the delay construct), and returns the
result. A delayed expression is evaluated at most once: the result of forcing is memoized.
- gc : unit -> unit forces a garbage collection—not necessarily a major one—to take place. This is usually not
needed. It may be used in alpha status to scramble the heap and see if it incurs any nasty memory management bug
in the GimML run-time. It may also be used prior to doing some benchmarks, but as of yet, there is no
guarantee that the garbage collection will indeed collect every free objects (this does not force a major
collection).
- major_gc : string -> unit forces a major garbage collection. This can be used in theory to get back some
memory that you know can be freed right away, but it takes time. Its main use is in conjunction with the
-gctrace <file-name> command-line option: then each garbage collection will write some statistics to <file-name>,
which can be used to detect space leaks. Typically, when you wish to see how much memory is allocated or freed in a call
to some function f, call major_gc "before f" before calling f, and major_gc "after f" afterwards. Then
consult <file-name>. There should be some information on the status of memory before f, beginning with a line of the
form:
==[before f]=========================
then sometime later another block of information on the status of memory after the call to f, beginning with a line of the
form:
==[after f]=========================
- abort : unit -> 'a aborts the current computation, and returns to toplevel without completing the current
toplevel declaration (actually, it invokes the toplevel continuation). This is the function that is internally
used when syntax errors, type errors or uncaught exceptions are triggered. Following the definition of
Standard ML, the current toplevel declaration is abandoned, so that its effect are nil, except possibly on the
store.
- quit : int -> 'a aborts all computations, and returns to the operating system with the return code in
argument.
- stamp_world : unit -> unit internally records a checksum and a copy of the current GimML environment. All
modules (see Section 5.5) are compiled in the latest stamped copy of the GimML environment, and are saved to disk with
the corresponding checksum. This way, when a module is reloaded, its checksum is compared to the checksums of all
stamped environments in memory, and an error is triggered if no environment exists with the same checksum (meaning
that we cannot safely use the module, either because it needed a richer or incompatible environment, or because the
version of GimML has changed too much). The environment is automatically stamped just before GimML starts
up.
- system_less : ''a ⋆ ''a -> bool is the system order. It returns true if the first argument is less than the
second in the system order. You may use this ordering as a total ordering on all data of equality admitting
types.
Another total ordering on equality admitting types is
fun system_less_2 (x,y) =
case {x,y} of {z,...} => z=x
but this is slower. It is also a different order.
- inline_limit : int ref is a reference to a limit on the size of functions that the compiler will accept to inline.
Its default value is 30. (Units are unspecified, but are roughly the numbers of basic operations and of variables in the
function.) In version 1.0, there is no compiler yet, but one that produces C source code to feed to your favorite C
compiler is in the works.