Chapter 5 Running the System
5.1 Starting up
HimML runs on Unix systems, and Amigas.
Previous versions also worked on Apple Macintoshes,
but this one lacks some functions. On a Mac, the only way to launch a
HimML session is to double-click on the HimML icon; a text window
opens, asking to enter Unix command-style arguments: enter the
arguments to the HimML command, except the command's name itself.
From then on, all work happens in this console, at
toplevel, as on Amiga and Unix systems.
On Amigas and Unix boxes, type himml
followed by a list of
arguments. The legal arguments are obtained by typing himml ?
,
to which HimML should answer:
Usage: himml [-replay replay-file] [-mem memory-size]
[-cmd ML-command-string] [-init ML-init-string] [-path path]
[-col number-of-columns]
[-grow memory-grow-factor] [-maxgrow max-memory-grow-factor]
[-nthreads max-cached-threads] [-threadsize thread-cache-size]
[-maxcells max-cells] [?] [-gctrace file-name]
[-pair-hash-size #entries] [-int-hash-size #entries]
[-real-hash-size #entries]
[-string-hash-size #entries] [-array-hash-size #entries]
[-pwd-prompt format-string] [-core-trace] [-data-hash-size #entries]
[-c source-file-name] [-inline-limit max-inlined-size] [-nodebug]
[-- arguments...]
and exit. Launching HimML without any arguments is fine. There are other
HimML tools, used to compile, link and execute bytecode compiled files; they
are listed at the end of this section.
To load
a file, the use
keyword may be used; it begins a declaration, just like
val
or type
, that asks HimML to load a file and interpret it
as if it were input at the keyboard (except it does not use stdin
.
The path that use
uses can be extended on the command
line by the -path
switch, or inside HimML by changing
the contents of usepath : string list ref
, which is
a reference to the list of volumes or directories in which to search
for files, from left to right.
The explanation of the various options are:
-
-replay
<replay-file>:
whenever himml is executed, it records every single word it parses
in a replay file, named
HimML.trace
by default; this is useful
notably in debugging the language (see section 5.10),
but also to see a computation evolve (because HimML writes there
its garbage collection statistics as
ML comments). To replay such a file, the -replay
option is
used with the name of the replay file. This option is incompatible
with all the others.
-core-trace
instructs HimML to
create and fill in a replay file for use with the -replay
option.
The replay file is named HimML.trace
and is created in the current
directory. It records every single line typed on standard input or in
any included file (by use
).
-mem
<memory-size> sets the initial amount of
memory HimML takes for its heap. By
default, it is 400000 (400 Kbytes) on Unix systems, and this is
expanded on demand by chunks of 400 Kbytes. On Macs, where the
heap cannot be expanded, the initial memory size is the maximum
memory size; by default, it is 16000000 (16 Mbytes), and the
system tries to allocate a heap at most that large on launching.
If it cannot do it, it reduces automatically the figure until the
heap can be allocated or until it finds out there is not enough
memory to load. On Amigas, the policy is like on Unix systems.
-mem
can be used only once on the command line.
-maxcells
<max-cells> sets an
upper bound on the number of cells to allocate. This limit is not
strict: if the system feels it absolutely needs more memory, it
will grab some, but only a minimal amount of it, to avoid aborting
the current HimML process. By default, the number of cells is
unlimited, but it may be useful in some situations to limit it,
otherwise HimML will take as much memory as it wants to feel at
ease, without consideration of any other processes on the same
machine.
-maxcells
can be used only once on the command line.
-nthreads
<max-cached-threads> sets
the number of cache threads that are
available for the implementation of callcc
and
catch
. In the current implementation, the process stack is
split into several distinct stack regions that act as stacks for
threads, and callcc
and catch
use these as caches
for heap-based thread stacks (the more, the better). By default,
there are 10 such caches, i.e. the default option is
-nthreads 10
.
-nthreads
can be used only once on the command line.
-threadsize
<thread-cache-size> sets
the size in bytes of the cache threads that
are available for the implementation of callcc
and
catch
. In the current implementation, the process stack is
split into several distinct stack regions that act as stacks for
threads, and callcc
and catch
use these as caches
for heap-based thread stacks. The larger the thread caches, the
less often cache overflows will occur (in which case, catch
is used transparently to switch threads and resume computation in
a new empty thread); the smaller the thread caches, the faster
catch
and mostly callcc
will be (the former might
have to copy thread caches to or from memory; the latter needs
to). By default, thread caches are 20000 bytes wide, with 4000
bytes left as safety zone; i.e., the default option is
-threadsize 20000
.
-threadsize
can be used only once on the command line.
-pair-hash-size
<#entries>
sets the number of slots in the global hash-table that is used
to keep a record of all shared pairs (couples, list cells,
basic blocks for sets, and so on).
You may wish to give it a value higher than the default (typically
23227) for memory-hungry programs. A rule of thumb is to evaluate
how many cells your program needs (one tuple is one cell, a
n-element list uses up n cells, a n-element set or map uses
up some 2n cells; or, more practically, run HimML with the
-gctrace
option on, and look at the statistics on total
live homo cells), and then to divide this number by, say, 3.
On the other hand, having a big table for few values makes for
longer garbage collection times, so you may also wish to reduce
this value on programs that do not use much memory, or which only
allocate very short-lived data.
-int-hash-size
<#entries>
is the same as -pair-hash-size
, except it is concerned
with integers. (All integers are boxed and shared in HimML.)
You may wish to raise its value if your program builds and keeps
lots of integers in data-structures, or you may wish to decrease
its value if you use few integers or only allocate them for
temporary computations.
-real-hash-size
<#entries>
is the same as -pair-hash-size
, except it is concerned
with reals (floating-point values).
You may wish to raise its value if your program builds and keeps
lots of numbers in data-structures, or you may wish to decrease
its value if you use few numerical quantities or only allocate
them for temporary computations. (In particular, there is no need
to increase its value for ordinary number-crunching, except if you
are handling big matrices.)
-string-hash-size
<#entries>
is the same as -pair-hash-size
, except it is concerned
with strings.
You may wish to raise its value if your program builds and keeps
lots of strings, or does a lot of text processing. It is not
advised to reduce its value, as many strings are used internally
by the compiler and the type-checker.
-array-hash-size
<#entries>
is the same as -pair-hash-size
, except it is concerned
with arrays (in general, with n-tuples, n≥ 3, or records
with at least two fields, or arrays with at least 3 entries).
You may wish to raise its value if your program builds and keeps
lots of tuples, records and arrays, or you may wish to decrease
its value in case you don't use many of these structures.
-cmd
<ML-command-string> is used to launch HimML as
a batch process. It makes HimML execute the program
whose text appears in the string <ML-command-string>, and exit upon
termination. The string is parsed and executed as if it were input at the
keyboard; e.g., this might be of the form "use \"myfile.ml\";"
,
where `myfile.ml
' contains declarations and a call to the main
function in the project. No welcome banner, no result of typing or
evaluation and no spurious message is printed; to print a message, you
must use the input/output functions. Moreover, standard
input is not used by the parser, and can be read
by input/output functions.
-cmd
can be used only once on the command line, and is
incompatible with -init
.
-init
<ML-init-string>
is used to initialize a HimML process. It
makes HimML execute the program whose text appears in the string
<ML-command-string>, and then present the usual
toplevel interface. The string is parsed and
executed as if it were input at the keyboard, though standard
input is not used to this end; typically,
this string will be of the form "use \"myfile.ml\";"
, where
`myfile.ml' contains declarations.
-init
can be used only once on the command line, and is
incompatible with -cmd
.
-path
<path> instructs HimML to look for files to load
(by the use
keyword) in the directory <path> if it
didn't find them before. The current directory is always the first
searched directory. Then come the paths
specified on the command line, in the order in
which they arrive.
-gctrace
<file-name> is
an option that is off by default. If you specify it,
this turns it on: then, each time HimML will trigger
a garbage collection, some information will be written
to the specified file. This information is a sequence
of lines of the form:
=========================
GC...done:
total number of allocated memory cells [nCells] = 54272
allocated homo cells in young generation : 4608 (~73728 bytes,
not counting sharing overhead)
allocated hetero cells in young generation : 512 (~8192 bytes,
not counting sharing and contents overhead)
live homo cells in young generation : 4019
live hetero cells in young generation : 25
total live homo cells : 4019
total live hetero cells : 25
strings freed : 43 bytes
patcheckbits freed : 0 bytes
stacks freed : 360 bytes
vectors (environments, arrays, tuples, records) freed : 6608 bytes
16 externals freed
garbage collection time = 0.089s.
There were 0 old generations, plus one new;
there are 1 old generations, plus one new.
Number of stacks (threads) allocated since startup: 7
Number of allocated bytes of temporary (stack) storage: 52400
The latter means the following: one garbage collection has just
been done (if the system crashes during a GC, you will just get
GC...
), the number of cells in the system is 54272, of
which 4608+512 = 5120 are considered young (i.e., will be
considered as highly likely to become garbageable at the next GC);
among these, 4019+25 = 4044 are live, i.e., not free. And the
system as a whole also contains 4019+25 = 4044 live cells. The
purpose of the “homo” and “hetero” figures is to separate
between homogeneous cells (couples, integers, maps, reals,
complexes, etc.) and heterogeneous cells (which point to
non-first class data, like strings, which point to an area of
memory where its contents lies, or arrays, or n-tuples with
n≥ 3, or records with at least 2 fields, which are allocated
as a cell pointing to an internal array of values). For hetero
cells, the amount of additional memory freed is shown: 43 bytes
of strings, none of patcheckbits (an internal structure of the
compiler), 360 bytes of stacks (i.e., of local thread
structure), 6608 bytes of vectors, and 16 externals were
freed. Externals are interfaces between HimML and non-HimML data,
typically files. The time taken to do this garbage collection was
0.089s., and the heap had only one generation (the so-called
young generation) before garbage collection, and is segmented in
two generations afterwards. It allocated 7 local threads since
startup (it allocates at least one at each toplevel command), most
of which have been freed since then. And it allocated and freed
so many bytes of temporary storage (typically for local HimML
variables during execution of code), of which 52400 remain
allocated at the end of garbage collection.
Calling major_gc
invokes a major collection, and the
argument passed to major_gc
is printed at the start of the
information block, e.g.:
==[test]=========================
GC...done:
total number of allocated memory cells [nCells] = 54272
allocated homo cells in young generation : 36864 (~589824 bytes,
not counting sharing overhead)
allocated hetero cells in young generation : 512 (~8192 bytes,
not counting sharing and contents overhead)
live homo cells in young generation : 6145
live hetero cells in young generation : 14
total live homo cells : 6145
total live hetero cells : 14
strings freed : 29 bytes
patcheckbits freed : 0 bytes
stacks freed : 240 bytes
vectors (environments, arrays, tuples, records) freed : 156204 bytes
16 externals freed
garbage collection time = 0.119s.
There were 1 old generations, plus one new;
there are 1 old generations, plus one new.
Number of stacks (threads) allocated since startup: 11
Number of allocated bytes of temporary (stack) storage: 118832
-col
<number-of-columns> specifies the width of the
screen of a HimML session, in
characters (by default, 80). This is used by the HimML
toplevel when it prints types and values, and by
the debugger.
-grow
<memory-grow-factor>
specifies the initial ratio of the
size of the heap to the size occupied by live data that the
garbage collector tries to maintain.
By default, it is 2.0. The greater the number, the less time
will be spent in garbage collection overall, but the more time a
single garbage collection may take. This number can not go lower
than 1.0, and evolves across garbage collections to adapt to the
evolving nature of the computations.
-maxgrow
<max-memory-grow-factor>
puts an upper bound to the ratio of the size of the heap to the size of
the live data space. By default, it is 8.0.
-pwd-prompt
followed by a format string
tells HimML that it should use a prompt that mentions the HimML
current directory (as modified by the HimML cd
function and
read by the HimML pwd
function). The HimML Emacs mode uses
a format string starting with
the escape character and continuing with |%s|%s
: the first %s
will be replaced by the HimML current directory, the second by the
current prompt (normally, >
). This is used by Emacs to
synchronize its current directory with HimML's in HimML mode.
-inline-limit
<limit> installs
a new size limit that the compiler reads when deciding whether it
should inline functions or not. This is essentially the same
as setting inline_limit
to
<limit> at HimML initialization time.
-c
<himml-source-file> compiles
the given source file, and produces a compiled module file: see
Section 5.6.3.
-nodebug
disables the debugger: typing
control-C will still interrupt the currently running HimML program,
but instead of entering the debugger, it will stop the program.
Moreover, raised exceptions won't enter the debugger either.
Finally, using -nodebug
will direct the bytecode compiler
not to output any debugging information. This can be used to
produce stripped modules (i.e., without any debugging information),
typically to save space or to prevent or make reverse engineering
of production code difficult.
Note that, if you compile a module with -nodebug
, and execute
it under HimML (with the debugger on), then typing control-C
or raising non-benign exceptions will enter the debugger, but
the debugger won't be able to extract any information from
the compiled code.
--
stops parsing of all options, and instructs
HimML that the rest of the command-line consists of options and
arguments that will be available from HimML programs by looking
at the list args()
.
5.2 Compiling, Linking, Finding
Dependencies
As said earlier, the HimML distribution includes other tools to
compile, link and run bytecode compiled files:
-
himml -c
compiles a module (see
Section 5.6 for details). That is,
himml -c foo.ml
compiles "foo.ml"
, and produces a bytecode file "foo.mlx"
.
This does exactly the same thing as typing #compile "foo.ml"
at the HimML prompt; typing open "foo.ml"
does almost the same
thing, except HimML will then print a list of all types and identifiers
defined in "foo.ml"
, and will declare them in the current
toplevel.
himmllnk
links a series of bytecode files
into one; this works both as a linker and as an archiver. Syntax is:
himmllnk
archive-file file1.mlx
... filen.mlx
to create an archive file—in which case it is recommended to give it
a .mla
extension—, or a bytecode executable file.
himmlrun
runs the HimML bytecode
interpreter: himmlrun
"foo.mlx"
followed by arguments
will execute the main ()
function in file "foo.mlx"
(that the name ends in .mlx
is, by the way, totally
irrelevant), and the HimML args()
function will get back the
command-line arguments. There is in fact no need to explicitly call
himmlrun
, as (at least on Unix) launching "foo.mlx"
will invoke himmlrun
automatically, if properly installed.
himmldep
computes dependencies between
HimML source files. This is used in building
makefiles, as used by make
.
A typical use of himmldep
is to run
himmldep *.ml >.depend
at the (Unix) command-line. This will produce a file
.depend
listing all dependencies between files, which can
be used by make
to help reconstruct all proper .mlx
files.
In fact, the standard makefile for projects using HimML is
as follows:
%.mlx : %.ml
himml -c $<
OBJS = a.mlx b.mlx c.mlx
prog: $(OBJS)
himmllnk prog $(OBJS)
clean:
-rm *.mlx prog
cleanall : clean
-rm *~
depend:
himmldep *.ml >.depend
include .depend
The first line (works only with GNU make) tells make
that
to build or rebuild any bytecode file, say foo.mlx
, it
should call himml -c foo.ml
. The OBJS =
line is a
macro definition, stating what bytecode files we would like to
build. The prog:
line states the main rule, which is to
build a HimML executable file or a library file prog
, by
calling himmllnk
to link all bytecode files in OBJS
.
The clean
and cleanall
are targets meant to remove
compiled files, and are called with make clean
or
make cleanall
respectively. Dependencies are recomputed by
typing make depend
, which creates dependencies in the
.depend
file; the latter is in turn included in the current
makefile
using GNU make's include
directive.
If you don't have GNU make, then you cannot include
.depend
, and you will have to copy its contents manually at
the end of makefile
. Additionally, the %.mlx : %.ml
line should be replaced by:
.SUFFIXES: .ml .mlx
.mlx.ml:
himml -c $<
5.3 Debugger
HimML contains a debugger, as shown by consulting the set
#debugging features
, which should be non-empty. It can be
called by the break
function:
-
break : unit -> unit
enters the debugger.
Another way of entering the debugger is when an
exception is raised but not caught by any handler.
There are two ways of entering the debugger. These are shown on entry by a
message, stop on break
(we entered the debugger through break
,
or by typing control-C or DEL when evaluating an expression), or stop at
...(we entered the debugger at a breakpoint located just
before the execution of an expression).
In any case, the debugger enters a command loop, under which you can
examine the values of expressions, see the call stack, step through
code, set breakpoints, resume or abort execution. The debugger
presents a prompt, normally (debug)
. It then
waits for a line to be typed, followed by a carriage return, and
executes the corresponding command. These commands are:
The way that the interpreter gives control to the debugger is by
means of code points, which are points
in the code where the compiler adds extra instructions.
These instructions usually do nothing. When you set
a breakpoint, they are patched to become the equivalent of
break
. Alternatively, these instructions also enter
the debugger when we are single-stepping through some code.
These instructions are added by default by the compiler, but
they tend to slow the interpreter. If you wish to dispense
with debugging information, you may issue the directive:
(*$D-*)
which turns off generation of debugging information (of code points).
If you wish to reinclude debugging information, type:
(*$D+*)
These directives are seen as declarations by the compiler,
just like val
or type
declarations. As such,
they obey the same scope rules. It is recommended to
use them in a properly scoped fashion, either inside a let
or local
expression, or confined in a module.
5.4 Profiler
The way that the interpreter records profiling information is by
means of special instructions that do the tallying.
These instructions are not added by default by the compiler,
since they tend to slow the interpreter by roughly a factor of 2,
and you may not wish to gather profiling information of every piece
of code you write. To use the profiler, you first have to issue the directive:
(*$P+*)
which turns generation of profiling instructions on.
The functions that will be profiled are exactly those that
were declared with the fun
or the memofun
keyword.
If you wish to turn it off again, type:
(*$P-*)
These directives are seen as declarations by the compiler,
just like val
or type
declarations. As such,
they obey the same scope rules. It is recommended to
use them in a properly scoped fashion, either inside a let
or local
expression, or confined in a module.
Usually, you will want to profile a collection of modules.
It is then advised to add (*P+*)
at the beginning of
each. Time spent in non-profiled functions will be taken
into account as though it had been spent in their profiled callers.
Then, the HimML system provides the following functions to help
manage profiling data:
-
report_profiles : unit
-> |[location : string * string * (int * int) * (int * int),
ncalls : int,
proper : |[time : time * time,
ngcs : int, gctime : time * time]|,
total : |[time : time * time,
ngcs : int, gctime : time * time]|
]| set
returns the set of all profiling data that the interpreter has accumulated
until now on all profiled functions. This is a dump of all internal
profiling structures of the interpreter.
The location
field describes where the function that is profiled
is located. Its first component is the function name, its second component
is the file name where this function was defined (or the empty string
""
if this function was defined at the toplevel prompt),
its third and fourth components are respectively the starting and ending
positions of the definition in this file, as line/column pairs.
Note that the function name alone is not enough to denote accurately
which function is intended, as you can build anonymous functions
(by fn
, for example): it was chosen to let these functions
inherit the name of the function in which they are textually enclosed.
The file name and positions in the file are then intended to give
a more precise description of what function it is that is described.
The ncalls
shows how many times this function was called.
The proper
and total
fields contain statistics in the
same format: time
is the time spent in the function (in the
format returned by times
, i.e. user time and system time),
ngcs
is the number of garbage collections that were done while
executing the function (this gives a rough idea of the memory
consumption of the function), and gctime
is the time spent
garbage collecting in this function. While the statistics in
proper
only include information of what happened when the
interpreter was really executing the function, total
also
includes the times spent executing all its callees.
report_profiles
only reports statistics for those functions
that were called at least once (or at least once since the last
call to reset_profiles
.)
report_profiles
is pretty low-level, and is intended to be used
as a basic block for more useful report generators. One such
generator is located in "Utils/profile.ml"
. To get a meaningful
report, execute your program, then type:
open "Utils/profile";
prof stdout;
to get a report on your console, or:
fprof "prof.out"
to get a report in a file named "prof.out"
in the current directory.
(To open the module "profile"
on a Macintosh,
write open "Utils:profile"
; in general, it's better to
modify the path to include the Utils
directory, and
not bother with directory names.)
reset_profiles : unit -> unit
resets all profile information, so that a new profiling round
can be launched on a clear basis.
clear_profiles : unit -> unit
purges the system from all profiling instructions. I.e.,
executing the same code again won't generate any profiling information;
the code should go a bit faster, but not as fast as if it
had been compiled without profiling first (it patches the
profiling instructions to become no-ops).
What can you do with profile information? The main goal is to detect
what takes up too much time in your code, so as to focus your efforts
of optimization on what really needs it. A good strategy to do this
is the following:
-
Identify the functions in which the most (proper) time is spent,
and optimize them.
- If the latter are already optimized, or do almost nothing, then
look at the number of times they are called. Usually, such functions
take time just because they are called often; then, identify their
callers and rewrite them so that they don't go through this subroutine
over and over again (i.e., take shortcuts in common situations).
- Finally, in rare occasions, strange cases may occur: it may be the case that some
function appears to be more costly than another one, which does the
same amount of work or more. In general, this is because the interpreter
needed to do some extra work behind the scenes. Typically, because
it keeps on getting a full stack when entering this function,
and has to switch threads (which is fast, but takes some time when
done repetitively); in this case, try to make your programs less
recursive—but this is really a misfeature on HimML's part.
5.5 Conditional Compilation
HimML offers a feature known as conditional
compilation. A language
like C, through the use of its preprocessor,
provides directives named #if
...[#else
]...#endif
,
which may be used to compile one chunk of code or another, depending
on the condition after the #if
keyword being true or false.
Because HimML cannot have exactly the same features on each platform,
this is a desirable feature to have to ensure portability
of HimML applications across different OSes. This is already the main
use of these directives in C: we can use the fork()
call on Unix
systems, but it is usually impossible even to emulate on other systems,
like the Amiga or the Macintosh.
This portability concern also extends to different versions of HimML,
even on the same platform. Some versions of HimML use a type system
slightly different from other versions, or some may include a module
system, or some may be interfaced with a graphical user interface
library, ...And we want the HimML programmer to be able to write
code that will correctly use the available facilities on each
version of HimML. Detecting these differences, whether related
to the processor type, operating system, or HimML version, can be
done by examining the value of the special variable features
.
Finally, having conditional compilation directives allows one to
parameterize one's applications with respect to a file of global
declarations. For example, if you want one version of your code
with and one without debugging code interspersed, you might
define a variable to be true in one case, false in the other, and
then test it at all points where conditional behaviour should occur.
There is no such feature in Standard ML, probably because having
conditional compilation would pose too many problems in general.
Using if
...then
...else
won't work in
general. Consider:
if |
"callcc" inset #continuations features |
|
then callcc (fn k => ...) |
else ... |
which is intended to test whether we have callcc
in the implementation, and if so, to use it. This presents
two problems. First, the type checker will be run on
the whole expression, not just the part that will indeed
be executed: if callcc
is not provided in the
implementation, then the the type-checker will just fail
on the second line of the example. Then, even if we could
overcome this problem, we would need an optimizer to recognize
that the test expression above is actually a constant,
and that only one branch of the test has to be compiled;
and there are versions of HimML with no such optimizer
(actually, none yet has one).
Instead, you should use an alternative conditional, built with
#if
...[#elsif
...][#else
]...#endif
,
namely the same preprocessor directives as in C. The #else
and #elsif
parts are optional, but don't forget the
#endif
: HimML needs to see it to know that the #if
clause is over. Each keyword must lie at the beginning of the
line. If this keyword is #if
or #elsif
, then the
rest of the line is taken to be a test expression, in HimML
syntax. Otherwise, the rest of the line is ignored. (So, don't
write HimML code on the same line as an #else
or an
#endif
!)
This conditional works mostly as in C, with the following important
difference: the expression tested by #if
or #elsif
,
which is the one after the keyword and extending to the end of the
line, can be any HimML expression, which is evaluated to determine
its truth value. (In C, we can only test for what the preprocessor
knows, namely #define
s and a few arithmetic comparisons.)
But note well that these expressions are always evaluated in
the toplevel environment, not the current environment.
For example:
fun f x =
#if x=3
frozzle ()
#else
foo ()
#endif
;
does not evaluate x
as the value of the argument to
f
, which cannot be guessed at compile-time. Instead, the value
is taken in the toplevel environment. Most likely, x
won't be
defined in the toplevel environment. This won't cause an error,
though: If the test expression is ill-typed or gives rise to an
uncaught exception, then it is assumed that its value if false
.
So, in this case, it is likely that f x
will be compiled as
just calling foo ()
, although the possibility that it will
be compiled as frozzle ()
is not zero. It is doubtful
that this is the intended code.
The same problem happens in the following less clear situation:
let val x=3
in
#if x=3
frozzle ()
#else
foo ()
#endif
end;
because the local binding introduced by let
is not a toplevel
binding.
The final case where an unexpected behaviour can occur is when
writing toplevel declarations not ended by a semicolon (;
).
For example:
val x=3 (* and not 'val x=3;' *)
#if x=3
frozzle ()
#else
foo ()
#endif
;
will also have the same problem. The reason is that the HimML parser
only processes declarations when it sees a semicolon or an end of file.
Then, it parses, type-checks, compiles and evaluates all declarations
before this semicolon or this end of file, as a whole. This is for
efficiency reasons. Then, when it tries to evaluates the #if
test above, the previous declaration val x=3
has not yet been
processed, since it did not end with a semicolon.
The keywords #if
, #else
, #elsif
, #endif
must lie at the beginning of the line to be taken into account as
conditional compilation directives. This is because #elsif
and
#endif
are otherwise perfectly valid HimML expressions, namely
the functions that select the field named elsif
,
resp. endif
, from a record argument. If you wish to use the
#elsif
field selection function, you can do it by invoking
(#elsif)
, its parenthesized version. It is strongly recommended
not to use these keywords as fields, in fact.
For indenting reasons, any number of spaces or tabulations are
allowed before the sharp (#
) sign, and between the sharp
sign and the if
, else
, elsif
, or endif
part.
You may test the conditional compilation directives in an interactive
toplevel session, but it is generally advised not to use the
conditional compilation directives at the toplevel prompt. If you have
no prompt after typing return, most probably there is an unclosed
#if
expression (type #endif
on a line by itself), or you
haven't typed a semi-colon at the end of the line, so that the parser
does not know that you have completed your input.
Finally, the test expression e may have side-effects, or may loop,
or may raise an exception, in which case the whole
when
clause has side-effects, loops or raise the
exception, but it is advised to avoid these behaviours.
Examples of uses of #if
are as follows:
-
Testing an operating system dependency:
#if |
#matches (regexp "Unix") (#OS features) |
|
val parentdir = ".." |
#elsif |
#matches (regexp "Amiga") (#OS features) |
|
val parentdir = "/" |
#elsif |
#matches (regexp "Mac") (#OS features) |
|
val parentdir = "::" |
#else #put stderr "Unknown OS.\n"; #flush stderr (); quit 1 |
#endif |
;
|
to define how the parent directory is named in a portable way,
for example. (Notice that we have not put the final semicolon on
the same line as the #endif
, where it would just be ignored.)
- Testing a feature in HimML:
#if |
#reftyping features = "weak tyvars" |
|
val fifo : '2a -> '2a -> '2a ref = ... |
#else val fifo : '_a -> '_a -> '_a ref : ... |
#endif |
;
|
This illustrates a source of difficulty with typing imperative
features. Although the second declaration for fifo
would also be valid in the first case (when we use Standard
ML of New Jersey's notion of weak type variables), it would
actually expand into fifo : '1a -> '1a -> '1a ref
,
which does not have the required type. In fact, when such type
declarations bother you, just drop them: the type inferencer
is smart enough to find them all by itself. The problem
is only real when declaring, say, datatypes with imperative
features.
- Making several versions of a program: assume you want
to have a version of your program, with tests done, statistics
printed to the output, and so on, but that they should be
gone in the production version. The easiest way to do this
is to begin your program by a declaration of the form
val testing = true
(for testing) or val testing = false
(for the production version). Then all calls to tests
or statistic printing routines can be compiled conditionally as follows.
Say you want to print your statistics by using a function
print_statistics
, with two arguments this
and that
.
You would declare it as:
#if |
testing |
|
fun print_statistics (this, that) = |
|
...definition of your statistic printing routine... |
#endif;
|
and then use it as, for example:
fun |
frozzle (x,y,z) = |
|
( ...do something... |
|
#if testing |
|
; print_statistics (this, that) |
|
#endif |
|
)
|
The syntax is not particularly elegant, but this is due to
the interaction between the #if
syntax and the syntax
of sequences of statements in Standard ML: we need to put
the semi-colon just before print_statistics
above, since
if testing
is false, leaving it at the end of the
previous line would yield a syntax error. We also need
the parentheses, since otherwise the parser would believe
that the semicolon ends the definition of frozzle
.
5.6 Separate Compilation and Modules
The main goal of the HimML module system is to implement separate
compilation, where you can build your
program as a collection of modules that you can compile independently
from each other, and then link them together.
The HimML module system was designed so that it integrated well with
the rest of the core language, while remaining simple and intuitive.
At the time being, the HimML module system does not provide the other
feature that modules are useful for, namely management of name spaces.
The module system of Standard ML seems best for this purpose, although
it is much more complex than the HimML module system.
Consider the following example. Assume that your program consists
naturally of three files, a.ml
, b.ml
and c.ml
.
The most natural way of compiling it would be to type:
use "a.ml";
use "b.ml";
use "c.ml";
But, b.ml
will probably use some types and values that were
defined in a.ml
, and similarly c.ml
will probably use
some types and values defined in a.ml
or b.ml
. In
particular, if you want to modify a definition in a.ml
, you
will have to reload b.ml
and c.ml
to be sure that
everything has been updated.
This is not dramatic when you have a few files, and provided they are
not too long. But if they are long or many, this will take a lot of
time. Separate compilation is the cure: with it, you can compile
a.ml
, b.ml
, and c.ml
separately, without having
to reload other files first.
The paradigm that has been implemented in HimML is close to that used
in CaML, and even closer in spirit to the C language. In particular,
modules are just source files, as in C. Two new keywords
are added to HimML: extern
and
open
. Note that the Standard ML module system
also has an open
keyword, but there is no ambiguity as it is
followed by a structure identifier like Foo
in Standard ML, and by
a module name like "foo"
in HimML.
The extern
keyword specifies some type or some value that we
need to compile the current file, telling the type-checker and
compiler that it is defined in some other file. Otherwise, if you
say, for example, val y=x+1
in b.ml
, but that x
is defined in a.ml
, the type-checker would complain that
x
is undefined when compiling b.ml
. To alleviate this,
just precede the declaration for y
by:
extern val x : int
This tells the compiler that x
has to be defined in some other
file, and that it will know its values only when linking all files
together. This is called importing the value
of x
from another module.
Not only values, but datatypes can be imported:
extern datatype foo
imports a datatype foo
. The compiler will then know
that some other module defines a datatype (or an abstype) of
this name. However, it won't know whether this datatype admits
equality, i.e. whether you can compare objects of this datatype by =
.
If you wish to import foo
as an equality-admitting datatype,
then you should write:
extern eqtype foo
Of course, if foo
is a parameterized datatype, you have
to declare it with its arity, for example:
extern datatype 'a foo
for a unary (not necessarily equality-preserving) datatype, or
extern eqtype ('a, 'b) foo
for an equality-preserving datatype with two type parameters.
Finally, dimensions can be imported as well:
extern dimension foo
imports foo
as a dimension (type of a physical quantity,
typically).
Given this, what does the following mean? We write a file
"foo.ml"
, containing:
extern val x:int;
val y = x+1;
Then this defines a module that expects to import a value
named x
, of type int
(alternatively, to
take x
as input), and will then define a new value
y
as x+1
and export it.
Try the following at the toplevel (be sure to place file
"foo.ml"
above somewhere on the load path, as referenced by the
variable usepath
):
val x = 4;
open "foo";
You should then see something like:
x : int
y : int
x = 4
y = 5
Opening "foo"
by the open
declaration above proceeded
along the following steps:
-
First,
open
precompiled the textual description of the module "foo.ml"
into an object module "foo.mlx"
in the
same directory as "foo.ml"
. This object module contains, in
a binary format, all information that was in "foo.ml"
, plus
the types it has computed, as well as a representation of the
interpreted code for what's in "foo.ml"
.
In fact, open
will recompile .mlx
files from
the corresponding .ml
files whenever one of the .ml
files on which it depends has been updated, so as to maintain
consistency between the textual versions of the modules (in
.ml
files, usually) and their precompiled versions
(the .mlx
files). On the other hand, if an up-to-date
.mlx
file is present, it won't recompile it, and will
proceed directly to the next step.
- Then,
open
opened the
module by loading the contents of foo.mlx
.
- Finally,
open
linked the
module by resolving all extern
declarations. In this
example, open
checked that there was a variable named
x
in the environment in which we issued the open
declaration, checked that its type was int
(to be more
precise, that its type could be instantiated to int
), and has
defined the value x
of inside the module as being the same as
the value of x
in the outside environment.
A variant on open is open*
, which does just the same,
except it does not try to recompile the source file "foo.ml"
: it just
assumes that "foo.mlx"
is up to date, or fails. This is useful when
shipping compiled bytecode modules, and is used internally in the
himmlpack
and himmllnk
tools.
Assume now that we didn't have any value x
handy; then
open
would still have precompiled and opened the resulting
object module "foo.mlx"
. Only, it would have failed to link it
to the rest of the system. If you wish to just compile
"foo.ml"
without loading it and linking it, issue the
directive:
#compile "foo.ml"
at the toplevel. (The #
sign must be at the start of the
line.) This compiles, or re-compiles, "foo.ml"
and writes the
result to "foo.mlx"
.
5.6.2 Header Files
Another problem pertaining to separate compilation is how to share
information between separate modules. For example, you might want to
define again three modules a.ml
, b.ml
and c.ml
,
where a.ml
would define some value f
(say, a function
from string
to int
), and b.ml
and c.ml
would use it.
A first way to do this would be to write:
but this approach suffers from several defects. First, no check is
done that the type of f
is the same in all three files; in
fact, the check will eventually be performed at link time, that is,
when doing:
open "a";
open "b";
open "c";
but we had rather be warned when first precompiling the modules.
Then, whenever the type of f
changes in a.ml
, we
would have to change the extern
declarations in all other
files, which can be tedious and error-prone.
The idea is then to do as in the C language, namely to use one header file common to all three modules.
(This approach still has one defect, and we shall see later one
how we should really do.) That is, we would define an auxiliary
file "a_h.ml"
(although the name is not meaningful, the
convention in HimML is to add _h
to a module name to get
the name of a corresponding header file), which would contain
only extern
declarations. This file, which contains
in our case:
extern val f : string -> int;
is then called a header file.
We then write the files above as:
-
a.ml
:
use "a_h.ml";
fun f name = ...
- in
b.ml
:
use "a_h.ml";
... f "abc" ...
- and in
c.ml
:
use "a_h.ml";
... f "foo" ...
This way, there is only one place where we have to change
the type of f
in case we wish to do it: the header
file a_h.ml
.
What is the meaning of using a_h.ml
in a.ml
, then?
Well, this is the way that type checks are effected across modules.
The meaning of extern
then changes: in a.ml
, f
is
defined after having been declared extern in a_h.ml
, so that
f
is understood by HimML not as being imported, rather as being
exported to other modules. This allows HimML to type-check
the definition of f
against its extern
declaration, and
at the same time to resolve the imported symbol f
as the
definition in a.ml
. This is more or less the way it is done in
C.
On thing that still does not work with this scheme, however, is how we
can share datatypes. This is because datatype declarations are generative. Try the following. In a_h.ml
, declare a new
datatype:
datatype foo = FOO of int;
extern val x:foo;
In a.ml
, define the datatype and the value x
:
use "a_h.ml";
val x = FOO 3;
Now in b.ml
, write:
use "a_h.ml";
val y = x : foo;
Then, open "a"
, then "b"
. This does not work: why? The
reason is that the definition of the datatype foo
in
a_h.ml
is read twice, once when compiling a.ml
,
then when compiling b.ml
, and that both definitions created
fresh datatypes (which just happen to have the same name foo
).
These datatypes are distinct, hence in val y = x : foo
,
x
has the old foo
type, whereas the cast to foo
is to the new foo
type.
The remedy is to avoid use
ing header files, and to rather
open
them. So write the following in a.ml
:
open "a_h";
val x = FOO 3;
and in b.ml
:
open "a_h";
val y = x : foo;
Opening a_h
produces a compiled module a_h.mlx
, which
holds the definition for foo
and the declaration for x
.
In the compiled module, the datatype declaration for foo
is
precompiled, so that opening a_h
does not re-generate a new
datatype foo
each time a_h
is opened, rather it
re-imports the same.
Technically, imagine that fresh datatypes are produced by pairing
their name foo
with a counter, so that each time we type
datatype foo = FOO of int
at the toplevel, we generate a type
(foo
, 1), then (foo
, 2), and so on. This process
is slightly changed when compiling modules, and the datatype name
is paired with the name of the module instead, say, (foo
, a_h
).
Opening a_h
twice then reimports the same datatype.
The same works for exceptions, except there is no extern exception
declaration. The reason is just that it would do exactly the same as what
exception
already does in a module. If you declare:
exception Bar of string;
in a_h.ml
, and import a_h
as above, by writing open "a_h"
in a.ml
and b.ml
, then both a.ml
and b.ml
will be able to share the exception Bar
. Typing the following
in a_h.ml
would not work satisfactorily, since Bar
would
not be recognized as a constructor in patterns:
extern val Bar : string -> exn;
That is, it would then become impossible to write expressions such as:
f(x) handle Bar message => #put stdout message
in a.ml
. However, if you don't plan to use pattern matching on
Bar
, then the latter declaration is perfectly all right.
The following commands are available in HimML:
It is easier to compile modules by typing the following under the shell:
himml -c foo.ml
which does exactly the same as launching HimML, and typing
#compile "foo"; quit 0;
under the HimML toplevel.
You can then use himml
as a HimML standalone compiler, and
compile each of your modules with himml -c
. This is especially
useful when using the make
utility. A typical
makefile would then look like:
.mlx : %.ml
himml -c $<
a_h.mlx: a_h.ml
a.mlx: a.ml a_h.mlx
b.mlx: b.ml a_h.mlx
pack.mlx: pack.ml a.mlx b.mlx
The first lines define a rule how to make compiled HimML modules from
source files ending in .ml
. It has a syntax specific to GNU make.
If your make utility does not support it, replace it by:
.SUFFIXES: .mlx .ml
.mlx.ml:
himml -c $<
The last lines of the above makefile represent dependencies: that a.mlx
depends
on a.ml
and a_h.mlx
means that make should rebuild a.mlx
(from a.ml
, then) whenever it is older than a.ml
or a_h.mlx
.
Such dependencies can be found automatically by the himmldep
utility. For example, the dependency line for a.mlx
was obtained by typing:
himmldep a.ml
at the shell prompt.
There is no specific way to link compiled modules together, since open
already does a link phase. To link a.mlx
and b.mlx
, write
a new module, say pack.ml
, containing:
open "a";
open "b";
then compile pack.ml
. The resulting pack.mlx
file can also
be executed, provided it has no pending imported identifiers, either by launching
HimML, opening pack
, and running main ();
(provided pack.ml
exports one such function), but it is even easier to type the following from
the shell:
himmlrun pack
Under Unix, every module starts with the line:
#!/usr/local/bin/himmlrun
assuming that /usr/local/bin
is the directory where himmlrun
was installed, so that you can even make pack.mlx
have an executable
status:
chmod a+x pack.mlx
and then run it as though it were a proper executable file:
pack.mlx
This will launch himmlrun
on module pack.mlx
, find a
function main
and run it.
5.7 Editor Support
Any ASCII text editor can be used to write HimML
sources. But an editor can also be used as an environment for HimML.
In GNU Emacs, there is a special mode for
Standard ML, called `sml-mode.el' and that comes with the Standard ML
of New Jersey distribution, that
can be adapted to deal with HimML: this is the `ml-mode.el'
file. However, it was felt
that it did not indent properly in all cases, because of the
complicated nature of the ML syntax. A replacement version is in the
works, called `himml-mode.el'; it is not yet
operational.
Remember: a feature is nothing but a documented bug! You may
therefore consider the following as features :-)
.
-
Continuations always capture the toplevel, but due to all sorts
of trickeries that can be played internally, it is unsafe to capture
and store continuations in the toplevel environment, and then to
throw them. In particular, it is not advised to throw a continuation
that was captured during a
use
: after throwing it, the system
would find itself in a situation where it believes it is loading a
file, but where no such file is open. A core dump is almost certain
to ensue. I don't plan to fix this soon.
- Strange behaviour from
#if
conditional compilation
directives can happen; see Section 5.5. This seems
hard to fix, too.
- Scale syntax is kludgy, but I don't see any way of fixing it
nicely.
- Toplevel should provide a secondary prompt on incomplete input.
Currently, it does not show any, which can be confusing. Also, the
toplevel parser shows two prompts after successfully
use
ing a
file.
inoutprocess
exhibits quirky behaviour. This seems to be
due to some cruftiness inside Unix, where opening a bidirectional
channel with a child process by using two pipes has strange
consequences. In particular, try inoutprocess
on the Unix
command cat
. You would think that sending the child
cat
process a newline-terminated line, and then reading the
output from cat
would give you back your message, but it
won't on most Unix machines. This is not related to flushing
buffers, either in HimML or in the child process. This is
unfortunate, since HimML will block on reading, deadlocking both
processes. To avoid this, you should first test your own
communication protocols by hand on small examples using
inoutprocess
.
5.9 Common Problems
5.9.1 Problems When Installing HimML
-
P:
- When I type
make
, nothing happens except that
I get a message telling me to type a sequence of commands.
This is normal. The installation procedure needs to make configuration
files, for interpreting your favorite options (in file OPTIONS
)
or for determining system or compiler behaviours. So, just do as
indicated.
- P:
- I don't understand the meaning of an option in file
OPTIONS
.
Then leave it alone. Most options have reasonable default values.
- P:
- When I run HimML, it just stops on
abort: attempt to longjmp() to lower stack
or a similar message.
See next question.
- P:
- After typing make, I get messages such as:
mksyscc: 20847 Abort - core dumped
longjmp() is brain-damaged (won't allow you to jump to a lower stack)
trying to find a standard patch...
Some operating systems (mostly BSD systems, although the only example
I know is AIX) implement a “smart” longjmp()
routine that
first checks whether the current stack pointer is lower than the one
it is trying to restore, and aborts if this is not the case. HimML
needs to be able to do just that, in order to implement continuations
(and continuations are heavily used internally, even if you don't plan
to use them). The best solution I've come up with on AIX is to write
a small patching utility (dpxljhak
) that hunts for a specific
piece of code in the prologue of the longjmp()
function and
puts no-ops instead. A better solution would be to rewrite the function
in assembler, but I've been unable to do this.
If this happens to you, try to rewrite longjmp()
so that it
does not check for stack levels and link your new definition.
Or write a patch, just like me; you'll need to experiment a bit.
Please also contribute your modification so that I can include it
in the next HimML release. (See MAINTENANCE at the end of the OPTIONS file
to know whom to write to.)
- P:
- My machine is a Cray/VMS machine/PC-Dos machine, and I cannot
manage to make the darn thing compile or execute.
Cray machines have a weird stack format, and my scheme for capturing
continuations has no hope of working on these machines. If it's
absolutely necessary for you, I'll see what I can do, provided you
promise to tell me whether it works or not. (See MAINTENANCE at the
end of the OPTIONS file to know my address.)
I don't have any VMS machine handy, so I cannot test HimML on it.
The HimML implementation is pretty much centered around Unix, so
I would be surprised if it worked without changes. Please tell
me what you have been forced to do to make it work.
PC-Dos machines won't do. 640K is not enough for HimML, and HimML
has no knowledge of extended or expanded memory. HimML must run in
one segment only, lest its sharing mechanism be defeated by one
physical address having two distinct representations (from two
different segments). This may work on 486's or higher, which can use
large segments, but the operating system (Dos or Windows, any
version until now) is the stumbling block. Your best bet is to change for
Linux or any other Unix for PCs. Windows/NT or OS/2 is expected
not to pose any problem.
- P:
- When I run HimML, it just core dumps.
Check the OPTIONS
file: there is no safeguard against illegal
values there (in particular stack values). Put back the default
values; if this does not work, try to increase the stack parameters
(notably SAFETY_SIZE
and SECURITY
). See also previous
questions; it is quite likely that this is due to stack problems. If
nothing works, mail me (goubault@lsv.ens-cachan.fr
, see
MAINTENANCE at the end of the OPTIONS file).
5.9.2 Problems When Running HimML
-
P:
- I have typed a command line at the toplevel prompt,
then typed return, but nothing happens.
Most probably, you have not terminated your command line with a
semicolon (;
). Although the syntax of Standard ML makes
semicolons optional between declarations, the toplevel parser has no
way of knowing that input is complete unless it finds a terminating
semicolon (or an end of file). Consider also all the ways to
complete input such as, say, 1
: if you write a semicolon
afterwards, then this is an abbreviation of val it=1;
, but if
you write +2;
, even on the following line, then you really
meant val it=1+2;
, and if you type return just after
1
, the parser has no way to know which possibility you
intended.
It may happen that typing a semicolon does not cure the problem.
This may happen is you have not closed all parentheses and brackets.
Consider (frozzle ()
: if you type a semicolon afterwards,
then your input is still incomplete, as you may want to write, say,
(frozzle (); foo)
. The semicolon is not only a declaration
separator, but also the sequence instruction.
Finally, it may be the case that you are in the middle of a
conditionally compiled phrase. See Section 5.5 for
details.
- P:
- When opening modules that open header modules, I keep
getting type errors, and the explanation is that some datatypes
are not the same in each type?
First, check that you are not defining or declaring datatypes (or
dimensions) in header files that you use
instead of
open
ing. Each time you use a given file, it creates new
versions of the datatypes or dimensions inside it. To avoid it,
open
the file instead; this creates unique stamps for the
datatype (or dimension), which it records in a file of the same
name, with .mlx
at the end. This will work only if
your header file can be compiled separately, so be prepared to
modularize your code.
If the above does not apply, it may happen that your .ml
files have inconsistent modification dates. The module system
always tries to recompile a .ml
file when the .ml
file appears to be newer than the corresponding .mlx
file.
Therefore, if the last modification date of the .ml
file
is some future date, it will always recompile it, as many times
as it is open
ed; and this leads to the same problem as
above. A quick fix is to set the modification date manually
(with touch
on Unix, or setdate
on Amigas; there's
probably a public-domain utility to fix this on Macintoshes, but
I don't know). In any case, there's probably something wrong
with the way the date is set up on your system, and it's worth
having a look at it.
5.10 Reporting Bugs, Making Suggestions
This is an alpha revision of HimML. This means that I do not consider
it as a distributable version. This means that I deem the product
robust enough to be given only to my friends, counting on their
comprehensive support, mostly as far as bugs are concerned. This also
means that I want some feedback on the usability of the language, and
on reasonable ways to improve the implementation.
To help me improve the implementation
(and possibly the language, though I am not eager
to), you can submit a note to the person in charge of maintaining the
system (type #maintenance features
at the
toplevel to know who, where and when). The preferred
communication means is electronic mail, but others (snail-mail
notably) are welcome. If you think you have found a bug
in HimML, or if you want something changed in HimML, you should send
the person in charge a message that should contain:
-
whether it is a bug or a suggestion of improvement;
- what the problem or suggestion is. You should give it a
meaningful title, and a precise description.
In case of a bug, the preferred description is in form of a short
piece of code, together with the symptoms, and the kind of machine
and operating system you are working on. It should be possible for
somebody else than you to replay the bug. If you don't find any
small code that would exhibit the same buggy behaviour as the one
you've just experienced, send the contents of the HimML.trace
file: every time you use HimML, it logs every single toplevel or
file input in this file, so as to ease replaying your actions. This
may not always work, but it can help. (This file may have another
name, if you have chosen to use the -replay-file
command-line
option.)
In case of a suggestion, please refrain from submitting your idea of
what would be a cute extension of the language. Suggestions should
improve the level of comfort you can have from using the
implementation, and should be implementable without destroying the
spirit of HimML. If you want to propose a suggestion, definitely
argue that it will be needed, and the maintainer will try and see if
it is doable.