If such a model exists, then it may be useful in teaching, learning, and using C-like languages. If such a model does not exist then the process of finding this out will improve the methods used to construct mathematical models used in software development.
In what follows the aim will be to define various sets, functions, and relations that form the semantic domains for C. The meaning of statements will be defined in terms of relations describing possible changes of state plus formulae describing - loosely - the timing of the statements.
In normal denotational semantics of a statically scoped language like C the state of the system is described by two functions: one describes a binding of variables to a set of locations and the other a binding from locations to values. Traditionally these locations have been identified with the natural numbers.
C is described in terms of bytes. It seems necessary to take a more machine-oriented view of this part of the semantics of C.
Note.
Term::=Meaning.I extend it by adding:
Term::Type=Meaning.
Term::Type.It indicates a term that is going to be used in several places in a document - or even outside this document. Binding local to a formula use a single colon. For more on the notation see MATHS_lexicon & MATHS_syntax.
However, knowing the address of an object does not tell us its value. A char and an int can have the same address but have different values. So Objects are of different types:
For a given type and address there is a value:
One value in #Byte is the sequence with no bytes in it. We can identify this with the dreaded "void" found in C programs.
Each type in C has a size, this is the number of bytes that is used to represent values in the particular implementation being used. As a rule C standards do not specify the precise sizes of most data types except for the type indicated by :"char".
int i=0, *pi=i; char *pc=(char*)&i; *pc='a';when then *pc and *pi are partially aliased, and so the assignement to *pc changes i. The expression (*pc.char='a'.NORMAL_RAM) does not allow i to change (since i is not*pc).
So if we define assignment by using (a.t:=b) with NORMAL_RAM assumptions then we implicitly invalidate a large number of C programs.
Instead we start by defining the set of addresses associated with a given address and type: For Address a, Type t, area(a,t)={ a':Address. a <= a' < a+sizeof(t) }.
Now we can define (a.t:=b)::RAM->RAM.
Notice this does not define the map (a.t:=b) precisely enough to predict the outcome of the code with the partial alias or overlap shown above. The C specifications refuse to state whether the first byte in an integer is a high order or a low order byte. The precise effect on i of changing its first byte may be on the most significant or least significant bits. In fact there is nothing to stop a truly wierd implementation of the 'int' and 'char' data types making the middle of the integer change.
However we can hope that a implementor could give a precise definition of the function if necessary. Further we can still prove some useful results about the behavior of most C programs without having the precise effect of (a.t:=b) defined.
Hence we can define a relationship between parts of the address space:
Now suppose that variable v1 is assigned value b and is bound to address a1
with type t1, and v2 is a variable of type t2 with address a2 then we
can show that
We will assume that some expressions are elementary - variables, constants, and so on:
Expressions also have a type - the type of value returned when the expression is evaluated:
We assume that the value returned has the right size to fit in data of the type:
This (expr0) together with (t0) and some facts about binary representations of numbers has an interesting effect. Sums and products are forced to produce values in a finite subset of the numbers.
In C expressions have side effects - unlike theoretical languages. Expressions also define a function that describes the effect of an expression on RAM:
A change_of_state has to be left fairly loose, for now, and so will be modelled as a relation between two States:
Some useful state changes are
In some cases we can only assign a value to an expression if its subexpressions have no side-effects. I'll write the property of not having side-effects using the word "pure".
If an expression is not Pure then it will be difficult to predict the behavior of expression that contain it unless we can show that the effected part of RAM is not involved in any other subexpression. So I will assume that for any expression e we can define a collection of bytes in RAM that can be change when the expression is evaluated:
Suppose that an Expression e has the effect of change state s to s', then we can (in theory) compare the two states and list the addresses that have changed: { Addresses a. s(a)<>s'(a) }. An address can be effected if there exists a state s such that it can be changed so we can define:
We can sow the following properties: For Pure e, Effected(e)={}.
Note. Because change of state is a relation (not necessarily a function) it would allow us to describe changes of state that give rise to no future states... in other words crashes. Similarly, we could (if needed) allow a change_of_state to be non-deterministic. However there is a subtle problem with this approach [ Limitations in rjb95a.Relations.vs.Programs ]
A simple resolution to this problem is to include a model of timing as part of the meaning of the command or expression[Hehner Circa 1990]. So define
Clearly if one expression contains another then the delay will be at least the delay caused by the subexpression plus the delay caused by the rest of the expression. However in the case of the conditional and boolean expressions short circuit evalutaion will mean that we can not just add up the delays of all the subexpressions to get a lower bound on the delay of the whole.
For the elementary parts of expressions we want way to express the fact that they terminate at some future and finite time. However we do not want to get into over-specifying the precise timings of each element. A suitable assumption is that there is a worst elementary Expression - one that takes longer than the rest. So we suppose that there is a constant
Clearly:
In other words the arithmetic operations in C and C++ are defined by the compiler - even by compiler options. For more on this see Hoare's ground breaking (and readable) classic on the axiomatization of computer programs. [Hoare69]
From here on in I will assume that I can use the normal symbols for addition, subtraction and so on, but without postulating any properties for them other than closure. In other words when arithmetic is used to model an operation on a value of a given C type all we can be sure of is that the result is also of the same type.
Pointer arithmetic will be tackled later. [ Pointer Arthmetic ]
Even with ignoring the ambiguous nature of the arithmetic in C one can not give precise semantics for something as simple as an addition because the two added subexpressions may have side-effects and C standards follow the Algol tradition of not specifying the order in which subexpressions are evaluated. For example the following expression does not have a clear value:
3*(++i)+2*(++i)The obvious rules are that when subexpressions are pure then the arithmetic operators add the results:
It may be possible to suggest some stronger rules. For example if the areas of storage used in two subexpressions are free of the side-effects of the other subexpression than the result is predicatable:
If Effected(e1)&Uses(e2) = Effected(e2)&Uses(e1) = {} then value(e1 + e2,s)=value(e1,s)+value(e2,s) and effect(e1+e2)=effect(e1);effect(e2)=effect(e2);effect(e1).
Hence we need a model of the areas of storage that are used in an expression. These are the addresses that when changed may change the value or effect of an expression:
I will delay discussing the ways that the areas or RAM used by an expression actually depends on the state of the RAM until Pointers are discussed.
The above does not discusss long, char, unsigned, and short versions of the '%' operation. It also begs the question of what the precise version of integer division is expected. Unlike in Ada 83 the standards permit integer division with negative number to behave in several different ways. For example 5/-3 may be either -2 (round down) or -1 (round towards zero).
Implicit coercions and casts further confuse the issue as well.
It is somewhat surprising that C programs look predictable to the moderately experienced programmer.
x != 0 ? y/x : 0Notice that these are defined to give short-circuit evaluation.
The formula for e1 && e2 are similar and left as an exercise.
Note: Allows for the extraction of the address of v and copying as an elementary operations.
It avoids the issue of what happens when the left hand side is not a simple variable:
a[++i] = i++;
We also have void expressions:
We will suppose that some values are also addresses. We can express this be defining a partial map from Values into Addresses:
Now all pointers are the same size:
Further, only values of type address(T) are defined...
TBA.
TBA.
(UML model of the data types in C):
[ C.png?root=atlantic-zoos ]
. . . . . . . . . ( end of section Towards a Semantics for C and C++) <<Contents | End>>