In Minuska, one can describe the semantics of a programming language using term rewriting. As term rewriting is quite low-level and expressing common patterns is repetitive, Minuska, inspired by the K-framework, implements some shortcuts (“syntactic sugar”) whose meaning is defined only by their translation to term rewriting rules. This post motivates the need for and presents a design of a higher-level language-description language that I plan to implement into Minuska.
Reduction contexts
One frequently-used approach for specifying operational semantics of programming languages is based on the concept of “reduction contexts”, described in the 1992 paper of Felleisen and Hieb. The approach consists of defining of a class of terms that are deemed to be “values” , which are meant to not be reduced/rewritten further, and a class of contexts (terms with a designated free variable, known as the hole), which describe positions inside a term in which positions reductions can happen. Typically, reduction contexts and values are defined using a grammar, and those grammars can refer to each other, leading to tricky questions about the precise meaning of the metatheory.
Naturally, frameworks like PLT-Redex, K-Framework, and Minuska want to support reduction contexts. Currently, Minuska supports them only in a limited form, inspired by K-Framework. In this form, reduction context do not have direct semantics, but are used to generate rewriting rules that mimick the “intended behavior” of reduction contexts using term rewriting only. This is problematic in theory, because one cannot prove the generated rules “correct”, as well as in practice, where persons describing a programming language in such a framework have no means of reasoning about the language in a formal yet high-level way. Another practical issue with the “rule generation” approach is that program executions are cluttered with intermediate states that are uninteresting and hard to understand.
For Minuska, the solution to these problems would be to give reduction contexts a direct semantics, and implement a transformation that would get rid of the reduction contexts in a provably correct way.
Language composition
I am currently working on some ways we can reason about programs written in multiple languages and interacting using a foreign-function interface (FFI); specifically, my work targets Rust and C. Naturally, I want to have a language definition for the combination of Rust and C, and want to somehow claim that the definition is “reasonable”. One criteria for reasonable-ness would be conservativeness: in the definition for the combination of Rust and C, programs that consist only of Rust code behave as in Rust (and programs that consist of C code only behave as in C). And of course, I do not want to spend time proving such theorems for rather ad-hoc formulations of the languages.
A solution would be to have some language combinators that are easy to prove to be conservative, and create the mixed Rust-C definition by applying these combinators to pure Rust and pure C definitions. Having a library of combinators might also benefit development of larger language definitions. However, while one can perform various transformations on term-rewriting systems, I claim that higher-level descriptions are more suitable for these tasks.
For example, I want to take the C language and add a new kind of values to it: these would represent foreign pointers (that is, pointers that were created in a Rust code and then passed to C through the FFI). However, the notion of values does not exist in the term-rewriting-based language-description language, and thus the transformation that adds these foreign values would not be straightforward to specify and implement.
Language composition motivates another key concept of a higher-level language-description language: separation of control and data. When composing Rust and C, I want the control part of the language to be composed differently than the data part. For the data part, I want to have pairs of C data and Rust data in the configurations, while for the control part, I want some interleavings of Rust and C code. But on the term-rewriting level, everything is just a term, and it is not starightforward to distinguish which part of a term is control and which is data.
Hidden algebras
There are certain operations that we would typically want to do on a data part of a configuration. For example, we may want to write a value to a certain memory location, or create a thread or a new mutex lock. We would also want to read from memory locations. It would make sense to represent the data part as an element of an algebra with those operations. But should it be the same algebra as the algebra of values?
Typically, the data part of a configuration would contain larger items than just values - it would contain collections of values (such as the store). Also, the data part would be manipulated differently than values. For example, all operations would take only one parameter that would be the data, and we will have only one instance of the data in the configuration at all times. This is in contrast with values, which can be freely duplicated and, for example of integers, added together. Also, as I have written earlier, tests for equality are problematic for, say, heaps, because they tend to break separation-logic style Framing rule. Again, this is not the case for a typical value.
This motivates me to consider having langauge definitions being defined over two algebras: an algebra of values, and an algebra of data. Operations on data would take at most one data argument and aribtrary number of value arguments, and returned either data or value. These operations can be broken down to the following categories.
- Data constructors: no data argument, arbitrary number of value arguments, data result. Useful for initialization of the configuration.
- Queries: single data argument, arbitrary number of value arguments, value result. For example, “Is this a valid address?”
- Methods: single data argument, arbitrary number of value arguments, data result. For example, “write this value to this address”.
- Useless: No data argument, arbitrary number of value arguments, value result. These should probably belong to the Value algebra.
This is known in the literature as “hidden algebra”.
So what I would do with such a separation? First, one could use queries on the right-hand side of rewriting rules and in side conditions without mentioning the data argument, since there would always be exactly one. This would enhance the user experience of writing a language definition. Second: all my separation-logic stuff would be independent of Value algebras and concern only the hidden Data algebra. Intuitively, this makes sense, as modular reasoning (as in separation logic) requires some form of information hiding.
New language definition.
So, a language definition would consist of:
- A Value algebra V,
- A hidden Data algebra D,
- A collection S of symbols (for the control part)
- A collection R of rewriting rules,
- A collection C of context declarations
- A data constructor “init” for initializing the data part
A rewriting rule would consist of
- A LHS control pattern,
- A guard,
- A RHS control pattern,
- A data effect
Some examples:
(X := V) => unit() with store.write(X, V)
where isValue(V)
Here, “store.write” is the name of a data method (taking the old store implicit argument), “X” and “A” are pattern variables, “unit” is a nullary value expression.
Since Data can’t be duplicated or combined, all the data expressions with explicit data arguments degenerate from trees to lists. We can write such lists using the semicolon notation, as in the following example for writing into a pair of variables.
(X,Y := V1,V2) => unit() with store.write(X, V1); store.write(Y, V2)
where isValue(V1) and isValue(V2)
If, for some reason, we want to write the data expression explicitly, we could use the following syntax:
(X,Y := V1,V2) => unit() with @store.write(store.write(X, V1), Y, V2)
where isValue(V1) and isValue(V2)
where the @-prefixed data methods take the data argument explicitly at the first position. Naturally, elements of the “;"-delimited lists can’t have “@"-prefixed method names. But it should be possible to write something like
(X,Y,Z := V1,V2,V3) => unit() with @store.write(store.write(X, V1); store.write(X, V1), Z, V3)
where isValue(V1) and isValue(V2) and isValue(V3)
Semantics
I have sketched the semantics of a similar language in my phd thesis. A language definition would induce an inductively-defined transition relation ⇝ over T × D (where T are terms and D elements of the data algebra). The most interesting rule would be for contexts. It would say that to make a transition (c₁,d₁) ⇝ (c₂, d₂), it is enought if:
- (r, d₁) ⇝ (v, d₂) (where “r” is an arbitrary term but v ∈ V )
- There is a context declaration
context c with h if g
in the semantics (c
is a pattern with a single occurrence of the variableh
, andg
is a guard), - there are valuations ρ₁, ρ₂ that agree on all variables excluding
h
- c[r/h] matches c₁ in ρ₁, c[v/h] matches v₂ in ρ₂, and g is satisfied in ρ₁
The reason why I am not formulating it as “(c[r/h]) ⇝ (c[v/h]) if …” is that the context c
may have other free variables than h
, and also typically will.
The base case - for single rules - is easy to imagine.
Conclusion
So what is next? Not in order:
- describe the semantics in Coq, with all the extensions that I have there (most namely, the “Program info” reader monad and labels over arrows)
- Extend Minuska to have the Hidden algebra
- Translate such languages into the term rewriting formalism
- Have some language transformations on the new formalism