Jacaranda Language Specification, draft 0.45 "Inside every large problem is a small problem struggling to get out." -- C. A. Hoare Preface ======= This document specifies the Jacaranda object-capability programming language. From the specification of Cajita [CajitaDraft]: An *object-capability language* is essentially a memory-safe object language with encapsulation, with additional restrictions that _protect the outside world from the objects_. In a memory-safe object language such as JavaScript, object A can only invoke object B if A has a reference to B. If A already has references to B and C, A can invoke B passing C as an argument, giving B access to C. Memory-safe object languages with encapsulation, such as Java, _protect objects from the outside world_. The clients of an encapsulated object can make requests using its public interface, but how an object reacts to a request is up to the object. An encapsulated object can ensure that the only way to invoke its code or change its state is through its public interface. In an object-capability language, an object can only cause effects outside itself by using the references it holds to other objects. Objects have no powerful references by default, and are granted new references only by normal message passing rules. Object references thereby become the sole representation of rights to affect the world, and normal message passing (method invocation) is the only rights transfer mechanism. An object can be denied authority simply by not giving it those references which would provide that authority. Jacaranda is an object-capability language in the same sense as Cajita, or E [Miller2006]. Like Cajita, it is approximately a sublanguage of JavaScript; more precisely of ECMAScript 5th edition. Our intention is to demonstrate that verification is a feasible approach to implementing an object-capability sublanguage of ECMAScript, and that such a language can be implemented efficiently and with relatively low complexity (given a mostly correct ECMAScript implementation as a base). An important concern in the design of Jacaranda is the ability to program "defensively consistent" object abstractions ([Miller2006]). A reference to an external object of a defensively consistent object abstraction can be given to several clients that potentially do not trust each other, and they will only be able to manipulate it according to the intended behaviour of the abstraction. That is, Jacaranda implements an object-capability model at the finest achievable granularity: any desired collection of objects can represent an independent subject potentially acting according to its own interests, in cooperation with other such collections. One of the most problematic constructs in JavaScript from the point of view of defensive security analysis, is the "this" keyword. A "this" construct (sometimes called "self") is common in object-oriented languages; it appears in the earliest OO languages such as Smalltalk-72 and Simula-67, and also in recent language such as Java. In Smalltalk, Simula, Java, and many other OO languages, "this" behaves essentially like a lexically scoped variable that, within a method definition, will be bound to the receiver of the current method invocation. However, JavaScript's treatment of "this" is deeply flawed; not only does it not always refer to the expected object within a method, its semantics also breaks some of the most basic expected equivalences between programs (such as substitutability for side-effect-free expressions), that hold in almost all other programming languages. To a first approximation, Jacaranda could be described as 'a subset of JavaScript in which "this" behaves sanely'. This approach differs from that of Cajita and ADsafe, which forbid use of "this" altogether. A reference implementation of the Jacaranda verifier is available from . It is written using the ANTLR parser generator with additional Java code, and closely follows this specification. Introduction ============ Define language A to be an *attribute-verifiable sublanguage* of language B if-and-only-if: - there exists an attribute grammar [Knuth1968, Knuth1971, Swierstra2005] for A whose underlying grammar recognises a subset of texts recognised by B, and that deterministically computes an 'errors' attribute (whose value is a set of error indicators) on the start symbol; and the semantics of executing a text in A are: - if the text is not recognised by A's underlying grammar or if the 'errors' attribute on the start symbol is non-empty, report these errors. Otherwise, execute the program according to the specification of language B, in a given execution environment. By *ESF*, we mean the programming language specified by [ES5Draft], subject to changes described in the "[ES5Draft] -> ESF" section below. Jacaranda is an attribute-verifiable sublanguage of ESF. (ESF also supports syntax for declaring «const» variables along lines proposed for ES-Harmony, and an optional improvement to the semantics of the GetValue operation that is also similar to a previous ES4 proposal.) In fact, the underlying lexical and syntactic grammars used to specify Jacaranda are very close to those of ESF; apart from the addition of rules to compute attributes, the only other significant changes are those needed to handle grammatical semicolon insertion. Note that _any_ implementation of language B that satisfies B's specification when applied to programs in the A sublanguage, is sufficient to execute a verified A program. In the case of Jacaranda, the sublanguage avoids many known bugs in commonly used JavaScript implementations, and so verified code does not necessarily need to be run on a fully correct implementation of ESF. By using this definition of subsetting, Jacaranda differs from some other proposals for security-focussed sublanguages of ECMAScript, such as Caja and Cajita [Miller2008] or FBJS [Facebook]. There are many advantages of specifying and implementing a safe sublanguage by attribute-verification rather than by rewriting: - In implementations of the languages mentioned above, the code is rewritten to a form that can introduce additional run-time errors (or other changes in behaviour, called "gotchas" in the Caja/Cajita specification), that would not occur when running it directly as ECMAScript. In Jacaranda, there are necessarily no changes in the possible run-time behaviour of a verified program relative to the same program in ESF. There are cases where an existing JavaScript program might have to be modified in such a way that additional run-time errors could occur -- for example some property accesses may have to be replaced by function calls, which can throw exceptions -- but these can only happen where there are explicit changes in the source code. - A verifier is easier to understand and has fewer sources of potential error, both in the specification and implementation, than a rewriter. To understand a rewriter, it is necessary to keep in mind the semantics of both the source and target languages, and the run-time library that the translated code depends on, and be convinced that: - the source language semantics are preserved by the translation; - no source program is translated in a way that will lead to the translated code breaking the intended security properties. To specify and implement a verified sublanguage, on the other hand, it is only necessary to define how the language is restricted syntactically; then the security properties of the sublanguage can be understood independently. If the restriction is done by an attribute grammar, there is little room for ambiguity in the specification of the verifier. The semantics are already defined by the original language specification, modulo bugs and ambiguities in that specification. A specification that defines a rewriting into another language is likely to be more complicated, and give a result that is less directly useful and understandable to most users of the sublanguage. (Jacaranda also requires a run-time library written in ESF, but that library is relatively small and of low implementation complexity. In fact it is mostly a set of utility functions which, although they are written in ESF and therefore have to be trusted, are each very short and can be understood independently.) - As mentioned above, for Jacaranda the underlying lexical and syntactic grammars on which attributes are computed are exactly as specified by ESF (and almost exactly as specified by [ES5Draft]). This means that: - the subsetting relation between Jacaranda and ESF holds by construction; - it is much easier to see that all possible cases have been accounted for in the specification, since they are all enumerated by an existing grammar; - an existing ECMAScript parser can be used, provided that it follows the ESF grammar closely enough and enforces it strictly; - it is possible to parse any syntactically valid ECMAScript program (even an invalid ECMAScript or JavaScript program if the parser can recover from errors) and report deviations from the Jacaranda sublanguage in one pass, rather than requiring the programmer to fix each error before the rest of the program can be parsed. - It is easy to perform white-box testing of a verifier based on an attribute grammar, because the intermediate results are deterministic and do not depend on any implementation choices. The attribute grammar specifies precisely what tree of attribute values are to be computed for any given input. So a single white-box test suite can be run against all implementations, or implementations can be compared against each other for "fuzzed" input , and any differences in the resulting attribute tree (not just differences in the overall pass/fail outcome of verification) can be treated as bugs. - In general an attribute grammar has *synthesized* and *inherited* attributes. The synthesized attributes of a node depend only on the attributes and contents of its immediate child nodes; while the inherited attributes of a node depend only on its parent node (and on its other attributes, possibly). A grammar that has only synthesized attributes is called *S-attributed*. Such grammars admit a verifier implementation that is particularly simple and efficient in time and memory: programs can be parsed and verified in a single pass, by either a "top-down" or "bottom-up" style of parser, without necessarily constructing any syntax tree explicitly. The Jacaranda attribute grammar is "almost" S-attributed, in the sense that it has only one inherited attribute (Options), which is determined by the toplevel grammar production, and has the same value in all nodes. In practice this case is as easy to implement as an S-attributed grammar; it requires just a single post-order traversal of the parse tree. If an explicit parse tree representation is required (e.g. for a code generation framework), the representation could also ensure that only error-free subtrees are constructed, since the attributes for any subtree can be computed independently if the expected value of the Options attribute is given. Attribute computation can be implemented using the "semantic actions" supported by most existing general-purpose parser systems, and is also straightforward to implement in an ad-hoc recursive descent parser. - There are no difficulties in reverifying a program that has already been verified. A non-idempotent rewriter, on the other hand, cannot usefully process a code fragment twice -- even if the result is correct, it will be too inefficient. (This also means that it would be possible to apply N-version programming, with verifiers written using different languages and/or parser libraries by independent teams of programmers: a simple harness could run all versions concurrently and if they differ, reject the input program and report a bug. Of course this will only directly catch implementation rather than specification bugs, and it only applies to the verifier; the Jacaranda runtime and the JavaScript implementation(s) used to execute verified programs still need to be tested more conventionally.) - Avoiding rewriting means that less inefficiency is introduced into the code that is being run. While there is some overhead due to changing property accesses to function calls, this is explicitly visible in the source code, and can be avoided in many cases (especially in the 'attach' ∈ Options mode, as explained later). - Debugging is simplified because it is the original code that is being debugged. Although it is possible in principle to provide a debugger with a mapping between lines of the input code and its rewritten form, in our experience this often doesn't work very well (particularly when the ordering of code is changed, as it is by many of the Caja rewriting rules). Despite these differences, Jacaranda is strongly influenced by Caja and Cajita, and by ADsafe [Crockford], which is also a static sublanguage of ECMAScript that could in principle be specified by an attribute grammar. (Differences between Jacaranda and ADsafe are out of scope for this document, but in general Jacaranda allows a qualitatively larger sublanguage of ECMAScript than ADsafe. Also, Jacaranda is an object-capability language with per-object encapsulation, whereas ADsafe only provides encapsulation between module instances.) We encountered the idea of using attribute grammars for validation (as opposed to their more common uses in assigning semantics or for code generation) from Meredith Patterson and Len Sassaman [PS2009]. The earliest reference we can find to a sublanguage of an existing programming language being defined in terms of an attribute grammar, is a 1977 paper defining a sublanguage of Algol 68 [Simonet1977]. Edgar Irons did much of the early work that led to the development of attribute grammars [Irons1963]. [Irons1963] Edgar T. Irons, Towards more versatile mechanical translators, Proceedings of Symposia in Applied Mathematics 15, pages 41-50. American Mathematical Society, 1963. [Knuth1968] Donald Knuth, Semantics of Context-Free Languages, California Institute of Technology, Theory of Computing Systems, Volume 2 Number 2. Springer, New York, June 1968. [Knuth1971] Donald Knuth, Semantics of Context-Free Languages: Correction, Stanford University, Theory of Computing Systems, Volume 5 Number 2. Springer, New York, June 1971. [Simonet1977] M. Simonet, An attribute description of a subset of Algol 68, Proceedings of the Strathclyde ALGOL 68 conference, Glasgow, Scotland, pages 129-137, 1977. [Swierstra2005] Wouter Swierstra, Why Attribute Grammars Matter, The Monad.Reader, Issue Four, 01-07-05. [PS2009] Meredith L. Patterson and Len Sassaman, Validation by Parse Tree Comparison Extended to Context-Sensitive Grammars, In preparation, 2009. [Miller2006] Mark S. Miller, Robust Composition: Towards a Unified Approach to Access Control and Concurrency Control, PhD thesis, Johns-Hopkins University, May 2006. [CajitaDraft] Mark S. Miller, Mike Samuel, Ben Laurie, Ihab Awad, Mike Stay, Caja: Safe active content in sanitized JavaScript, In preparation, 2009. Latest version available from [Crockford] Douglas Crockford, ADsafe specification, [Facebook] FBJS - Facebook Developers Wiki, [ES5Draft] Candidate draft specification for ECMAScript 5, with errata through 17 September 2009. . [ES4Last] Last public draft of ECMAScript 4 proposal, 30 May 2008. Conventions =========== Text in a subsection following "Rationale:" is non-normative; that is, it does not impose any conformance requirements on Jacaranda implementations or programs. The same programming language would be specified if this text were removed. Text surrounded by [# #] is a non-normative editing note. Definitions of technical terms used in this specification should be taken from the first of the following documents that provides a definition for the term: 1. This document. 2. [ES5Draft]. 3. The Unicode Standard, version 5.1.0. Defining occurrences of terms in this document are shown like this: *term*. A *string* is a sequence of zero or more UTF-16 code units (not necessarily well-formed). In this specification, strings are often presented in single quotes; this has the same meaning, including use of \u escape sequences, as the corresponding string literal in ESF. Terms applied to strings such as *starts with*, *ends with*, etc. are defined in terms of this code unit sequence, regardless of whether it is a well-formed UTF-16 encoding. For example, '\uD800__' is considered to end in '__'. Code fragments that appear "inline" in the text are surrounded by double angle quotation marks: « ». In general, these fragments should be interpreted as ESF code; they may or may not be valid in the Jacaranda sublanguage. In code fragments, \ stands for itself. Grammar productions that are referenced "inline" in the text are given in single angle brackets: < >. \u escapes are used in grammar productions to mean the corresponding unescaped code unit. An optional occurrence of a production is shown with a trailing '?'. Alternatives of a production are separated by '|'. There is no implied priority between alternatives. Productions that do not correspond to an [ES5Draft] production are preceded by '+'. There are three grammars used in this specification: - In the lexical grammar, productions are named with an initial uppercase letter, and :: is used to separate a nonterminal from its definition. The lexical grammar is an S-attribution of the ESF lexical grammar. The input is a string, and the output is a sequence of tokens, each of which is a string that may have attributes. ([ES5Draft] specifies some characteristics of the ECMAScript lexical grammar only informally. It is possible to use the attribute grammar formalism to specify the required tokenisation very precisely. We do not do this in the current draft because it would add apparent complexity to the Jacaranda specification that would only be concerned with clarifying ESF, not with defining how Jacaranda is a sublanguage of ESF.) In lexical rules, a string delimited by single quotes denotes the corresponding sequence of input code units. - In the syntactic grammar, productions are named with an initial lowercase letter, and : is used to separate a nonterminal from its definition. The syntactic grammar is technically an LR-attribution of the ESF syntactic grammar (it would be an S-attribution without the Options attribute). Its input is the token sequence obtained from the lexical grammar, and its principal output is an 'errors' attribute giving the set of errors (if any), other than parse errors, that would prevent the source text from being a valid Jacaranda program. - In the string grammar, ::: is used to separate a nonterminal from its definition. The input to the string grammar is a sequence of 16-bit code units. It is a regular grammar that computes no attributes and is only used to recognise classes of strings. Lexical and syntactic grammar productions that do not require an explicit attribution rule but are directly used by a production that does, are shown with a reference to the [ES5Draft] section number. The symbol ε in a production means the empty input sequence. The syntactic grammar can be parsed unambiguously, independently of attribute computation. That is, while attribute values affect whether errors result from a given parse, it is never necessary to compute attributes in order to choose between different parse trees. (In a small number of cases, as part of processing a production's attribution rules, the node must be reparsed against a more specific production.) It is possible to implement a parser for ESF that is LR(2), i.e. uses LR parsing with two-token lookahead. The grammar given in this specification, like that in [ES5Draft], is not LR(2) because it is not left-factored. The metalanguage used to define attribution rules has values of the following types: - true and false - arbitrary-precision decimal numbers - the undefined attribute value - finite sets of other attribute values - finite sequences of other attribute values - finite mappings from attribute names to attribute values. A parse tree node is represented as a finite mapping of its attributes. All nodes have a 'text' attribute holding the corresponding source text as a string. A convention is used to present attribution rules more concisely, by making computation of "boilerplate" values for attributes implicit: - If a production has exactly one child node, or a child node annotated with a trailing ^, then its boilerplate attribute values are given by the attributes of that child. - Otherwise, - for attributes whose name is a plural noun (which are set-valued), the boilerplate value is computed by taking the union of the corresponding attribute values on all child nodes; - for attributes with names starting 'is' or 'has' (which are boolean-valued), the boilerplate value is false; - for other attributes, the boilerplate value is the undefined value. Boilerplate values are only used for attributes for which no explicit definition is given, or when 'U=' is used as described below. This convention is to be used for _all_ syntactic and lexical grammar productions, including those that are used by ESF but not explicitly mentioned in this document. Notation and auxiliary functions used in attribution rules: // comment a comment that is non-normative and has no effect N.foo the value of attribute 'foo' of node N N is P true iff N.text matches production

(grammatically, by reparsing with attribute rules ignored) N isnot P not(N is P) N ispresent not(N is '') not(b) true if b is false; otherwise false { x <- A | p(x) } the subset of values x taken from A for which p(x) UNION S the union of all sets that are members of S, or undefined if any of the members are undefined x ∈ A true if x is in set A; otherwise false x ∉ A not(x ∈ A) A U B UNION {A, B} A \ B the set difference { x <- A | x ∉ B } S ++ T the concatenation of sequence S followed by T A U= B define A to be the union of its boilerplate value and B S startswith T true if S is not undefined and sequence S is a prefix of T; otherwise false X == Y true if X and Y are identical attribute values; otherwise false cond ? X : Y if cond is true then X, otherwise Y max(X, Y) X > Y ? X : Y min(X, Y) X < Y ? X : Y match N as P reparse N as production

, possibly binding additional variables in the attribution rule. Evaluate to true if N can be parsed as

, else evaluate to false and bind variables declared in

to undefined. match N as P ... like "match N as P", but match an initial token subsequence of N. // the following definition uses Haskell-style pattern matching on sequences: let errorIfMultipleDeclaration([]) = {}, errorIfMultipleDeclaration([x] ++ xs) = { err(id, "multiple declaration") | id <- x.looseNames ∩ UNION { y.looseNames | y <- xs } } U errorIfMultipleDeclaration(xs); let errorIfMultipleDeclarationOf(id, x) = errorIf(id ∈ x.looseNames, id, "multiple declaration"); let errorIfLineTerminatorBetween(t1, t2) = error(t2, "LineTerminator not allowed here"), if there is a LineTerminator between t1 and t2 = {}, otherwise let errorIf(cond, arg, msg) = cond ? error(arg, msg) : {}; let error(arg, msg) = { err(arg, msg) }; let err(arg, msg) = representation of an error with auxiliary information 'arg' (an identifier or node) and message 'msg' let NumberToString(x) = the result of applying ToString as defined in [ES5Draft] section 9.8.1 to "the number value for x" as defined in [ES5Draft] section 8.5. Any conformance requirements specified in the normative text are intended to be redundant with, i.e. specify precisely the same restrictions as, the Jacaranda lexical and syntactic grammars. If this is not the case then it indicates an ambiguity in the specification that needs to be clarified. The descriptions of Jacaranda restrictions relative to ESF are given in groups, each with an uppercase name. Attribution rules given in one group may also impose constraints needed to implement the rules in other groups, in which case cross-references are given in most cases. Rationale: Analogous to type declarations, some degree of redundancy in a specification is useful in order to provide a better chance of detecting specification errors. This is the reason for providing informal descriptions that are redundant with the attribute grammar. Conformance =========== The terms *SHALL*, *SHALL NOT* or *SHALL ONLY* are used to describe constraints on a conforming implementation of Jacaranda. The terms *MUST*, *MUST NOT* or *MUST ONLY* are used to describe static constraints on a module definition (see below). This indirectly specifies that a conforming implementation of Jacaranda SHALL reject module definitions that fail to satisfy such a constraint. The terms *SHOULD*, *SHOULD NOT* or *SHOULD ONLY* are used to describe a recommendation for Jacaranda code to follow, in order to mimimise the risk of potential incompatibilities with future versions of this specification or implementations of ECMAScript. It has no formal consequence for conformance. For any constraint that is described as implementation-defined, a conforming implementation SHALL make a consistent choice and SHALL document which choice is made. An *ESF interpreter* is an instance of an ESF implementation conforming to the ECMA-262 5th edition draft amended as described in the section "[ES5Draft] -> ESF". A *context* is an ESF execution environment that has copies of the standard built-in objects independent of other contexts in the same ESF interpreter. Contexts may share other objects. (In typical web browser implementations of JavaScript/ECMAScript, there is usually one context for each frame or iframe.) A *visible side effect* is an effect that changes the state of one or more objects in an ESF interpreter in a way that is observable to other code running in that interpreter by overt means (that is, without making use of timing of events where this timing is nondeterministic). Allocation of memory is not by itself considered to be a visible side effect. Jacaranda programs are organised into *modules*. Typically, these are independently developed units of source code. A *module definition* is a string, representing Jacaranda source code intended to define a module. This takes the form of a call to a function called $module. The last argument passed to $module is a function expression defining the *module function*, which can contain arbitrary Jacaranda code. An *internal module representation* is a representation of a module as a collection of values in a given ESF interpreter. An instance of a Jacaranda module is called a *caplet*. The creator of a caplet must provide an *environment* and a *powersource*. The environment specifies *imports* that will be in scope throughout the caplet's module code. The powersource is an object that will only be initially available to the caplet's *powerbox*, which is the function returned by its module function. The caplet will have no overt authority other than that granted via its environment and its powersource, and the ability to perform pure computation. (Jacaranda does not attempt to prevent a caplet from denying computational service to other caplets.) If a caplet is created by the Jacaranda implementation or by a container, then any of its imports that are callable objects SHALL be simple functions. A *module name* is a string, representing the name of a module. A module definition specifies a list of *option strings*, which can specify variations in the behaviour of the Jacaranda interpreter and/or verifier for that module. If an option string starts with the character '!', it indicates an *important option*, which must be recognised by the Jacaranda interpreter in order for the module to be accepted, and by the verifier in order for the module definition to pass verification. Other options are ignored by the interpreter and verifier if they are not recognised. A *Jacaranda interpreter* is an ESF interpreter with the following additional characteristics: - an interpreter has a per-context state consisting of a mapping from module names, to internal module representations of modules that have been accepted in that context. - it is possible to create a context with an initially empty mapping. - it is possible to submit a module definition to a given context, with the following effect: - if the module definition violates any of the constraints given in sections "Module Definitions" and "Static Constraints" below, a *static rejection error* SHALL be reported. - if the module definition specifies an important option that is not supported by the interpreter, then an *unsupported important option error* SHALL be reported. - if the module definition does not give a 'name' property that is different from any module name already in the mapping, a *module name clash error* SHALL be reported. - the implementation may, for any or no reason, report an *implementation failure* when an attempt is made to submit a module. Such a failure SHALL be distinguishable from other errors. - otherwise, a mapping from the module's name to its internal module representation SHALL be added to the context's state. In any of these cases, there SHALL be no other visible side effects of submitting the module definition. - it is possible to create a caplet by *instantiating* a named module, providing an environment and a powersource as discussed above. - the *actual import environment* of the caplet is obtained by reading the public properties of the provided environment object at the time of instantiation, augmented with the values specified in the "Module Run-time Environment" section, the latter taking priority over the former. - if the caplet's module function has any formal parameter whose name is not the name of a public propery of the caplet's actual import environment, then an *unavailable import error* SHALL be reported. - the implementation may, for any or no reason, report an *implementation failure* when an attempt is made to instantiate a caplet. Such a failure SHALL be distinguishable from other errors. - the caplet's module function is called, with the properties of the actual import environment providing the correspondingly named parameters and free variables of the module function. - if the module function throws an exception, this exception SHALL be reported to the instantiator of the caplet. - if the return value of the module function is callable, then that object SHALL be called with the given powersource object as its only parameter. (Note that the return value may be a function that has captured variables from its surrounding scope; consistent with lexical scoping, these refer to the actual import environment of the caplet.) - the function given by the 'powerbox' property may interact with the environment of the Jacaranda implementation in order to set up *event handlers* which cause functions created by the module to regain control at some later point in time. The specifications of such event-handling features are implementation-defined. In cases where the above description says that an error SHALL occur as a result of submitting or instantiating a module, other visible side effects SHALL NOT occur. At any time, and for any or no reason, a Jacaranda implementation may *lock* a given context, in which case it SHALL NOT visibly run any further code in that context. Other aspects of the interaction between a Jacaranda interpreter and its environment are implementation-defined. (One possibility is that a *container* is written in ESF code to control this interaction.) Rationale: The idea of using different conformance terms to indicate constraints on the implementation vs constraints on programs is borrowed from W3C specifications. The term "lock" is named after the LOCK utility in the Incompatible Timesharing System: . The allowance for implementation failures makes Jacaranda particularly easy to implement: #include int main(int argc, char **argv) { puts("Implementation failure.\n"); return 1; } More seriously, there are good reasons to allow nondeterministic failures, given that a practical Jacaranda implementation has no control over what memory or execution resources are available, and machines with unbounded resources do not exist. The requirement for these failures to be distinguishable from other errors is actually quite strong. [ES5Draft] -> ESF ================= Start with [ES5Draft]. Change the productions for a variable declaration statement to allow «const» to occur in place of «var». Also allow «const» before catch variables and function parameter formals. («const» is a in ES3/5, so it already cannot occur as an identifier.) The detailed changes to these productions are given in the DECLARATIONS, TRY_CATCH, and FUNCTIONS groups, but are mentioned here because they extend the language rather than restricting it. Section 2 Conformance - Replace A conforming implementation of this International standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later, and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it presumed to be the UTF-16 encoding form. with A conforming implementation of ESF shall interpret characters in conformance with the Unicode standard, version 5.1.0 or later (see section 6 for further detail). - The following paragraph does not apply to ESF: A conforming implementation of ECMAScript is permitted to provide additional types, values, objects, properties, and functions beyond those described in this specification. In particular, a conforming implementation of ECMAScript is permitted to provide properties not described in this specification, and values for those properties, for objects that are described in this specification. Rationale: Permission to provide arbitrary implementation-defined extensions is not appropriate for a security-focussed language. In the case of ESF, some kinds of extension can make it impossible to define a secure sublanguage by static restriction. For more detail see the constraints on extensions given in the changes to sections 8.6.2, 15, and 16 below. Section 3 References - Replace the reference to Unicode version 3.0 with references to Unicode version 5.1.0. Rationale: The conformance requirements for Unicode have been significantly tightened since version 3.0. In particular, there were security- relevant changes at around versions 3.2 and 4.0. Note that this does not require support for characters added in later versions. Section 6 Source Text - Replace the section before "Throughout the rest of this document" with: ESF source text is represented as a sequence of code units in the UTF-16 transformation format of Unicode, version 5.1.0 or later. The text should be in Unicode Normalization Form C as described in Unicode Standard Annex #15. An ESF implementation SHALL NOT perform any normalisation of source text itself (but may warn if the text is not in Unicode Normalization Form C). This does not preclude an ESF source text from being stored or transmitted in other character encodings, but the process by which these encodings are converted to UTF-16 is outside the scope of this document. - delete NOTE 1, and change "NOTE 2" to "NOTE". Section 7.5 Tokens - In ESF, «/*const*/» and «/*fallthru*/» are also tokens, but are not included in the production. Section 8.6.2 Internal Properties and Methods - Replace Host objects may implement these internal methods with any implementation-dependent behaviour, or it may be that a host object implements only some internal methods and not others. with A *Jacaranda-accessible property* is a property that has an accessible name as defined in group PROPERTY_NAMES. A value is *built-in-reachable* iff it is reachable in the sense of garbage collection terminology, given that only the properties listed in ESF_BUILTINS of the global object of any context, or the Jacaranda-accessible properties of any other reachable object can be obtained. ESF_BUILTINS = { 'undefined', 'Infinity', 'NaN', 'Array', 'Boolean', 'Decimal', 'Math', 'Number', 'RegExp', 'String', 'Error', 'EvalError', 'RangeError', 'ReferenceError', 'SyntaxError', 'TypeError', 'URIError' 'decodeURI', 'decodeURIComponent', 'encodeURI', 'encodeURIComponent', 'isFinite', 'parseFloat', 'parseInt' } An ESF implementation SHALL NOT add Jacaranda-accessible properties other than those specified in [ES5Draft] to an object that is at any time built-in-reachable, except as described in Section 15 below. A host object may have observably different semantics to any possible native object only if it is either not built-in-reachable, or if those semantics can affect only properties that are not Jacaranda-accessible. Section 8.7 The Reference Type - Replace the last sentence of section 8.7 with: The abstract operations GetValue and PutValue are used in this specification to operate on references. The definition of GetValue is dependent on an 'Attach' flag associated with each function or program. An ESF implementation is not required to support the change to the semantics of GetValue (relative to ES5) described here. If it does not, then the Attach flag SHALL be false for all functions and programs. If it does support this change, then the directive "use attach"; in the directive prologue of a function or program SHALL set the Attach flag to true for that program or function (including all nested functions), in the same manner that "use strict;" enables strict mode. All uses of the GetValue operation in [ES5Draft], except for the use in section 11.2.3 (Function Calls), should be considered to pass as a second argument, the value of the Attach flag for the function or program being evaluated. A function object created by step 6 of the GetValue operation below is called an *attached function object*. - Replace section 8.7.1 with: GetValue (/V/, /Attach/) 1. If Type(/V/) is not Reference, return /V/. 2. Let /base/ be the result of calling GetBase(/V/). 3. If IsUnresolvableReference(/V/), throw a ReferenceError exception. 4. If IsPropertyReference(/V/), then a. If HasPrimitiveBase(/V/) is false, then let /get/ be the [[Get]] internal method of /base/, otherwise let /get/ be the special [[Get]] internal method defined below. b. Let /R/ be the result of calling the /get/ internal method using /base/ as its *this* value, and passing GetReferencedName(/V/) for the argument. c. If /Attach/ is false or if /R/ is not a callable object that could refer to its «this» value, then return /R/. d. Let /F/ be a new function object as specified in 13.2, such that the new function throws a ReferenceError exception if its «this» is bound to any other value than /base/, but otherwise acts identically to the function given by /R/. e. Return /F/. 7. Return F. 5. Else, /base/ must be an environment record. a. Return the result of calling the GetBindingValue (see 10.2.1) concrete method of /base/ passing GetReferencedName(/V/) and IsStrictReference(/V/) as arguments. For the purpose of step 4.c, a native function object "could refer to its «this» value" if and only if its function body, excluding nested function bodies, contains the keyword «this». Whether a built-in function or callable host object "could refer to its «this» value" SHALL be determined conservatively. The following [[Get]] method is used by GetValue when /V/ is a property reference with a primitive base value. It is called using /base/ as its *this* value and with property /P/ as its argument. The following steps are taken: 1. Let /O/ be ToObject(/base/). 2. Let /desc/ be the result of calling the [[GetProperty]] internal method of /O/ with property name /P/. 3. If /desc/ is *undefined*, return *undefined*. 4. If IsDataDescriptor(/desc/) is *true*, return /desc/.[[Value]]. 5. Otherwise, IsAccessorDescriptor(/desc/) must be true so, let /getter/ be /desc/.[[Get]]. 6. If /getter/ is *undefined*, return *undefined*. 7. Return the result of calling the [[Call]] internal method of /getter/ providing /base/ as the *this* value and providing no arguments. NOTE The object that may be created in step 1 is not accessible outside of the above method. An implementation might choose to avoid the actual creation of the object. The only situation where such an actual property access that uses this internal method can have visible effect is when it invokes an accessor function. Rationale: The motivation for changing the semantics of GetValue is explained in the rationale for group EXPRESSIONS. (Briefly, it allows removing restrictions on how the result of a property access can be used.) The "use attach"; syntax is consistent with that of "use strict";. It is possible for the Jacaranda runtime to detect whether "use attach"; is supported by attempting to call a function that would be attached, for example: function isUseAttachSupported() { "use attach"; var test = {method: function() { return this; }}; var m = test.method; // should attach to test try { m(); return false; } catch (e) { return e instanceof ReferenceError; } } When GetValue creates a new attached function object, it does not automatically cache this function object in the original property. That would not be difficult to specify, but it would not be a transparent optimization, because: a) the base object B (Result(2)) may be used as a prototype for some other object, O. In that case, a lookup of the property in O could find the cached function that is attached to B, when it should find an unattached function that will be later attached to O. b) suppose that an unattached function is found in some prototype P of B, and later the property is mutated in some object in B's prototype chain, up to and including P. Meanwhile an attached version of the original function has been cached in B. For the caching to be transparent, subsequent lookups should find the mutated property value, not the cached value. There is no obvious way to ensure that the cache is invalidated in this case. c) code that is not using the attachment semantics should not be able to see cached values. These problems are not insoluble, but in combination they make the caching optimization too complicated to be worthwhile. Versions of Jacaranda before 0.4 used automatic binding (a function automatically uses the "correct" this) rather than attachment (a function that would use the "wrong" this instead throws an exception). Attachment may be more politically acceptable because it ensures that the restricted language is a fail-stop subset of ECMAScript 5. Section 11.2.3 Function Calls - In step 3 of the algorithm, replace 3. Call GetValue(Result(1)). with 3. Call GetValue(Result(1), false). Rationale: Without this change the semantics would be the same, but the algorithm would unnecessarily specify the generation of an attached function for every call to a function that refers to «this». - Add after the NOTE: An ESF implementation SHALL NOT return a value of type Reference as a result of calling a function that is built-in-reachable. Section 15 Native ECMAScript Objects - Replace Unless otherwise specified in the description of a particular function, if a function or constructor described in this section is given more arguments than the function is specified to allow, the behaviour of the function or constructor is undefined. In particular, an implementation is permitted (but not required) to throw a TypeError exception in this case. NOTE Implementations that add additional capabilities to the set of built-in functions are encouraged to do so by adding new functions rather than adding new parameters to existing functions. with Unless otherwise specified in the description of a particular function, if a function or constructor described in this section is given more arguments than the function is specified to allow, then the function SHALL either ignore the extra arguments, or throw a TypeError exception. It is implementation-defined (per function) which of these alternatives is chosen. Rationale: A object-capability language must be much more conservative in any additions to libraries that are built-in or implicitly imported, since these libraries should provide no significant authority (other than pure computation). Any vendor-specific extensions should be made in libraries that are not implicitly imported. - All references to "undefined behaviour" in the specification shall be interpreted as "implementation-dependent behaviour". Rationale: A specification of a safe language cannot allow undefined behaviour in any situation. Since this is a change in the specification of run-time behaviour, it must be changed in ESF rather than Jacaranda. Section 16 Errors - Replace * An implementation may provide additional types, values, objects, properties, and functions beyond those described in this specification. This may cause constructs (such as looking up a variable in the global scope) to have implementation-defined behaviour instead of throwing an error (such as ReferenceError). with * An implementation may provide additional values in the global scope beyond those described in this specification, subject to the constraints described in sections 8.6.2 and 15. This may cause looking up a variable in the global scope that is not described in this specification to have implementation-defined behaviour instead of throwing an error (such as ReferenceError). Rationale: The Jacaranda run-time will only rely on properties of the global scope specified in ESF_BUILTINS. Jacaranda code will not be able to access other properties added to the global scope. Note that this section also includes the following quite scary provision: * An implementation may extend program and regular expression syntax. To permit this, all operations (such as calling eval, using a regular expression literal, or using the Function or RegExp constructor) that are allowed to throw SyntaxError are permitted to exhibit implementation-defined behaviour instead of throwing SyntaxError when they encounter an implementation-defined extension to the program or regular expression syntax. However, we don't need to delete this provision, because: - we are relying on the verifier to catch any code that is syntactically invalid in ESF; - Jacaranda code cannot call eval or use the Function constructor; - RegExp can be tamed to verify that no regexp strings and flags use extensions. (In practice it must be tamed anyway, because it is broken in SpiderMonkey, as discussed in the rationale for group EXPRESSIONS.) This text must not be interpreted to allow the behaviour of ESF extensions to ES5 (const and "use attach";) to be implementation-defined. Annex B Compatibility - The additional syntax specified in section B.1 SHALL NOT be part of the ESF grammar. - An ESF program SHOULD NOT rely on the presence of any of the additional properties of String specified in sections B.2.1 to B.2.6. Lexical Grammar =============== The lexical grammar of Jacaranda is that of ESF with the additional constraints listed below. FILTER_CDATA: A Jacaranda source text MUST NOT contain any of the sequences of unescaped code units «», or any initial subsequence of these at the end of the text. A Jacaranda source text MUST NOT contain the code unit '&' unless immediately followed by a code unit from . (If followed by '&', the same restriction applies to the next code unit.) + SourceText :: ε | '<' AfterLT | ']' After1RB | '&' AfterAmp | (CodeUnit but not '<' or ']' or '&') SourceText + c=AfterLT :: ε { errors U= error(c, "unexpected end of text") } | '<' AfterLT | ']' After1RB | '&' AfterAmp | ('!' | '/' | '?') SourceText { errors U= error(c, "may be misinterpreted in HTML or XML") } | (CodeUnit but not '<' or ']' or '&' or '!' or '/' or '?') SourceText + c=After1RB :: ε { errors U= error(c, "unexpected end of text") } | '<' AfterLT | ']' After2RB | '&' AfterAmp | (CodeUnit but not '<' or ']' or '&') SourceText + c=After2RB :: ε { errors U= error(c, "unexpected end of text") } | '<' AfterLT | ']' After2RB | '&' AfterAmp | '>' SourceText { errors U= error(c, "may be misinterpreted as end of CDATA") } | (CodeUnit but not '<' or ']' or '&' or '>') SourceText + c=AfterAmp :: ε { errors U= error(c, "unexpected end of text") } | '&' AfterAmp | AmpFollower SourceText | ('<' AfterLT | ']' After1RB | (CodeUnit but not '<' or ']' or '&' or AmpFollower) SourceText) { error U= error(c, "may be misinterpreted as a character entity in HTML or XML") } + AmpFollower :: one of '=' '(' '+' '-' '!' '~' '\u0027' '"' '/' '\u005C' ' ' '\u0009' '\u000D' '\u000A' DecimalDigit Rationale: This allows verified Jacaranda source to be embedded (as a Unicode string) in an XML CDATA section or an HTML Jacaranda source may not be embedded in an HTML or XHTML _attribute_ value, even if its content is defined as CDATA, since processing of attribute values does not preserve tab, carriage return or line feed characters. LEX_CODE_UNITS: Source code units, line terminators, and whitespace MUST ONLY be used as follows: SourceCharacter :: c=[\u0000-\uFFFF] { errors U= errorIf(c isnot ValidChar, c, "invalid code unit") } + ValidChar :: one of '\u0009' '\u000A' '\u000D' // exclude ASCII controls except TAB, LF, CR. [\u0020-\u007E] // \u007F-\u009F controls [\u00A0-\u00AC] // \u00AD :Cf: [\u00AE-\u05FF] // \u0600-\u0603 :Cf: [\u0604-\u06DC] // \u06DD :Cf: [\u06DE-\u070E] // \u070F :Cf: [\u0710-\u17B3] // \u17B4-\u17B5 :Cf: [\u17B6-\u200A] // \u200B-\u200F :Cf:; \u200C-\u200F !Safari !IE6 [\u2010-\u2027] // \u2028-\u2029 !Cajita; \u202A-\u202E :Cf: !Safari !IE6 [\u202F-\u205F] // \u2060-\u2064 :Cf:; \u2065-\u2069 !Cajita; // \u206A-\u206F :Cf: !Safari !IE6 [\u2070-\uD7FF] // \uD800-\uDFFF UTF-16 surrogates [\uE000-\uFDCF] // \uFDD0-\uFDEF noncharacters [\uFDF0-\uFEFE] // \uFEFF (BOM) :Cf: !Safari !IE6 [\uFF00-\uFFEF] // \uFFF0-\uFFF8 !Safari !Firefox; // \uFFF9-\uFFFB :Cf: // \uFFFC (object replacement character) // \uFFFD (replacement character) // \uFFFE-\uFFFF noncharacters + PrintableChar :: [\u0020-\u007E] + BOM :: '\uFEFF' c=WhiteSpace :: '\u0009' | '\u000B' { errors U= error(c, "VT character not allowed as whitespace") } | '\u000C' { errors U= error(c, "FF character not allowed as whitespace") } | '\u0020' | '\u00A0' { errors U= error(c, "NBSP character not allowed as whitespace") } | USP { errors U= error(c, "non-ASCII character not allowed as whitespace") } LineTerminator :: '\u000A' | '\u000D' | c='\u2028' { errors U= error(c, "LS not allowed as line terminator") } | c='\u2029' { errors U= error(c, "PS not allowed as line terminator") } LineTerminatorSequence :: // only used in LineContinuation '\u000A' | '\u000D' [lookahead ∉ '\u000A'] | '\u2028' | '\u2029' | '\u000D\u000A' lc=LineContinuation :: '\' LineTerminatorSequence { errors U= error(c, "line continuation not allowed") } Note that in [ES5Draft] sections 7.2 and 7.3, both the tables and the grammar must be changed. This rule does not restrict the set of code units that may be represented by \u escapes. This rule does not affect the set of whitespace code units used in the specification of ToNumber ([ES5Draft] section 9.3.1). A BOM (Byte Order Mark) code unit is treated as a token for the purpose of lexing Jacaranda. Note that BOM is not included in the ValidChar production, but is nevertheless valid where allowed by the syntactic grammar. Rationale: Start with \u0000-\uFFFF, and - remove code units that are not in XML 1.1's or that are in XML 1.1's (see below) - remove code units corresponding to BMP characters that are in category :Cf: (Format Control) in Unicode 5.1.x: (this is a superset of category :Cf: in all previous Unicode versions) - remove code units corresponding to BMP characters that are in category :Zs: (Whitespace) in Unicode 5.1.1: other than \u0009 and \u0020. - remove code units that are disallowed by Cajita according to . - remove \uFFFC and \uFFFD. and from XML 1.1 are used just because that is a slightly less arbitrary choice than us deciding on our own what control characters should be restricted. Disallowing :Cf: characters is in principle the wrong thing for proper internationalisation support. However, we eventually concluded that it's the only way to guarantee consistent and secure behaviour across browsers, which in this case trumps internationalisation. (The right thing would be to allow :Cf: characters other than \uFEFF only in string literals and comments, and to allow \uFEFF [byte order mark] only between tokens. Relaxing the rules later is easier than tightening them.) Whitespace is restricted to US-ASCII tab and space characters. There is no reason to internationalise whitespace used to separate tokens in source code. (Other space code units are still allowed in strings.) Also, the fact that vertical tab and form feed were treated as whitespace rather than as line separators was potentially confusing; Jacaranda disallows them. BOM tokens (and therefore BOM code units) are only allowed immediately before a module definition -- see group MODULE_DEFINITION. LEX_COMMENT_FIRST_CHAR: If the first code unit in a after «/*» is not «*» or «\u0020» or «\u000A» or «\u000D», or if the first code unit in a after «//» is not «/» or «\u0020», then all code units in the comment MUST be printable US-ASCII (that is, [\u0020-\u007E]). The necessary grammar change is given in group LEX_COMMENTS. Rationale: This prevents an attack relevent to adversarial code review, where a comment that looks like «/*const*/» does not actually enforce constness, and similarly for other comments used by JSLint, for example. It also prevents the first code unit in a comment from being a non-US-ASCII code unit that is stripped on parsing, which might plausibly result in the LEX_COMMENT_NO_ATSIGN_EXTENSIONS group being bypassed. LEX_COMMENT_NO_ATSIGN_EXTENSIONS: The first code unit in any comment after «/*» or «//» MUST NOT be «@». The necessary grammar change is given in group LEX_COMMENTS. Rationale: This may trigger non-ESF-compliant extensions (for example or ). ADsafe rejects «@cc» anywhere in a comment. This is sufficient to prevent enabling of JScript conditional compilation, but not the Venkman debugger extensions. Caja and Cajita do not need to reject «@» in comments, because the output of cajoling does not include any comments. See . LEX_COMMENTS: «/*const*/» and «/*fallthru*/» are treated as tokens. (Other comments are not tokens.) They MUST ONLY appear where specified in the Jacaranda grammar. Note the lack of spaces before and after «const» or «fallthru». Also note that only these exact spellings are recognised; for example, «/*\u0063onst*/» is not recognised as a «/*const*/» token, even though «\u0063» would represent «c» within a string literal. A single-line comment at the end of the text MUST be terminated by a LineTerminatorSequence. The lexical grammar changes resulting from this and groups LEX_COMMENT_FIRST_CHAR and LEX_COMMENT_NO_ATSIGN_EXTENSIONS are: + Printable :: ε | PrintableChar Printable + MultiLineNotAsteriskOrSpaceOrCROrLFOrAtChar :: SourceCharacter but not '*' or '\u0020' or '\u000A' or '\u000D' or '@' MultiLineCommentChars :: see [ES5Draft] section 7.4 PostAsteriskCommentChars :: see [ES5Draft] section 7.4 MultiLineNotAsteriskChar :: see [ES5Draft] section 7.4 MultiLineNotForwardSlashOrAsteriskChar :: see [ES5Draft] section 7.4 mlc=MultiLineComment :: '/**/' | '/**' PostAsteriskCommentChars? '*/' | '/*' ('\u0020' | '\u000A' | '\u000D') MultiLineCommentChars? '*/' | '/*@' MultiLineCommentChars? '*/' { errors U= error(mlc, "multi-line comment starts with '@'") } | '/*' first=MultiLineNotAsteriskOrSpaceOrCROrLFOrAtChar rest=MultiLineCommentChars? '*/' { errors U= errorIf(first isnot Printable or rest isnot Printable, mlc, "multi-line comment does not start with '*' or space or " ++ "newline and is not printable ASCII") } + SingleLineNotForwardSlashOrSpaceOrAtChar ::" SourceCharacter but not LineTerminator or '/' or '\u0020' or '@' SingleLineCommentChars :: see [ES5Draft] section 7.4 SingleLineCommentChar :: see [ES5Draft] section 7.4 + SingleLineCommentEnd :: LineTerminator | eot=EndOfText { errors U= error(eot, "single-line comment at end of text must end with a " ++ "line terminator") } slc=SingleLineComment :: '//' SingleLineCommentEnd | '///' SingleLineCommentChars? SingleLineCommentEnd | '//\u0020' SingleLineCommentChars? SingleLineCommentEnd | '//@' SingleLineCommentChars? SingleLineCommentEnd { errors U= error(slc, "single-line comment starts with '@'") } | '//' first=SingleLineNotForwardSlashOrSpaceOrAtChar rest=SingleLineCommentChars SingleLineCommentEnd { errors U= errorIf(first isnot Printable or rest isnot Printable, slc, "single-line comment does not start with '/' or space " ++ "and is not printable ASCII") } Rationale: /*const*/ and /*fallthru*/ must be ES3 or ES5 comments in order for Jacaranda to remain a sublanguage of ESF. (They are lexed as multiline comments, but should then be retained as tokens, while other comment tokens are thrown away.) See group ASSIGNABLE_EXPRESSIONS for a description of the use of /*const*/ (it is essentially the same as 'const' as implemented in some current browsers and proposed for ES-Harmony, but with static rejection instead of silent failure). A SingleLineComment must be terminated by a LineTerminatorSequence, to avoid the possibility that any end-of-CDATA marker or end-tag added when embedding verified source might be treated as part of a comment: (It should not be; we're just being paranoid.) LEX_NUMERIC_LITERALS: A MUST NOT have more then 20 significant digits. As specified in [ES5Draft] section 7.8.3, the source code unit immediately following a MUST NOT be an or . The following attribute definitions are used to compute the value of a numeric literal, which is needed for other rules. They correspond exactly to the procedure in [ES5Draft] section 7.8.3, expressed in the same attribute grammar notion as the rest of this specification. nl=NumericLiteral :: | dl=DecimalLiteralNoExp (AfterNumeric but not ExponentIndicator or DecimalDigit)? { MV = dl.MV; SV = NumberToString(MV) } | dl=DecimalLiteralNoExp ExponentIndicator ep=SignedInteger (AfterNumeric but not DecimalDigit)? { MV = dl.MV * 10^(ep.MV); SV = NumberToString(MV) } | '0' HexIndicator hi=HexInteger (AfterNumeric but not HexDigit)? { MV = hi.MV; SV = NumberToString(MV) } + AfterNumeric :: dd=DecimalDigit { errors U= error(dd, "disallowed DecimalDigit character after numeric literal") } | is=IdentifierStart { errors U= error(is, "disallowed IdentifierStart character after numeric literal") } + dl=DecimalLiteralNoExp :: dil=DecimalIntegerLiteral '.'? { MV = dil.MV; errors U= errorIf(dil.significant > 20, dl, "> 20 significant digits in decimal literal") } | dil=DecimalIntegerLiteral '.' dd=DecimalDigits { if (match (dil ++ dd) as digits=DecimalDigits) { MV = dil.MV + dd.MV*10^-(dd.n); errors U= errorIf(digits.significant > 20, dl, "> 20 significant digits in decimal literal") } else { errors U= error(dl, "cannot happen") } } | '.' dd=DecimalDigits { MV = dd.MV*10^-(dd.n); errors U= errorIf(dd.significant > 20, dl, "> 20 significant digits in decimal literal") } DecimalIntegerLiteral :: d=DecimalDigit { MV = d.MV; n = 1 } | d=NonZeroDigit dd=DecimalDigits { MV = d*10^(dd.n) + dd.MV; n = dd.n + 1 } DecimalDigits :: '0' { MV = 0; significant = 0; n = 1 } | d=NonZeroDigit { MV = d.MV; significant = 1; n = 1 } | '0' rest=DecimalDigits { MV = rest.MV; significant = rest.significant; n = rest.n + 1 } | d=NonZeroDigit rest=DecimalDigits { MV = d.MV*10^(rest.n) + rest.MV; significant = rest.significant + 1; n = rest.n + 1 } DecimalDigit :: '0' { MV = 0 } | NonZeroDigit NonZeroDigit :: '1' { MV = 1 } | '2' { MV = 2 } | '3' { MV = 3 } | '4' { MV = 4 } | '5' { MV = 5 } | '6' { MV = 6 } | '7' { MV = 7 } | '8' { MV = 8 } | '9' { MV = 9 } ExponentIndicator :: one of 'e' 'E' SignedInteger :: DecimalDigits | '+' dd=DecimalDigits { MV = dd.MV } | '-' dd=DecimalDigits { MV = -(dd.MV) } + HexIndicator :: one of 'x' 'X' + HexInteger :: hd=HexDigit { MV = hd.MV } | hi=HexInteger hd=HexDigit { MV = hi.MV*16 + hd.MV } HexDigit : see group LEX_STRING_LITERALS Rationale: [ES5Draft] section 7.8.3 says that: Once the exact MV for a numeric literal has been determined, it is then rounded to a value of the Number type. If the MV is 0, then the rounded value is +0; otherwise, the rounded value must be /the/ number value for the MV (in the sense defined in 8.5), unless the literal is a and the literal has more than 20 significant digits, in which case the number value may be either the number value for the MV of a literal produced by replacing each significant digit after the 20th with a 0 digit or the number value for the MV of a literal produced by replacing each significant digit after the 20th with a 0 digit and then incrementing the literal at the 20th significant digit position. Therefore decimal literals with more than 20 significant digits cannot be used to obtain any additional precision. (Note that 20 significant digits are more than sufficient to round-trip any IEEE double value.) This restriction is also necessary in order to enforce that properties in object literals are unique. The issue here is that if the conversion to an IEEE double is nondeterministic, then the verifier may fail to detect names that should be treated as colliding in group OBJECT_LITERALS. LEX_STRING_LITERALS: A MUST ONLY use one of the escape characters «b», «t», «n», «f», «r», double-quote, single-quote, backslash, or «/». A string token MUST NOT contain a . The following attribution rules are used to compute the value of a string literal, which is needed for other rules. They correspond exactly to the procedure in [ES5Draft] section 7.8.4, expressed in the same attribute grammar notion as the rest of this specification. The grammar has been altered to not require lookahead in the production. StringLiteral :: '"' ds=DoubleStringCharacters '"' { SV = ds.SV } | '\u0027' ss=SingleStringCharacters '\u0027' { SV = ss.SV } DoubleStringCharacters :: ε { SV = [] } | c=DoubleStringCharacter rest=DoubleStringCharacters { SV = [c.CV] ++ rest.SV } | '\u005C' '0' rest=NIDDoubleStringCharacters // no initial decimal { SV = ['\u0000'] ++ rest.SV } + NIDDoubleStringCharacters :: ε { SV = [] } | c=NIDDoubleStringCharacter rest=DoubleStringCharacters { SV = [c.CV] ++ rest.SV } | '\u005C' '0' rest=NIDDoubleStringCharacters { SV = ['\u0000'] ++ rest.SV } DoubleStringCharacter :: c=SourceCharacter but not '"' or '\u005C' or LineTerminator { CV = c } | '\u005C' e=EscapeSequence { CV = e.CV } | LineContinuation + NIDDoubleStringCharacter :: c=SourceCharacter but not '"' or '\u005C' or LineTerminator or DecimalDigit { CV = c } | '\u005C' e=EscapeSequence { CV = e.CV } | LineContinuation SingleStringCharacters :: ε { SV = [] } | c=SingleStringCharacter rest=SingleStringCharacters { SV = [c.CV] ++ rest.SV } | '\u005C' '0' rest=NIDSingleStringCharacters { SV = ['\u0000'] ++ rest.SV } + NIDSingleStringCharacters :: ε { SV = [] } | c=NIDSingleStringCharacter rest=SingleStringCharacters { SV = [c.CV] ++ rest.SV } | '\u005C' '0' rest=NIDSingleStringCharacters { SV = ['\u0000'] ++ rest.SV } SingleStringCharacter :: c=SourceCharacter but not '\u0027' or '\u005C' or LineTerminator { CV = c } | '\u005C' e=EscapeSequence { CV = e.CV } | LineContinuation + NIDSingleStringCharacter :: c=SourceCharacter but not '\u0027' or '\u005C' or LineTerminator or DecimalDigit { CV = c } | '\u005C' e=EscapeSequence { CV = e.CV } | LineContinuation LineContinuation :: '\u005C' e=LineTerminatorSequence { errors U= error(e, "line continuation not allowed") } EscapeSequence :: CharacterEscapeSequence | HexEscapeSequence | UnicodeEscapeSequence CharacterEscapeSequence :: SingleEscapeCharacter | NonEscapeCharacter se=SingleEscapeCharacter :: 'b' { CV = '\u0008' } | 't' { CV = '\u0009' } | 'n' { CV = '\u000A' } | 'f' { CV = '\u000C' } | 'r' { CV = '\u000D' } | '"' { CV = '\u0022' } | '\u0027' { CV = '\u0027' } // ' | '\u005C' { CV = '\u005C' } // \ | 'v' { CV = '\uerrors U= error(se, "\\v escape") } ne=NonEscapeCharacter :: '/' { CV = '\u002F' } // explicitly allowed by JSON | SourceCharacter but not '/' { errors U= error(ne, "unrecognised escape") } HexEscapeSequence = 'x' a=HexDigit b=HexDigit { CV = [a.MV*16 + b.MV] } UnicodeEscapeSequence :: 'u' a=HexDigit b=HexDigit c=HexDigit d=HexDigit { CV = [a.MV*4096 + b.MV*256 + c.MV*16 + d.MV] } HexDigit :: '0' { MV = 0 } | '1' { MV = 1 } | '2' { MV = 2 } | '3' { MV = 3 } | '4' { MV = 4 } | '5' { MV = 5 } | '6' { MV = 6 } | '7' { MV = 7 } | '8' { MV = 8 } | '9' { MV = 9 } | 'A' { MV = 10 } | 'B' { MV = 11 } | 'C' { MV = 12 } | 'D' { MV = 13 } | 'E' { MV = 14 } | 'F' { MV = 15 } | 'a' { MV = 10 } | 'b' { MV = 11 } | 'c' { MV = 12 } | 'd' { MV = 13 } | 'e' { MV = 14 } | 'f' { MV = 15 } Rationale: JScript does not implement «\v» (it treats it as «v»). This could allow lexical confusion attacks. Unlike Caja, we cannot rewrite it to a different escape. There is no reason to allow «\» followed by an escape character that does not have a defined meaning. However, we must allow «\/» in order to remain a superset of JSON. (Unfortunately, JSON does not allow «\'», otherwise JSON and Jacaranda would support identical syntax for double-quoted strings. We considered disallowing «\'» in double-quoted strings, but that would create an inconsistency between single- and double-quoting.) Line continuations cannot be allowed, even if 'ES5' ∈ Options, because an ES3 parser may lex a module definition incorrectly if they are used. LEX_NO_REGEXP_LITERALS: Regular expression literals MUST NOT be used. Rationale: ES3 requires that each regexp literal in a program share a single RegExp object. However RegExp objects are mutable, which makes this sharing error-prone, and provides implicit communication channels that should not be present in an object-capability language. In ES5, regexp literals create a new RegExp object each time they are evaluated, but the fact that RegExp objects are mutable is still undesirable, and ConstRegExp SHOULD be used in preference. To get an idea of how often the mutability of RegExp objects trips up developers, see the huge number of duplicate bugs for (many of which concern mutability in general, not the issue that is fixed in ES5). Removing mutable RegExp objects entirely would unfortunately break too much code. Supporting regexp literals, however, is not feasible without complicating lexing enormously and incurring unacceptable security risks. A JavaScript lexer must be implemented using feedback from the parser, because whether a «/» character (not in a comment) is lexed as part of a or a depends on whether a is syntactically allowed at that position. In Jacaranda, it would have been possible to distinguish such positions by using a finite state machine in the lexer, and this approach was seriously considered for Jacaranda draft 0.4. However, it was abandoned when it became clear (as a consequence of a bug in JScript.Net), that this would make Jacaranda highly vulnerable to a complete break in the case of any noncompliance in a JavaScript implementation's lexer. In more detail, the bug in the JScript.Net lexer is that «/=» is incorrectly tokenized as a division-assignment even in contexts where a regexp literal can occur: An exploit for this bug could have been of the form: try {/=} catch (e) {toast();} try{1/+1;} catch (e2) {} except that JScript.Net is a fully static compiler, so syntax errors cause failure of compilation rather than a run-time exception. Although this specific bug in JScript.Net would not have been exploitable, the risk of similar unknown bugs in other implementations, given the complexity of distinguishing and correctly, was considered to be too great. The ANTLR-based implementation of Jacaranda does lex regular expression literals (using feedback from the parser to the lexer), but it does so only after signalling a verification error, so that no security failure can occur even if the verifier's lexing would differ from that of an implementation under which the code might have been run. LEX_SEMICOLON_INSERTION: Section 7.9.1 of [ES5Draft] describes the rules by which semicolons are implicitly inserted into ES5 programs under certain conditions that would otherwise be a syntax error. In Jacaranda, those conditions are made more restrictive, _as though_ the following *grammatical semicolon insertion* rule were in effect: A *nonsubstatement* is a statement that is not a substatement of a , , , , , or ; and is not the last statement of a case or default clause of a . When, as the program is parsed from left to right, a «}» token is encountered where the «;» token that would terminate a nonsubstatement is expected, then parsing occurs as-if by inserting a «;» token before the «}». The effect of this rule is actually specified by modifications to the Jacaranda grammar, which remains context-free. In addition, * A postfix «++» or «--» operator MUST be on the same line as the last token of its operand; * If a continue, break, return or throw statement takes an operand (label or expression), then the keyword MUST be on the same line as the first token of its operand. Rationale: First read . It is obvious that we do not want "syntactic semicolon insertion" [SSI below] -- it is too difficult to specify and too difficult for programmers to understand its consequences. But is the above wiki page correct in asserting that "grammatical semicolon insertion" [GSI below] is "relatively harmless"? The wiki page oversimplifies how GSI could be specified or implemented: it is not sufficient to simply make semicolons optional in statement productions, since they are not optional when followed by another statement. Nevertheless, it is feasible to specify this in a context-free grammar. GSI is almost equivalent to treating ';' as a statement _separator_. The assertion that GSI in this sense is harmless is probably not entirely correct. In The Impact of Language Design on the Production of Reliable Software an experiment was performed comparing two languages -- TOPPS, which used semicolons as separators, and a redesigned language TOPPSII, which used semicolons as terminators (as well as having many other changes). The results were interpreted as follows: In using the semicolon as a separator, rather than as a statement terminator, TOPPS was following a long and honorable tradition (ALGOL 60, Pascal, BLISS, etc.). However, the TOPPSII form (similar to that of PL/I) led to an order of magnitude reduction in the number of semicolon errors. Of course, most semicolon errors are rather trivial (i.e., they generally do not persist more than one run). However, a small modification to the language would have eliminated errors that occurred on more than a quarter of all compilations. It is also interesting to note that over 14% of the second-half TOPPS errors are still semicolon errors (compared to 1% for TOPPSII). On the other hand, this study considered programmers who had previously only used PL/I (which as stated, uses semicolons as terminators), and had only very limited programming experience. It is not clear that its conclusions are relevant to JavaScript -- not least because TOPPS prohibited trailing semicolons in a block rather than making them optional. We consider the benefit of compatibility with a slightly larger set of JavaScript programs to outweigh the potential for additional errors in this case. Jacaranda does not make semicolons optional when they terminate a substatement, for example the semicolons after both «return» statements in this example are required: (function (foo) { if (foo) return 1; else return 2; })(); This is to simplify the grammar, and because omitting the semicolon would obscure the fact that the substatement is a separate statement to its parent. Due to a bug in most JavaScript implementations other than Opera, the behaviour of programs such as the following is not conformant to ES3 or ES5 on those implementations: (function () { a: for (;;) { for (;;) { break/* */a } return 1 } return 2 })(); ES3 requires the comment that is split over lines to be treated as a line terminator, so that «break [LineTerminator] a;» will not be syntactically valid, and a semicolon will be inserted after «break». The buggy implementations do not perform this insertion, instead interpreting the statement as «break a;» and returning 2. Jacaranda is not vulnerable to this issue, because of the rules requiring postfix «++» or «--» to be on the same line as the last token of its operand, and the first token of the operand (if present) of a «continue», «break», «return» or «throw» statement to be on the same line as the keyword. The example above, for instance, is not valid because «a» is not on the same line as «break». Module Definitions ================== MODULE_DEFINITION: A module definition MUST be lexable according to the Jacaranda lexical grammar; then, the resulting token stream MUST be parseable as the production below. The *directive prologue* of a is the longest initial sequence of in its body that consist of whose expression is a . The passed as an argument to $module MUST NOT be named, and its directive prologue MUST contain the directives "use strict" and "use jacaranda". If 'attach' ∈ Options then the directive prologue MUST also contain the directive "use attach". [# Design issue: it would be nice for the runtime to be able to detect the "use attach" directive, so that we wouldn't need the 'attach' option. #] Every freely-used identifier of a module (as defined by the SCOPE_OF_IDENTIFIERS group) MUST occur in IMPLICIT_IMPORTS. A module MUST NOT have any freely-assigned names. [# FIXME: directive prologues not implemented correctly in grammar. #] [# FIXME: need to be closer to implementation here. #] program : BOM* programElements? + programElements module BOM* programElements? | [lookahead ... {'$module'}] sourceElements sl=statementList { if (match sl as mdef=moduleDefinition) { errors = mdef.errors } else { if (match "function() {" ++ sl.text ++ "}" as f=functionExpression) { errors = error(sl, "not a module definition") U f.errors U errorIf('this' ∈ f.freelyUsedNames, sl, "'this' used freely at top-level") } } else { errors = error(sl, "not a module definition, and parsing as a function body did not work (please report this)") } } } + module : '$module' '(' name=StringLiteral ',' ol=optionsList f=function ')' ';' { Options = ol.options; directives = {}; errors U= errorIf('use strict' ∉ f.directives, f, "module function has no \"use strict\" directive") U errorIf('use jacaranda' ∉ f.directives, f, "module function has no \"use jacaranda\" directive") U errorIf('attach' ∈ Options and 'use attach' ∉ f.directives, f, "attach option specified but module function has no \"use attach\" directive") U errorIf(f.looseNames != {}, f, "module function must not be named") U { err(f, "freely-used identifier "++quote(id)++ " (not in implicit imports) at module level") | id <- f.freelyUsedNames \ IMPLICIT_IMPORTS } U { err(f, "freely-assigned identifier "++quote(id)++" at module level") | id <- f.freelyAssignedNames } } + optionsList : ε | opt=StringLiteral ',' list=optionsList { let strip('!' ++ S) = S, strip(_) = _; options = strip(opt.SV) U list.options; errors U= errorIf(opt.SV startswith '!' and opt.SV != '!ES5' and opt.SV != '!attach', opt, "important option is not recognised") } + BOM :: see group LEX_CODE_UNITS IMPLICIT_IMPORTS = ESF_BUILTINS U functionExpression : see group FUNCTIONS Rationale: The format of a module definition has been redesigned for Jacaranda 0.4x. The imports of a module are now specified as parameters to the module function. (In Jacaranda 0.3, a 'with' statement was used to inject bindings into the scope of the module.) Because ES5 does not support keyword parameters, the Jacaranda runtime uses a hack involving parsing the result of 'toString()' on the module function to find its parameter names; it then maps the corresponding properties of the imports object to the function arguments in the correct order. There will be a runtime exception if this results in access to a protected property of the imports object. This approach simplifies the specification by avoiding the need to compute the imports from the module object literal; in fact, there is no module literal any more (eliminating all of the 'sef' grammar productions from Jacaranda 0.3), nor any concept of a "module object". It also makes module definitions prettier and less obscure, and allows the import names to be unquoted. To maintain some of the flexibility of the approach that used a module literal, it is possible to pass an arbitrary sequence of "options", specified as string literals, to $module. Because occurrences of explicit imports are now bound occurrences, the remaining freely-used names of a module function must all be implicit imports. Since the set of implicit imports is fixed for a given version of Jacaranda, it is possible to run the module code in a lexical scope that provides bindings for all of these imports. Requiring the imports to be declared also has software engineering advantages, and could possibly detect some errors due to misspelled words. In these respects it is similar to a /*globals*/ declaration in JSLint. Note that future versions of Jacaranda may add new implicit imports. In that case, a module that was verified by the later version's verifier should not be run under the previous version's runtime, since access to the added imports will not be blocked. (The approach used in Jacaranda 0.3 also had this deficiency.) There is no longer any concept of optional imports. However, if a module has a parameter named "imports", it will be mapped to the imports object, enabling the use of $get(imports, 'foo') and $hasPublic(imports, 'foo'). Imports can be directly used, but not directly assigned to. In the case of explicit imports, this restriction is now an automatic consequence of declaring these imports as parameters, which are const in Jacaranda. This ensures that there is no implicit per-module or per-caplet mutable state. (Explicit per-caplet mutable state can be declared by using «var» in some outer function.) The main rationale for this restriction is to allow confinement at the granularity of groups of objects, rather than only between module instances (see the message quoted at ). Note that the module function expression must be anonymous, since a named function expression might have the side effect of defining the name in the surrounding scope. (Since Jacaranda 0.41, it is no longer required that all function expressions are anonymous.) One way to submit a module to a Jacaranda interpreter is to simply execute the module definition as ESF code -- for example in a