Jacaranda Language Specification, draft 0.4 "Inside every large problem is a small problem struggling to get out." -- C. A. Hoare Introduction ============ This document is draft version 0.4 of a specification for the Jacaranda programming language. Jacaranda is an object-capability [Miller2006] sublanguage of JavaScript. To be more precise: Define language A to be an *attribute-verifiable sublanguage* of language B if-and-only-if: - there exists an attribute grammar [Knuth1968, Knuth1971, Swierstra2005] for A whose underlying grammar recognises a subset of texts recognised by B, and that deterministically computes an 'errors' attribute (whose value is a set of error indicators) on the start symbol; and the semantics of executing a text in A are: - if the text is not recognised by A's underlying grammar or if the 'errors' attribute on the start symbol is non-empty, report these errors. Otherwise, execute the program according to the specification of language B, in a given execution environment. By *ESF*, we mean the programming language specified by [ES5Draft], subject to changes described in the "[ES5Draft] -> ESF" section below. Jacaranda is an attribute-verifiable sublanguage of ESF. (ESF also supports syntax for declaring «const» variables along lines proposed for ES-Harmony, and an optional improvement to the semantics of the GetValue operation that is also similar to a previous ES4 proposal.) In fact, the underlying lexical and syntactic grammars used to specify Jacaranda are almost identical to those of ESF; the only differences are the addition of rules to compute attributes, and some minor changes needed to handle grammatical semicolon insertion [# CHECK #]. Note that _any_ implementation of language B that satisfies B's specification when applied to programs in the A sublanguage, is sufficient to execute a verified A program. In the case of Jacaranda, the sublanguage avoids many known bugs in commonly used JavaScript implementations, and so verified code does not necessarily need to be run on a fully correct implementation of ESF. By using this definition of subsetting, Jacaranda differs from some other proposals for security-focussed sublanguages of ECMAScript, such as Caja and Cajita [Miller2008] or FBJS [Facebook]. There are many advantages of specifying and implementing a safe sublanguage by attribute-verification rather than by rewriting: - In implementations of the languages mentioned above, the code is rewritten to a form that can introduce additional run-time errors (or other changes in behaviour, called "gotchas" in the Caja/Cajita specification), that would not occur when running it directly as ECMAScript. In Jacaranda, there are necessarily no changes in the possible run-time behaviour of a verified program relative to the same program in ESF. There are cases where an existing JavaScript program might have to be modified in such a way that additional run-time errors could occur -- for example some property accesses may have to be replaced by function calls, which can throw exceptions -- but these can only happen where there are explicit changes in the source code. - A verifier is easier to understand and has fewer sources of potential error, both in the specification and implementation, than a rewriter. To understand a rewriter, it is necessary to keep in mind the semantics of both the source and target languages, and the run-time library that the translated code depends on, and be convinced that: - the source language semantics are preserved by the translation; - no source program is translated in a way that will lead to the translated code breaking the intended security properties. To specify and implement a verified sublanguage, on the other hand, it is only necessary to define how the language is restricted syntactically; then the security properties of the sublanguage can be understood independently. If the restriction is done by an attribute grammar, there is little room for ambiguity in the specification of the verifier. The semantics are already defined by the original language specification, modulo bugs and ambiguities in that specification. A specification that defines a rewriting into another language is likely to be more complicated, and give a result that is less directly useful and understandable to most users of the sublanguage. (Jacaranda also requires a run-time library written in ESF, but that library is relatively small and of low implementation complexity. In fact it is mostly a set of utility functions which, although they are written in ESF and therefore have to be trusted, are each very short and can be understood independently.) - As mentioned above, for Jacaranda the underlying lexical and syntactic grammars on which attributes are computed are exactly as specified by ESF (and almost exactly as specified by [ES5Draft]). This means that: - the subsetting relation between Jacaranda and ESF holds by construction; - it is much easier to see that all possible cases have been accounted for in the specification, since they are all enumerated by an existing grammar; - an existing ECMAScript parser can be used, provided that it follows the ESF grammar closely enough and enforces it strictly; - it is possible to parse any syntactically valid ECMAScript program (even an invalid ECMAScript or JavaScript program if the parser can recover from errors) and report deviations from the Jacaranda sublanguage in one pass, rather than requiring the programmer to fix each error before the rest of the program can be parsed. - It is easy to perform white-box testing of a verifier based on an attribute grammar, because the intermediate results are deterministic and do not depend on any implementation choices. The attribute grammar specifies precisely what tree of attribute values are to be computed for any given input. So a single white-box test suite can be run against all implementations, or implementations can be compared against each other for "fuzzed" input , and any differences in the resulting attribute tree (not just differences in the overall pass/fail outcome of verification) can be treated as bugs. - In general an attribute grammar has *synthesized* and *inherited* attributes. The synthesized attributes of a node depend only on the attributes and contents of its immediate child nodes; while the inherited attributes of a node depend only on its parent node (and on its other attributes, possibly). A grammar that has only synthesized attributes is called *S-attributed*. Such grammars admit a verifier implementation that is particularly simple and efficient in time and memory: programs can be parsed and verified in a single pass, by either a "top-down" or "bottom-up" style of parser, without necessarily constructing any syntax tree explicitly. The Jacaranda attribute grammar is "almost" S-attributed, in the sense that it has only two inherited attributes ('DependOnAutoAttach' and 'DependOnES5'), which are both determined by the first token of the source text, and have the same value in all nodes. In practice this case is as easy to implement as an S-attributed grammar; it requires just a single post-order traversal of the parse tree. If an explicit parse tree representation is required (e.g. for a code generation framework), the representation could also ensure that only error-free subtrees are constructed, since the attributes for any subtree can be computed independently if the expected value of the 'DependOnAutoAttach' and 'DependOnES5' attributes are given. Attribute computation can be implemented using the "semantic actions" supported by most existing general-purpose parser systems, and is also straightforward to implement in an ad-hoc recursive descent parser. - There are no difficulties in reverifying a program that has already been verified. A non-idempotent rewriter, on the other hand, cannot usefully process a code fragment twice -- even if the result is correct, it will be too inefficient. (This also means that it would be possible to apply N-version programming, with verifiers written using different languages and/or parser libraries by independent teams of programmers: a simple harness could run all versions concurrently and if they differ, reject the input program and report a bug. Of course this will only directly catch implementation rather than specification bugs, and it only applies to the verifier; the JavaScript implementation(s) used to execute verified programs still need to be tested more conventionally.) - Avoiding rewriting means that less inefficiency is introduced into the code that is being run. While there is some overhead due to changing property accesses to function calls, this is explicitly visible in the source code, and can be avoided in many cases (especially in the 'DependOnAutoAttach' mode, as explained later). - Debugging is simplified because it is the original code that is being debugged. Although it is possible in principle to provide a debugger with a mapping between lines of the input code and its rewritten form, in our experience this often doesn't work very well (particularly when the ordering of code is changed, as it is by many of the Caja rewriting rules). Despite these differences, Jacaranda is strongly influenced by Caja and Cajita, and by ADsafe [Crockford], which is also a static sublanguage of ES3 that could in principle be specified by an attribute grammar. (Differences between Jacaranda and ADsafe are out of scope for this document, but in general Jacaranda allows a qualitatively larger sublanguage of ECMAScript than ADsafe. Also, Jacaranda is an object-capability language with per-object encapsulation, whereas ADsafe only provides encapsulation between module instances.) We encountered the idea of using attribute grammars for validation (as opposed to their more common uses in assigning semantics or for code generation) from Meredith Patterson and Len Sassaman [PS2009]. The earliest reference we can find to a sublanguage of an existing programming language being defined in terms of an attribute grammar, is a 1977 paper defining a sublanguage of Algol 68 [Simonet1977]. Edgar Irons did much of the early work that led to the development of attribute grammars [Irons1963]. [Irons1963] Edgar T. Irons, Towards more versatile mechanical translators, Proceedings of Symposia in Applied Mathematics 15, pages 41-50. American Mathematical Society, 1963. [Knuth1968] Donald Knuth, Semantics of Context-Free Languages, California Institute of Technology, Theory of Computing Systems, Volume 2 Number 2. Springer, New York, June 1968. [Knuth1971] Donald Knuth, Semantics of Context-Free Languages: Correction, Stanford University, Theory of Computing Systems, Volume 5 Number 2. Springer, New York, June 1971. [Simonet1977] M. Simonet, An attribute description of a subset of Algol 68, Proceedings of the Strathclyde ALGOL 68 conference, Glasgow, Scotland, pages 129-137, 1977. [Swierstra2005] Wouter Swierstra, Why Attribute Grammars Matter, The Monad.Reader, Issue Four, 01-07-05. [PS2009] Meredith L. Patterson and Len Sassaman, Validation by Parse Tree Comparison Extended to Context-Sensitive Grammars, In preparation, 2009. [Miller2006] Mark S. Miller, Robust Composition: Towards a Unified Approach to Access Control and Concurrency Control, PhD thesis, Johns-Hopkins University, May 2006. [Miller2008] Mark S. Miller, Caja draft specfication, Latest version available from [Crockford] Douglas Crockford, ADsafe specification, [Facebook] FBJS - Facebook Developers Wiki, [ES5Draft] "Sunnyvale" draft specification for ECMAScript 3.1/5, updated through Errata 8. Available from . Conventions =========== Text in a subsection following "Rationale:" is non-normative; that is, it does not impose any conformance requirements on Jacaranda implementations or programs. The same programming language would be specified if this text were removed. Text surrounded by [# #] is a non-normative editing note. Definitions of technical terms used in this specification should be taken from the first of the following documents that provides a definition for the term: 1. This document. 2. [ES5Draft]. 3. The Unicode Standard, version 5.1.0. Defining occurrences of terms in this document are shown like this: *term*. A *string* is a sequence of zero or more UTF-16 code units (not necessarily well-formed). In this specification, strings are often presented in single quotes; this has the same meaning, including use of \u escape sequences, as the corresponding string literal in ESF. Terms applied to strings such as *starts with*, *ends with*, etc. are defined in terms of this code unit sequence, regardless of whether it is a well-formed UTF-16 encoding. For example, '\uD800__' is considered to end in '__'. Code fragments that appear "inline" in the text are surrounded by double angle quotation marks: « ». In general, these fragments should be interpreted as ESF code; they may or may not be valid in the Jacaranda sublanguage. In code fragments, \ stands for itself. Grammar productions that are referenced "inline" in the text are given in single angle brackets: < >. \u escapes are used in grammar productions to mean the corresponding unescaped code unit. An optional occurrence of a production is shown with a trailing '?'. Alternatives of a production are separated by '|'. There is no implied priority between alternatives. Productions that do not correspond to an [ES5Draft] production are preceded by '+'. There are three grammars used in this specification: - In the lexical grammar, productions are named with an initial uppercase letter, and :: is used to separate a nonterminal from its definition. The lexical grammar is an S-attribution of the ESF lexical grammar. The input is a string, and the output is a sequence of tokens, each of which is a string that may have attributes. ([ES5Draft] specifies some characteristics of the ES3 lexical grammar only informally. It is possible to use the attribute grammar formalism to specify the required tokenisation very precisely. We do not do this in the current draft because it would add apparent complexity to the Jacaranda specification that would only be concerned with clarifying ESF, not with defining how Jacaranda is a sublanguage of ESF.) In lexical rules, a string delimited by single quotes denotes the corresponding sequence of input code units. - In the syntactic grammar, productions are named with an initial lowercase letter, and : is used to separate a nonterminal from its definition. The syntactic grammar is technically an LR-attribution of the ESF syntactic grammar (it would be an S-attribution without the 'DependOnAutoAttach' and 'DependOnES5' attributes). Its input is the token sequence obtained from the lexical grammar, and its principal output is an 'errors' attribute giving the set of errors (if any), other than parse errors, that would prevent the source text from being a valid Jacaranda program. - In the string grammar, ::: is used to separate a nonterminal from its definition. The input to the string grammar is a sequence of 16-bit code units. It is a regular grammar that computes no attributes and is only used to recognise classes of strings. The regexp literal grammar in ESF is used unchanged by Jacaranda. Lexical and syntactic grammar productions that do not require an explicit attribution rule but are directly used by a production that does, are shown with a reference to the [ES5Draft] section number. The symbol ε in a production means the empty input sequence. The syntactic grammar can be parsed unambiguously, independently of attribute computation. That is, while attribute values affect whether errors result from a given parse, it is never necessary to compute attributes in order to choose between different parse trees. (In a small number of cases, as part of processing a production's attribution rules, the node must be reparsed against a more specific production.) The metalanguage used to define attribution rules has values of the following types: - true and false - arbitrary-precision decimal numbers - the undefined attribute value - finite sets of other attribute values - finite sequences of other attribute values - finite mappings from attribute names to attribute values. A parse tree node is represented as a finite mapping of its attributes, with a 'text' attribute holding the corresponding source text. A convention is used to present attribution rules more concisely, by making computation of "boilerplate" values for attributes implicit: - If a production has exactly one child node, or a child node annotated with an initial ^, then its boilerplate attribute values are given by the attributes of that child. - Otherwise, - for the attributes 'freelyUsed', 'freelyAssigned', 'allVars', 'constVars', 'varVars' [# fixme #], 'statementVars', 'usedProperties', 'aliasesForThis', and 'errors', the boilerplate value is computed by taking the union of the corresponding attribute values on all child nodes; - for attributes with names starting 'is' (which are boolean-valued), the boilerplate value is false; - for other attributes, the boilerplate value is the undefined value. Boilerplate values are only used for attributes for which no explicit definition is given, or when 'U=' is used as described below. The attributes mentioned in b) are all set-valued, so that taking the union of their values for all children of a node is well-defined. The convention is to be used for _all_ syntactic and lexical grammar productions, including those that are used by ESF but not explicitly mentioned in this document. Notation and auxiliary functions used in attribution rules: // comment a comment that is non-normative and has no effect N.foo the value of attribute 'foo' of node N N is P true iff N.text matches production

(grammatically, by reparsing ignoring attributes) N isnot P not(N is P) not(b) true if b is false; otherwise false { x <- A | p(x) } the subset of values x taken from A for which p(x) UNION S the union of all sets that are members of S, or undefined if any of the members are undefined A U B UNION {A, B} A \ B the set difference { x <- A | x notin B } S ++ T the concatenation of sequence S followed by T A U= B define A to be the union of its boilerplate value and B S startswith T true if S is not undefined and sequence S is a prefix of T; otherwise false X == Y true if X and Y are identical values; otherwise false cond ? X : Y if cond is true then X, otherwise Y max(X, Y) X > Y ? X : Y min(X, Y) X < Y ? X : Y reparse N as P else Action reparse N as production

, possibly binding additional variables in the attribution rule. If N cannot be parsed as

, perform Action instead of the remainder of the rule. // the following definition uses Haskell-style pattern matching on sequences: let checkNoMultipleDeclaration([]) = {}, checkNoMultipleDeclaration(x cons xs) = { error(id, 'multiple declaration') | id <- x.allVars ∩ UNION { y.allVars | y <- xs } } U checkNoMultipleDeclaration(xs); let check(cond, arg, msg) = cond ? {} : {error(arg, msg)}; let error(arg, msg) = representation of an error with auxiliary information 'arg' (an identifier or node) and message 'msg' let NumberToString(x) = the result of applying ToString as defined in [ES5Draft] section 9.8.1 to "the number value for x" as defined in [ES5Draft] section 8.5. let DecimalToString(x) = undefined // for future support of decimal literals [# check whether any other notation should be defined here #] Any conformance requirements specified in the normative text are intended to be redundant with, i.e. specify precisely the same restrictions as, the Jacaranda lexical and syntactic grammars. If this is not the case then it indicates an ambiguity in the specification that needs to be clarified. The descriptions of Jacaranda restrictions relative to ESF are given in groups, each with an uppercase name. Attribution rules given in one group may also impose constraints needed to implement the rules in other groups, in which case cross-references are given in most cases. Rationale: Analogous to type declarations, some degree of redundancy in a specification is useful in order to provide a better chance of detecting specification errors. This is the reason for providing informal descriptions that are redundant with the attribute grammar. Conformance =========== The terms *SHALL*, *SHALL NOT* or *SHALL ONLY* are used to describe constraints on a conforming implementation of Jacaranda. The terms *MUST*, *MUST NOT* or *MUST ONLY* are used to describe static constraints on a module definition (see below). This indirectly specifies that a conforming implementation of Jacaranda SHALL reject module definitions that fail to satisfy such a constraint. The terms *SHOULD*, *SHOULD NOT* or *SHOULD ONLY* are used to describe a recommendation for Jacaranda code to follow, in order to mimimise the risk of potential incompatibilities with future versions of this specification or implementations of ECMAScript. It has no formal consequence for conformance. For any constraint that is described as implementation-defined, a conforming implementation SHALL make a consistent choice and SHALL document which choice is made. An *ESF interpreter* is an instance of an ESF implementation conforming to ECMA-262 3rd edition amended as described in the section "[ES5Draft] -> ESF". A *context* is an ESF execution environment that has copies of the standard built-in objects independent of other contexts in the same ESF interpreter. Contexts may share other objects. (In typical web browser implementations of JavaScript/ECMAScript, there is usually one context for each frame or iframe.) A *visible side effect* is an effect that changes the state of one or more objects in an ESF interpreter in a way that is observable to other code running in that interpreter by overt means (that is, without making use of timing of events where this timing is nondeterministic). Memory usage is not considered to be a visible side effect. Jacaranda programs are organised into *modules*. Typically, these are independently developed units of source code. A *module definition* is a string, representing Jacaranda source code intended to define a module. An *internal module representation* is a representation of a module as a collection of values in a given ESF interpreter. An internal module representation includes a *module object*, which is a deeply immutable object with the properties specified in the module definition. [# is this an implementation detail? #] An instance of a Jacaranda module is called a *caplet*. Each caplet will have a *powerbox* object defined by a specific region of its module code. The creator of a caplet must provide an *environment* and a *powersource*. The environment specifies *imports* that will be in scope throughout the caplet's module code. If an import is callable, it SHALL be a simple function. The powersource is an object that will only be initially available to the caplet's powerbox. The caplet will have no overt authority other than that granted via its environment and its powersource, and the ability to perform pure computation (Jacaranda does not attempt to prevent a caplet from denying computational service to other caplets). It will also have access to the module object, but that object provides no additional authority. A *module name* is a string, representing the name of a module. A *Jacaranda interpreter* is an ESF interpreter with the following additional characteristics: - an interpreter has a per-context state consisting of a mapping from module names, to internal module representations of modules that have been accepted in that context. - it is possible to create a context with an initially empty mapping. - it is possible to submit a module definition to a given context, with the following effect: - if the module definition violates any of the constraints given in sections "Module Definitions" and "Static Constraints" below, a *static rejection error* SHALL be reported. - the module definition may specify a variant of the Jacaranda language that is unsupported by the implementation. In that case, an *unsupported Jacaranda variant error* SHALL be reported. - if the module definition does not give a 'name' property that is different from any module name already in the mapping, a *module name clash error* SHALL be reported. - the implementation may, for any or no reason, report an *implementation failure* when an attempt is made to submit a module. Such a failure SHALL be distinguishable from other errors. - otherwise, a mapping from the module's name to its internal module representation SHALL be added to the context's state. In any of these cases, there SHALL be no other visible side effects of submitting the module definition. - it is possible to create a caplet by *instantiating* a named module, providing an environment and a powersource as discussed above. - the *actual import environment* of the caplet is the provided environment, augmented with the objects specified in the "Module Run-time Environment" section, the latter taking priority over the former; then intersected with the module's import set as defined in group MODULE_DEFINITION. - if an import named in the caplet's 'imports' list is not available in the actual import environment, an *unavailable import error* SHALL be reported. - the module MUST have a 'powerbox' property that holds a function (callable) value -- if not then a *missing powerbox property error* SHALL be reported. - the implementation may, for any or no reason, report an *implementation failure* when an attempt is made to instantiate a caplet. Such a failure SHALL be distinguishable from other errors. - the function given by the 'powerbox' property of the module SHALL be called, with the given powersource object and the module object as parameters in that order. The bindings available in the caplet's outer lexical scope SHALL be given by its actual import environment. - the function given by the 'powerbox' property may interact with the environment of the Jacaranda implementation in order to set up *event handlers* which cause functions created by the module to regain control at some later point in time. The specifications of such event-handling features are implementation-defined. In cases where the above description says that an error SHALL occur as a result of submitting or instantiating a module, other visible side effects SHALL NOT occur. At any time, and for any or no reason, a Jacaranda implementation may *lock* a given context, in which case it SHALL NOT visibly run any further code in that context. Other aspects of the interaction between a Jacaranda interpreter and its environment are implementation-defined. (One possibility is that a *container* is written in unrestricted ESF code to control this interaction.) Rationale: The idea of using different conformance terms to indicate constraints on the implementation vs constraints on programs is borrowed from W3C specifications. The term "lock" is named after the LOCK utility in the Incompatible Timesharing System: . The allowance for implementation failures makes Jacaranda particularly easy to implement: #include int main(int argc, char **argv) { puts("Implementation failure.\n"); return 1; } More seriously, there are good reasons to allow nondeterministic failures, given that a practical Jacaranda implementation has no control over what memory or execution resources are available, and machines with unbounded resources do not exist. The requirement for these failures to be distinguishable from other errors is actually quite strong. [ES5Draft] -> ESF ================= Start with [ES5Draft]. Change and related productions to allow «const» to occur in place of «var». («const» is a in ES3/3.1, so it already cannot occur as an identifier.) The detailed changes to these productions are given in the DECLARATIONS group, but are mentioned here because they extend the language rather than restricting it. Section 2 Conformance - Replace A conforming implementation of this International standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later, and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it presumed to be the UTF-16 encoding form. with A conforming implementation of ESF shall interpret characters in conformance with the Unicode standard, version 5.1.0 or later (see section 6 for further detail). - The following paragraph does not apply to ESF: A conforming implementation of ECMAScript is permitted to provide additional types, values, objects, properties, and functions beyond those described in this specification. In particular, a conforming implementation of ECMAScript is permitted to provide properties not described in this specification, and values for those properties, for objects that are described in this specification. Rationale: Permission to provide arbitrary implementation-defined extensions is not appropriate for a security-focussed language. In the case of ESF, some kinds of extension can make it impossible to define a secure sublanguage by static restriction. For more detail see the constraints on extensions given in the changes to sections 8.6.2, 15, and 16 below. Section 3 References - Replace the reference to Unicode version 3.0 with references to Unicode version 5.1.0. Rationale: The conformance requirements for Unicode have been significantly tightened since version 3.0. In particular, there were security- relevant changes at around version 3.2. Note that this does not require support for characters added in later versions. Section 6 Source Text - Replace the section before "Throughout the rest of this document" with: ESF source text is represented as a sequence of code units in the UTF-16 transformation format of Unicode, version 5.1.0 or later. The text should be in Unicode Normalization Form C as described in Unicode Stanard Annex #15. An ESF implementation shall not perform any normalisation of source text itself (but may warn if the text is not in Unicode Normalization Form C). This does not preclude an ESF source text from being stored or transmitted in other character encodings, but the process by which these encodings are converted to UTF-16 is outside the scope of this document. - delete NOTE 1, and change "NOTE 2" to "NOTE". Section 7 Lexical Conventions - Insert after the first paragraph: Note that tokens include «/», «/=», «/*const*/», and «/*fallthru*/», even though these are not included in the production. Section 7.1 Unicode Format Control Characters - Note that this section no longer allows Format Control characters to be "removed from the source text before applying the lexical grammar", as was specified in ES3. Section 7.8.3 Numeric Literals - The grammar for HexIntegerLiteral is incorrect. Change it to HexIntegerLiteral :: 0x HexInteger 0X HexInteger HexInteger :: HexDigit HexInteger HexDigit where '0x' and '0X' are in bold and production names are in italic. Change the rules for calculating MV of a HexIntegerLiteral correspondingly. (These rules are given in the Jacaranda grammar, but the bug is in ES3.) Section 8.6.2 Internal Properties and Methods - Replace Host objects may implement these internal methods with any implementation-dependent behaviour, or it may be that a host object implements only some internal methods and not others. with A *Jacaranda-accessible property* is a property that has an accessible name as defined in group PROPERTY_NAMES. A value is *built-in-reachable* iff it is reachable in the sense of garbage collection terminology, given that only the properties listed in ESF_BUILTINS of the global object of any context, or the Jacaranda-accessible properties of any other reachable object can be obtained. ESF_BUILTINS = { 'Array', 'Boolean', 'Decimal', 'Math', 'Number', 'RegExp', 'String', 'Error', 'EvalError', 'SyntaxError', 'TypeError', 'RangeError', 'ReferenceError', 'URIError', 'Infinity', 'NaN', 'undefined', 'decodeURI', 'decodeURIComponent', 'encodeURI', 'encodeURIComponent', 'isFinite', 'isNaN', 'parseFloat', 'parseInt' } An ESF implementation SHALL NOT add Jacaranda-accessible properties other than those specified in [ES5Draft] to an object that is at any time built-in-reachable, except as described in Section 15 below. Host objects may implement the [[Get]], [[Put]], [[ThrowablePut]], [[CanPut]], and [[HasProperty]] internal methods with implementation- dependent behaviour when they are used to access properties that are not Jacaranda-accessible. Host objects may also implement internal methods with implementation-dependent behaviour in other cases subject to the following constraint: for any given host object, either - the object is never built-in-reachable, or - the object has behaviour for all of its internal methods that is equivalent to that of some possible non-host object. [# Is there a conflict between use of 'built-in' here and in [ES5Draft]? #] [# Are these constraints strong enough to rule out oddball objects that would prevent us from implementing the Jacaranda run-time securely? Should [[Get]], [[Put]], [[ThrowablePut]], [[CanPut]], and [[HasProperty]] be further restricted? Is "reachable in the sense of garbage collection terminology" clear or precise enough? The problem here is that this is the ESF spec, but we care mainly about whether it will be possible to prevent an oddball object from being reachable in Jacaranda (hence the restriction to Jacaranda-accessible properties). #] [# Since global constructors are now 2nd-class, we might be able to relax these constraints. #] Section 8.6.2.6 [[DefaultValue]] (hint) - In steps 3 and 7 of each of the algorithms given for [[DefaultValue]], it is specified to "call the [[Call]] method of" a particular object, even though that object is not guaranteed to implement a [[Call]] method. Modify these steps to specify that if the object does not implement [[Call]], a TypeError exception SHALL be thrown. Rationale: Throwing a TypeError appears to be the behaviour of at least SpiderMonkey and JScript, tested using «String({toString: {}})». [# Test other implementations. #] Section 8.7 The Reference Type - Replace the last sentence of section 8.7 with: The abstract operations GetValue and PutValue are used in this specification to operate on references. The definition of GetValue is dependent on a 'AutoAttach' flag associated with each function or program fragment. An ESF implementation is not required to support the change to the semantics of GetValue (relative to ES3) described here. If it does not, then it SHALL NOT define any variable called 'useAutoAttach' in the initial global environment of a new context, and the AutoAttach flag for all fragments SHALL be false. If it does support this change, then: - it SHALL define a ReadOnly variable called 'useAutoAttach' in the global environment of each context, that is set permanently to true. - if a has the «useAutoAttach;» as its first statement, then the AutoAttach flag for that function fragment, including all nested functions, SHALL be true. [# FIXME #] - the AutoAttach flag for the global scope, and for functions not included in the preceding point, SHALL be false. All uses of the GetValue operation in [ES5Draft], except for the use in section 11.2.3 (Function Calls), should be considered to pass as a second argument, the value of the AutoAttach flag for the fragment being evaluated. A function object created by step 6 of the GetValue operation is called an *attached function object*. - Replace section 8.7.1 with: GetValue (V, AutoAttach) 1. If Type(V) is not Reference, return V. 2. Call GetBase(V). 3. If Result(2) is null, throw a ReferenceError exception. 4. Call the [[Get]] method of Result(2), passing GetPropertyName(V) for the property name. 5. If AutoAttach is false or if Result(4) is not a function object with a body that refers to «this», then return Result(4). 6. Create a new function object as specified in 13.2, such that the new function throws an exception [# which one? #] if its «this» is bound to any other value than Result(2), but otherwise acts identically to the function given by Result(4). 7. Return Result(6). Rationale: The motivation for changing the semantics of GetValue is explained in the rationale for group EXPRESSIONS. (Briefly, it allows removing restrictions on how the result of a property access can be used.) The 'useAutoAttach;' syntax has the following characteristics: - by default, it will throw a ReferenceError in ES3, and so ESF code that depends on attachment will not fail silently. - if a programmer wishes to deliberately suppress this ReferenceError, they can write the following: if (typeof useAutoAttach === 'undefined') useAutoAttach = false; - it is possible for the Jacaranda run-time to suppress the ReferenceError for Jacaranda modules (by defining useAutoAttach in an import environment) without doing so for other ESF code. - it is possible to test easily whether a given implementation supports the attachment semantics (whether or not the ReferenceError has been suppressed as described above). When GetValue creates a new attached function object, it does not automatically cache this function object in the original property. That would not be difficult to specify, but it would not be a transparent optimization, because: a) the base object B (Result(2)) may be used as a prototype for some other object, O. In that case, a lookup of the property in O could find the cached function that is attached to B, when it should find an unattached function that will be later attached to O. b) suppose that an unattached function is found in some prototype P of B, and later the property is mutated in some object in B's prototype chain, up to and including P. Meanwhile an attached version of the original function has been cached in B. For the caching to be transparent, subsequent lookups should find the mutated property value, not the cached value. There is no obvious way to ensure that the cache is invalidated in this case. c) code that is not using the attachment semantics should not be able to see cached values. These problems are not insoluble, but in combination they make the caching optimization too complicated to be worthwhile. [# Consider having some way to test whether a given function has the AutoAttach flag set. Would that be needed/helpful for the Jacaranda runtime? #] Versions of Jacaranda before 0.4 used automatic binding (a function automatically uses the "correct" this) rather than attachment (a function that would use the "wrong" this instead throws an exception). Attachment may be more politically acceptable because it ensures that the restricted language is a fail-stop subset of ECMAScript. Section 11.2.3 Function Calls - In step 3 of the algorithm, replace 3. Call GetValue(Result(1)). with 3. Call GetValue(Result(1), false). Rationale: Without this change the semantics would be the same, but the algorithm would unnecessarily specify the generation of an attached function for every call to a function that refers to «this». - Add after the NOTE: An ESF implementation SHALL NOT return a value of type Reference as a result of calling a function that is built-in-reachable. [# CHECK already prohibited by ES5? #] Section 15 Native ECMAScript Objects - Replace Unless otherwise specified in the description of a particular function, if a function or constructor described in this section is given more arguments than the function is specified to allow, the behaviour of the function or constructor is undefined. In particular, an implementation is permitted (but not required) to throw a TypeError exception in this case. NOTE Implementations that add additional capabilities to the set of built-in functions are encouraged to do so by adding new functions rather than adding new parameters to existing functions. with Unless otherwise specified in the description of a particular function, if a function or constructor described in this section is given more arguments than the function is specified to allow, then the function SHALL either ignore the extra arguments, or throw a TypeError exception. It is implementation-defined (per function) which of these alternatives is chosen. ESF adds the following properties not defined by ES3: Array.prototype: every, filter, forEach, indexOf, lastIndexOf, map, reduce, reduceRight, some which are specified as described in [ES5Draft]. [# FIXME: indexOf and lastIndexOf have bugs in that draft #] [# If the underlying JavaScript implementation supports it, ESF also adds the Decimal global constructor, as specified in the ES3.1 Kona draft. This constructor, and the corresponding syntax for decimal literals, are only supported in modules declared to depend on ES5 (i.e. using the $module31 or $newmodule FIXME keywords). #] Other than the above, implementations are not encouraged to add additional capabilities to the set of built-in functions, either by adding new functions or by adding new parameters to existing functions. Rationale: A object-capability language must be much more conservative in any additions to libraries that are built-in or implicitly imported, since these libraries should provide no significant authority (other than pure computation). Any vendor-specific extensions should be made in libraries that are not implicitly imported. Array.prototype.{every, filter, forEach, map, some} have a 'thisObject' argument that is bound to 'this' in each call. For an explanation of why it is safe for these methods to be usable from Jacaranda (even though Function.prototype.{call, apply, bind, bindAsEventListener} must not be), see the rationale for the PROPERTY_NAMES group. - Delete the following: In section 15.7.4.5, If the toFixed method is called with more than one argument, then the behaviour is undefined (see clause 15). In section 15.7.4.6, If the toExponential method is called with more than one argument, then the behaviour is undefined (see clause 15). In section 15.7.4.7, If the toPrecision method is called with more than one argument, then the behaviour is undefined (see clause 15). Rationale: A specification of a safe language cannot allow undefined behaviour in any situation. Since this is a change in the specification of run-time behaviour, it must be changed in ESF rather than Jacaranda. Section 15.3.4.2 Function.prototype.toString() - Replace An implementation-dependent representation of the function is returned. This representation has the syntax of a FunctionDeclaration. Note in particular that the use and placement of white space, line terminators, and semicolons within the representation string is implementation-dependent. with The string value "function () {...}" is returned. Rationale: See the discussion of function opaqueness under 'toSource' in group PROPERTY_NAMES. Section 16 Errors - Replace * An implementation may provide additional types, values, objects, properties, and functions beyond those described in this specification. This may cause constructs (such as looking up a variable in the global scope) to have implementation-defined behaviour instead of throwing an error (such as ReferenceError). with * An implementation may provide additional values in the global scope beyond those described in this specification, subject to the constraints described in sections 8.6.2 and 15. This may cause looking up a variable in the global scope that is not described in this specification to have implementation-defined behaviour instead of throwing an error (such as ReferenceError). Rationale: The Jacaranda run-time will only rely on properties of the global scope specified in ESF_BUILTINS. Jacaranda code will not be able to access other properties added to the global scope. Note that this section also includes the following quite scary provision: * An implementation may extend program and regular expression syntax. To permit this, all operations (such as calling eval, using a regular expression literal, or using the Function or RegExp constructor) that are allowed to throw SyntaxError are permitted to exhibit implementation-defined behaviour instead of throwing SyntaxError when they encounter an implementation-defined extension to the program or regular expression syntax. However, we don't need to delete this provision, because: - we are relying on the verifier to catch any code that is syntactically invalid in ESF; - Jacaranda code cannot call eval or use the Function constructor; - RegExp can be tamed to verify that no regexp strings and flags use extensions. (In practice it must be tamed anyway, because it is broken in SpiderMonkey, as discussed in the rationale for group EXPRESSIONS.) This text must not be interpreted to allow the behaviour of ESF extensions to ES5 (const and useAutoAttach) to be implementation-defined. Annex B Compatibility - The additional syntax specified in section B.1 SHALL NOT be part of the ESF grammar. - An ESF program SHOULD NOT rely on the presence of any of the additional global functions, or properties of String, specified in sections B.2.1 to B.2.6. Lexical Grammar =============== The lexical grammar of Jacaranda is that of ESF with the additional constraints listed below. LEX_OK_FOR_CDATA: A Jacaranda source text MUST NOT contain any of the sequences of unescaped code units «», or an initial subsequence of these at the end of the text. A Jacaranda source text MUST NOT contain the code unit «&» unless immediately followed by «&», «=», «(», «+», «-», «!», «~», «'», «"», «/», «\», , , or . (If followed by «&», the same restriction applies to the next code unit.) + SourceText :: ε | AmpText | (SourceCharacter but not ']' or '<' or '&') SourceText | ('<' | ']<' | ']]<') ((SourceCharacter but not '!' or '/' or '&') SourceText | AmpText) | ']' ((SourceCharacter but not ']' or '<' or '&') SourceText | AmpText) | ']]' ((SourceCharacter but not '>' or '<' or '&') SourceText | AmpText) | c=('') SourceText { errors U= error(c, 'may be misinterpreted in HTML or XML') } + AmpText :: '&&' AmpText | '&' AmpFollower SourceText | '&' (c=SourceCharacter but not '&' or AmpFollower) SourceText { error U= error(c, 'may be misinterpreted as character entity in HTML or XML') } + AmpFollower :: one of '=' '(' '+' '-' '!' '~' '\u0027' '"' '/' '\u005C' DecimalDigit WhiteSpace LineTerminator Rationale: This allows verified Jacaranda source to be embedded (as a Unicode string) in an XML CDATA section or an HTML Jacaranda source may not be embedded in an HTML or XHTML _attribute_ value, even if its content is defined as CDATA, since processing of attribute values does not preserve tab, carriage return or line feed characters. LEX_CODE_UNITS: Source code units, line terminators, and whitespace MUST ONLY be used as follows: SourceCharacter :: c=CodeUnit { errors U= check(c in ValidCodeUnit, c, 'invalid code unit') } + ValidCodeUnit :: one of '\u0009' '\u000A' '\u000D' // exclude ASCII controls except TAB, LF, CR. [\u0020-\u007E] // \u007F-\u009F controls [\u00A0-\u00AC] // \u00AD :Cf: [\u00AE-\u05FF] // \u0600-\u0603 :Cf: [\u0604-\u06DC] // \u06DD :Cf: [\u06DE-\u070E] // \u070F :Cf: [\u0710-\u17B3] // \u17B4-\u17B5 :Cf: [\u17B6-\u200A] // \u200B-\u200F :Cf:; \u200C-\u200F !Safari !IE6 [\u2010-\u2027] // \u2028-\u2029 !Cajita; \u202A-\u202E :Cf: !Safari !IE6 [\u202F-\u205F] // \u2060-\u2064 :Cf:; \u2065-\u2069 !Cajita; \u206A-\u206F :Cf: !Safari !IE6 [\u2070-\uD7FF] // \uD800-\uDFFF UTF-16 surrogates [\uE000-\uFDCF] // \uFDD0-\uFDEF noncharacters [\uFDF0-\uFEFE] // \uFEFF (BOM) :Cf: !Safari !IE6 [\uFF00-\uFFEF] // \uFFF0-\uFFF8 !Safari !Firefox; // \uFFF9-\uFFFB :Cf: // \uFFFC (object replacement character) // \uFFFD (replacement character) // \uFFFE-\uFFFF noncharacters + PrintableCodeUnit :: \u0020-\u007E LineTerminator :: [# FIXME #] \u000A | \u000D WhiteSpace :: [# FIXME #] \u0009 | \u0020 { errors U= check(c == '\u0009' or c == '\u0020', c, 'invalid whitespace code unit') } Note that in [ES5Draft] sections 7.2 and 7.3, both the tables and the grammar must be changed. This rule does not restrict the set of code units that may be represented by \u escapes. This rule does not affect the set of whitespace characters used in the specification of ToNumber ([ES5Draft] section 9.3.1). Rationale: Start with \u0000-\uFFFF, and - remove code units that are not in XML 1.1's or that are in XML 1.1's (see below) - remove code units corresponding to BMP characters that are in category :Cf: (Format Control) in Unicode 5.1.1: (this is a superset of category :Cf: in all previous Unicode versions) - remove code units corresponding to BMP characters that are in category :Zs: (Whitespace) in Unicode 5.1.1: other than \u0009 and \u0020. - remove code units that are disallowed by Cajita according to . - remove \uFFFC and \uFFFD. and from XML 1.1 are used just because that is a slightly less arbitrary choice than us deciding on our own what control characters should be restricted. Disallowing :Cf: characters is in principle the wrong thing for proper internationalisation support. However, we eventually concluded that it's the only way to guarantee consistent and secure behaviour across browsers, which in this case trumps internationalisation. (The right thing would be to allow :Cf: characters other than \uFEFF only in string literals and comments, and to allow \uFEFF [byte order mark] only between tokens. Relaxing the rules later is easier than tightening them.) Whitespace is restricted to US-ASCII tab and space characters. There is no reason to internationalise whitespace used to separate tokens in source code. (Other space characters are still allowed in strings.) Also, the fact that vertical tab and form feed were treated as whitespace rather than as line separators was potentially confusing; Jacaranda disallows them. LEX_COMMENT_FIRST_CHARACTER: If the first code unit in a after «/*» is not «*» or «\u0020», or if the first code unit in a after «//» is not «/» or «\u0020», then all code units in the comment MUST be printable US-ASCII (that is, [\u0020-\u007E]). The necessary grammar change is given in group LEX_COMMENT_KEYWORDS. Rationale: This prevents an attack relevent to adversarial code review, where a comment that looks like «/*const*/» does not actually enforce constness, and similarly for other comments used by JSLint, for example. It also prevents the first character in a comment from being a non-US-ASCII character that is stripped on parsing, which might plausibly result in the LEX_COMMENT_NO_ATSIGN_EXTENSIONS group being bypassed. LEX_COMMENT_NO_ATSIGN_EXTENSIONS: The first character in any comment after «/*» or «//» MUST NOT be «@». The necessary grammar change is given in group LEX_COMMENT_KEYWORDS. Rationale: This may trigger non-ESF-compliant extensions (for example or ). ADsafe rejects «@cc» anywhere in a comment. This is sufficient to prevent enabling of JScript conditional compilation, but not the Venkman debugger extensions. Caja and Cajita do not need to reject «@» in comments, because the output of cajoling does not include any comments. See . LEX_COMMENT_KEYWORDS: «/*const*/» and «/*fallthru*/» are treated as tokens. (Other comments are not tokens.) They MUST ONLY appear where specified in the Jacaranda grammar. Note the lack of spaces before and after «const» or «fallthru». Also note that only these exact spellings are recognised; for example, «/*\u0063onst*/» is not recognised as a «/*const*/» token, even though «\u0063» would represent «c» within a string literal. The lexical grammar changes resulting from this and groups LEX_COMMENT_FIRST_CHARACTER and LEX_COMMENT_NO_ATSIGN_EXTENSIONS are: + Printable :: ε | PrintableCodeUnit Printable mlc=MultiLineComment :: '/**/' | '/**' PostAsteriskCommentChars? '*/' | '/*\u0020' MultiLineCommentChars? '*/' | '/*@' MultiLineCommentChars? '*/' { errors U= error(mlc, 'multi-line comment starts with @') } | '/*' first=MultiLineNotAsteriskOrSpaceOrAtChar rest=MultiLineCommentChars? '*/' { errors U= check(first is Printable and rest is Printable, mlc, 'multi-line comment does not start with * or space and is not printable ASCII') } + MultiLineNotAsteriskOrSpaceOrAtChar :: SourceCharacter but not '*' or '\u0020' or '@' MultiLineCommentChars :: see [ES5Draft] section 7.4 PostAsteriskCommentChars :: see [ES5Draft] section 7.4 MultiLineNotAsteriskChar :: see [ES5Draft] section 7.4 MultiLineNotForwardSlashOrAsteriskChar :: see [ES5Draft] section 7.4 slc=SingleLineComment :: '//' SingleLineCommentEnd | '///' SingleLineCommentChars? SingleLineCommentEnd | '//\u0020' SingleLineCommentChars? SingleLineCommentEnd | '//@' SingleLineCommentChars? SingleLineCommentEnd { errors U= error(slc, 'single-line comment starts with @') } | '//' first=SingleLineNotForwardSlashOrSpaceOrAtChar rest=SingleLineCommentChars SingleLineCommentEnd { errors U= check(first is Printable and rest is Printable, slc, 'single-line comment does not start with / or space and is not printable ASCII') } + SingleLineCommentEnd :: LineTerminator | EndOfText + SingleLineNotForwardSlashOrSpaceOrAtChar :: SourceCharacter but not LineTerminator or '/' or '\u0020' or '@' SingleLineCommentChars :: see [ES5Draft] section 7.4 SingleLineCommentChar :: see [ES5Draft] section 7.4 Rationale: /*const*/ and /*fallthru*/ must be ES3 comments in order for Jacaranda to remain a sublanguage of ESF. (They are lexed as multiline comments, but should then be retained as tokens, while other comment tokens are thrown away.) See group ASSIGNABLE_EXPRESSIONS for a description of the use of /*const*/ (it is essentially the same as 'const' as implemented in some current browsers and proposed for ES-Harmony, but with static rejection instead of silent failure). See also . LEX_STRING_LITERALS: A MUST ONLY use one of the escape characters «b», «t», «n», «f», «r», double-quote, single-quote, backslash, or «/». A string token MUST NOT contain a . The following attribution rules are used to compute the value of a string literal, which is needed for other rules. They correspond exactly to the procedure in [ES5Draft] section 7.8.4, expressed in the same attribute grammar notion as the rest of this specification. The grammar has been altered to not require lookahead in the production. StringLiteral :: '"' ds=DoubleStringCharacters '"' { SV = ds.SV } | '\u0027' ss=SingleStringCharacters '\u0027' { SV = ss.SV } DoubleStringCharacters :: ε { SV = [] } | c=DoubleStringCharacter rest=DoubleStringCharacters { SV = [c.CV] ++ rest.SV } | '\u005C' '0' rest=NIDDoubleStringCharacters // no initial decimal { SV = ['\u0000'] ++ rest.SV } + NIDDoubleStringCharacters :: ε { SV = [] } | c=NIDDoubleStringCharacter rest=DoubleStringCharacters { SV = [c.CV] ++ rest.SV } | '\u005C' '0' rest=NIDDoubleStringCharacters { SV = ['\u0000'] ++ rest.SV } DoubleStringCharacter :: c=SourceCharacter but not '"' or '\u005C' or LineTerminator { CV = c } | '\u005C' e=EscapeSequence { CV = e.CV } | LineContinuation + NIDDoubleStringCharacter :: c=SourceCharacter but not '"' or '\u005C' or LineTerminator or DecimalDigit { CV = c } | '\u005C' e=EscapeSequence { CV = e.CV } SingleStringCharacters :: ε { SV = [] } | c=SingleStringCharacter rest=SingleStringCharacters { SV = [c.CV] ++ rest.SV } | '\u005C' '0' rest=NIDSingleStringCharacters { SV = ['\u0000'] ++ rest.SV } + NIDSingleStringCharacters :: ε { SV = [] } | c=NIDSingleStringCharacter rest=SingleStringCharacters { SV = [c.CV] ++ rest.SV } | '\u005C' '0' rest=NIDSingleStringCharacters { SV = ['\u0000'] ++ rest.SV } SingleStringCharacter :: c=SourceCharacter but not '\u0027' or '\u005C' or LineTerminator { CV = c } | '\u005C' e=EscapeSequence { CV = e.CV } | LineContinuation + NIDSingleStringCharacter :: c=SourceCharacter but not '\u0027' or '\u005C' or LineTerminator or DecimalDigit { CV = c } | '\u005C' e=EscapeSequence { CV = e.CV } LineContinuation :: '\u005C' e=LineTerminatorSequence { errors U= check(e, 'line continuation not allowed') } EscapeSequence :: ^CharacterEscapeSequence | ^HexEscapeSequence | ^UnicodeEscapeSequence CharacterEscapeSequence :: ^SingleEscapeCharacter | ^NonEscapeCharacter se=SingleEscapeCharacter :: 'b' { CV = '\u0008' } | 't' { CV = '\u0009' } | 'n' { CV = '\u000A' } | 'f' { CV = '\u000C' } | 'r' { CV = '\u000D' } | '"' { CV = '\u0022' } | '\u0027' { CV = '\u0027' } // ' | '\u005C' { CV = '\u005C' } // \ | 'v' { errors U= error(se, '\\v escape') } ne=NonEscapeCharacter :: '/' { CV = '\u002F' } // explicitly allowed by JSON | SourceCharacter but not '/' { errors U= error(ne, 'unrecognised escape') } HexEscapeSequence = 'x' a=HexDigit b=HexDigit { CV = [a.MV*16 + b.MV] } UnicodeEscapeSequence :: 'u' a=HexDigit b=HexDigit c=HexDigit d=HexDigit { CV = [a.MV*4096 + b.MV*256 + c.MV*16 + d.MV] } HexDigit :: '0' { MV = 0 } | '1' { MV = 1 } | '2' { MV = 2 } | '3' { MV = 3 } | '4' { MV = 4 } | '5' { MV = 5 } | '6' { MV = 6 } | '7' { MV = 7 } | '8' { MV = 8 } | '9' { MV = 9 } | 'A' { MV = 10 } | 'B' { MV = 11 } | 'C' { MV = 12 } | 'D' { MV = 13 } | 'E' { MV = 14 } | 'F' { MV = 15 } | 'a' { MV = 10 } | 'b' { MV = 11 } | 'c' { MV = 12 } | 'd' { MV = 13 } | 'e' { MV = 14 } | 'f' { MV = 15 } Rationale: JScript does not implement «\v» (it treats it as «v»). This could allow lexical confusion attacks. Unlike Caja, we cannot rewrite it to a different escape. There is no reason to allow «\» followed by an escape character that does not have a defined meaning. However, we must allow «\/» in order to remain a superset of JSON. (Unfortunately, JSON does not allow «\'», otherwise JSON and Jacaranda would support identical syntax for double-quoted strings. We considered disallowing «\'» in double-quoted strings, but that would create an inconsistency between single- and double-quoting.) Line continuations cannot be allowed, even if DependOnES5 is set, because an ES3 parser may lex a module definition incorrectly if they are used. LEX_NUMERIC_LITERALS: A MUST NOT have more then 20 significant digits. As specified in [ES5Draft] section 7.8.3, the source character immediately following a MUST NOT be an or . The following attribute definitions are used to compute the value of a numeric literal, which is needed for other rules. They correspond exactly to the procedure in [ES5Draft] section 7.8.3, expressed in the same attribute grammar notion as the rest of this specification. nl=NumericLiteral :: dl=DecimalLiteralNoExp 'm' AfterNumeric? { errors U= error(nl, 'floating-decimal literals not yet supported')); MV = dl.MV; SV = DecimalToString(MV) } | dl=DecimalLiteralNoExp (AfterNumeric but not ExponentIndicator or 'm' or DecimalDigit)? { MV = dl.MV; SV = NumberToString(MV) } | dl=DecimalLiteralNoExp ExponentIndicator ep=SignedInteger 'm' AfterNumeric? { errors U= error(nl, 'floating-decimal literals not yet supported')); MV = dl.MV * 10^(ep.MV); SV = DecimalToString(MV) } | dl=DecimalLiteralNoExp ExponentIndicator ep=SignedInteger (AfterNumeric but not 'm' or DecimalDigit)? { MV = dl.MV * 10^(ep.MV); SV = NumberToString(MV) } | '0' HexIndicator hi=HexInteger (AfterNumeric but not HexDigit)? { MV = hi.MV; SV = NumberToString(MV) } + AfterNumeric :: is=IdentifierStart { errors U= error(is, 'disallowed IdentifierStart character after numeric literal') } | dd=DecimalDigit { errors U= error(dd, 'disallowed DecimalDigit character after numeric literal') } + dl=DecimalLiteralNoExp :: dil=DecimalIntegerLiteral '.'? { MV = dil.MV; errors U= check(dil.sigDigits <= 20, dl, '> 20 significant digits in decimal literal') } | dil=DecimalIntegerLiteral '.' dd=DecimalDigits { reparse (dil ++ dd) as digits=DecimalDigits else { errors U= error(dl, 'cannot happen') }; MV = dil.MV + dd.MV*10^-(dd.n); errors U= check(digits.sigDigits <= 20, dl, '> 20 significant digits in decimal literal') } | '.' dd=DecimalDigits { MV = dd.MV*10^-(dd.n); errors U= check(dd.sigDigits <= 20, dl, '> 20 significant digits in decimal literal') } DecimalIntegerLiteral :: d=DecimalDigit { MV = d.MV; n = 1 } | d=NonZeroDigit dd=DecimalDigits { MV = d*10^(dd.n) + dd.MV; n = dd.n + 1 } DecimalDigits :: '0' { MV = 0; sigDigits = 0; n = 1 } | d=NonZeroDigit { MV = d.MV; sigDigits = 1; n = 1 } | '0' rest=DecimalDigits { MV = rest.MV; sigDigits = rest.sigDigits; n = rest.n + 1 } | d=NonZeroDigit rest=DecimalDigits { MV = d.MV*10^(rest.n) + rest.MV; sigDigits = rest.sigDigits + 1; n = rest.n + 1 } DecimalDigit :: '0' { MV = 0 } | ^NonZeroDigit NonZeroDigit :: '1' { MV = 1 } | '2' { MV = 2 } | '3' { MV = 3 } | '4' { MV = 4 } | '5' { MV = 5 } | '6' { MV = 6 } | '7' { MV = 7 } | '8' { MV = 8 } | '9' { MV = 9 } ExponentIndicator :: one of 'e' 'E' SignedInteger :: ^DecimalDigits | '+' dd=DecimalDigits { MV = dd.MV } | '-' dd=DecimalDigits { MV = -(dd.MV) } + HexIndicator :: one of 'x' 'X' + HexInteger :: hd=HexDigit { MV = hd.MV } | hi=HexInteger hd=HexDigit { MV = hi.MV*16 + hd.MV } HexDigit : see group LEX_STRING_LITERALS Rationale: [ES5Draft] section 7.8.3 says that: Once the exact MV for a numeric literal has been determined, it is then rounded to a value of the Number type. If the MV is 0, then the rounded value is +0; otherwise, the rounded value must be /the/ number value for the MV (in the sense defined in 8.5), unless the literal is a and the literal has more than 20 significant digits, in which case the number value may be either the number value for the MV of a literal produced by replacing each significant digit after the 20th with a 0 digit or the number value for the MV of a literal produced by replacing each significant digit after the 20th with a 0 digit and then incrementing the literal at the 20th significant digit position. Therefore decimal literals with more than 20 significant digits cannot be used to obtain any additional precision. (Note that 20 significant digits are more than sufficient to round-trip any IEEE double value.) This restriction is also necessary in order to enforce that properties in object literals are unique. The issue here is that if the conversion to an IEEE double is nondeterministic, then the verifier may fail to detect names that should be treated as colliding (in groups OBJECT_LITERALS and MODULE_DEFINITION). A mistake in the ES3 grammar for HexIntegerLiteral has been corrected. LEX_REGEXP_LITERALS: If DependOnES5 is not set, regular expression literals MUST NOT be used. If DependOnES5 is set, regular expression literals MUST NOT contain unescaped «/» within a character class (i.e. after «[» and before the next «]»); also, a regular expression literal MUST NOT start with «/=». re=RegularExpressionLiteral :: '/' RegularExpressionBody '/' RegularExpressionFlags { errors U= check(DependOnES5, re, 'regexp literal not allowed in ES3 mode') } RegularExpressionFlags :: ε | flags=RegularExpressionFlags f=IdentifierPart { errors U= check(f is 'g' or f is 'i' or f is 'm', f, 'invalid regexp literal flag') U check(not(flags contains f), f, 'duplicated regexp literal flag') } body=RegularExpressionBody :: RegularExpressionFirstChar RegularExpressionChars { reparse body as Pattern else { errors U= error(body, 'invalid regexp literal pattern') } } RegularExpressionChars :: ε RegularExpressionChars RegularExpressionChar RegularExpressionFirstChar :: NonTerminator but not '*' or '\' or '/' or '[' or '=' | RegularExpressionBackslashSequence | RegularExpressionClass | c='=' { errors U= error(c, 'regexp literal cannot start with /=, use /\\= instead') } RegularExpressionChar :: RegularExpressionNonTerminator but not '\' or '/' or '[' | BackslashSequence | RegularExpressionClass RegularExpressionBackslashSequence :: '\' RegularExpressionNonTerminator RegularExpressionNonTerminator :: SourceCharacter but not LineTerminator RegularExpressionClass :: '[' RegularExpressionClassChars ']' RegularExpressionClassChars :: ε | RegularExpressionClassChars RegularExpressionClassChar RegularExpressionClassChar :: RegularExpressionNonTerminator but not '\' or '/' or ']' | c='/' { errors U= error(c, 'unescaped / in character class of regexp literal') } | RegularExpressionBackslashSequence Pattern :: see [ES5Draft] section 15.10.1 Rationale: ES3 requires that each regexp literal in a program share a single RegExp object. However RegExp objects are mutable, which makes this sharing error-prone, and provides implicit communication channels that should not be present in an object-capability language. In ES5, regexp literals create a new RegExp object each time they are evaluated. The fact that RegExp objects are mutable is still undesirable (and ConstRegExp should be used in preference), but this is not sufficient reason to disallow them. The restriction that a regexp literal must not start with «/=» avoids a lexer bug in JScript.Net, which incorrectly tokenizes «/=» as a division-assignment even in contexts where a regexp literal can occur: JScript.Net is not a compliant ES5 implementation, but this restriction avoids lexical confusion when parsing a module definition. Note that such confusion would be exploitable even though the treatment of «/=» as a division-assignment will cause a syntax error, for example: try {/=} catch (e) {toast();} try{1/+1;} catch (e2) {} will call toast() in JScript.Net [# not tested #]. Allowing regexp literals significantly complicates lexing. A JavaScript lexer must be implemented using feedback from the parser, because whether a '/' character (not in a comment) is lexed as part of a or a depends on whether a is syntactically allowed at that position. However in Jacaranda, it is possible to distinguish such positions using a finite state machine in the lexer, as described in the LEX_TOKENS group. [# CHECK #] LEX_TOKENS: The tokens '++' or '--' MUST NOT immediately precede a , or precede a with only tokens intervening. [# This rule may not be necessary any more. #] Rationale: This restriction allows a Jacaranda lexer to assume that the context needed to distinguish a from a can be inferred only from the preceding token. That is, it can be assumed that a valid token stream will match in the following (nonnormative) grammar: + TokenStream :: text=SourceText { reparse text as RELTokenStream else { errors U= error(...) } } RELTokenStream :: [EOF] | RELPreceder RELTokenStream | DivPreceder DivTokenStream | Neutral RELTokenStream | '.' PropTokenStream | RegularExpressionLiteral DivTokenStream DivTokenStream :: [EOF] | RELPreceder RELTokenStream | DivPreceder DivTokenStream | Neutral DivTokenStream | '.' PropTokenStream | DivPunctuator RELTokenStream PropTokenStream :: [EOF] | RELPreceder DivTokenStream | DivPreceder DivTokenStream | Neutral PropTokenStream | '.' PropTokenStream | DivPunctuator RELTokenStream RELPreceder :: one of '{' '(' '[' ';' ',' '<' '>' '<=' '>=' '==' '!=' '===' '!==' '+' '-' '*' '%' '<<' '>>' '>>>' '&' '|' '^' '!' '~' '&&' '||' '?' ':' '=' '+=' '-=' '*=' '%=' '<<=' '>>=' '>>>=' '&=' '|=' '^=' RELPrecederReservedWord RELPrecederReservedWord :: one of 'case' 'delete' 'do' 'else' 'in' 'instanceof' 'new' 'return' 'throw' 'typeof' 'void' DivPreceder :: '}' | ')' | ']' | ReservedWord but not RELPrecederReservedWord | Identifier | NumericLiteral | StringLiteral Neutral :: '++' | '--' | WhiteSpace | LineTerminator | Comment LEX_NO_SYNTACTIC_SEMICOLON_INSERTION: A Jacaranda source text MUST NOT trigger "automatic semicolon insertion" as defined in [ES5Draft] section 7.9 when interpreted as an ESF program. The following restrictions are needed to ensure that this does not occur: * The syntactic grammar is interpreted strictly, without applying the rules of [ES5Draft] section 7.9.1; * A postfix ++ or -- operator MUST be on the same line as the last token of its operand; * The next non-comment token (i.e. identifier, expression, or semicolon) after a 'continue', 'break', 'return' or 'throw' keyword MUST be on the same line as that keyword. Rationale: First read . [# FIXME: not stable reference #] It is obvious that we do not want "syntactic semicolon insertion" [SSI below] -- it is too difficult to specify and too difficult for programmers to understand its consequences. But is the above wiki page correct in asserting that "grammatical semicolon insertion" [GSI below] is "relatively harmless"? The wiki page confuses the issue by giving an incorrect description of how GSI could be implemented (which led to this rationale section in Jacaranda draft 0.3 misinterpreting what was intended to be meant by GSI, and criticising the proposal for allowing cases that were intended to be categorized as SSI). In fact GSI can be correctly defined and implemented by treating ';' as a statement _separator_. The assertion that GSI in this sense is harmless is probably not quite correct: in The Impact of Language Design on the Production of Reliable Software ... On the other hand, this study considered programmers with only limited experience in PL/I, and it is not clear that its conclusions are relevant to JavaScript. We consider the benefit of compatibility with a slightly larger set of JavaScript programs to outweigh the potential for additional errors in this case. In 'Why Pascal is Not My Favorite Programming Language', , Brain Kernighan gives the following argument against ';' as statement separator, which is as relevant to JavaScript as to Pascal. (Pascal source examples have been transliterated to JavaScript.) Pascal, in common with most other Algol-inspired languages, uses the semicolon as a statement separator rather than a terminator (as it is in PL/I and C). As a result one must have a reasonably sophisticated notion of what a statement is to put semicolons in properly. Perhaps more important, if one is serious about using them in the proper places, a fair amount of nuisance editing is needed. Consider the first cut at a program: if (a) b; c; But if something must be inserted before «b», it no longer needs a semicolon, because it now precedes a «}»: if (a) { b0; b } c; Now if we add an «else», we must remove the semicolon on the 'end': if (a) { b0; b } else d; c; And so on and so on, with semicolons rippling up and down the program as it evolves. Kernighan's argument here is dubious: it is not the case that we "must" remove semicolons before '}' ('end' in Pascal), only that we can. That is, the effort he is concerned about simply need not be expended. The last example where the 'then' clause is a block but the 'else' clause is a non-block statement is clearly bad style, and easily avoided. [# FIXME: rest of this is obsolete #] First note that the page is not correct in stating that grammatical semicolon insertion can be implemented by removing terminating semicolons from the grammar. That is true in some cases: specifically, it is true for the semicolons that terminate 'do', 'continue' and 'break' statements. In these cases, if the semicolon were removed from the grammar but present in a given source text, it would be interpreted as an , which has no semantic effect. So, after 'do', 'continue', and 'break' statements, semicolon insertion would indeed be relatively harmless: at most, a human reader or a parser would need one token lookahead to understand/parse the resulting language unambiguously. For the 'return', 'var', 'throw' and expression statements, however, this approach fails -- in a way that demonstrates the basic problem with implicit semicolon insertion of any kind. Consider the example: { var a = x + y; } If the semicolon terminating a were simply omitted from the grammar, then the grammar would be ambiguous because this example could be parsed as either { var a = x; + y; } or { var a = x + y; } both of which are syntactically valid. In ES3 it will be parsed as the latter -- but if a programmer mistypes + as ++, for example: { var a = x ++ y; } then that will be parsed as { var a = x; ++ y; } In other words, if the programmer makes any syntax error in a variable initialiser, return, throw, or expression statement, then under ES3's semicolon insertion rules it might silently be interpreted as something quite different. To determine the resulting parse, we would need to consider every possible location at which a semicolon could be inserted in the program beyond that point, with _unbounded_ lookahead (it is not even sufficient to consider until the next explicit semicolon, in general). This is clearly unacceptable. By this argument, if Jacaranda were to support grammatical semicolon insertion, it could only safely do so after 'do', 'continue' and 'break' statements, and not after 'return', 'var', 'throw' or expression statements. That would create an ugly inconsistency in the language. It is better to simply disallow semicolon insertion entirely, at the cost of accepting fewer JavaScript programs. [A somewhat related Mozilla bug report is . This report is interesting because the developers indicate that they know ES3 explicitly excludes the case in question from the semicolon insertion rules (because there is no LineTerminator), but they change the SpiderMonkey implementation to perform it anyway, in order to work around an error in *one* web page. This attitude, that perceived compatibility problems should trump standards compliance, seems not to be uncommon among JavaScript implementors, and is quite worrying from a security point of view. (The specific case, and similar cases, will be correctly rejected by a Jacaranda verifier.)] Due to a bug in most JavaScript implementations other than Opera, the behaviour of programs such as the following is not conformant to ES3 on those implementations: (function () { a: for (;;) { for (;;) { break/* */a } return 1 } return 2 })(); ES3 requires the comment that is split over lines to be treated as a line terminator, so that «break [LineTerminator] a;» will not be syntactically valid, and a semicolon will be inserted after «break». The buggy implementations do not perform this insertion, instead interpreting the statement as «break a;» and returning 2. [# FIXME #] Note that currently suggests that the bug in implementations other than Opera should be grandfathered into ES3.1. This would imply that neither behaviour can be relied on even from a correct implementation. Jacaranda is not vulnerable to this issue, because of the rules requiring postfix «++» or «--» to be on the same line as the last token of its operand, and the next non-comment token after a «continue», «break», «return» or «throw» keyword to be on the same line as that keyword. In the example above, for instance, «a» is not on the same line as «break». Module Definitions ================== MODULE_DEFINITION: A module definition MUST be parseable according to the Jacaranda lexical grammar; then, the resulting token stream MUST be parseable as the production below. The "use strict,jacaranda" string literal appearing in MUST be written as a double-quoted string exactly as shown, without escapes or line continuations. Each of the 'imports' and 'optionalImports' properties of a module, if present, MUST be a literal array of string literals, none of which are a . The *import set* of a module is the union of the set of strings that occur in the module's 'import' property (or the empty set if not present) and its 'optionalImports' property (or the empty set if not present), and the implicit imports defined below. For each freely-used identifier of a module (as defined by the SCOPE_OF_IDENTIFIERS group), a string corresponding to the identifier name MUST occur in the module's import set. A module MUST NOT have any freely-assigned identifiers. A that freely-uses 'this' must not be assigned to a property of a that is an . program : sl=statementListFudgy { reparse sl as mdef=moduleDefinition { errors = mdef.errors } else { reparse "function() {" ++ sl.text ++ "}" as f=functionExpression { errors = error(sl, 'not a module definition') U f.errors U check('this' notin f.freelyUsed, sl, '"this" used freely at top-level') } } else { errors = error(sl, 'not a module definition, and parsing as a function body did not work (please report this)') } } } + moduleDefinition : k=ModuleKeyword '(' 'function' '(' '$_' ')' '{' 'with' '(' '$_' ')' '{' 'return' 'function' '(' ')' '{' "use strict,jacaranda"' ';' 'return' mod=moduleLiteral ';' '}' ';' '}' '}' ')' ';' { DependOnAutoAttach = k.text == '$newmodule'; DependOnES5 = k.text == '$module31' || k.text == '$newmodule' } + moduleLiteral : mlit=sefObjectLiteral { errors U= { error(mlit, id, 'freely-used identifier not declared as an import') | id <- mlit.freelyUsed \ (mlit.imports U mlit.optionalImports U IMPLICIT_IMPORTS) } U { error(mlit, id, 'freely-assigned identifier at module level') | id <- mlit.freelyAssigned } IMPLICIT_IMPORTS = ESF_BUILTINS U + sefObjectLiteral : // sef stands for "side-effect free" '{' sefPropertyNameAndValueList '}' + sefPropertyNameAndValueList : nv=sefPropertyNameAndValue | list=sefPropertyNameAndValueList ',' nv=sefPropertyNameAndValue { errors U= { error(id, 'duplicate property name in module object literal') | id <- list.dataProps isect nv.dataProps } + sefPropertyNameAndValue : n=PropertyName ':' v=sefExpression { dataProps = {n.name}, if (n.name == 'imports') { imports = v.elements; errors U= check(nv.imports != undefined, nv, 'value of imports property is not a list of string literals') U { check(id isnot ProtectedName, id, 'imported identifier has protected name') | id <- nv.imports }; } if (n.name == 'optionalImports') { optionalImports = v.elements; errors U= check(optionalImports != undefined, nv, 'value of optionalImports property is not a list of string literals') U { check(id isnot ProtectedName, id, 'optionally imported identifier has protected name') | id <- optionalImports }; } // Spec maintenance hazard: this duplicates checks in . freelyUsed = v.freelyUsed \ {'this'}; if ('this' in v.freelyUsed) { errors U= check(v is functionExpression, v, 'this used freely outside a function expression') U check(n.name isnot ExposedName, v, 'this used freely in an exposed module property') } } + sefExpression : 'null' | 'true' | 'false' | 'undefined' | sl=StringLiteral { SV = sl.SV } | NumericLiteral | '-' NumericLiteral | ^sefObjectLiteral | ^sefArrayLiteral | ^functionExpression sefArrayLiteral : '[' ']' { elements = {} } | '[' ^sefElementList ']' sefElementList : elision sefExpression | sefElementList ',' elision sefExpression | e=sefExpression { elements = (e.SV == undefined) ? undefined : {e.SV} } | list=sefElementList ',' e=sefExpression { elements = (e.SV == undefined) ? undefined : (list.elements U {e.SV}) } elision : see [ES5Draft] section 11.1.4 functionExpression : see group FUNCTIONS Rationale: An object literal is used as the argument to enable extensibility. Evaluating the object literal itself must have no side effects (because it will be evaluated before the call to the function given by ). This is not a significant constraint on the main code of the module, which can be defined in a function expression. A deeply immutable copy of the object created from the object literal will be provided to caplets created from this module. This allows caplets created from the same module to share an arbitrary amount of immutable data (without needing one copy of it per caplet). Note that only anonymous function expressions are allowed, since a named function expression might have the side effect of defining the name in the surrounding scope. (This restriction applies to Jacaranda in general, not just the top level of a module definition -- see group FUNCTIONS.) The module definition could technically have been passed in as a string literal and then 'eval'ed, but that option was quickly rejected because it would require excessive use of quoting in the module code. does not attempt to be the largest possible side-effect-free subset of ESF expressions, just one that is sufficiently expressive for any meta-information that we might want to use to describe a module. In fact it is essentially JSON, except that: - anonymous functions are included; - comments are included; - string literals use the syntax specified by ESF, not JSON; - property names in object literals can be identifiers, not only string literals. [Thanks to Kris Zyp for the idea that "JSON plus anonymous functions" is a potentially useful sublanguage of JavaScript, although we're not using it exactly as he suggested.] Trailing commas are not allowed in , because Jacaranda also does not allow them in (see group ARRAY_LITERALS). does not allow 'get' or 'set' property assignments, because this allows the runtime to assume that the object it creates is constant. One way to submit a module to a Jacaranda interpreter is to simply execute the module definition as ESF code -- for example in a