Jacaranda Language Specification, draft 0.2 "Inside every large problem is a small problem struggling to get out." -- C. A. Hoare Introduction ============ This document is draft version 0.2 of a specification for the Jacaranda programming language. Jacaranda is an object-capability [Miller2006] sublanguage of JavaScript. To be more precise, Define language A to be an *attribute-verifiable sublanguage* of B if-and-only-if: - there exists an attribute grammar [Knuth1968, Knuth1971, Swierstra2005] for A whose underlying grammar recognises a subset of texts recognised by B, and that deterministically computes an 'errors' attribute (whose value is a set of error indicators) on the start nonterminal; and the semantics of executing a text in A are: - if the text is not recognised by A's underlying grammar or if the 'errors' attribute on the start nonterminal is non-empty, report these errors. Otherwise, execute the program according to the specification of language B, in a given execution environment. If the grammar for A is S-attributed , then A is an S-attribute-verifiable sublanguage of B. By *ES3*, we mean the programming language specified by ECMAScript, 3rd edition, subject to corrections for known specification errors in that standard as described in the "ES3 Amendments" section below. Jacaranda is an S-attribute-verifiable sublanguage of ES3. Note that _any_ implementation of language B that satisfies B's specification when applied to programs in the A sublanguage, is sufficient to execute a verified A program. In the case of Jacaranda, the sublanguage avoids many known bugs in commonly used JavaScript implementations, and so verified code does not necessarily need to be run on a fully correct implementation of ES3. By using this definition of subsetting, Jacaranda differs from some other proposals for security-focussed sublanguages of ECMAScript, such as Caja and Cajita [Miller2008] or FBJS [Facebook]. In implementations of those languages, the code is rewritten to a form that can introduce additional run-time errors (or other changes in behaviour, called "gotchas" in the Caja/Cajita specification), that would not occur when running it directly as ECMAScript. In Jacaranda, there are necessarily no changes in the possible run-time behaviour of a verified program relative to the same program in ES3. There are cases where an existing JavaScript program might have to be modified in such a way that additional run-time errors could occur -- for example some property accesses may have to be replaced by function calls, which can throw exceptions -- but these can only happen where there are explicit changes in the source code. Other advantages of specifying and implementing a safe sublanguage by attribute-verification rather than by rewriting include: - A verifier is easier to understand and has fewer sources of potential implementation error than a rewriter. To understand a rewriter, one must keep in mind the semantics of both the source and target languages, and convince onesself that the source language semantics are preserved by the translation (taking into account the run-time library that the translated code depends on). For a verifier, the subsetting relation is automatic. - It is easy to perform white-box testing of a verifier based on an attribute grammar, because the intermediate results are deterministic and do not depend on any implementation choices. The attribute grammar specifies precisely what tree of attribute values are to be computed for any given input. So a single white-box test suite can be run against all implementations, or implementations can be compared against each other for "fuzzed" input , and any differences in the resulting attribute tree (not just differences in the overall pass/fail outcome of verification) will necessarily be bugs. - There are no difficulties in reverifying a program that has already been verified. (A non-idempotent rewriter cannot usefully process a code fragment twice -- even if the result is correct, it will be too inefficient.) It would also be possible to apply N-version programming, with verifiers written using different languages and/or parser libraries by independent teams of programmers: a simple harness could run all versions concurrently and if they differ, reject the input program and report a bug. (Of course these techniques will only directly catch implementation rather than specification bugs. They also only apply to the verifier; the JavaScript implementation(s) used to execute verified programs still need to be tested more conventionally.) - In the S-attributed grammars (lexical and syntactic) used to specify Jacaranda, all attributes are *synthesized*; that is, each node's attributes can be computed only from the attributes and contents of its immediate child nodes. This admits a particularly simple and efficient (in time and memory) verifier implementation, in which programs can be parsed and verified in a single pass, without necessarily constructing a syntax tree, abstract or otherwise. The verifier can use almost any form of parser (Jacaranda omits two misfeatures of ECMAScript syntax that may complicate parsers: semicolon insertion, and context-sensitive lexing of regexp literals). Attribute computation can be implemented using the "semantic actions" supported by most existing general-purpose parser systems, and is also straightforward to implement in an ad-hoc recursive descent parser. - to specify a sublanguage, you only need to define how the language is restricted syntactically. If that is done by an attribute grammar, there is little room for ambiguity in the specification of the verifier. The semantics are already defined by the original language specification, modulo bugs and ambiguities in that specification. (Jacaranda also requires a run-time library written in JavaScript, but that library is relatively small and of low implementation complexity. In fact it is mostly a set of utility functions which, although they are written in JavaScript and therefore have to be trusted, are each very short and can be understood independently.) A specification that defines a rewriting into another language is more complicated, and gives a result that is less directly useful and understandable to most users of the subset language. - There is less inefficiency introduced into the code that is being run. While there is some overhead due to changing property accesses to function calls, this is explicitly visible in the source code (and can be avoided in many cases). - Debugging is simplified because it is the original code that is being debugged. Although it is possible in principle to provide a debugger with a mapping between lines of the input code and its rewritten form, in my experience this often doesn't work very well (particularly when the ordering of code is changed, as it is by many of the Caja rewriting rules). Despite these differences, Jacaranda is strongly influenced by Caja and Cajita, and by ADsafe [Crockford], which is also a static sublanguage of ES3 that could in principle be specified by an attribute grammar. (Differences between Jacaranda and ADsafe are out of scope for this document, but in general Jacaranda allows a qualitatively larger sublanguage of ECMAScript than ADsafe. Also, Jacaranda is an object-capability language with per-object encapsulation, whereas ADsafe only provides encapsulation between module instances.) I encountered the idea of using attribute grammars for validation (as opposed to their more common uses in assigning semantics or for code generation) from Meredith Patterson and Len Sassaman [PS2008]. The earliest reference I can find to a programming language being defined in terms of an attribute grammar, is a 1977 paper defining a subset of Algol 68 [Simonet1977]. [Knuth1968] Donald Knuth, Semantics of Context-Free Languages, California Institute of Technology, Theory of Computing Systems, Volume 2 Number 2. Springer, New York, June 1968. [Knuth1971] Donald Knuth, Semantics of Context-Free Languages: Correction, Stanford University, Theory of Computing Systems, Volume 5 Number 2. Springer, New York, June 1971. [Simonet1977] M. Simonet, An attribute description of a subset of Algol 68, Proceedings of the Strathclyde ALGOL 68 conference, Glasgow, Scotland, pages 129-137, 1977. [Swierstra2005] Wouter Swierstra, Why Attribute Grammars Matter, The Monad.Reader, Issue Four, 01-07-05. [PS2008] Meredith L. Patterson and Len Sassaman, Validation by Parse Tree Comparison Extended to Context-Sensitive Grammars, In preparation, 2008. [Miller2006] Mark S. Miller, Robust Composition: Towards a Unified Approach to Access Control and Concurrency Control, PhD thesis, Johns-Hopkins University, May 2006. [Miller2008] Mark S. Miller, Caja draft specfication, Latest version available from [Crockford] Douglas Crockford, ADsafe specification, [Facebook] FBJS - Facebook Developers Wiki, Conventions =========== Definitions of technical terms used in this specification should be taken from the first of the following documents that provides a definition for the term: 1. This document. 2. ECMA-262, 3rd edition. 3. The Unicode Standard, version 5.1.0. Defining occurrences of terms in this document are shown like this: *term*. A *string* is a sequence of zero or more UTF-16 code units. In this specification, strings are often presented in single quotes; this has the same meaning, including use of \u escape sequences, as the corresponding string literal in ES3. Terms applied to strings such as *starts with*, *ends with*, etc. are defined in terms of this code unit sequence, regardless of whether it is a well-formed UTF-16 encoding. For example, '\uD800__' is considered to end in '__'. Code fragments are given in double angle quotation marks: « ». In general, these fragments should be interpreted as ES3 code; they may or may not be valid in the Jacaranda sublanguage. In code fragments, \ stands for itself. Grammar productions are given in single angle brackets: < >. An optional occurrence of a production is shown as < >opt. \u escapes are used in grammar productions to mean the corresponding unescaped code unit. (These conventions are different to those in ECMA-262 because in plain text, we cannot distinguish terminals from nonterminals by font.) There are three grammars used in this specification: - in the lexical grammar, :: is used to separate a nonterminal from its definition. The lexical grammar is an S-attributed context-free grammar (CFG) [*], and is specified in terms of a set of changes to the ES3 lexical grammar. With some minor exceptions (/*const*/, /*fallthru*/, and the treatment of ++ and -- tokens), these only cause additional rejections; they do not change the tokenisation of lexically valid ES3 source texts. The input to the lexical grammar is a string, and the output is a sequence of tokens given by the top-level 'tokens' attribute. Each output token is a string that may also have 'SV', 'MV', and/or 'errors' attributes. (Note that the production defined in ECMA-262 is misnamed, because it does not include s, which are tokens.) - in the syntactic grammar, : is used to separate a nonterminal from its definition. The syntactic grammar is also an S-attributed CFG [*], and is specified in terms of a set of changes to the ES3 syntactic grammar. Its input is the token sequence computed by the lexical grammar, and its principal output is an 'errors' attribute giving the set of errors (if any), other than parse errors, that would prevent the source text from being a valid Jacaranda program. - in the string grammar, ::: is used to separate a nonterminal from its definition. The input to the string grammar is a sequence of 16-bit code units. It is a regular grammar that computes no attributes and is only used to recognise classes of strings. The regexp literal grammar in ES3 has no equivalent in Jacaranda, which does not allow regexp literals. The symbol ε in a production means the empty input sequence. Lexical and syntactic grammar productions that are used by modified productions, but have not themselves been modified from ECMA-262, 3rd edition, are explicitly marked with a reference to the ECMA-262 section number. The underlying lexical and syntactic grammars are only given as general CFGs [*], rather than a more specific class such as LALR(k), because the grammars given in the ES3 specification on which they are based are not in any more specific class. The syntactic CFG is unambiguous, independently of the attribute rules; that is, a possible parse is never rejected based on the values of attributes. (This property does not quite hold for the lexical grammar, where attributes are used in the LEX_TOKENISATION rule to control tokenisation.) [[FIXME: . is ambiguous. To fix that we need to distinguish *syntactically* between public and exposed identifiers.]] [*] The lexical and syntactic grammars use "but not" and "lookahead ∉" constructs, which do not obviously result in a CFG by construction. In all but one case, "but not" is only used to specify sets of individual code units, so it does not increase the power of the grammar notation. The single exception is: :: but not from ECMA-262 section 7.6, which is context-free (although it would be ugly to write it as such) because is finite. Only one use of the 'lookahead ∉' construct is retained by the Jacaranda grammar, in . This use could have been avoided by duplicating all of the expression-related productions with versions that do not allow an object literal or function expression. It would be undesirable to actually do so, but the possibility of this rewriting shows that the grammar is context-free. The types used for attribute values are: - rational numbers - the undefined value - finite sets of other values - finite sequences of other values. The following auxiliary functions are used in attribute rules: check(cond, node, arg, msg) = cond ? {} : {error(node, arg, msg)} error(node, arg, msg) = // representation of an error detected at 'node', with auxiliary // information 'arg' (usually an identifier) and message 'msg' last(sequence) = (length(sequence) > 0) ? sequence[length(sequence)-1] : undefined NumberToString(x) = // the result of applying ToString as defined in ECMA-262 section 9.8.1 // to "the number value for x" as defined in ECMA-262 section 8.5 A convention is used in the syntactic grammar specification in order to present grammar rules more concisely, by making some "boilerplate" attribute computation implicit: if no explicit rule is given to compute an attribute, it is computed by taking the union of the corresponding attribute values on all child nodes. This convention applies to the 'freelyUsed', 'freelyAssigned', 'vars', 'usedProperties', 'aliasesForThis', and 'errors' attributes in the syntactic grammar, and to the 'errors' attribute in the lexical grammar. (Other attributes are undefined if no specific rule is given.) These are all set-valued attributes, so that taking the union of their values for all children of a node is well-defined. The convention is to be used for _all_ syntactic and lexical grammar productions, including those that are otherwise unchanged from ES3. Each rule that restricts Jacaranda relative to ES3 is given an uppercase name. Any conformance requirements given in the normative text of these rules are intended to be redundant with, i.e. specify precisely the same restrictions as, the Jacaranda lexical and syntactic grammars. If this is not the case then it indicates an ambiguity in the specification, and clarification should be sought. Grammar productions given in one rule may also impose constraints needed to implement other rules, in which case cross-references are given. Text in a subsection following "Rationale:" is non-normative; that is, it does not impose any conformance requirements on Jacaranda implementations or programs. The same programming language would be specified if this text were removed. Rationale: Analogous to type declarations, some degree of redundancy in a specification is useful in order to provide a better chance of detecting specification errors. This is the reason for providing informal descriptions that are redundant with the grammar rules. Conformance =========== The terms *SHALL*, *SHALL NOT* or *SHALL ONLY* are used to describe constraints on a conforming implementation of Jacaranda. The terms *MUST*, *MUST NOT* or *MUST ONLY* are used to describe static constraints on a module definition (see below). This indirectly specifies that a conforming implementation of Jacaranda SHALL reject module definitions that fail to satisfy such a constraint. The terms *SHOULD*, *SHOULD NOT* or *SHOULD ONLY* are used to describe a recommendation for Jacaranda code to follow, in order to mimimise the risk of potential incompatibilities with future versions of this specification or implementations of ECMAScript. It has no formal consequence for conformance. For any constraint that is described as implementation-defined, a conforming implementation SHALL make a consistent choice and SHALL document which choice is made. An *ECMAScript interpreter* is an instance of an ECMAScript implementation conforming to ECMA-262 3rd edition amended as described in the section "ES3 Amendments". The *primordial objects* are the constructors Array, Function, Object and String, and any other objects necessary for them to function correctly in a given ECMAScript implementation (for example the objects comprising their prototype chains). A *context* is an ECMAScript execution environment that has copies of the primordial objects independent of other context in the same ECMAScript interpreter. Contexts may share other objects. (In typical web browser implementations of JavaScript/ECMAScript, there is usually one context for each frame or iframe [[CHECK]].) A *visible side effect* is an effect that changes the state of one or more objects in an ECMAScript interpreter in a way that is observable to other code running in that interpreter by overt means (that is, without making use of timing of events where this timing is nondeterministic). Memory usage is not considered to be a visible side effect. Jacaranda programs are organised into *modules* (typically, these are independently developed units of source code). A *module definition* is a string, representing Jacaranda source code intended to define a module. An *internal module representation* is a representation of a module as a collection of values in a given ECMAScript interpreter. An internal module representation includes a *module object*, which is a deeply immutable object with the properties specified in the module definition. An instance of a Jacaranda module is called a *caplet*. Each caplet will have a *powerbox* object defined by a specific region of its module code. The creator of a caplet must provide an *environment* and a *powersource*. The environment specifies *imports* that will be in scope throughout the caplet's module code. The powersource is an object that will only be initially available to the caplet's powerbox. The caplet will have no overt authority other than that granted via its environment and its powersource. (It will also have access to the module object, but that object provides no authority.) A *module name* is a string, representing the name of a module. A *Jacaranda interpreter* is an ECMAScript interpreter with the following additional characteristics: - an interpreter has a per-context state consisting of a mapping from module names, to internal module representations of modules that have been accepted in that context. - it is possible to create a context with an initially empty mapping. - it is possible to submit a module definition to a given context, with the following effect: - if the module definition violates any of the constraints given in sections "Module Definitions" and "Static Constraints" below, a *static rejection error* SHALL be reported. - the module definition may specify a variant of the Jacaranda language that is unsupported by the implementation. In that case, an *unsupported Jacaranda variant error* SHALL be reported. - if the module definition does not give a 'name' property that is different from any module name already in the mapping, a *module name clash error* SHALL be reported. - the implementation may, for any or no reason, report an *implementation failure* when an attempt is made to submit a module. Such a failure SHALL be distinguishable from other errors. - otherwise, a mapping from the module's name to its internal module representation SHALL be added to the context's state. In any of these cases, there SHALL be no other visible side effects of submitting the module definition. - it is possible to create a caplet by *instantiating* a named module, providing an environment and a powersource as discussed above. - if the import environment specifies a property with an unshadowable name, an *unshadowable import error* SHALL be reported. - if an import named in the caplet's 'imports' list is not available in the import environment, an *unavailable import error* SHALL be reported. - the module MUST have a 'powerbox' property that holds a function -- if not then a *missing powerbox property error* SHALL be reported. - the implementation may, for any or no reason, report an *implementation failure* when an attempt is made to instantiate a caplet. Such a failure SHALL be distinguishable from other errors. - the function given by the 'powerbox' property of the module SHALL be executed, in an environment specified by the section "Module Run-time Environment", with the given powersource object and the module object as parameters in that order, and with the imports specified in the given import environment available in the module's outer lexical scope. - the function given by the 'powerbox' property may interact with the environment of the Jacaranda implementation in order to set up *event handlers* which cause functions created by the module to regain control at some later point in time. The specifications of event-handling features are implementation-defined. In cases where the above description says that an error SHALL occur as a result of submitting or instantiating a module, other visible side effects SHALL NOT occur. At any time, and for any or no reason, a Jacaranda implementation may *lock* a given context, in which case it SHALL NOT run any further code in that context. Other aspects of the interaction between a Jacaranda interpreter and its environment are unspecified. (One possibility is that a *container* is written in unrestricted ECMAScript to control this interaction.) Rationale: The idea of using different conformance terms to indicate constraints on the implementation vs constraints on programs is borrowed from W3C specifications. The term "lock" is named after the LOCK utility in the Incompatible Timesharing System: . The allowance for implementation failures makes Jacaranda particularly easy to implement: #include int main(int argc, char **argv) { puts("Implementation failure.\n"); return 1; } More seriously, there are good reasons to allow nondeterministic failures, given that a practical Jacaranda implementation has no control over what memory or execution resources are available, and machines with unbounded resources do not exist. The requirement for these failures to be distinguishable from other errors is actually quite strong. ES3 Amendments ============== Apply the errata given in . (Note that the grammar productions changed in the errata are disjoint from those changed for Jacaranda. [[CHECK again]]) Change and related productions to allow «const» to occur in place of «var». The detailed changes to these productions are given in the DECLARABLE_IDENTIFIERS rule, but are mentioned here because they are not a sublanguage of ES3. Section 2 Conformance - The following paragraphs do not apply to Jacaranda: A conforming implementation of ECMAScript is permitted to provide additional types, values, objects, properties, and functions beyond those described in this specification. In particular, a conforming implementation of ECMAScript is permitted to provide properties not described in this specification, and values for those properties, for objects that are described in this specification. A conforming implementation of ECMAScript is permitted to support program and regular expression syntax not described in this specification. In particular, a conforming implementation of ECMAScript is permitted to support program syntax that makes use of the “future reserved words” listed in 7.5.3 of this specification. Rationale: permission to provide arbitrary implementation-defined extensions is not appropriate for a security-focussed language. Any proposed extension should undergo thorough public security review and, if accepted, be added to the Jacaranda specification. This does not prevent experimentation with proposed extensions, but an implementation that supports such experiments is not considered to be conformant to this version of the specification. Section 3 References - Replace the reference to Unicode version 2.0 with references to Unicode version 5.1.0. Rationale: The conformance requirements for Unicode have been significantly tightened since version 2.0. (In particular, there were security-relevant changes at around version 3.2.) Note that this does not require support for characters added in later versions. Sections 4.2 and 4.2.1 Language Overview - This (non-normative) section should not be considered to apply to Jacaranda. Section 6 Source Text [[TBD]] Rationale: The terms used here are inconsistent with the definitions in the Unicode standard. For example, "character" is used in place of "code unit", and "code point" is used entirely incorrectly. It's possible to infer what was probably meant, so I won't attempt to rewrite it in this draft; the important points are that: - strings are sequences of code units - an escape sequence represents exactly one code unit. Section 7.1 Unicode Format Control Characters - This section does not apply to Jacaranda; in particular, Format Control characters SHALL NOT be "removed from the source text before applying the lexical grammar." (Format Control characters are not allowed anywhere in a Jacaranda source text.) Section 7.6 Identifiers - The following does not apply to Jacaranda: Identifiers are interpreted according to the grammar given in Section 5.16 of the upcoming version 3.0 of the Unicode standard, with some small modifications. This grammar is based on both normative and informative character categories specified by the Unicode standard. The characters in the specified categories in version 2.1 of the Unicode standard must be treated as in those categories by all conforming ECMAScript implementations; however, conforming ECMAScript implementations may allow additional legal identifier characters based on the category assignment from later versions of Unicode. The allowed syntax for identifiers in Jacaranda is specified in rule LEX_IDENTIFIER_SYNTAX. Rationale: the syntax of identifiers needs to be independent of which version of the Unicode character data is otherwise used by the implementation. Section 7.9 Semicolon Insertion - Replace the entire section (including 7.9.1 and 7.9.2) with: 7.9 No "Automatic Semicolon Insertion" Jacaranda does not allow programs that would trigger "automatic semicolon insertion" as defined in ECMA-262 3rd edition, section 7.9. Any such program is invalid according to the grammar defined in this specification. This is implemented in rule LEX_TOKENISATION (and by the use in some syntactic productions of and tokens, which correspond to ++ and -- tokens that are not separated from the preceding token by a line terminator). Rationale: In the ES3 grammar, ... In some cases, "grammatical" semicolon insertion can be supported simply by removing the terminating semicolon from ... If the semicolon is present, it will be interpreted according to this grammar as an , which has no semantic effect. In the case of the 'do', 'continue', and 'break' statements, this would be relatively harmless because a human reader or a parser would need only one token lookahead to understand/parse the resulting language unambiguously. For the 'return', 'var', 'throw' and expression statements, however, this approach fails -- in a way that demonstrates the basic problem with implicit semicolon insertion. Consider the example: var a = x + y; If the .. were simply omitted, then the grammar would be ambiguous because .. could be parsed as either var a = x; + y; or var a = x + y; both of which are syntactically valid. In ES3 it will be parsed as the latter -- but if a programmer mistypes + as ++, for example: var a = x ++ y; then that will be parsed as var a = x; ++ y; In other words, if we make any syntax error in a variable initialiser, return, throw, or expression statement, then under ES3's semicolon insertion rules it might silently be interpreted as something quite different. To determine the resulting parse, we would need to consider every possible location at which a semicolon could be inserted in the program beyond that point, with unbounded lookahead (it is not even sufficient to consider until the next explicit semicolon, in general). This is clearly unacceptable. By this argument, if Jacaranda were to support grammatical semicolon insertion, it could only safely do so after 'do', 'continue' and 'break' statements, and not after 'return', 'var', 'throw' or expression statements. That would create an ugly inconsistency in the language. It is better to simply disallow semicolon insertion entirely, at the cost of accepting fewer programs. Annex B Compatibility - The additional syntax specified in section B.1 SHALL NOT be valid in Jacaranda. - The 'escape' and 'unescape' global functions in section B.2.1 and B.2.2 SHALL NOT be available to Jacaranda modules, unless they are explicitly provided by an import environment. - Whether the additional properties of String and Date specified in sections B.2.3 to B.2.6 are available to Jacaranda modules is implementation-defined. A Jacaranda program SHOULD NOT rely on the presence of these properties. Lexical Grammar =============== The lexical grammar of Jacaranda is that of ES3 with the additional constraints listed below. LEX_CODE_UNITS: Source code units, line terminators, and whitespace MUST ONLY be used as follows: :: // Rationale: \u0009 // exclude ASCII controls except TAB, \u000A // LF, \u000D // CR. \u0020-\u007E // \u007F-\u009F controls \u00A0-\u00AC // \u00AD :Cf: \u00AE-\u05FF // \u0600-\u0603 :Cf: \u0604-\u06DC // \u06DD :Cf: \u06DE-\u070E // \u070F :Cf: \u0710-\u17B3 // \u17B4-\u17B5 :Cf: \u17B6-\u200A // \u200B-\u200F :Cf:; \u200C-\u200F !Safari !IE6 \u2010-\u2027 // \u2028-\u2029 !Cajita; \u202A-\u202E :Cf: !Safari !IE6 \u202F-\u205F // \u2060-\u2064 :Cf: \u2065-\u2069 // \u206A-\u206F :Cf: !Safari !IE6 \u2070-\uD7FF // \uD800-\uDFFF UTF-16 surrogates \uE000-\uFDCF // \uFDD0-\uFDEF noncharacters \uFDF0-\uFEFE // \uFEFF (BOM) :Cf: !Safari !IE6 \uFF00-\uFFEF // \uFFF0-\uFFF8 !Safari !Firefox; // \uFFF9-\uFFFB :Cf: // \uFFFC (object replacement character) // \uFFFD (replacement character) // \uFFFE-\uFFFF noncharacters :: \u0020-\u007E :: \u000A \u000D :: \u0009 \u0020 This rule does not restrict the set of code units that may be represented by \u escapes. Rationale: Start with \u0000-\uFFFF, and - remove code units that are not in XML 1.1's or that are in XML 1.1's (see below) - remove code units corresponding to BMP characters that are in category :Cf: (Format Control) in Unicode 5.1.1: (this is a superset of category :Cf: in all previous Unicode versions) - remove code units that are disallowed by Cajita according to . - remove \uFFFC and \uFFFD. and from XML 1.1 are used just because that is a slightly less arbitrary choice than me deciding on my own what control characters should be restricted. Disallowing :Cf: characters is in principle the wrong thing for proper internationalisation support. However, I eventually concluded that it's the only way to guarantee consistent and secure behaviour across browsers, which in this case trumps internationalisation. (The right thing would be to allow :Cf: characters other than \uFEFF only in string literals and comments, and to allow \uFEFF [byte order mark] only between tokens. Relaxing the rules later is easier than tightening them.) Whitespace is restricted to US-ASCII tab and space characters. There is no reason to internationalise whitespace used to separate tokens in source code. (Other space characters are still allowed in strings.) Also, the fact that vertical tab and form feed were treated as whitespace rather than as line separators was potentially confusing; Jacaranda disallows them. LEX_COMMENT_FIRST_CHARACTER: If the first code unit in a after "/*" is not '*' or ' ', or if the first code unit in a after "//" is not '/' or ' ', then all code units in the comment MUST be printable US-ASCII (that is, [\u0020-\u007E]). The necessary grammar change is given in rule LEX_COMMENT_KEYWORDS. Rationale: This prevents an adversarial code review attack where a comment that looks like /*const*/ does not actually enforce constness, and similarly for other comments used by JSLint, for example. It also prevents the first character in a comment from being a non-US-ASCII character that is stripped on parsing, which might plausibly result in the LEX_COMMENT_NO_ATSIGN_EXTENSIONS rule being bypassed. LEX_COMMENT_NO_ATSIGN_EXTENSIONS: The first character in any comment after «/*» or «//» MUST NOT be «@». The necessary grammar change is given in rule LEX_COMMENT_KEYWORDS. Rationale: This may trigger non-ES3-compliant extensions (for example or ). ADsafe rejects «@cc» anywhere in a comment. This is sufficient to prevent enabling of JScript conditional compilation, but not the Venkman debugger extensions. Caja and Cajita do not need to reject «@» in comments, because the output of cajoling does not include any comments. See . LEX_COMMENT_KEYWORDS: «/*const*/» and «/*fallthru*/» are treated as non-comment tokens. They MUST ONLY appear where specified in the Jacaranda grammar. Note the lack of spaces before and after «const» or «fallthru». The lexical grammar changes resulting from this and rules LEX_COMMENT_FIRST_CHARACTER and LEX_COMMENT_NO_ATSIGN_EXTENSIONS are: :: ε :: * / { c.lineBreak = false } opt * / { c.lineBreak = left.lineBreak or right.lineBreak; c.errors = check(right is , c, right, 'comment does not start with * or space and is not printable ASCII') } * opt * / { c.lineBreak = pacc.lineBreak } \u0020 opt * / { c.lineBreak = mcc.lineBreak } :: but not * or \u0020 or @ { mc.lineBreak = sc is } :: as in ECMA-262 section 7.4 { [[FIXME lineBreak]] } :: as in ECMA-262 section 7.4 { [[FIXME lineBreak]] } :: as in EMCA-262 section 7.4 { [[FIXME lineBreak]] } :: as in ECMA-262 section 7.4 { [[FIXME lineBreak]] } :: { c.errors = check(right is , c, right, 'comment does not start with / or space and is not printable ASCII') } / opt \u0020 opt :: :: but not or / or \u0020 or @ :: as in ECMA-262 section 7.4 :: as in ECMA-262 section 7.4 Rationale: /*const*/ and /*fallthru*/ must be ES3 comments in order for Jacaranda to remain a sublanguage of ES3. (The approach taken by this and the LEX_TOKENISATION rules is to lex them as multiline comments, but then retain them as tokens, when other comment tokens are thrown away.) See the LEFT_HAND_SIDE rule for a description of the use of /*const*/ (it is essentially the same as 'const' in ES4, but with static rejection instead of silent failure). See also . LEX_STRING_LITERALS: A MUST ONLY use one of the escape characters defined in below. The following attribute definitions are used to compute the value of a string literal, which is needed for other rules. They correspond exactly to the procedure in ECMA-262 section 7.8.4, expressed in the same attribute grammar notion as the rest of this specification. The grammar has been altered to not require lookahead in the production. :: " " { s.SV = ds.SV } ' ' { s.SV = ss.SV } :: ε { ds.SV = "" } { ds.SV = [c.CV] ++ rest.SV } \u005C 0 // no initial decimal { ds.SV = ['\u0000'] ++ rest.SV } :: ε { ds.SV = "" } { ds.SV = [c.CV] ++ rest.SV } \u005C 0 { ds.SV = ['\u0000'] ++ rest.SV } :: but not " or \u005C or { dc.CV = c } \u005C { dc.CV = e.CV } :: but not " or \u005C or or { dc.CV = c } \u005C { dc.CV = e.CV } :: ε { ss.SV = "" } { ss.SV = [c.CV] ++ rest.SV } :: ε { ss.SV = "" } { ss.SV = [c.CV] ++ rest.SV } \u005C 0 { ss.SV = ['\u0000'] ++ rest.SV } :: but not ' or \u005C or { sc.CV = c } \u005C { sc.CV = e.CV } :: but not ' or \u005C or or { sc.CV = c } \u005C { sc.CV = e.CV } { e.CV = ce.CV } { e.CV = he.CV } { e.CV = ue.CV } { ce.CV = se.CV } :: b { se.CV = '\u0008' } t { se.CV = '\u0009' } n { se.CV = '\u000A' } f { se.CV = '\u000C' } r { se.CV = '\u000D' } " { se.CV = '\u0022' } ' { se.CV = '\u0027' } / { se.CV = '\u002F' } \u005C { se.CV = '\u005C' } :: x { he.CV = [a.MV*16 + b.MV] } :: u { ue.CV = [a.MV*4096 + b.MV*256 + c.MV*16 + d.MV] } :: 0 { hd.MV = 0 } 1 { hd.MV = 1 } 2 { hd.MV = 2 } 3 { hd.MV = 3 } 4 { hd.MV = 4 } 5 { hd.MV = 5 } 6 { hd.MV = 6 } 7 { hd.MV = 7 } 8 { hd.MV = 8 } 9 { hd.MV = 9 } A { hd.MV = 10 } B { hd.MV = 11 } C { hd.MV = 12 } D { hd.MV = 13 } E { hd.MV = 14 } F { hd.MV = 15 } a { hd.MV = 10 } b { hd.MV = 11 } c { hd.MV = 12 } d { hd.MV = 13 } e { hd.MV = 14 } f { hd.MV = 15 } Rationale: JScript does not implement «\v» (it treats it as «v»). This could allow lexer confusion attacks. Unlike Caja, we cannot rewrite it to a different escape. There is no reason to allow «\» followed by an escape character that does not have a defined meaning. However, we must allow «\/» in order to remain a superset of JSON. (Unfortunately, JSON does not allow «\'», otherwise JSON and Jacaranda would support identical syntax for double-quoted strings. I considered disallowing «\'» in double-quoted strings, but that would create an inconsistency between single- and double-quoting.) LEX_NUMERIC_LITERALS: A MUST NOT have more then 20 significant digits. The following attribute definitions are used to compute the value of a numeric literal, which is needed for other rules. They correspond exactly to the procedure in ECMA-262 section 7.8.3, expressed in the same attribute grammar notion as the rest of this specification. :: { nl.MV = dl.MV } { nl.MV = hl.MV } :: . { dl.MV = dil.MV; dl.errors = check(dil.sig <= 20, dl, '> 20 significant digits in decimal literal') } . { dl.MV = dil.MV + dd.MV*10^-(dd.n); dl.errors = check((dil ++ dd)::DecimalDigits.sig <= 20, dl, '> 20 significant digits in decimal literal') } [[FIXME: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ avoid reparsing a computed value]] . { dl.MV = dil.MV*10^(ep.MV); dl.errors = check(dil.sig <= 20, dl, '> 20 significant digits in decimal literal') } . { dl.MV = (dil.MV + dd.MV*10^-(dd.n))*10^(ep.MV); dl.errors = check((dil ++ dd)::DecimalDigits.sig <= 20, dl, '> 20 significant digits in decimal literal') } [[FIXME: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ avoid reparsing a computed value]] . { dl.MV = dd.MV*10^-(dd.n); dl.errors = check(dd.sig <= 20, dl, '> 20 significant digits in decimal literal') } . { dl.MV = dd.MV*10^(ep.MV - dd.n); dl.errors = check(dd.sig <= 20, dl, '> 20 significant digits in decimal literal') } { dl.MV = dd.MV*10^(ep.MV); dl.errors = check(dil.sig <= 20, dl, '> 20 significant digits in decimal literal') } :: { dil.MV = d.MV; dil.n = 1 } { dil.MV = nzd*10^(dd.n) + dd.MV; dil.n = dd.n + 1 } :: 0 { dd.MV = 0; dd.sig = 0; dd.lead = 0; dd.n = 1 } { dd.MV = d.MV; dd.sig = 1; dd.lead = 1; dd.n = 1 } 0 { dd.MV = rest.MV; dd.sig = rest.sig; dd.lead = rest.lead === 0 ? 0 : (rest.lead + 1); dd.n = rest.n + 1 } { dd.MV = nzd.MV*10^(rest.n) + rest.MV; dd.sig = rest.sig + 1; dd.lead = rest.sig + 1; dd.n = rest.n + 1 } [[CHECK do we still need 'lead'?]] :: 0 { d.MV = 0 } { d.MV = nzd.MV } 1 { dd.MV = 1 } 2 { dd.MV = 2 } 3 { dd.MV = 3 } 4 { dd.MV = 4 } 5 { dd.MV = 5 } 6 { dd.MV = 6 } 7 { dd.MV = 7 } 8 { dd.MV = 8 } 9 { dd.MV = 9 } :: { ep.MV = si.MV } :: one of e E :: { si.MV = dd.MV } + { si.MV = dd.MV } - { si.MV = -(dd.MV) } :: 0x { hils.MV = hd.MV } 0X { hils.MV = hd.MV } { hils.MV = hil.MV*16 + hd.MV } : see rule LEX_STRING_LITERALS Rationale: ECMA-262 section 7.8.3 says that: Once the exact MV for a numeric literal has been determined, it is then rounded to a value of the Number type. If the MV is 0, then the rounded value is +0; otherwise, the rounded value must be /the/ number value for the MV (in the sense defined in 8.5), unless the literal is a and the literal has more than 20 significant digits, in which case the number value may be either the number value for the MV of a literal produced by replacing each significant digit after the 20th with a 0 digit or the number value for the MV of a literal produced by replacing each significant digit after the 20th with a 0 digit and then incrementing the literal at the 20th significant digit position. Therefore decimal literals with more than 20 significant digits cannot be used to obtain any additional precision. (Note that 20 significant digits are more than sufficient to round-trip any IEEE double value.) This restriction is also necessary in order to enforce that properties in object literals are unique. The issue here is that if the conversion to an IEEE double is nondeterministic, then the verifier may fail to detect names that should be treated as colliding (in rules SCOPE_OF_THIS and MODULE_DEFINITION_SYNTAX). LEX_RESERVED_WORDS: The following are considered to be reserved words (a larger set than that specified in ECMA-262 section 7.5): :: $module arguments Function Object Rationale: $module can be used as a keyword to introduce a module definition, but is otherwise unmentionable for two reasons: - the implementation of $module assumes that it is called with a module definition that has already been verified, so for security, the code within a module cannot be allowed to call it with an unrestricted argument; - a Jacaranda parser implementation can assume that a «$module» token always starts a module. When more than one module is parsed at once, this permits any backtracking or other information associated with a module to be thrown away each time «$module» is seen, and can allow better diagnostics when a module is not terminated correctly. «arguments» can only be used in the contexts specified in the PUBLIC_PROPERTY_READ rule. Function is unmentionable because the Function constructor is unsafe. Object is unmentionable because the "static" methods of Object proposed to be added in ES3.1, should not be accessible untamed from Jacaranda. Note that unlike the other globals that are accessible by default, we cannot run module code with a scope chain that rebinds Array, Function, Object, or String, because the resulting behaviour is not clearly defined by ES3. In some implementations, this changes the behaviour of functions, objects, arrays and strings created using literals; in others it does not. As it happens, we do not need to rebind Array or String (which have useful "static" methods, i.e. methods referenced as «Array.method(...)» or «String.method(...)»), and so those can remain first-class (although not shadowable). LEX_IDENTIFIER_SYNTAX: Identifiers MUST follow the syntax given below. :: as defined in ECMA-262 section 7.6 :: [\$A-Z_a-z\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u00ff] :: 0-9 :: t h i s :: $ Rationale: Letters from ISO-Latin-1 are supported in identifiers. Serious consideration was given to supporting full Unicode. However, - the only script that is adequately supported in the common set of characters allowed by Safari, IE6 and Mozilla, is extended Latin. Allowing extended Latin, but no other scripts in identifiers would not provide a compelling complexity / functionality tradeoff. - Jacaranda defines whether a property name is "exposed" or not based partly on whether the name starts with a lowercase letter. For consistency, the definition of lowercase should be internationalised if Unicode identifiers are allowed, but that would imply doing a Unicode property lookup on each access of an object property. [[CHECK]] Note that this syntax does not allow the use of \u escapes in identifiers. JavaScript implementations do not implement these escapes consistently (or in conformance to ES3) in the case where the unescaped identifier spells a reserved word. Since \u escapes are essentially useless outside string and regexp literals (and would be even if Jacaranda supported Unicode identifiers), it is simpler to disallow them completely, than to do so only for identifiers that would spell a reserved word. and are kinds of , which means that the token classes are no longer entirely disjoint, as they are in the ES3 grammar. This is not likely to cause any significant implementation problem (if it does, then it is possible to change uses of etc. to , and add explicit attribute checks enforcing that the identifier is of the correct form). LEX_TOKENISATION: For clarification, ECMA-262 section 7.3 specifies that "A line terminator cannot occur within any token, not even a string." Therefore, the "line continuation" syntax supported by some ECMAScript implementations, in which a line terminator occurs immediately after a «\» code unit, MUST NOT be used in Jacaranda. The following restrictions are needed to ensure that a Jacaranda program will not trigger semicolon insertion as defined in ECMA-262 section 7.9: * A postfix ++ or -- operator MUST be on the same line as the last token of its operand. * The next token (i.e. identifier, expression, or semicolon) after a 'continue', 'break', 'return' or 'throw' keyword MUST be on the same line as that keyword. As specified in ECMA-262 section 7.8.3, the source character immediately following a MUST NOT be an or . is treated as a distinguished element of the input sequence distinct from any other code unit. A source text is tokenised by appending and then applying the following attribute grammar: :: ε { text.tokens = [], text.lineBreak = false, text.partial = '' } { text.tokens = left.tokens ++ (left.partial === '' ? [] : [left.partial]) ++ e.tokens; text.lineBreak = (left.partial === '' and left.lineBreak) or e.lineBreak; text.partial = ''; text.errors = check(left.partial === '' or left.partial is , text, left.partial, 'invalid token') union check(not(last(ltokens) in {'continue', 'break', 'return', 'throw'} and e.lineBreak), text, last(ltokens), 'misplaced line terminator') union left.errors union e.errors} / { text.tokens = left.tokens ++ ['/']; text.lineBreak = false; text.partial = c; } { // This is supposed to implement the longest-match rule (for tokens not starting with / or ' or "). // I'm not sure it is correct yet. if (not((left.partial ++ c) is )) { if (left.partial ++ c === '++' and not left.lineBreak) { token = ; } else if (left.partial ++ c === '--' and not left.lineBreak) { token = ; } else { token = left.partial; } text.tokens = left.tokens ++ token; text.lineBreak = false; text.partial = c; text.errors = check(token is , text, token, 'invalid token') union check(not(token is and (c is or c is )), text, token, 'NumericLiteral cannot be followed by IdentifierStart or DecimalDigit'); } else { text.tokens = left.tokens; text.lineBreak = left.lineBreak; text.partial = left.partial ++ c; } } :: { e.tokens = []; e.lineBreak = false } { e.tokens = []; e.lineBreak = true } / * { e.tokens = (c === 'const*/' or c === 'fallthru*/') ? ['/*' ++ c] : []; e.lineBreak = c.lineBreak } / / { e.tokens = []; e.lineBreak = true } / = { e.tokens = ['/=']; e.lineBreak = false } { e.tokens = [sl]; e.lineBreak = false } { e.tokens = []; e.lineBreak = false } :: but not * or = :: but not or The following _syntactic_ productions are also needed: : ++ : -- Rationale: ECMA-262 defines the process of tokenising a source text only informally. The above attribute grammar formalises this process. [[I only recently realized that it was possible to do this using an attribute grammar, so more thorough checking is needed to ensure that it captures all of the hairy special cases in lexing JavaScript.]] A left-to-right lexer requires lookahead of up to two code units in order to distinguish the alternatives of the production. [[Explain how and work.]] Module Definitions ================== MODULE_DEFINITION_SYNTAX: A module definition MUST be parseable according to the Jacaranda lexical grammar; then, the resulting token stream MUST be parseable as the following production. Each of the 'imports' and 'optionalImports' properties of a module, if present, MUST be a literal array of string literals, none of which are a . The *import set* of a module is the union of the set of strings that occur in the module's 'import' property (or the empty set if not present) and its 'optionalImports' property (or the empty set if not present), and the implicit imports defined below. For each freely-used identifier of a module (as defined by the SCOPE_OF_IDENTIFIERS rule), a string corresponding to the identifier name MUST occur in the module's import set. A module MUST NOT have any freely-assigned identifiers. [[FIXME: choose which module boilerplate to use]] : $module ( function ( _ ) { with ( _ ) return { } } ); OR ( function ( ) { eval ( $module ( ) ) ; } ) ( ) ; { md.errors = { error(mlit, "freely-used identifier " ++ quote(id) ++ " not declared as an import") | id <- mlit.freelyUsed - mlit.imports - mlit.optionalImports - IMPLICIT_IMPORTS} union { error(mlit, "freely-assigned identifier " ++ quote(id) ++ " at module level") | id <- mlit.freelyAssigned } : // SEF stands for "side-effect free" { } : { nvl.names = {nv.name} } , { nvl.names = list.names union {nv.name}, nvl.errors = check(nv.name not in list.names, nvl, nv.name, 'duplicate property name in module object literal') } : : : { nv.name = n, if (n === 'imports') { nv.imports = literalStringsToSet(v.value); nv.errors = check(nv.imports !== undefined, nv, 'value of imports property is not a list of string literals') union { error(id is , nv, 'imported identifier is protected') | id <- nv.imports } union n.errors union v.errors; } } if (n === 'optionalImports') { nv.optionalImports = literalStringsToSet(v.value); nv.errors = check(nv.optionalImports !== undefined, nv, 'value of optionalImports property is not a list of string literals') union { error(id is , nv, 'optionally imported identifier is protected') | id <- nv.optionalImports } union n.errors union v.errors; } } } [[SEFPropertyNameAndValue should *not* allow 'this' in function expressions]] : null true false - { t.errors = check('this' notin f.freelyUsed, f, "'this' used freely by a function referenced in a module property") union f.errors } : { t.error = check(s.value in , t, "string literals in module definitions must be printable") } : [ opt ] : { sel.value = undefined } , { sel.value = undefined } { sel.value = exp.value } , { sel.value = (exp.value === undefined) ? undefined : list.value ++ [exp.value] } : as in ECMA-262 section 11.1.4 : see the NO_NAMED_FUNCTION_EXPRESSION rule function literalStringsToSet(list) { if (list is not an array of strings) return undefined; return the set of strings in list; } IMPLICIT_IMPORTS = { 'Array', 'Boolean', 'Class', 'ConstArray', 'ConstDate', 'ConstRegExp', 'Date'??, 'decodeURI', 'decodeURIComponent', 'encodeURI', 'encodeURIComponent', 'Error', 'EvalError', 'Infinity', 'isFinite', 'isNaN', 'Math'??, 'NaN', 'Number', 'parseFloat'??, 'parseInt'??, 'RangeError', 'ReferenceError', 'RegExp'??, 'String', 'SyntaxError', 'TypeError', 'undefined', 'URIError' } [[FIXME: imports cannot replace 'Array', 'Object' etc. Specify this in $makeCaplet?]] Rationale: An object literal is used as the argument to enable extensibility. Evaluating the object literal itself must have no side effects (because it will be evaluated before the call to $module). This is not a significant constraint in practice, since the main code of the module can be defined in a function expression. A deeply immutable copy of the object created from the object literal will be made accessible to caplets created from this module. This allows caplets created from the same module to share an arbitrary amount of immutable data (without needing one copy of it per caplet). Note that only anonymous function expressions are allowed, since a named function expression might have the side effect of defining the name in the surrounding scope. (This restriction applies to Jacaranda in general, not just the top level of a module definition -- see NO_NAMED_FUNCTION_EXPRESSION.) The module definition could technically have been passed in as a string literal and then 'eval'ed, but that option was quickly rejected because it would require excessive use of quoting in the module code. does not attempt to be the largest possible side-effect-free subset of ES3 expressions, just one that is sufficiently expressive for any meta-information that we might want to use to describe a module. In fact it is essentially JSON, except that: - anonymous functions are included; - comments are included; - string literals use the syntax specified by ES3, not JSON; - property names in object literals can be identifiers, not only string literals. [Thanks to Kris Zyp for the idea that "JSON plus anonymous functions" is a potentially useful sublanguage of JavaScript, although I'm not using it exactly as he suggested.] Trailing commas are not allowed in , because Jacaranda also does not allow them in (see ARRAY_LITERAL_NO_TRAILING_COMMAS below). One way to submit a module to a Jacaranda interpreter is to simply execute the module definition as ECMAScript code -- for example in a