The Janus Language Specification

Introduction

Janus is a simple, dynamically and strongly typed, interpreted language aimed at being as easy as possible to implement.

This document is the primary reference for the Janus programming language grammar and semantics.

This document has been greatly inspired by and based on The Rust Reference and Rust Grammar (including ordinary copy-pasting tedious paragraphs).

Notation

Janus' grammar is defined over Unicode codepoints, each conventionally denoted U+XXXX, for 4 or more hexadecimal digits X. Most of Janus' grammar is confined to the ASCII range of Unicode, and is described in this document by a dialect of Extended Backus-Naur Form (EBNF) which can be defined self-referentially as follows:

grammar        : rule+
rule           : nonterminal ':' productionrule
productionrule : production [ '|' production ]*
production     : term+
term           : element repeats
element        : LITERAL | IDENTIFIER | '[' productionrule ']'
repeats        : [ '*' | '+' ] NUMBER? | NUMBER? | '?'

Where:

Whitespace in the grammar is ignored.
Square brackets are used to group rules.
LITERAL is a single printable ASCII character, or an escaped hexadecimal ASCII code of the form \xQQ, in single quotes, denoting the corresponding Unicode codepoint U+00QQ.
IDENTIFIER is a nonempty string of ASCII letters and underscores.
The repeats forms apply to the adjacent element, and are as follows:
- ? means zero or one repetition
- * means zero or more repetitions
- + means one or more repetitions
- NUMBER trailing a repeat symbol gives a maximum repetition count
- NUMBER on its own gives an exact repetition count

This EBNF dialect should hopefully be familiar to many readers.

Unicode productions

A few productions in Janus' grammar permit Unicode codepoints outside the ASCII range. We define these productions in terms of character properties specified in the Unicode standard, rather than in terms of ASCII-range codepoints. The section Special Unicode Productions lists these productions.

String table productions

Some rules in the grammar — notably unary operators, binary operators, and keywords — are given in a simplified form: as a listing of a table of unquoted, printable whitespace-separated strings. These cases form a subset of the rules regarding the token rule, and are assumed to be the result of a lexical-analysis phase feeding the parser, driven by a DFA, operating over the disjunction of all such string table entries.

When such a string enclosed in double-quotes (") occurs inside the grammar, it is an implicit reference to a single member of such a string table production. See tokens for more information.

Lexical structure

Input format

Janus input is interpreted as a sequence of Unicode codepoints encoded in UTF-8. Most Janus grammar rules are defined in terms of printable ASCII-range codepoints, but a small number are defined in terms of Unicode properties or explicit codepoint lists.

Special Unicode Productions

The following productions in the Janus grammar are defined in terms of Unicode properties: ident, non_null, non_eol, non_single_quote and non_double_quote.

Identifiers

The ident production is any nonempty Unicode string of the following form:

The first character has property XID_start
The remaining characters have property XID_continue

that does not occur in the set of keywords.

Note: XID_start and XID_continue as character properties cover the character ranges used to form the more familiar C and Java language-family identifiers.

Delimiter-restricted productions

Some productions are defined by exclusion of particular Unicode characters:

non_null is any single Unicode character aside from U+0000 (null)
non_eol is non_null restricted to exclude U+000A ('\n')
non_single_quote is non_null restricted to exclude U+0027 (')
non_double_quote is non_null restricted to exclude U+0022 (")

Miscellaneous productions

These productions do not have any special Janus grammar meaning, but are defined in order to simplify definitions of more sophisticated productions.

hex_digit   : 'a' | 'b' | 'c' | 'd' | 'e' | 'f'
            | 'A' | 'B' | 'C' | 'D' | 'E' | 'F'
            | dec_digit
oct_digit   : '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7'
dec_digit   : '0' | nonzero_dec
nonzero_dec : '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

Comments

comment            : block_comment | line_comment
block_comment      : "/*" block_comment_body "*/"
block_comment_body : [ block_comment | character ]*
line_comment       : "//" non_eol*

Whitespace

whitespace_char : '\x20' | '\x09' | '\x0a' | '\x0d'
whitespace      : [ whitespace_char | comment ]+

Tokens

token : [ keyword | op | ident | literal | symbol ] whitespace

Keywords


and	as	break	case	class
const	continue	do	else	enum
False	fn	for	if	in
let	loop	mod	or	return
trait	True	type	while	yield

Keywords are case-sensitive. Each of these has special meaning in its grammar, and all of them are excluded from the ident rule.

Not all of these keywords are used by the language as of now. Some of them were reserved to make space for possible future features.

Literals

literal : string_lit | char_lit | num_lit | bool_lit | unit_lit

Character and string literals

char_lit   : '\x27' char_body '\x27'
string_lit : '"' string_body* '"'

char_body : non_single_quote
          | '\x5c' [ '\x27' | common_escape ]

string_body : non_double_quote
            | '\x5c' [ '"' | common_escape ]

common_escape  : '\x5c' | 'n' | 'r' | 't' | '0'
               | 'x' hex_digit 2
               | 'u' '{' hex_digit+ 6 '}'

Number literals

num_lit : nonzero_dec [ dec_digit | '_' ]* float_suffix?
        | '0' [       [ dec_digit | '_' ]* float_suffix?
              | 'b'   [ '1' | '0' | '_' ]+
              | 'o'   [ oct_digit | '_' ]+
              | 'x'   [ hex_digit | '_' ]+  ]

float_suffix : exponent | '.' dec_lit exponent?

exponent : ['E' | 'e' ] [ '-' | '+' ]? dec_lit

dec_lit : [ dec_digit | '_' ]+

Boolean literals

bool_lit : "True" | "False"

The two values of the boolean type are written True and False.

Unit literal

unit_lit : "()"

Symbols

symbol : '[' | ']' | '(' | ')' | '{' | '}' | ',' | ';'

Symbols are a general class of printable tokens that play structural roles in a variety of grammar productions. They are cataloged here for completeness as the set of remaining miscellaneous printable tokens that do not otherwise appear as operators or keywords.

Language grammar

The entry rule of Janus source file is called program.

program : whitespace? stmt*

Statements

stmt      : decl_stmt | subst_stmt | expr_stmt | ";"
decl_stmt : let_decl ";" | item ";"?
expr_stmt : expr ";"

Variable bindings

let_decl : "let" ident "=" expr

Variable binding introduces new subscope with new variable. This prevents leaking variables before their declaration and helps programmer prevent unexpected variable value changes (though the latter can be mitigated with Substitution statements).

Substitution statements

subst_stmt : lvalue ":=" expr ";"

Items

item : fn_item

Functions

fn_item   : "fn" ident "(" fn_params? ")" block
fn_params : ident [ "," ident ]*

Blocks

A block is a sequence of statements, possibly ending with an expression. The return value of the block is the value of the last expression statement, or () otherwise.

block      : "{" stmt* "}"

Lvalues

Lvalue is a reference to something in memory (either variable or item).

lvalue   : index_lv | path
index_lv : path "[" expr "]"
path     : ident

Expressions

expr : literal_expr
     | block_expr
     | op_expr
     | if_expr
     | while_expr
     | loop_expr
     | break_expr
     | continue_expr
     | return_expr
     | lvalue_expr

literal_expr : literal
block_expr   : block
lvalue_expr  : lvalue

Operators

The special op_expr production means unary and binary expression with operator, and for brevity, it is denoted in this document using following precedence table:

Precedence	Operator	Associativity	Operation
20	`(...)`	n/a	Grouping
19	-	-	-
18	`... ( ... )`	left-to-right	Function call
17	`... ++`	n/a	Postfix increment
	`... --`	n/a	Postfix decrement
16	`! ...`	right-to-left	Logical NOT
	`~ ...`	right-to-left	Bitwise NOT
	`+ ...`	right-to-left	Unary plus
	`- ...`	right-to-left	Unary minus
	`++ ...`	n/a	Prefix increment
	`-- ...`	n/a	Prefix decrement
15	`... ** ...`	right-to-left	Exponentation
14	`... * ...`	left-to-right	Multiplication
	`... / ...`	left-to-right	Division
	`... mod ...`	left-to-right	Remainder
13	`... + ...`	left-to-right	Addition
	`... - ...`	left-to-right	Substraction
12	`... << ...`	left-to-right	Bitwise left shift
	`... >> ...`	left-to-right	Bitwise right shift
11	`... & ...`	left-to-right	Bitwise AND
10	`... ^ ...`	left-to-right	Bitwise XOR
9	`...	...`	left-to-right
8	`... == ...`	left-to-right	Equality
	`... != ...`	left-to-right	Inequality
	`... < ...`	left-to-right	Less than
	`... > ...`	left-to-right	Greater than or equal
	`... <= ...`	left-to-right	Less than or equal
	`... >= ...`	left-to-right	Greater than or equal
7	-	-	-
6	`... and ...`	left-to-right	Logical AND
5	`... or ...`	left-to-right	Logical OR
4	-	-	-
3	-	-	-
2	-	-	-
1	-	-	-
0	-	-	-

If expressions

if_expr : "if" expr
          block
          else_tail?

else_tail : "else" [ if_expr | block ]

The return value of the if-else expression is either the result of the if block, or the else one. If the latter one was not provided, it evaluates to (), e.g.:

let a = if False { 1234 } // a == ()

While loops

while_expr : "while" expr
             block

The while loop is also similar to constructs in other languages, and it also always returns ().

The while loop is syntactic sugar for following snippet:

loop {
    /* body */
    if /* condition */ { break }
}

Infinite loops

loop_expr : "loop" block

loop always returns ().

Break expressions

break_expr : "break"

break does not evaluate as it performs jump, but technically it evaluates to ().

Continue expressions

continue_expr : "continue"

continue does not evaluate as it performs jump, but technically it evaluates to ().

Return expressions

return_expr : "return" expr

return does not evaluate as it performs jump, but technically it evaluates to ().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Janus Language Specification

Introduction

Notation

Unicode productions

String table productions

Lexical structure

Input format

Special Unicode Productions

Identifiers

Delimiter-restricted productions

Miscellaneous productions

Comments

Whitespace

Tokens

Keywords

Literals

Character and string literals

Number literals

Boolean literals

Unit literal

Symbols

Language grammar

Statements

Variable bindings

Substitution statements

Items

Functions

Blocks

Lvalues

Expressions

Operators

If expressions

While loops

Infinite loops

Break expressions

Continue expressions

Return expressions

FilesExpand file tree

Language-Specification.md

Latest commit

History

Language-Specification.md

File metadata and controls

The Janus Language Specification

Introduction

Notation

Unicode productions

String table productions

Lexical structure

Input format

Special Unicode Productions

Identifiers

Delimiter-restricted productions

Miscellaneous productions

Comments

Whitespace

Tokens

Keywords

Literals

Character and string literals

Number literals

Boolean literals

Unit literal

Symbols

Language grammar

Statements

Variable bindings

Substitution statements

Items

Functions

Blocks

Lvalues

Expressions

Operators

If expressions

While loops

Infinite loops

Break expressions

Continue expressions

Return expressions