Tokens - Acorn Reference

This chapter describes the atomic tokens used to build all Acorn programs. These are assembled together using the grammar described by the Extended BNF chapter.

Unicode characters

An Acorn program is a stream of UTF-8 encoded Unicode characters. If the stream starts with the byte-order mark (U+FEFF), it is ignored. The last character of the program is either U+0000 or the end-of-stream, whichever comes first.

Lines and Tokens

A program consists of one or more lines, separated from each other using the line-feed (U+000A) character. Line numbering begins with 1.

A program also consists of one or more multi-character tokens. Each token's type is determined by the first character of the token. The token type sets the rules for which of the following characters are also part of that token. Tokens can sometimes span more than one line.

Certain token types (more fully described in sections below) are used to generate executable code:

Number, which begins with a numeric digit from '0' to '9'.
Text, which begins with a double quote (").
Symbol, which begins with a single quote (').
Name, which begins with any of the following, a roman letter ('a' through 'z'), a dollar sign '$', an underscore '_', or any unicode symbol above U+009F locally defined as a letter.
Operator, which begins with one of the common punctuation characters between U+0021 and U+007F.

In addition, these token types play an important role:

Comment, which begins with a sharp sign '#'.
Line indent, which begins with a tab (U+0009) at the beginning of a line (and not contained within a multi-line text, symbol or comment).
End-of-file, which is either the null character (U+0000) or the end of the program.
White space, which should typically be the space character (U+0020), but could be any character not matching the starting character for the other tokens listed above.

The sections below describe each of these tokens in greater detail.

Line Indentation, Blocks and Statements

With most programming languages, the indented-line layout of a program's code is irrelevant to the computer. However, with some languages (e.g., Python and Coffeescript) the line layout does matter. So it is with Acorn. It adheres to the off-side rule. The benefit is code less cluttered with statement-end and block delimiters.

In most cases, the concept is simple and obvious:

Each line is a statement, implicitly understood to be terminated with a semi-colon ';'.
Each successive level of tab indentation at the start of lines begins a block attached to the preceding line. Each such block is implicitly surrounded by curly braces.

So:

a=1
while a<4
	wander(a)
	if outside?
		a = 2
	a = 3

Is the same as:

a=1; while a<4 {wander(a); if outside? {a = 2;}; a = 3;};

As the latter example shows, single-line brevity is allowed. Use the semicolon to pack multiple statements onto a line. Likewise, wrap curly braces around a block placed on the same line as the "preceding" line.

Note: The leading tabs on a line are ignored when:

The line is part of a multi-line comment or symbol/text literal.
It is a blank line, containing only white space or comments.

Line continuations

Sometimes, one wants to split a line across multiple lines. To ensure Acorn does not automatically insert semi-colons between each line, begin the continuation lines with a backslash at the same indentation:

a = 
\ b or
\ c       # equivalent to: a = b or c

This backslash works properly even when a statement has an indented block in the middle:

# Below equivalent to: abc {monkey: foo;} if condition;
abc
	monkey: foo
\ if condition

Comments

Comments document the code to make it more easily understood by people. Comments have no impact on program execution.

Comments may be placed between any two tokens (and certainly not within a text or symbol literal). A comment always begins with the sharp character '#'.

There are two types of comments:

A line comment. begins with a pound sign ('#') and continues to the end of line.
A block comment starts with ###. It ends at the next ###, extending across multiple lines, if desired.

White Space

Spaces are regularly used to improve code clarity and separate tokens. Except within text or symbols, spaces are otherwise ignored. Other control or unexpected characters, such as tabs not at the start of the line or the carrier return (U+000D), are also ignored.

End-of-File

The program code ends when it reaches a null character (U+0000) or has no more characters. Acorn will tidy up whatever is unfinished (e.g., still open blocks, literals, or comments) and then will complete the compilation.

Number Token

A number token can represent either an integer or a float literal. Both begin with a digit character from '0' to '9'. It is determined to be a Float if it contains a period or 'E' character.

Note: A negative number (e.g., -123) is actually two tokens, the operator '-' and the number '123'. It must be done this way, as a '-' dash before a number might be a minus operator.

Integer Literal

An Integer literal may be:

Decimal integer. A simple sequence of numeric digits which stops at the first non-digit character.
Hexadecimal integer. Starts with '0x' followed by either numeric digits or 'a'-'f' (or 'A'-'F') to represent the hexadecimal digit values ten through fifteen).

Float Literal

A Float literal is a sequence of numeric digits which must start with a first digit from '0' to '9' and have exactly one decimal point ('.') within the number.

The Float token may end with an exponent, which is indicated by an 'e' or 'E' followed by an optional minus sign and additional numeric digits.

A number token that lacks a decimal point or exponent will be considered an Integer.

Text or Symbol Token

Text and Symbol tokens look nearly identical, except that a Text token begins and ends with the double quote (") and the Symbol token begins and ends with the single quote (').

Any unicode character may appear between the opening and closing quote marks. However, carrier return (U+000D), line feed (U+000A) and line-beginning tabs (U+0009) are stripped and will not appear in the resulting literal's value. Why strip them? So that multi-line text literals may be shaped into indented lines that match the code they are part of, without their code format being imposed on wherever the text is ultimately displayed.

Use one of these backslash '\' escape sequences to inject control, delimiter or other characters where they belong in the resulting literal:

\n: New-line (U+000A)
\r: Return (U+000D)
\t: Tab (U+0009)
\char: char character (typically: \\, \' or \")
\unnnn: unicode character which matches the specified hexadecimal code point.
\Unnnnnnnn: unicode character which matches the specified hexadecimal code point.

Name Token

An name is typically made up of letters. A letter is a lower-case Latin letter from 'a' to 'z', an upper-case Latin letter from 'A' to 'Z', or (for now) any Unicode character above U+00A0.

The characters allowed in an name vary depending on its position:

The first character must be a letter, '$' or '_'.
Subsequent characters may be a letter, digit, '$' or '_'.
The final character may also be a '?'.

The following are all valid names:

balance toReturn True _temp_ $ funny? π

The case of the letters is significant. Thus, 'abc' is a different name than 'ABC'.

In certain contexts, a name is treated as a literal symbol rather than a variable. This happens when the name is immediately followed by ':' (with no intervening white space), or preceded by '.', '.:' or '::'.

Certain names are reserved for exclusive use by Acorn and may not be used as variables.

URL Token

The '@' operator followed by a URL or IRI address is treated as a static Resource literal. The end of the url is detected by the first space, tab, line-feed or carriage-return found.

If the character right after the '@' is a space, tab, lf, cr, ", ', or (, then it is not treated as a single token, but as the '@' operator token followed by tokenizing the subsequent characters as normal.

Operator Token

An operator is a sequence of one or more punctuation characters with no intervening white space. Acorn is greedy, and will look for the longest character sequence that matches one of these operators. See Reserved Operators for a list.