Saturday, July 6, 2013

Clojure: Symbols, Reading, and Execution

I wanted to read a function from a string, and apply an argument to it. This seemingly simple task is surprisingly  tricky in Lisp-like languages. The basic steps are straightforward: read the contents of a string, parse them as a function name, and apply the corresponding function to the desired arguments. Lisp-based languages are homoiconic, and also tend to have support for read-time evaluation. The combination makes safe parsing of data structures both powerful and dangerous.

I tried to rely on my prior Common Lisp knowledge, but that turned out to be information I needed to unlearn. There might still be errors in my understanding. And, while my writeup is accurate as of Clojure 1.5.1, the language is still fast evolving. Take the information here with a grain of salt.

Of Symbols and Namespaces

While reading, one has to deal with symbols, as they are invariably part of the input stream. In Common Lisp (CL) a symbol always belongs to a package, except when it doesn't. Every package symbol is automatically interned. So two symbols with the same name are always identical. The exception is that you can create uninterned symbols that don't belong to any package. And these symbols aren't equal to each other even when they have the same name.

;; By default intern into the current package
? (intern "FOO")
FOO
NIL
;; A quoted symbol is interned
? (eq (intern "FOO") 'foo)
T
;; Two quoted symbols are the same
? (eq 'foo 'foo)
T
;; This is an uninterned symbol
? '#:foo
#:FOO
;; Two uninterned symbols aren't the same
? (eq '#:foo '#:foo)
NIL
;; This is how you create a new uninterned symbol
? (make-symbol "FOO")
#:FOO
;; Two created uninterned symbols still aren't the same
? (eq (make-symbol "FOO") (make-symbol "FOO"))
NIL
;; But two unintnerned symbols are, as expected, identical
? (let ((uninterned '#:foo)) (eq uninterned uninterned))
T

Clojure is different. Instead of packages we have namespaces, and quoted symbols aren't automatically interned into namespaces. You can ask ns-interns for a map of all the interned symbols. Quoting will give the uninterned symbol, even if there's an interned symbol with the same name. There's some more information and examples in this google groups post. The CL transcript above can be approximated in clojure thus:

;; By default intern into the current package
user> (intern 'user 'abc)
#'user/abc
;; Hash-quoted returns an interned symbol, but only if previously interned
user> #'foo
CompilerException java.lang.RuntimeException: Unable to resolve var: foo in this context, compiling:(NO_SOURCE_PATH:1:602) 
user> #'abc
#'user/abc
;; Two hash-quoted symbols are the same
user> (identical? #'abc #'abc)
true
;; This is an uninterned symbol
user> 'abc
abc
;; Two uninterned symbols aren't the same
user> (identical? 'abc 'abc)
false
;; This is how you create a new uninterned symbol
user> (symbol "abc")
abc
;; Two created uninterned symbols still aren't the same
user> (identical? (symbol "abc") (symbol "abc"))
false
;; But two unintnerned symbols are, as expected, identical
user> (let [uninterned (symbol "abc")] (identical? uninterned uninterned))
true

The keyword namespace in clojure is different, and symbols in that namespace have a distinct appearance. This is also the case in CL, and clojure's behavior more closely resembles that of CL.

  1. All keyword symbols begin with a :.
  2. Keywords are automatically interned. One needn't explicitly create an interned keyword.
  3. A corollary is that two keywords with the same printed representation will always be identical.

Keywords have a special role in clojure, in their ability to access content in data structures. These properties are essential in enabling that role.

One of the read-time pitfalls possible in CL is interning a very large number of symbols into a package, which is effectively a memory leak. Symbols are garbage collected only when they're uninterned. And the automatic interning of symbols is a subtle bug that can creep into many programs. This concern isn't realized in clojure for two reasons:

  1. Symbols aren't automatically interned, except keywords.
  2. Interned symbols are based on interned strings, so can be garbage collected when there are no references to them.

The Pitfalls of read-string

Another type of problem occurs during read-time evaluation. The reader can execute code as its reading an s-expression from a stream. Here's a far better writeup on the matter than I could have done.

The Solution?

Add clojure.tools.reader to your project.clj. Then the following works:

user=> (require 'clojure.tools.reader.edn)
nil
user=> (clojure.tools.reader.edn/read-string "clojure.core/list")
clojure.core/list
user=> (resolve (clojure.tools.reader.edn/read-string "clojure.core/list"))
#'clojure.core/list
user=> @(resolve (clojure.tools.reader.edn/read-string "clojure.core/list"))
#< clojure.lang.PersistentList$1@37f6b4df>
user=> (@(resolve (clojure.tools.reader.edn/read-string "clojure.core/list")) 1 2)
(1 2)