Monday, January 23, 2012

Compiling and loading in ClojureCLR

Wherein I document environment variables and other factors influencing compiling and loading files in ClojureCLR and how ClojureCLR differs from Clojure in this regard.

Compiler variables

During AOT-compilation, the following vars are consulted to control aspects of the compilation process:

Vardoc says
*compile-path*
Specifies the directory where 'compile' will write out .class files. This directory must be in the classpath for 'compile' to work. Defaults to "classes"
*unchecked-math*
While bound to true, compilations of +, -, *, inc, dec and the coercions will be done without overflow checks. Default: false.
*warn-on-reflection*
When set to true, the compiler will emit warnings when reflection is needed to resolve Java method calls or field accesses. Defaults to false.

If you compile by invoking the compile function, such as from a REPL, you will have had a chance to set these vars to appropriate values. However, when compiling from the command line by running Clojure.Compile.exe, you do not have a chance to run Clojure code to initialize these vars. Instead, you can set environment variables to initialize these vars prior to compilation.

The same is true for Clojure. In fact, ClojureCLR and Clojure used the same environment variables for these variables until just recently. Starting with the 1.4.0-alpha5 release (already in the master branch), ClojureCLR has changed the environment variable names to be strict POSIX-compliant. This is due to problems with periods in environment variable names in Cygwin's bash -- see this thread for more information. Here are the names:

Clojure & older ClojureCLRnew in ClojureCLR
clojure.compile.pathCLOJURE_COMPILE_PATH
clojure.compile.unchecked-mathCLOJURE_COMPILE_UNCHECKED_MATH
clojure.compile.warn-on-reflectionCLOJURE_COMPILE_WARN_ON_REFLECTION

BTW, ClojureCLR defaults *compile-path* to ".".  "classes" didn't seem to make sense given that ClojureCLR creates assemblies.

Locating files

For identifying libraries for loading, Clojure relates the symbol naming the library to a Java package name and uses Java's mapping of package name to a classpath-relative path. For example, evaluating (compile 'a.b.c) causes Clojure to look for a file a/b/c.clj relative to some root listed on the classpath.  The result of the compilation will be a set of classfiles, written to classes/a/b/c.

ClojureCLR follows Clojure in mapping dotted symbol names to relative paths.  Not having classpaths, ClojureCLR instead uses the value of the environment variable CLOJURE_LOAD_PATH to supply roots for the file probes. In addition, it will look (first) in the current directory and directory of the entry assembly.

The same holds for load, use, require and other lib-loading functions.

Assembly output

The Clojure compiler outputs (many) class files.  The ClojureCLR compiler outputs (not as many) assemblies.  All classes resulting from (compile 'a.b.c) will go into an assembly named a.b.c.clj.dll located in *compile-path*.

When evaluating (load "a/b/c"),  ClojureCLR will look for both <AppDomain.CurrentDomain.BaseDirectory>\a.b.c.clj.dll and <any_load_path_root>\a\b\c.clj, and load the assembly if it exists and has a timestamp newer than the .clj file (if it exists).  At the moment the same set of roots (as named above) is used for assemblies and source code.  

AppDomain.CurrentDomain.BaseDirectory is used as the root for ClojureCLR assembly probes as that is also the CLR's root for resolving assembly references.  

Too many assemblies

Each file loaded during compilation will go into its own assembly.  I find this terribly inelegant.  The distribution for ClojureCLR itself needs Clojure.Main.exe, Clojure.Compile.exe, and the DLR support assemblies, of course, but also thirty-plus assemblies resulting from compiling the Clojure source that defines the initial environment. The pprint lib alone contributes eight assemblies.  They are not really independent.   Conceivably that code all could go into one assembly.  

I've not been able to think of a way to make this work.  I know that the eight files making up pprint are related.  They get compiled because the main pprint file loads each of them, and loading a file while compiling cause that file to be compiled also.  I could very easily write the compiler to output the code into the same assembly as the parent.  However, pprint could load support code that should not be part of its assembly, that should have its own assembly.  In fact, it does;  pprint loads clojure.walk.  It happens to do this with a :use clause in its ns form, but it doesn't have to.  Without a mechanism in Clojure that allows us to distinguish these uses of load, I'm afraid we're stuck with some inelegance.

Tuesday, January 17, 2012

Porting effort for Clojure contrib libs

Looking for Clojure contrib lib projects to port to ClojureCLR?

I looked at the most popular libs on https://github.com/clojure, the official libs of the clojure project.  I defined popularity by the number of watchers, lacking a better criterion.  Here are the top projects sorted by number of watchers when I looked recently.  Ignoring those in single digits and all java.* projects, here they are:

WatchersProjectWatchersProject
129
core.logic
23
test.generative
69
core.match
20
core.cache
60
tools.nrepl
19
core.memoize
37
tools.cli
18
algo.monads
36
data.finger-tree
15
data.xml
35
tools.logging
11
test.benchmark
32
core.unify
10
core.incubator
28
data.json
10
data.csv
10
tools.macro

There are some fairly trivial edits that are required in porting most libs.  These include:

  1. Substituting an appropriate CLR exception class.  For example, InvalidArgumentException becomes ArgumentException.  If a throw uses Exception, that will work as is.
  2. Substituting interop method names.  For example, toString becomes ToString, hashCode becomes GetHashCode, etc.  Most String methods and some I/O methods just need capitialization.  BTW, ClojureCLR preserves case on most clojure.lang class method names so they don't need to be changed.  (You're welcome.)  Also, method names on protocols won't need to be changed.

I'll refer to these kinds of changes below as the usual.

I did a quick scan of the source of each project to estimate the effort required to port the project to ClojureCLR. In the order given above, here are some comments on each.

core.logic: This is one of the larger projects.  The usual, and not that much of it.  The only thing I saw that might take a little more investigation is that the deftype Pair implements java.util.Map$Entry.  (See below for more.)  Easy. (Unless it requires actual thought, in which case you'd have to understand the code, and that would make it a Challenge.)

core.match:  Another large project.  The usual, and not much of it.  The bean-match function will require adaptation to CLR classes and the regular expression matcher will need to be examined -- JVM vs CLR regexes always requires a look.  Of most concern is the deftype MapPattern that mentions java.util.Map.  The question is always dealing with IDictionary and IDictionary<K,V> -- support for arbitray generics is always tricky.  Probably Easy, with the same caveat as core.logic.

tools.nrepl: This is likely to be tricky.  There are some Java classes that will have to be ported.  Of greater concern is the amount of low-level I/O on sockets.  At best, a Medium project, likely a Challenge.  Given that this project is being redesigned, it might be wise to wait for 2.0 and then put in the effort.

tools.cli: The uusual, and not much of it.  There is a test that uses an Integer method.  Trivial.

data.finger-tree:  The usual.  The only concern is the mention of java.util.Set.  There is no System.Collections.ISet, only System.Collections.Generic.ISet<T>, so some thought will be required. At worst, Medium; more likely Easy.

tools.logging: This will take some work because adapters for .Net logging tools will have be developed.  One might consider log4net, ELMAH, NLog.  The good news is that the code is designed to plug different adapters into its framework, so developing new adapters should be easy, requiring mostly a decent knowledge of the target logging framework.  Most of the tests will have to be rewritten.  Medium, probably fun.

core.unify: The usual.  The same concern about java.util.Map mentioned for core.match.  I'm guessing this is trivial here.  Easy.

data.json: We know exactly how much work this will take.  See Porting libs to ClojureCLR: an example.

test.generative: Needs tools.namespace.  That didn't make the popularity cut, but it should be barely Medium to port, mostly due to the need to think a little about the I/O interop.  In test.generative, there are some library calls, to Random, Math.* methods, system time, etc., that will take a little more work than just the usual.  Barely Medium.

core.cache: A moment's thought about replacing java.lang.Iterable in the definition of defcache.  Otherwise, just the usual.  Easy.

core.memoize: Needs core.cache.  Might work as-is!  Trivial.

algo.monads: Might work as-is!  Trivial.  Hey, when was the last time you saw 'trivial' and 'monads' in such proximity?

data.xml:  The README notes that is is not yet ready for use.  Really, this should be called java.xml because of its dependence on org.xml.sax, java.xml.parsers, etc.  This will require a major rewrite.  Until this is complete, I can't say how hard it will be.

test.benchmark: Looks straightforward.  Easy.

core.incubator: The toughest thing is reference to java.util.Map (see above).  Trivial.

data.csv: The I/O will take some time, but at worst a Medium.  A very Easy Medium at that.

tools.macro: Appears to be Trivial.

So, what are you waiting for.  Plenty of easy ones to get started with and a few more challenging ones.  Whatever you pick, you'll have a chance to read some good Clojure code, always a worthwhile exercise.

Where are the hard ones, you ask?  They certainly exist, just not among the official contrib libs.  There are plenty of other Clojure projects floating around that will require significant effort.

Port a lib today!

A note on java.util.Map$Entry:  clojure.lang.IMapEntry extends java.util.Map$Entry on the JVM. ClojureCLR could not do that because the equivalent to Map$Entry, System.Collections.DictionaryEntry, is a struct and can't be subclassed. Also, we have the problem with the generic System.Collections.Generic.KeyValuePair<TKey,TValue>. I shudder when I see Map$Entry; this is a sign that real thinking will be required.

Friday, January 6, 2012

Porting libs to ClojureCLR: an example

Responses to the 2011 ClojureCLR survey gave high priority to the porting of Clojure contrib libs to ClojureCLR.  One action item resulting from an analysis of the responses was to provide examples of the porting process.  As an example, I decided to port the data.json contrib lib, authored by Stuart Sierra.  In looking across all the authorized contrib libs on github.com/clojure, this port was above trivial but below daunting in complexity.  (In a later post, I will analyze all the contrib libs for complexity.)

The following recipe will work for simple ports.

Figure out how the original code works.  This step might be optional on the simplest ports involving just simple interop method renaming.  For this port, I had to modify one algorithm and think through extending a protocol to CLR types.

Create a project for the port.  I did not fork the original data.json project on github; the project needs its own identity and you're not going to be issuing pull requests back to the original.

You are on your own for picking project names and namespace structures for these ports.  I chose the name cljclr.data.json for both the project and the namespace.  You will find my project here.

I would appreciate input on naming these projects.  Here's my current thoughts.  I didn't want to call  this project clr.data.json--I plan to use clr.X.X for ports of contrib libs that have names like java.X.X.  The namespace for data.json is actually clojure.data.json, so I decided to use cljclr.data.json.  I thought of data.json.clr and other variations adding clr as a component, but was offended by having on extra layer of subdirectory injected.  I don't actually like cljclr--I can't read it and I always mistype it.  Help!
In the absence of tools such as leiningen or a working Visual Studio extension, for now you will have to come up with your project structure.  I just used

  • <root>
    •  src
    •  test

as the two main branches, with subdirectories corresponding to the namespaces.

Copy files from the source project.  Usually, you can preserve the basic code structure, relocating files appropriately for your namespace structure.  The data.json project really has only two files of significance:  json.clj and json_test.clj.  I copied them into src/cljclr/data/json.clj and test/cljclr/data/json_test.clj.

Modify file headers. I changed the ns directive in each file to match my namespace structure. If you are interested in things like copyrights, you will have figure out how you want to acknowledge the original author according the license on the original project.

Scan for trouble.  I look for trouble.  Here are some of the things I look for.

  • Imports in the ns directive. Imports of clojure.lang classes are usually okay--I've been carerful to maintain class names and public method names to match Clojure as best I can.  Anything starting  java. will have to be replaced.
  • Interop calls. Scan for (.name,  (name.,  and (CapName/name.  If you are lucky, a simple capitalization will handle many of the (.name calls.
  • Type hints.  Most type hints of classes outside clojure.lang will need to be changed.  Carats feel like sticks.
  • Exceptions.  The only exception class that works unchanged is Exception itself.
  • CapitalLetters. Given that Clojure code is mostly lowercase, anything with CapitalLetters is likely to be trouble.
Do simple renamings.  Hack out the bad imports and add new ones as you work through the code.  Work through each interop call and determine equivalents; ditto for type hints and exceptions.

I adopt a uniform method of marking the places where I make changes.  This makes it easier to update the port as the original lib changes.  You will see lines like this in my code:

 (Char/IsWhiteSpace c) (recur (.Read stream) result)             ;DM: Character/isWhitespace .read   

where the comment shows what was changed from in the original code.

For json.clj, I made the following simple changes.

Method renamings:

FromToCount
(.read(.Read
30
(.append(.Append
14
(.print(.Write
11
(.unread(.Unread
3
(Character/isWhitespace(Char/IsWhiteSpace
3
(.isArray(.IsArray
2
(.charAt(.get_Chars
1
(.toString(.ToString
1

Type renamings:

FromToCount
PushbackReaderPushbackTextReader
9
PrintWriterTextWriter
5
PrintWriterStreamWriter
1
CharSequenceString
1
EOFExceptionEndOfStreamException
5

PrintWriter was mostly used as a type hint, and could have TextWriter substituted.  In one place, it was used new'd.  TextWriter is abstract and can't be constructed, so StreamWriter had to be used in that place.

This project is obviously quite I/O-centric.  Fortunately, there is sufficient commonality between the I/O libraries on the JVM and CLR to make this part of the translation fairly routine.

At this point, you are probably very close to being able to load the file. You can try loading or compiling and let the errors guide you.

Check for reflection warnings.  I find it useful to  (set! *warn-on-reflection* true) at the start of each code file.  I don't do this for performance, but it can catch bad type hints and misnamed methods in the code.   The perf I care about is coding perf, not runtime.

Do the hard parts. The remaining changes were not as easy.  There were two other type renamings, not I/O-related.
java.util.Map => System.Collections.IDictionary
java.util.Collection => System.Collections.ICollection
If you see Java collection types, you are going to have think about the correct alternatives.  In general, we have no way to deal with CLR generic collection types -- most uses of these type names will require the type parameters to be instantiated, so you can't say you want all System.Collections.GenericICollection<T> to be handled. However, most of the generic collection types will provide non-generic interfaces to work with.  That is the solution I chose for this code:

(defn- pprint-json-dispatch [x escape-unicode]
 (cond (nil? x) (print "null")
       (instance? System.Collections.IDictionary x) 
                        (pprint-json-object x escape-unicode)        ;DM: java.util.Map
       (instance? System.Collections.ICollection x) 
                        (pprint-json-array x escape-unicode)         ;DM: java.util.Collection
       (instance? clojure.lang.ISeq x) 
                        (pprint-json-array x escape-unicode)
       :else (pprint-json-generic x escape-unicode)))

Another common trouble spot is dealing with the Java primitive wrapper classes, such as Integer.  You mostly  like won't see many type hints, but you will see calls to static methods.  There is only one in this code.  In read-json-hex-character, I translated (Integer/parseInt s 16) to  (Int32/Parse s NumberStyles/HexNumber).  (I added an import for System.Globalization.NumberStyles.)  I don't have a list of rules for this.  You're on your own.

Extending protocols to Java library classes will also cause you to spend some time thinking.  The primary difficulty here came in extensions to the Write-JSON protocol.  Extensions to things like nil and clojure.lang.Named were unchanged.  Other extensions had to be modifed.

You will have a problem anytime you run into java.lang.Number.  This is a base class for all the primtive numeric wrapper classes such as Integer.  There is no equivalent in the CLR. I created extensions for each primitive numeric type.

(extend System.Byte Write-JSON {:write-json write-json-plain})
(extend System.SByte Write-JSON {:write-json write-json-plain})
(extend System.Int16 Write-JSON {:write-json write-json-plain})
...

Also, java.math.BigInteger and java.math.BigDecimal will need to be translated to clojure.lang.BigInteger and clojure.lang.BigDecimal, and CharSequence usualy goes to String.

The only significant algorithmic change was in write-json-string.  Stuart was careful to properly handle full Unicode, which means not just iterating through the string character by character but dealing with actual Unicode code points as expressed in the UTF-16 encoding used in Java strings.  CLR strings use the same encoding, but the proper way to iterate through Unicode code points is different.  This took the most time for me to translate due to my own ignorance.  I ended up with

(defn- write-json-string [^String s ^PrintWriter out escape-unicode?]
  (let [sb (StringBuilder. ^Integer(count s))]
    (.append sb \")
    (dotimes [i (count s)]
      (let [cp (Character/codePointAt s i)]
        (cond
         ;; Handle printable JSON escapes before ASCII
         (= cp 34) (.append sb "\\\"")
...
         (< 31 cp 127) (.append sb (.charAt s i))
...
         :else (if escape-unicode?
   ;; Hexadecimal-escaped
   (.append sb (format "\\u%04x" cp))
   (.appendCodePoint sb cp)))))
 ...

becoming

(defn- write-json-string [^String s ^TextWriter out escape-unicode?]
  (let [sb (StringBuilder. ^Int32 (count s))
        chars32 (StringInfo/ParseCombiningCharacters s)]
    (.Append sb \")
    (dotimes [i (count chars32)]
      (let [cp (Char/ConvertToUtf32 s (aget chars32 i))]
        (cond
         ;; Handle printable JSON escapes before ASCII
         (= cp 34) (.Append sb "\\\"")
...
         (< 31 cp 127) (.Append sb (.get_Chars s i))
...
         :else (if escape-unicode?
   ;; Hexadecimal-escaped
   (.Append sb (format "\\u%04x" cp))
   (.Append sb  (Char/ConvertFromUtf32 cp))))))
...

Translate your tests.  This is generally easier.  The only changes I had to make were to some setup code for certain tests.  Only a handful of changes were required.

Test. Repair. Repeat.  Good luck.

After just a few iterations, I had all tests passing except one.  In the pretty-print test, I was getting output for some floating point number that were exact integers without a .0 at the end as was required by the test.  Having run into this before, I recalled that I had defined a helper function name fp-str to take care of this.

(defn- write-json-float [x ^TextWriter out escape-unicode?] 
  (.Write out (fp-str x)))                                  

and then modifed the two float-type extends:

(extend System.Double Write-JSON {:write-json write-json-float})
(extend System.Single Write-JSON {:write-json write-json-float})

GREEN!!!!

Celebrate.  
Publish.

I leave these steps to you.


Next steps? Choose something to port and go for it.  


Many of the contrib libs will require significantly less work than this.  There are others that will require almost complete rewrites.  Outside of the core contrib libs, many of the requested ports such as leiningen and the web frameworks are going to a lot of work.

Resources

* data.json:  https://github.com/clojure/data.json
* clrclj.data.json -- https://github.com/dmiller/clrclj.data.json

Tuesday, January 3, 2012

Referring to types

The basics for referring to types for CLR interop are the same for ClojureCLR as for Clojure on the JVM. I will assume you are familiar with interop as covered in http://clojure.org/java_interop or your favorite Clojure intro.

Standard Clojure allows use of the symbols int, double, float, etc. in type hints to refer to the corresponding primitive types. ClojureCLR allows this and extends this to the numeric types present in the CLR but not in the JVM: uint, ulong, etc. Similarly, the shorthand array references such ints and doubles work, and are joined by uints, ulongs, etc.

The CLR is not C#.
Do not let the presence of int and company for type hinting put you in a C# frame of mind. When specifying generic types, you cannot use C# notation:

System.Collections.Generic.IList<int>

Instead, you must use the actual CLR type name:

System.Collections.Generic.IList`1[System.Int32]

Remember that floatbecomes System.Single; I've had to dope-slap myself on that one a few times.

Clojure uses symbols to refer to types. This works on the JVM because package-qualified class names are lexically compatible with symbols. Not so here. The backquote and square brackets in the type name shown above cannot be part of a symbol name. If you type that string of characters into the REPL, you will get

user=> System.Collections.Generic.IList`1[System.Int32]
CompilerException System.InvalidOperationException:
  Unable to resolve symbol: System.Collections.Generic.IList in this context
   at ...
1
[System.Int32]

The input string is parsed as separate entities:

System.Collections.Generic.IList
`1
[System.Int32]

Not what was intended.

In addition to backquotes and square brackets, a fully-qualified type name can contain an assembly identifier--that involves spaces and commas. In fact, CLR typenames can contain arbitrary characters. Backslashes can escape characters that do have special meaning in the typename syntax (comma, plus, ampersand, asterisk, left and right square bracket, left and right angle bracket, backslash).

To allow symbols to contain arbitrary characters, ClojureCLR extends the reader syntax using "vertical bar quoting". Vertical bars are used in pairs to surround the name or a part of the name of a symbol. Any characters between the vertical bars are taken to be part of the symbol name. For example,

|A(B)|
A|(|B|)|
A|(B)|

all mean the symbol whose name consists of the four characters A, (, B, and ).  I consider only the first one to be readable; quoting the entire name is to be preferred.  To quote the IList example above, you would write

|System.Collections.Generic.IList`1[System.Int32]|

To include a vertical bar in a symbol name that is |-quoted, use a doubled vertical bar.

|This has a vertical bar in the name ... || ...<>@#$@#$#$|

With this mechanism can we make a symbol for a fully-qualified typename, such as:

(|com.myco.mytype+nested, MyAssembly, Version=1.3.0.0, Culture=neutral, PublicKeyToken=b14a123334343434|/DoSomething x y)

or

(reify 
 |AnInterface`2[System.Int32,System.String]| 
 (m1 [x] ...)
 I2
 (m2 [x] ...))


There are a number of things you should note about |-quoting and about generic type references.

First, what |-quoting does is to prevent characters from stopping the token scan. Checks on symbol validity that follow token scanning are still in effect. These include not starting with a digit, containing a non-intial colon, and a few others. When scanning A(B), the left parenthesis stops scanning the token that begins with A.  When scanning |A(B)|, the left and right parentheses do not stop the scan.  However, scanning |ab:| is the same as scanning ab:, a colon being a perfectly fine token constituent.  However, the colon at the end is a no-no, and so the token is rejected and the reader throws an exception.

(I could have taken more radical approach and allowed |ab:|.   One then gets into all kinds of edge cases that I didn't want to solve.  I feel that a more radical quoting approach requires consultation and agreement with the Clojure powers-that-be.)


Second, be careful with namespaces for symbols. Any / appearing |-quoted does not count as a namespace/name separator. If you have special characters in either the namespace name or the symbol name, you must |-quote either one separately. Thus,

(namespace 'ab|cd/ef|gh)    ;=> nil
(name 'ab|cd/ef|gh)         ;=> "abcd/efgh"
 
(namespace 'ab/cd|ef/gh|ij) ;=> "ab"
(name 'ab/cd|ef/gh|ij)      ;=> "cdef/ghij"

Rather than

ab/cd|ef/gh|ij

it would be more readable to write

ab/|cdef/ghij|

Third, you will usually need to fully namespace-qualify generic types and their parameters.  For example,

|System.Collections.Generic.IList`1[System.Int32]|

works as a type reference while

|System.Collections.Generic.IList`1[Int32]|

does not.

Also, aliasing via import is not of much help.  After eval'ing

(import '|System.Collections.Generic.IList`1|)

the symbol |IList`1| will refer in the current namespace to the generic type |System.Collections.Generic.IList`1|, but that is of no help in referring to instantiated IList types. You cannot then refer to

|IList`1[System.Int32]|

Perhaps someday we will introduce a more compositional approach to generic types and symbols that will accommodate this.

Fourth, if you are familiar with |-quoting in Common Lisp, the ClojureCLR mechanism is not as inclusive.   In CL you could include a literal vertical bar in a symbol name with backslash-escaping: abc\|123 has name “abc|123”. CL has  \-escaping for characters in symbol tokens; ClojureCLR does not.


Finally, note that when printing with *print-dup* true, symbols with 'bad' characters will be |-quoted.