[Update: the jvmr package has been replaced by a new package called rscala. I have a new post which explains it.]
Introduction
In previous posts I have explained why I think that Scala is a good language to use for statistical computing and data science. Despite this, R is very convenient for simple exploratory data analysis and visualisation – currently more convenient than Scala. I explained in my recent talk at the RSS what (relatively straightforward) things would need to be developed for Scala in order to make R completely redundant, but for the short term at least, it seems likely that I will need to use both R and Scala for my day-to-day work.
Since I use both Scala and R for statistical computing, it is very convenient to have a degree of interoperability between the two languages. I could call R from Scala code or Scala from R code, or both. Fortunately, some software tools have been developed recently which make this much simpler than it used to be. The software is jvmr, and as explained at the website, it enables calling Java and Scala from R and calling R from Java and Scala. I have previously discussed calling Java from R using the R CRAN package rJava. In this post I will focus on calling Scala from R using the CRAN package jvmr, which depends on rJava. I may examine calling R from Scala in a future post.
On a system with Java installed, it should be possible to install the jvmr R package with a simple
install.packages("jvmr")
from the R command prompt. The package has the usual documentation associated with it, but the draft paper describing the package is the best way to get an overview of its capabilities and a walk-through of simple usage.
A Gibbs sampler in Scala using Breeze
For illustration I’m going to use a Scala implementation of a Gibbs sampler which relies on the Breeze scientific library, and will be built using the simple build tool, sbt. Most non-trivial Scala projects depend on various versions of external libraries, and sbt is an easy way to build even very complex projects trivially on any system with Java installed. You don’t even need to have Scala installed in order to build and run projects using sbt. I give some simple complete worked examples of building and running Scala sbt projects in the github repo associated with my recent RSS talk. Installing sbt is trivial as explained in the repo READMEs.
For this post, the Scala code, gibbs.scala is given below:
package gibbs object Gibbs { import scala.annotation.tailrec import scala.math.sqrt import breeze.stats.distributions.{Gamma,Gaussian} case class State(x: Double, y: Double) { override def toString: String = x.toString + " , " + y + "\n" } def nextIter(s: State): State = { val newX = Gamma(3.0, 1.0/((s.y)*(s.y)+4.0)).draw State(newX, Gaussian(1.0/(newX+1), 1.0/sqrt(2*newX+2)).draw) } @tailrec def nextThinnedIter(s: State,left: Int): State = if (left==0) s else nextThinnedIter(nextIter(s),left-1) def genIters(s: State, stop: Int, thin: Int): List[State] = { @tailrec def go(s: State, left: Int, acc: List[State]): List[State] = if (left>0) go(nextThinnedIter(s,thin), left-1, s::acc) else acc go(s,stop,Nil).reverse } def main(args: Array[String]) = { if (args.length != 3) { println("Usage: sbt \"run <outFile> <iters> <thin>\"") sys.exit(1) } else { val outF=args(0) val iters=args(1).toInt val thin=args(2).toInt val out = genIters(State(0.0,0.0),iters,thin) val s = new java.io.FileWriter(outF) s.write("x , y\n") out map { it => s.write(it.toString) } s.close } } }
This code requires Scala and the Breeze scientific library in order to build. We can specify this in a sbt build file, which should be called build.sbt and placed in the same directory as the Scala code.
name := "gibbs" version := "0.1" scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature") libraryDependencies ++= Seq( "org.scalanlp" %% "breeze" % "0.10", "org.scalanlp" %% "breeze-natives" % "0.10" ) resolvers ++= Seq( "Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/", "Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/" ) scalaVersion := "2.11.2"
Now, from a system command prompt in the directory where the files are situated, it should be possible to download all dependencies and compile and run the code with a simple
sbt "run output.csv 50000 1000"
Calling via R system calls
Since this code takes a relatively long time to run, calling it from R via simple system calls isn’t a particularly terrible idea. For example, we can do this from the R command prompt with the following commands
system("sbt \"run output.csv 50000 1000\"") out=read.csv("output.csv") library(smfsb) mcmcSummary(out,rows=2)
This works fine, but won’t work so well for code which needs to be called repeatedly. For this, tighter integration between R and Scala would be useful, which is where jvmr comes in.
Calling sbt-based Scala projects via jvmr
jvmr provides a very simple way to embed a Scala interpreter within an R session, to be able to execute Scala expressions from R and to have the results returned back to the R session for further processing. The main issue with using this in practice is managing dependencies on external libraries and setting the Scala classpath correctly. For an sbt project such as we are considering here, it is relatively easy to get sbt to provide us with all of the information we need in a fully automated way.
First, we need to add a new task to our sbt build instructions, which will output the full classpath in a way that is easy to parse from R. Just add the following to the end of the file build.sbt:
lazy val printClasspath = taskKey[Unit]("Dump classpath") printClasspath := { (fullClasspath in Runtime value) foreach { e => print(e.data+"!") } }
Be aware that blank lines are significant in sbt build files. Once we have this in our build file, we can write a small R function to get the classpath from sbt and then initialise a jvmr scalaInterpreter with the correct full classpath needed for the project. An R function which does this, sbtInit(), is given below
sbtInit<-function() { library(jvmr) system2("sbt","compile") cpstr=system2("sbt","printClasspath",stdout=TRUE) cpst=cpstr[length(cpstr)] cpsp=strsplit(cpst,"!")[[1]] cp=cpsp[1:(length(cpsp)-1)] scalaInterpreter(cp,use.jvmr.class.path=FALSE) }
With this function at our disposal, it becomes trivial to call our Scala code direct from the R interpreter, as the following code illustrates.
sc=sbtInit() sc['import gibbs.Gibbs._'] out=sc['genIters(State(0.0,0.0),50000,1000).toArray.map{s=>Array(s.x,s.y)}'] library(smfsb) mcmcSummary(out,rows=2)
Here we call the getIters function directly, rather than via the main method. This function returns an immutable List of States. Since R doesn’t understand this, we map it to an Array of Arrays, which R then unpacks into an R matrix for us to store in the matrix out.
Summary
The CRAN package jvmr makes it very easy to embed a Scala interpreter within an R session. However, for most non-trivial statistical computing problems, the Scala code will have dependence on external scientific libraries such as Breeze. The standard way to easily manage external dependencies in the Scala ecosystem is sbt. Given an sbt-based Scala project, it is easy to add a task to the sbt build file and a function to R in order to initialise the jvmr Scala interpreter with the full classpath needed to call arbitrary Scala functions. This provides very convenient inter-operability between R and Scala for many statistical computing applications.
