Wednesday, May 11, 2011

Embracing Functional Programming

Few months ago when learning Scala I ported few personal projects from Java, I wanted to share my experience and discoveries on a particular one, since I thought it demonstrates IMO how I embraced functional programming into my thinking process, applied to something useful. Scala was a good fit for this experiment due to its hybrid [OO+F], so I'll try to depict an implementation transitioning from imperative into functional. I found an old class (good candidate) performing statistical analysis on some collected data, so let's set the stage with summary stats , here's the class definition in the REPL:
Welcome to Scala version 2.9.1.final (Java HotSpot(TM) Server VM, Java 1.6.0_23).
Type in expressions to have them evaluated.
Type :help for more information.
scala>
final class Stats( private val list: List[Double] ) {
    require( list != Nil   , "data list can't be Nil")    
    assert ( !list.isEmpty , "data list can't be empty")
                  
    lazy val data = list.sorted

    lazy val len : Int = data.length

    lazy val sum = data.sum
        
    lazy val min = data.min

    lazy val max = data.max

    lazy val range = max - min
                                                
    lazy val median = if (len%2==0) ((data((len/2)-1) + data((len/2))) /2)
                      else data(((len+1)/2)-1)
                      
    lazy val mean = sum / len
                                                                                
    lazy val variance = {
      @scala.annotation.tailrec
      def vrec(list: List[Double] , tot: Double): Double = list match {
        case head :: tail => vrec( tail , tot + math.pow( head - mean , 2 ) )
        case _ => tot / ( len - 1 )
      }
      vrec(data,0D)
    }

    lazy val standardDeviation = math.sqrt( variance )

    lazy val mode = histogram.maxBy(_._2)._1 

    lazy val histogram = {
       scala.collection.immutable.TreeMap[Double,Int]() ++
       data.groupBy( x => x ).mapValues( _.length )
    }

    def dump { data foreach println }

    override def toString() = {
    new scala.collection.mutable.StringBuilder()
    .append("DATA     : ").append( data )
    .append("\nN        : ").append( len )
    .append("\nMIN      : ").append( min )
    .append("\nMAX      : ").append( max )
    .append("\nRANGE    : ").append( range )
    .append("\nMODE     : ").append( mode )
    .append("\nMEDIAN   : ").append( median )
    .append("\nMEAN     : ").append( mean )
    .append("\nVARIANCE : ").append( variance )
    .append("\nSTRD-DEV : ").append( standardDeviation )
    .append("\nHISTOGRAM: ").append( histogram )
    .mkString
    }             
}
defined class Stats
scala>
let's instantiate our new class with some test data
scala> val stats = new Stats(List[Double](60,10,20,30,20,40,20,50,90,80))
stats: Stats =
DATA     : List(10.0, 20.0, 20.0, 20.0, 30.0, 40.0, 50.0, 60.0, 80.0, 90.0)
N        : 10
MIN      : 10.0
MAX      : 90.0
RANGE    : 80.0
MODE     : 20.0
MEDIAN   : 35.0
MEAN     : 42.0
VARIANCE : 751.1111111111111
STRD-DEV : 27.406406388125955
HISTOGRAM: Map(10.0->1,20.0->3,30.0->1,40.0->1,50.0->1,60.0->1,80.0->1,90.0->1)
so let's then focus on the variance in particular: variance = (1/(N-1) × (the sum of (each data point - mean)2) let's start with the following implementations series, we begin the imperative way with var re-assignment, not FP friendly (but ok since within method scope)
def variance() = {
var sum:Double = 0D
for( x <- data ) {
  sum += math.pow( x - mean , 2 )
}
( sum / ( len - 1 ) )
}
and now with built in map + sum from List impl, this will yield O(n) for map and O(n) for the sum respective collection traversals, not too optimal but better
def variance = data.map( x => math.pow( x - mean , 2 ) ).sum / ( len -1 )

(or use data.view.map() for O(n) traversal -> as explained below)
we move on to a more interesting functional approach with view (could be applied to above one as well), note this time we're forcing a non-strict view on our data list, the new Seq from non-strict-view optimises performance since collection is traversed only once and accumulating the sum without a var, this important 'laziness concept' is something that Stream could achieve:
def variance = {
 val sum = (data.view foldLeft 0D) {
  case (sum,x) => sum + math.pow( x - mean , 2 )
 }
( sum / ( len - 1 ) )
}
now with inner helper function & pattern matching & recursion, calculating on head and adding via recursion, in order to be able to do this I had to define an inner helper function, (note that: recursive calls create a new stack frame and we could potentially reach at some point in time stack-overflow)
def variance = {
def vrec( list:List[Double] ) : Double = list match {
 case head :: tail => math.pow( head - mean , 2 ) + vrec( tail )
 case _ => 0D
}
vrec( data ) / ( len - 1 )
}
and finally (for this particular impl, and falling back to strict since data is not large), the favourite and refined version is optimal since we added an accumulator on the recursive helper signature to sum up our total via tail-recursion (hinting with @tailrec the compiler with optimisation ):
def variance = {
@scala.annotation.tailrec
def vrec(list: List[Double] , tot: Double): Double = list match {
 case head :: tail => vrec( tail , tot + math.pow( head - mean , 2 ) )
 case _ => tot / ( len - 1 )
}
vrec(data,0D)
}
Finally, when comparing the java and scala sources afterwards, it did seem a bit dated to review the imperative way (particularly its redundant verbosity and boilerplate), I don't claim to be an expert on FP nor favour nor force you any specific path, and I'm sure there are more clever ways to do this in FP with syntactic sugar etc, my personal experience and thoughts are subjective, my goal here is to inspire you enough to give it a try yourself (with an open mind) and then you come up with your own conclusions. What I find important to identify here is the fact that when forcing ourselves to think differently (which applies in general terms not just FP, and regardless of language or syntax flavour or complexity at hand) we should always strive for:
  • Eliminate repetitive duplication: [inspired by DRY & abstraction principles].
  • Eradicate side effects, specially on unrelated matters, foster immutable things: [inspired by the orthogonal principle].
  • Design implementations (with intermediate results) that are self-contained, independent, and with a single loose coupled well-defined purpose [inspired by law of demeter].
We must admit though, the polyglot of new JVM languages is raising the bar in language design and engineering, the peril of the ones falling behind is the opportunity for the new ones to blossom, exciting to see languages evolve and improve :-) Till next one, cheers!

0 comments: