Tuesday, July 15, 2008

System.identityHashCode

A recent thread on the the Scala IRC channel piqued my curiosity about exactly how System.identityHashcode works.  It looks like as less heated remarks have flushed out the conversation more the definition was irrelevant, but regardless, I think it's interesting.  Thankfully Sun open-sourced the their implementation of the Java platform so it was pretty easy to find out exactly how it works.

It turns out the Sun JVM contains three different algorithms for calculating the identity hash, none of which are guaranteed to yield unique hash codes.  That's not surprising, because on a 64 bit architecture, doing that would be almost impossible.  I say almost because one could just generate them as sequential numbers, and then crash the JVM when it runs out of 32 bit hash codes (an OutOfHashCodesError ;-).

From share/vm/runtime/synchronizer.cpp:

static inline intptr_t get_next_hash(Thread * Self, oop obj) {
  intptr_t value = 0 ; 
  if (hashCode == 0) { 
     // This form uses an unguarded global Park-Miller RNG, 
     // so it's possible for two threads to race and generate the same RNG.
     // On MP system we'll have lots of RW access to a global, so the
     // mechanism induces lots of coherency traffic.  
     value = os::random() ; 
  } else
  if (hashCode == 1) { 
     // This variation has the property of being stable (idempotent)
     // between STW operations.  This can be useful in some of the 1-0
     // synchronization schemes.  
     intptr_t addrBits = intptr_t(obj) >> 3 ; 
     value = addrBits ^ (addrBits >> 5) ^ GVars.stwRandom ; 
  } else 
  if (hashCode == 2) { 
     value = 1 ;            // for sensitivity testing
  } else { 
     // Marsaglia's xor-shift scheme with thread-specific state
     // This is probably the best overall implementation -- we'll
     // likely make this the default in future releases.
     unsigned t = Self->_hashStateX ; 
     t ^= (t << 11) ; 
     Self->_hashStateX = Self->_hashStateY ; 
     Self->_hashStateY = Self->_hashStateZ ; 
     Self->_hashStateZ = Self->_hashStateW ; 
     unsigned v = Self->_hashStateW ; 
     v = (v ^ (v >> 19)) ^ (t ^ (t >> 8)) ; 
     Self->_hashStateW = v ; 
     value = v ; 
  } 

  value &= markOopDesc::hash_mask;
  if (value == 0) value = 0xBAD ; 
  assert (value != markOopDesc::no_hash, "invariant") ; 
  TEVENT (hashCode: GENERATE) ;
  return value;
}

What's interesting is that there's an algorithm in there (hashCode == 2) where the identity hash code turns out to always be the same all objects.  I'm pretty sure it's there solely for testing purposes, so they can run tests against the JVM and standard library and ensure no piece of code is relying on the uniqueness of the identity hash code.

This relates back to my previous point about transparency.  Because Sun's JVM is open source, it's possible for anyone to peek inside and see how the various bits work, and what assumptions the Sun JVM engineers are and are not making about those pieces.

Sphere: Related Content

Trust in Authority vs Trust in Transparency

This morning Murph posted a blog on the "censorship" of Wikipedia by over-zealous article owners, citing a posting by Lawrence Solomon about an experience with editing an article related to global warming.  Murph uses this to support one of his common conspiracy theories that Wikipedia, and social media in general, is doomed because of this kind of censorship and deliberate distortion of the facts.

What's interesting is that this censorship took place in the open.  Anyone who knows where to look can see exactly what changes were made by Solomon, and their disposition relative to the current official article or any other version of it.  Here's Solomon's version versus the current version.

The core issue here isn't censorship.  Editors will always censor.  Their job is to act as filters.  Furthermore, every editor is subject to biases, whether they be his own or those imposed on him by a third party such as his employer.  With most publications the editorial process happens behind closed doors with unseen forces.  With Wikipedia the process happens right before the eyes of the world.

The real issue here is that Murph trusts authority more than transparency.  Someone corrected an article, and the correction was subsequently removed due to political bias.  I'm sure that's happened time and time again in every encyclopedia that has ever been published.  The difference is in this case the change was transparent, with the politics open for all to judge, where with a traditional model we would have never known.

If history has taught us anything, it's that placing too much faith in authority is a bad idea.  Our authority figures are all human, and our authoritative organizations are still organizations of people.  They are both fallible and corruptable.  We can't entirely strip authority from our society because that would lead to anarchy, but we can make authority more transparent, and with transparency we can judge for ourselves.

In this particular case the editorial process of Wikipedia probably failed to yield the most accurate article possible, at least as the article stands today.  I'm not an expert on the subject matter, but I think Solomon's corrections were most likely correct.  That being said, I think the process has succeeded.  The changes Solomon made have not be entirely censored, they have merely been driven from the main page.  Discourse continues with regards to their validity.  The biases of the editors have been made public.  The last thing we want to do is reseal that process behind closed doors, simply because in this case we didn't like some of the results.

Sphere: Related Content

Sunday, July 13, 2008

Love, Hate, and Type Inference

Ever since I started using Scala, I've had somewhat of a love-hate relationship with type inference.  On the local level, type inference can make code much more concise, easier to read, and easier to refactor.  It makes code easier to read because in cases where the type of variable is obvious, type annotations just add noise that distract from the meaning of the code.  It makes code easier to refactor, because often times changes to the type of something in one place in a program can ripple throughout the rest of the program through a simple recompile, assuming the new type supports the methods being used on the original type.  This leads to something akin to structural typing without the associated overhead.

However, at a more macro level I think type inference can make code much more difficult to read.  For example, consider the following code from the Scalax library (full source here):

//  Scalax - The Scala Community Library
//  Copyright (c) 2005-8 The Scalax Project. All rights reserved
abstract class InputStreamResource[+I <: InputStream] extends CloseableResource[I] {
    def buffered : InputStreamResource[BufferedInputStream] =
        new InputStreamResource[BufferedInputStream] with Wrapper {
            type Handle = BufferedInputStream
            override def wrap(is : SelfHandle) = new BufferedInputStream(is)
            
            override def buffered = this
        }

    def slurp() = for (is <- this) yield StreamHelp.slurp(is)
    
    /* Obtains a Reader using the system default charset. */
    def reader =
        new ReaderResource[Reader] with Wrapper {
            type Handle = Reader
            // XXX: should be UTF-8 by default instead of OS default
            // practically, here in Russia I never used default charset
            override def wrap(is : SelfHandle) = new InputStreamReader(is)
        }
    
    def gunzip =
        new InputStreamResource[GZIPInputStream] with Wrapper {
            type Handle = GZIPInputStream
            override def wrap(is : SelfHandle) = new GZIPInputStream(is)
        }
    
    /** Obtains a Reader using the supplied charset. */
    def reader(charset : String) = {
        // Do this lookup before opening the file, since it might fail.
        val cs = Charset.forName(charset)
        new ReaderResource[Reader] with Wrapper {
            type Handle = Reader
            override def wrap(is : SelfHandle) = new InputStreamReader(is, cs)
        }
    }
    
    def lines = reader.lines
    
    def lines(charset : String) = reader(charset).lines
    
    def readLines() = reader.readLines()
    
    def readLine() = reader.readLine()
    
    def pumpTo[O <: OutputStream](osr : OutputStreamResource[O]) {
        // Note InputStream should be opened before OutputStream
        for (is <- this; os <- osr) StreamHelp.pump(is, os)
    }
}

This is an example of where type inference makes code easier to refactor yet more difficult to read.  If one were to change the return type of the reader methods, or the subsequent lines and/or readLines methods on that type, then the return types of these methods would automatically change on recompile.  However, now try to figure out the return types of the lines and readLines methods.  In order to do that, you need to know what the return type of the reader method is, and the structure of that type. Figuring out the return type of reader is reasonably straight forward, as it is defined in the same.  However, the base class for the return type is not, so in order to figure it out you need to trace through several class definitions and potentially in other source files.  I doubt this is a big deal for those people who are intimately familiar with the code, but I pity the new guy who has to sort it out when he's trying to create a new subclass of this abstract class.  Of course, there's always the API docs, so users of the code, and the poor new guy, do have a place to turn.

So that's Scala, which supports a relatively limited form of type inference.  Now, let's consider a snippet of OCaml, which has much fuller type inference, that I stole from Mauricio Fernandez:

   (* Copyright (C) 2008 Mauricio Fernandez  http//eigenclass.org
      Solution to the coding challenge at
      http://beust.com/weblog/archives/000491.html *)
   open Printf
   module S = Set.Make(struct type t = int let compare = (-) end)
   let (--) set elm = S.remove elm set
  
   let permutations f zero digits len =
     let base = S.cardinal digits in
     let rec aux n digits = function
         0 -> f n
       | len -> S.iter (fun s -> aux (n * base + s) (digits -- s) (len-1)) digits
     in S.iter (fun s -> aux s (digits -- s) (len - 1)) (digits -- zero)
 
   let () =
     let max = 10_000_000_000 in
     let digits = List.fold_right S.add [0; 1; 2; 3; 4; 5; 6; 7; 8; 9] S.empty in
  
     let count = ref 0 and prev = ref 0 and maxj = ref 0 and base = ref 0 in
     let report () = printf "Found %d numbers under %d.\n" !count max;
                     printf "Max jump: %d (%d -- %d).\n" !maxj !base (!base + !maxj)
   
     in try
       for i = 1 to 10 do
         permutations
           (fun num ->
              if num >= max then raise Exit;
              (* printf "%d\n" num; *)
              incr count;
              let jump = num - !prev in
                if jump > !maxj then (maxj := jump; base := !prev);
                prev := num)
           0 digits i
       done;
       report ()
     with Exit -> report ()

It's not entirely fair that I am picking on this OCaml code, because I don't know the language.  Also, this is a short, self-contained program that I personally would be likely to write in Python without any type annotations at all.  So in a way, for short programs like this, I think full type inference is great.  You get all of the protection of static typing with none of the hassle.  That being said, I think it would be extremely difficult to approach a large code base that is this devoid of type annotations.

All this adds up to my love-hate relationship with type inference.  I love it when I'm writing my own code, and I hate it (aside from local variables) when I am reading other people's code.  Over the past months I've decreased my use of it in Scala, preferring instead to explicitly specify types wherever they would not be entirely obvious from looking at the code, even though all my Scala code is hobby-work so no one else has to read it.  Of course, being hobby work, I often go days, weeks, or even months without looking at a chunk of code, so I often need help remembering what I wrote.  Overall I would say that excessive reliance on type inference obfuscates code, and that there are plenty ways for programmers to obfuscate their code, so it is a style issue as opposed to a language design issue.  I also think it will be interesting to see how many dynamic language people migrate over to languages with strong type inference as these languages gain more attention.  As you can see from the OCaml code above, there is no extra "typing burden" placed on the programmer.  If the program is correctly typed, it will compile and run quickly.  If not, it won't.  In theory this should satisfy the "type annotations are too much work/add too much noise to my program" camp, but not the "I don't want the compiler telling me what I can and can't do" camp.  My guess is that more people fall into the latter than the former, even the ones that claim they don't, but only time will tell.

Update: I just had a thought. Try thinking of type inference like using pronouns in natural languages. Imagine talking to a person who rarely uses pronouns. It's slow, cumbersome, and occasionally makes you think the speaker thinks you are an idiot who can't remember what he said 30 seconds ago. Likewise, when someone speaks entirely in pronouns, especially when they use a pronoun to refer to something that isn't part of the immediate conversation, it leads to utter confusion. Context is key, and the compiler can usually keep a much more distant context in its working memory than a programmer can. Type annotations provide context. Type inference assumes context.

Update 2: Here's a link to some OCaml API docs, where, just like in Scala, you can clearly see the inferred types.

Sphere: Related Content