Stalwarts in Tech – An Interview with Donald Raab – On Why GS Collections is awesome for Java!

About this series

This is the first interview in a regular series of interviews with stalwarts of the technology industry. We wanted to highlight many of the unsung heroes of the technology industry, the people and projects that have made huge impacts in our lives as developers and technologists.

Update: 07/Mar/2014

Donalds has informed us that there is an update to the GS collections library (4.2.x). You can find it on their GitHub wiki and via a direct link.

Don Raab – Goldman Sachs and GS Collections

We’re really pleased to have Don Raab from Goldman Sachs to open up this series as he and Goldman Sachs have been major contributors to Java and Open Source for many, many years. In particular Don and his team have been responsible for the very popular GS Collections library, one that we use ourselves here in jClarity. So without further ado, let’s find out about Don’s contributions, motivations behind GS collections and some deep dive details about GS collections itself!

1. Would you like to introduce yourself Don?

My name is Donald Raab. I am a Tech Fellow at Goldman Sachs. I love to code, and also enjoy teaching other developers how to code. I’ve programmed in 20 different languages over the years, and have been paid to develop in at least 10 of them. My favorite programming language is Smalltalk. My second favorite language is Java using GS Collections. I first learned Java in 1997, and have been programming professionally using Java for the past 13 years. I manage a core team in Goldman Sachs which among other things is responsible for the continued development of GS Collections, a Java Collections Framework which we open sourced in GitHub almost 2 years ago.

2. What would you say is the key benefit for developers of using GS-Collections over the existing java.util Collections package or say Google’s well known Guava library? What makes them unique?

Scope and completeness.

GS Collections is a feature-rich collections framework. It started out as a supplement to the Java Collections Framework but has grown to the point where it can be used as a complete replacement for the JCF.

Guava also started out as a supplement to the JCF. It has grown into a feature-rich supplement for all of the standard libraries, not just collections. There’s a lot of overlap. Both frameworks add Multimaps, BiMaps, Bags (a.k.a. Multisets) and Immutable Collections. Guava includes utility for I/O, reflection, and EventBuses, which are out of scope for GS Collections. GS Collections includes optimized replacements for Lists, Sets, and Maps which are out of scope for Guava.

Both GS Collections and Guava are more complete and richer in their interfaces than the base JDK Collections library. GS Collections combines a set of collections features and capabilities that you would otherwise only get by adding several Java collections frameworks to your classpath like Apache Commons Collections, Guava, Trove, and Functional Java.

GS Collections includes a rich functional API inspired by the Smalltalk protocol. Any Smalltalker or Ruby programmer should recognize method names like select, reject, collect, detect, and injectInto. Developers familiar with other programming languages might know these as filter, filterNot, map, find, and foldLeft. We’ve added a lot of other methods over the years that were inspired by our experience with other languages like Ruby, Haskell, and Scala, including groupBy, partition, flatCollect, and zip. We have over 90 methods available on our parent RichIterable interface which most of our types extend.

GS Collections includes optimized replacements for the standard JDK Collections classes like ArrayList, HashSet, and HashMap. We wanted more memory efficient and performant versions of these classes in addition to a rich functional API. We also wanted memory efficient immutable collections and primitive collections.

GS Collections is tuned for both large and small scale performance. We’ve supported data-level parallelism for years. In the earliest years of the framework, we added parallel utilities using Doug Lea’s original version of the Fork/Join framework. When we migrated from Java 4 to Java 5 we switched over to using Executors. We have since revived our usage of Fork/Join by providing a separate forkjoin module. GS Collections supports Java version 5 and above, so we had to add a Java 7 specific module to include support for Fork/Join again.

Here is a list of the feature combinations that I believe sets GS Collections apart in the Java Collections framework space today.

Container Interface Hierarchies

  • Readable
  • Mutable
  • Immutable
  • Fixed Size

Iteration styles

  • Eager (Serial, Parallel)
  • Lazy (Serial)

Container Concurrency options

  • Thread-unsafe (Mutable)
  • Thread-safe (Immutable)
  • Concurrent (Map only)
  • Synchronized (unsafe on iterator)
  • Multi-reader (List, Set and Bag only – throws on iterator)

Container Implementations

  • Generified Object Containers
  • Primitive Containers (boolean, char, byte, short, int, float, long, double)
  • Unmodifiable Collections

Container Types

  • RichIterable
  • Collection
  • List
  • Set, SortedSet
  • Map, SortedMap
  • BiMap
  • Stack
  • Bag, SortedBag
  • Multimap (List, Set, SortedSet, Bag, SortedBag)
  • Interval
  • LazyIterable

Utility Classes

  • Iterate
  • MapIterate
  • ArrayIterate
  • StringIterate
  • ParallelIterate, ParallelArrayIterate, ParallelMapIterate
  • Predicates, Functions

API Style

  • Rich base API w/ 90+ methods
  • Object-Oriented – API directly available on the appropriate types (List, Set, etc.)
  • Functional (API leveraging lambdas/immutability)
  • Static Utility

Lambda Ready Functional Interfaces

  • Function, Predicate, Procedure (with arity from 0 up to 3)
  • Primitive Functions, Predicates, Procedures with combination of primitive/object types

3. How well does GS-Collections work with Java 8 and how does GS-Collections compare to the overhauled collections library in Java 8?

GS Collections works very well today with Java 8 Lambdas and method expressions. Most of the functionality a developer would use from the new Streams API is already available on the GS Collections types. However, our types extend the Java types (for example MutableList extends List) so they inherit the new Streams API. GS Collections has not been tuned for the Streams API yet, like the built in container types, so the performance characteristics may vary. Java 8 adds just a few new methods to the existing collections interfaces, like forEach(), sort(), and stream(). The remaining new API is on the Stream interface returned by stream(). GS Collections has the equivalent methods right on the collections. For example, the new functionality collection.stream().filter() is already available as collection.select() in GS Collections.

GS Collections has a lot of types you will not find in Java 8, like Bags, Multimaps, Immutable Collections, and Primitive Collections. We also support an eager API by default, as well as a lazy API on request via a call to asLazy(). The Streams API is primarily lazy. Lazy is often a good thing, but can be slower when you want to ultimately get back a collection.

And there is much more… GS Collections has been in development for 10 years, so it has accumulated a lot of functionality not easily found in a single library. We do not shy away from adding new types and new APIs, so GS Collections will continue to evolve and grow. We follow semantic versioning, and will keep deprecated APIs through one major release before removing. We don’t remove much these days, but we often add new APIs in major versions. We strive to provide serialization compatibility between releases by leveraging a large battery of serialization tests.

4. Was there any particular inspiration or spark behind the design of the libraries themselves?

In the early days, the design inspiration was from Smalltalk. Later we saw similarities between a lot of our designs and the Scala collections library since Scala 2.8. This was a nice validation of our designs. We have been inspired to add a parallel lazy API to GS Collections since seeing it first in Scala, and now in Java 8. We have eager parallel utilities today, but I believe utility classes are harder to use and discover than an object-oriented API directly available and discoverable on containers.

5. How easy have new recruits to Goldman Sachs found it to pick-up the GS Collections library?

We train new hires using the GS Collections Kata. It is the best way to learn the basics of the framework. The Kata is available on GitHub as well. It’s set up as a series of unit tests which fail. You can fork the repository, read the training materials, and get the tests to pass. Then you can compare your solutions with our solutions branch. Eventually we hope to have online videos available for the GS Collections Kata so internal and external developers can learn directly from some of our GS Collections contributors and instructors.

6. We haven’t seen many banks open source their internal libraries. What motivated Goldman Sachs to open source the GS-Collections library and get involved with the wider Java community?

We use a lot of open source software in Goldman Sachs. We wanted to give something back that we thought would be valuable to the community. We also thought it was important to be part of the discussion that would lead to lambdas finally being added to Java. We have a lot of experience to offer in the use of a lambda ready collections framework solving real business problems. Having GS Collections out in the open has helped us provide valuable feedback and input to the process of adding lambdas to Java 8.

7. We’re a performance tuning company, so we’re always interested in the performance characteristics of new libraries. What are the performance characteristics of GS-Collections like compared to other collections libraries?

It’s easiest to talk about static memory usage. The most dramatic difference is in our implementations of hash tables. GSC’s UnifiedMap uses half the memory of the JCF HashMap because it doesn’t hold onto Entry objects. GSC’s UnifiedSet uses one quarter the memory of the JCF HashSet because it’s not implemented by delegating to a Map (wasting memory on values when only the keys are used) plus it also doesn’t hold onto Entry objects. Our Multimaps, BiMaps, and Bags are backed by our own hash tables; Guava’s are backed by the JCF’s HashMap.

It’s harder to compare runtime performance. Over the last 10 years of working on GS Collections, I have written lots of really bad micro-benchmark code, and then spent lots of time improving the micro-benchmarks I wrote in the hope of proving that our framework was “fast” or “faster”. This winds up being a bit of a fool’s folly in my opinion, as “fast” taken out of an application context can be mostly meaningless. I believe it is better for experts like yourselves to help applications profile and understand “their” code and tune it appropriately.

Our memory optimizations have some positive but difficult to measure effects in terms of performance, because we give the garbage collector a break by not generating garbage unnecessarily. We have memory benchmarks that compare the memory footprints of our containers alongside the equivalent types in the JDK, Guava and Trove. The benchmarks are in the performance-tests module in GitHub. Anyone can run these tests on their own with the latest version of the libraries. We also published slides to the GS Collections GitHub wiki which show a graphical version of the comparisons.

8. Is there any particular programming first-love that you have – for example what first set you on the road to being a developer?

I started programming by teaching myself BASIC on an Epson HX-20 when I was 11 years old. I knew then what I know now – that I would be a developer for the rest of my life. I learned Smalltalk in my 20 something years, and this changed my world-view completely. I then decided that I would not only enjoy coding the rest of my life, but I would teach others to learn how to love to code. I’ve spent the greater part of the past 10+ years trying to convince Java developers that there is a better, easier way to do things. GS Collections is a part of that story.

9. Anything else you would like to mention?

I hope Java developers discover and give GS Collections a try and enjoy using it as much as I have over the years. Check out the library in GitHub and the documentation and Javadoc available from the GitHub wiki. The binaries are available in Maven Central. The Kata is a great way not just to learn GS Collections, but to also learn how to use Java 8 Lambdas and method references.

The one link that can help you find everything you are looking for in regards to GS Collections is https://github.com/goldmansachs. You will find a project for GS Collections and a project for the GS Collections Kata here.

Thank you for asking me such great questions. I hope folks find the answers and information helpful. If there are any further questions about GS Collections I would encourage folks to ask them on StackOverflow tagged gs-collections, and we’ll be happy to answer them.

=============

Once more we’d like to thank Don for his detailed and thoughtful answers and encourage all of you to experiment with GS Collections for your next project or hack session!

Cheers,
Martijn (CEO) and the jClarity Team!

Say goodbye to Java performance problems!

No more memory leaks and application pauses!

Recent Posts