I have recently attended two conferences: Hadoop World 2010, here in New York City, and Strange Loop 2010, in St. Louis, MO. Strange Loop’s location was, you could say, the strangest of these two, but it proved astoundingly strategic: as it’s near the middle of the U.S. people from all over the country were able to attend it, and sure enough I met a lot of people from both the West and East coast, and other places in between.
Also St. Louis is a quaint little city without many distractions, yet full of good eating places and nice people.
While I have to say that it is not generally easy to socialize with tech people, as anyone in the industry can attest, I found the people at Strange Loop a surprisingly friendly bunch. It probably has to do with the contents of this conference: I could describe it as an eclectic mix of ideas from the forefront of technology.
Another theme was parallelism and concurrency. Several very smart people contributed talks here, including Guy Steele. The other big theme was the NoSQL trend, with practical examples from various types of databases and business scenarios.
Before I begin I would like to point out that the pdf slides of most of the talks can be found currently at http://strangeloop2010.com/talk/presentations
Below, I will attempt to describe very briefly some of the talks that caught my interest.
A book that was recommended for the ideas presented is “Open Source SOA”.
In the beginning, some of the building blocks were defined.
Complex Event Processing (CEP) is a technology to process events and discover complex patterns among multiple streams of event data.
Event Stream Processing (ESP) involves processing multiple streams of event data with the intention of identifying meaningful events within those streams, and to derive meaningful information from them.
Finally, ESPER (http://www.infoq.com/news/2007/10/esper) is an open source ESP framework and an CEP engine. It also provides an Event Processing Language (EPL), for dealing with high frequency time-based event data.
Another relevant and recommended resource mentioned was the book on Enterprise Integration Patterns ( eaipatterns.com ).
Edward then went on to give a practical example of an electronic trading system.
Yehuda Katz, “Making Your Open Source Project More Like Rails”
Yehuda, better known from Ruby/Rails circles, had a non-technical discussion which was in no way specific to Rails but rather to open source projects in general, and lessons they can learn from the Rails success.
The emerging ideas were:
Rails was optimized for “developer happiness” and not for “performance”
You have to “optimize” for something, and as such make compromises. Once you choose your main “optimization factor”, it is possible afterwards to improve other factors (for example, performance). The idea is that it is very important to have ‘developer happiness’ as a central focus, and that unfortunately many open source projects neglect this and thus, get neglected.
Nothing Beats Adoption
Release early, and get people to contribute and to give suggestions.
Try to NOT tie up the project to a particular company. Although it was spearheaded by 37signals, Rails has in no way been tied up with them, but rather it was given into the hands of the community at large. Yehuda thinks that having a company behind the project is therefore dangerous and non-optimal. Examples that come to mind are MongoDB (10gen) and JQuery (The JQuery company).
Attribution and Credit Builds Community
The idea is that we, as human beings, tend to often underestimate the potential of the network effect. The MIT license is probably best suited for taking advantage of the network effect of open source developer software, while GPL (for example) would be more suited for smaller, more contained applications such as “Adium”.
Another point is that the importance of “marketing” should not be underestimated: things like blog posts and showing practical applications of your software are very powerful tools to encourage adoption. To add my own comments here, this is to be contrasted with “geeky”, “dark room” projects that apart from technical discussion show little interest in discussing any practical application. Unfortunately, in my experience, this is all too often the case with most open source projects out there.
I have attended several very interesting talks on NoSQL databases, which was another well debated topic. More specifically, I got a good mix of a “success stories” (Steve Smith - Real World Modeling with MongoDB), “failure stories” (avoiding the pitfalls, Billy Newport Enterprise NoSQL: Silver Bullet or Poison Pill?)
Steve Smith, Real World Modeling with MongoDB
I was attracted from the start to the title of this presentation and I was not disappointed. Steve talked about their real world experiences with their startup (Harmony, http://get.harmonyapp.com/ - a CMS for building websites) and what motivated their move from MySQL to MongoDB.
As a general observation of mine, each time your application has to deal with “dynamic data types” (when datatypes can be decided only at runtime), there are a few possible approaches:
1. Model using SQL relationships (polymorphic associations, multiple table inheritance, etc)
2. Model using Entity Attribute Value (EAV), http://en.wikipedia.org/wiki/Entity-attribute-value_model
3. Model using a NoSQL database
We can think of a few real-life scenarios when this will be the case. One example is any kind of CMS, and a more specific one is an eCommerce platform. (Unrelated to this conference, a friend of mine has blogged about the need to use NoSQL/MongoDB within the eCommerce space, http://www.doctrine-project.org/blog/mongodb-for-ecommerce )
The second approach, EAV, has been shown to be traditionally unacceptably slow. Some frameworks still use it and I don’t know about their performance (The Spree Rails eCommerce project), but in general this is not a recommended approach. It is actually one of the reasons why the semantic web ideas, which rely essentially on these kind of modeling, have not been more successful so far.
Going back to Steve’s talk, their approach was to use MySQL to model the dynamic data types. After having developed their entire application like that, they realized it had become an entangled mess that was very hard to maintain and no fun at all to extend. Switching to Mongo worked out very well for them and they have been very pleased with the results.
When data “belongs together”, there are alternative approaches to the SQL joins in the concept of “embedding” - which in a document store like Mongo means that you will embed related information each data item. (as opposed to spreading it over several tables), for example a “template” document that contains particular subfields embedded into it. Of course, this will not work well if you need frequent access to the subdata, but the idea is to model things that you will use “most” (e.g. 99%) of the time.
When it comes to storing images and files (binary data storage), Mongo is well suited for storing these directly in the db (which is done efficiently through its GridFS storage specification). This gives many benefits: for example, when backing up your database, you will include automatically all the binary files, without having to process those separately, as with traditional SQL.
Steve gave some specific examples of data types, such as items, templates, activity streams. Despite not going into the guts and internals about why things works the way they do, altogether this was a useful and very practical talk, that essentially convinced me to go with Mongo/NoSQL for my own startup idea, as opposed to experimenting with SQL first.
Billy Newport Enterprise NoSQL: Silver Bullet or Poison Pill?
Billy Newport’s talk was situated on the “enterprise” side of things, as opposed to Steve’s earlier experience with a fairly small startup. He implemented NoSQL solutions for large clients, and in some cases it proved to be a failure (from which lessons were learned) because some clients didn’t realize the drawbacks that come with NoSQL.
One thing I need to point out is the drawback of labeling things with “NoSQL” when in fact there are so many different databases that go under that category. For example, there are key-value stores like Redis, graph databases like Cassandra, and document stores like Mongo. This was mentioned in the talk, but perhaps not emphasized as much.
It was pointed out that there is no “join” possible as in SQL, so you have to do a full (and thus, very inefficient) table scan (implement programmatically database access through algorithms such as MapReduce). That is certainly a valid concern in many of the NoSQL dbs, although some databases such as Mongo provide very fast querying abilities.
Other topics discussed: having a single System of Record (SOR) as in SQL versus having multiple ones, choosing to have denormalizing data (in NoSQL) for performance vs normalization in the SQL way. A host of new problems arise when there are multiple clusters of data (on separate machines), which is often the case with large enterprises. In that case, with NoSQL it will be often impossible to do multiple table scans “online” meaning with real-time responses, and those kinds of scans will have to be done programmatically. Instead, the preferred approach is to do map/reduce algorithms offline and to do as much caching as possible, for as many queries as possible. Because of the many possibilities, things like group by/limit/joins will always be hard to cover.
Another very important idea was the fact that, with NoSQL, you need to know in advance the kinds of operations (queries, updates) you will do on your data types, and that will dictate the way you design and model your data. If, for example, you decide to embed / partition data in a certain way, you can tell goodbye to efficient querying for ways that don’t match that model. This is in sharp contrast to the misconception that with NoSQL “anything goes” and that “upfront modeling” is unimportant - quite the contrary, correct upfront modeling for NoSQL is essential.
The NoSQL panels and these talks drove home the idea that “there is no magic bullet” solution that will work in all circumstances. Different needs can be achieved optimally by using different dbs. Also, DBAs will still be needed, since a lot of work needs to be done on the db side, whether that db is sql or NoSQL.
Concurrency and Parallelism
There were several interesting talks dealing with the related concepts of parallelism (multi-core, multi-processor or distributed processing) and concurrency.
Guy Steele, “How to Think about Parallel Programming: Not!”
Guy L. Steele Jr. is one of the brightest minds alive in Computer Science and I found his talk incredible, in both its delivery and content. Guy began his talk in a very funny way - by telling us how he spent his weekend reverse-engineering a computer program he wrote decades ago, from which all he had was a paper card with zeroes and ones. (back in the times when punchcards were in use)
What began as a half-serious joke turned the audience into stupefied listeners as Guy turned the instructions into assembly code and went on an on about the intricacies for writing that particular assembly code. He was remembering the old ways and tricks such as register specifics, interrupt codes, bits for communicating with a matrix printer, and even bit patterns. It turned out that this whole apparently crazy part had an extremely good point - it showed how things used to be, and really how difficult it was to get things done. And then how things have evolved steadly:
from coding in octal or decimal
to relocating assemblers and linkers
to expression compilation
to register allocation
to stack management of local data
to heap management
to virtual memory / address mapping
In other words, thing have been evolving into higher “abstractions” while the lower level work has been automated. Steele goes on to explain then that when it comes to parallel computing, what we want is to have that layer automated - in other words, to not have the developer deal with it directly.
In order to achieve that, however, applications have to be written in certain ways, that are amenable to parallelism. Enter divide and conquer, and out with the “accumulator pattern”, the latter being how the applications have been traditionally written.
Then came a beautiful example of such an algorithm on a practical problem - splitting a string into words. It turns out that expressing the problem in these terms require a fairly different thinking and ingenuity with regards to choice of data structures and ways of combining solutions to subproblems. In essence, we are implementing “Map Reduce”-like algorithms.
What also came out of this example is that there are algebraic properties of the data structures and operators chosen that are essential to this process - namely associativity, commutativity, idempotence, and the existence of an identity and a zero. (Who could have thought pesky math would come to be of such use!:)
Paul King, Groovy and Concurrency
Somewhat related to Guy’s concepts was Paul King’s talk on Groovy and Concurrency. Although it involved Groovy, the concepts described there were very general. King gave a tour de force of several ways of doing concurrency, and actually implemented Guy’s problem in several different ways by using Groovy’s concurrency features. The slides for this are not up at the time of this writing, but I will upload them as soon as they are.