Tuesday, March 31, 2009

10 Years in one Project

For about ten years now I am the CTO at WorldLingo. During those years I have seen quite a few people join and leaving us eventually. Below is a small snapshot of how time has passed. Obviously I am quite proud to be somewhat the rock in the sea.


If you like to know how the video was created, then read on.

I download the source of the code_swarm project following the description, i.e. I used


svn checkout http://codeswarm.googlecode.com/svn/trunk/ codeswarm-read-only
cd codeswarm-read-only
ant all

to get the code and then ran ant all in its root directory:

C:\CODESW~1>ant all
Buildfile: build.xml

init:
[echo] Running INIT

build:
[echo] Running BUILD
[mkdir] Created dir: C:\CODESW~1\build
[javac] Compiling 18 source files to C:\CODESW~1\build
[copy] Copying 1 file to C:\CODESW~1\build

jar:
[echo] Running JAR
[mkdir] Created dir: C:\CODESW~1\dist
[jar] Building jar: C:\CODESW~1\dist\code_swarm.jar

all:
[echo] Building ALL

BUILD SUCCESSFUL
Total time: 6 seconds

Note that this is on my Windows machine. After the build you will have to edit the config file to have your settings and regular expressions match your project. I really took the supplied sample config file, copied it and modified these lines:

# This is a sample configuration file for code_swarm

...

# Input file
InputFile=data/wl-repevents.xml

...

# Project time per frame
#MillisecondsPerFrame=51600000

...

# Optional Method instead of MillisecondsPerFrame
FramesPerDay=2

...

ColorAssign1="wlsystem",".*wlsystem.*", 0,0,255, 0,0,255
ColorAssign2="www",".*www.*", 0,255,0, 0,255,0
ColorAssign3="docs",".*docs.*", 102,0,255, 102,0,255
ColorAssign4="serverconfig",".*serverconf.*", 255,0,0, 255,0,0

# Save each frame to an image?
TakeSnapshots=true

...

DrawNamesHalos=true

...

This is just adjusting the labels and turning on the snap shots to be able to create a video at the end. I found a tutorial that explained how to set this up.

What did not work for me is getting mencoder to work. I downloaded the MPlayer Windows installer from its official site and although it is meant to have mencoder included it does not. Or I am blind.

So, I simply ran

mkdir frames
runrepositoryfetch.bat data\wl.config

to fetch the history of our repository spanning about 10 years - going from Visual SourceSafe, to CVS and currently running on Subversion. One further problem was that the output file of the above script was not named as I had previously specified in the config file, so I had to rename it like so:
cd data
ren realtime_sample1157501935.xml wl-repevents.xml

After that I was able to use run.bat data\wl.config to see the full movie in real time.

With the snap shots created but me not willing to further dig into the absence of mencoder I fired up my trusted MacBookPro and used Quicktime to create the movie from an image sequence.

When Quicktime did its magic I saved the .mov file and used VisualHub to convert it to a proper video format to upload to Vimeo. And that was it really.

Monday, March 16, 2009

CouchDB and CouchApp

As mentioned in my previous post, I wanted to see CouchDB in action and decided to "push" one of the available sample applications into it. CouchDB has a built in web server to support the REST based API CouchDB supports. The developers were smart enough to see its broader use and allow for applications to be uploaded into a database. An application is a set of static HTML files, images, and JavaScripts that can form a fully functional web application, including the data stored directly in that very same database. With CouchDB's built in replication you get a fully distributed application. Sweet!

Sure you will have to install a load balancer in front of multiple instances of CouchDB, but that is a simple engineering task. I am not sure how session handling will work though. Without a somewhat standard session handling framework it may be difficult to build state aware applications. You can save the session in the database of course and reload upon each request. Is it replicated fast enough though for random access to cluster nodes?

Back to the sample application. I chose the Twitter client provided by Chris Anderson from the CouchDB team (btw, his blog is now hosted directly on CouchDB). I got the tar ball from the above GitHub repository and unpacked it. To get it "pushed" into a CouchDB database you need the actual CouchApp as well. Download its tar ball as well and unpack it. Now we can install CouchApp first. I went with the README file provide and tried the Ruby version, installing it as a gem:


$ sudo gem update --system
$ sudo gem install couchapp

While that installed fine, the syntax that you then find for example for the application to be pushed into the database does not match with the Ruby version of CouchApp. After talking to Chris on the IRC channel he advised uninstalling the Ruby version and rather use the Python based version. I uninstalled the gem and ran:

$ sudo easy_install couchapp

Now the syntax of couchapp did match and I was ready to upload the Twitter sample application - or was I?

Not so fast though, every attempt to upload the application resulted in partial failures. I compared what I had in the uploaded database with what Chris had. His had the attachments needed, mine did not. The database has been created but half of the files were missing. After a quick chat with Chris again we realized that he is using the trunk version of CouchDB, I was using the latest release. Mine was simply outdated and was missing the new application extensions. Here are the next steps I had to run:


$ cd /downloads
$ svn co http://svn.apache.org/repos/asf/couchdb/trunk couchdb
$ cd couchdb/
$ ./bootstrap
$ ./configure
$ make && sudo make install
$ sudo -i -u couchdb couchdb
and to push the application

$ cd ../jchris-couchdb-twitter-client-6bee14ae1b3525d56d77dd9c114002582dc0abe8
$ couchapp push http://localhost:5984/test-twitter

As expected, the application push did succeed now. I was able to see all the files in the newly created database. The screen shot shows the design view of the new "test-twitter" database. Everything you need to serve the application is there, even the favicon.ico to display in the browsers address bar. If you click on the "index.html" the newly uploaded application is started and after logging into Twitter I had everything running as I wanted it.

Here is the a screen shot of the application running. This is really great. I just have to figure out now for myself how to use it either for work or privately. So many choices - but only 24 hours in a day.

Sunday, March 15, 2009

Erlang and CouchDB on MacOS

As I mentioned before I am looking into how I could use Erlang in our own efforts. Sure, it is not the silver bullet for all problems ("Can it make coffee?") and all the hype on the developer ether. But it is certainly built to be the basis of large concurrent systems, for example Facebook's chat system. Or the below mentioned Amazon Dynamo clone called Dynomite.

For starters I purchased "The Pragmatic Programmers" screen case Erlang in Practise with Kevin Smith. I have to say, it is worth every cent! I did watch it on a flight from Munich to Las Vegas and it made the hours literally "fly by". There is something really cool about seeing a program being developed in front of your eyes and re-written many times to explain more advanced concepts as you go along. Highly recommended.

Now I have a background in Prolog and Lisp so getting into the Erlang way of doing things was not too difficult. Of course, the difficult yet again is how to use it for something own. I decided to first build Erlang on my MacBook Pro and then try an Erlang based system to see it working. Building Erlang on MacOS is described in here but overall it is a simple "wget && tar -zxvf && ./configure && make" obstacle course with a few hickups thrown in for good measure. As the post describes, you first need to install XCode from either the supplied MacOS disks or by downloading it from the net - easy peasy. Next is to build libgd. The above post has a link to the details required to build libgd on MacOS. It first required downloading all the required library tar balls and then running the following commands:


$ cd /downloads/
$ tar -zxvf zlib-1.2.3.tar.gz
$ tar -zxvf gd-2.0.35.tar.gz
$ tar -zxvf freetype-2.3.8.tar.gz
$ tar -zxvf jpegsrc.v6b.tar.gz
$ tar -zxvf libpng-1.2.34.tar.gz

This assumes all tar balls are saved in the "/downloads" directory. Next is zlib:

$ cd zlib-1.2.3 ; ./configure --shared && make && sudo make install
$ ./example

Then libpng:

$ cd ../libpng-1.2.34
$ cp scripts/makefile.darwin Makefile
$ vim Makefile
$ make && sudo make install
$ export srcdir=.; ./test-pngtest.sh

Next is the jpeg library. Here I did not have to symlink the libtool as described in the post. I assume that is because I am on MacOS 10.5. So all I did is this:

$ cd ../jpeg-6b/
$ cp /usr/share/libtool/config.sub .
$ cp /usr/share/libtool/config.guess .
$ ./configure --enable-shared
$ make
$ sudo make install
$ sudo ranlib /usr/local/lib/libjpeg.a

We are getting closer. The freetype library needs these steps, where the last line is for the subsequent libgd build:

$ cd ../freetype-2.3.8
$ ./configure && make && sudo make install
$ sudo ln -s /usr/X11R6/include/fontconfig /usr/local/include

OK, now the libgd library:

$ cd ../gd-2.0.35
$ ln -s `which glibtool` ./libtool
$ ./configure
$ make && sudo make install
$ ./gdtest test/gdtest.png

With this all done we can build Erlang from sources like so:

$ cd /downloads
$ tar -zxvf otp_src_R12B-5.tar.gz
$ cd otp_src_R12B-5
$ ./configure --enable-hipe --enable-smp-support --enable-threads
$ make && sudo make install
$ erl

All cool, the Erlang shell starts up and is ready for action:

$ erl
Erlang (BEAM) emulator version 5.6.5 [source] [smp:2] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.6.5 (abort with ^G)
1> "hello world".
"hello world"
2>

With this in place I decided to try CouchDB. Again, here are the steps to get it running:


$ cd /downloads
$ tar -zxvf apache-couchdb-0.8.1-incubating.tar.gz
$ cd apache-couchdb-0.8.1-incubating
$ less README
$ sudo port install automake autoconf libtool help2man
$ sudo port install icu spidermonkey
$ ./configure
$ make
$ sudo make install
$ sudo -i -u couchdb couchdb

I did not have to execute this line $ open /Applications/Installers/Xcode\ Tools/XcodeTools.mpkg. I assume it is because I had the full XCode install done beforehand. You can also see that you need to install the MacPorts tools. With those you will have to add a few binary packages to be able to build CouchDB. Especially the Unicode library ICU and the spidermonkey C based JavaScript library provided by the Mozilla organization. The last line starts up the database and directing your browser to http://localhost:5984/_utils/ allows you to see its internal UI. Now relax! ;)

If you look closely at the screen shot you will see an inconsistency to the notes above. I will describe this in more detail in a future post about getting the sample Twitter client for CouchDB running using CouchApp.

Saturday, March 7, 2009

HBase vs. CouchDB in Berlin

I had the pleasure of presenting our involvement with HBase at the 4th Berlin Hadoop Get Together. It was organized by Isabel Drost. Thanks again to Isabel for having me there, I thoroughly enjoyed it. First off, here are the slides:



HBase @ WorldLingo - Get more Information Technology

The second talk given was by Jan Lehnardt, a CouchDB team member. I am looking into Erlang for the last few months to see how we could use it for our own efforts. CouchDB is one of the projects you come across when reading articles about Erlang. So it was really great to have Jan present too.

At the end of both our talks it was great to see how the questions from the audience at times tried to compare the two. So is HBase better or worse than CouchDB. Of course, you cannot compare them directly. While they share common features (well, they store data, right?) they are made to solve different problems. CouchDB is offering a schema free storage with build in replication, which can even be used to create offline clients that sync their changes with another site when they have connectivity again. One of the features puzzling me most is the ability to use it to serve your own applications to the world. You create the pages and scripts you need and push it into the database using CouchApp. Since the database already has a built-in web server it can handle your applications requirements implicitly. Nice!

I asked Jan if he had concerns about scaling this, or if it wouldn't be better to use Apache or Nginx to serve the static content. His argument was that Erlang can handle many many more concurrent request than Apache can for example. I read up on Yaws and saw his point. So I guess it is a question then of memory and CPU requirements. The former is apparently another strength of CouchDB, which has proven to serve thousands of concurrent requests only needed about 10MB of RAM - how awesome is that?!?! I am not sure about CPU then - but take a gander that it is equally sane.

Another Erlang project I am interested in is Dynomite, a Erlang based Amazon Dynamo "clone" (or rather implementation). Talking to Cliff it seems it is as awesome leveraging the Erlang OTP abilities to create something that a normal Java developer and their JRE is just not used to.

And that brings me to HBase. I told the crowd in Berlin that as of version 0.18.0 HBase is ready for anyone to get started with - given they read the Wiki to set the file handles right and a few other bits in pieces.

Note: I was actually thinking about suggesting an improvement to the HBase team to have a command line check that can be invoked separately or is called when "start-hbase.sh" is called that checks a few of these common parameters and prints out warnings to the user. I know that the file handle count is printed out in the log files, but for a newbie this is a bit too deep down. What could be checked? First of the file handles being say 32K. The next thing is newer resource limits that were introduced with Hadoop for example that now need tweaking. An example is the "xciever" (sic) value. This again is documented in the Wiki, but who reads it, right? Another common issue is RAM. If the master knows the number of regions (or while it is scanning the META to determine it) it could warn if the JRE is not given enough memory. Sure, there are no hard boundaries, but better to see a Warning: Found x regions. Your configured memory for the JRE seems too low for the system to run stable.

Back to HBase. I also told the audience that as of HBase 0.19.0 the scanning was much improved speed wise and that I am happy where we are nowadays in terms of stability and speed. Sure, it could be faster for random reads so I may be able to drop my MemCached layer. And the team is working on that. So, here's hoping that we will see the best HBase ever in the upcoming version. I for myself am 100% sure that the HBase guys can deliver - they have done so in the past and will now as well. All I can say - give it a shot!

So, CouchDB is lean and mean while HBase is a resource hog from my experience. But it is also built to scale to Petabyte size data. With CouchDB, you would have to add sharding on top of it including all the issues that come with it, for example rebalancing, fail-over, recovery, adding more servers and so on. For me HBase is the system of choice - for this particular problem. That does not mean I could use CouchDB, or even Erlang for that matter, in a separate area. Until then I will keep my eyes very close in this exciting (though in case of Erlang not new!) technology. May the open-source projects are rule and live long and prosper!