concurrency: part 2 - actors

Posted by anton
on Friday, September 19, 2008

message-passing

if shared memory makes concurrent programming difficult, what else is there that an app developer can use?

one way of representing coarse-grained parallelism is through message-passing concurrency.

the idea is pretty simple – the only way to share state between isolated components is through message passing. what happens inside the components, is their own business.

there is no global state of the whole system, unlike in shared memory program that behaves like a giant state machine. sending and receiving of messages happens concurrently together with the computations by the components themselves. this approach is much easier for the developer to reason about and it maps easily to multiple CPUs and multiple machines.

a lot of the enterprise workflow-like processing falls into this model. it is basically a pipeline of worker pools, configured independently and possibly running on separate machines. a unit of work travels through the pipeline, as it is being handed off from one set of workers to another.

actors

one of the common implementations of message-passing concurrency is actor model. i’ll take a liberty to interpret this model to fit my own developer needs, even though i am butchering decades of academic research in the process.

actors model represents well multiple computers talking to each other using network packets, components exchanging messages using JMS, independent processes or threads talking to each other using messages – basically anything where isolation is possible and interactions are loosely coupled.

usually each actor has a mailbox associated with it (often represented with a queue), where messages are stored until an actor processes them. messages map well to physical artifacts in the real world – they are immutable, and only one actor can handle a given message at a time.

actors are connected with channels; individual actors are isolated from each other – a failure of an actor does not affect another actor. no other actor can change the inner state of a given actor – the only way to communicate is through message-passing.

messaging is usually asynchronous, but synchronous messaging could also be useful.

depending on implementation, beware of deadlocks if you are using synchronous messaging. another issue to keep in mind is order of messages – depending on implementation it might not be preserved.

while some advocate “everything is an actor” approach, and I get dizzy imagining the possibilities, the pragmatic app developer in me is living in the real world among existing apps. in this case actors work best as a library for the existing language.

erlang

although i shied away from “actors everywhere” approach above, erlang is the most successful implementation that actually does just that. it is not just the language, but a whole platform that transparently runs actors within a single process as well as across multiple machines.

as this topic is heating up, one should at least read the book and play with the language. after all, a language that doesn’t affect the way you think about programming is not worth knowing, and erlang is enough of a paradigm shift to kickstart your concurrency thinking.

Tibco BusinessWorks

as i’ve described before, BusinessWorks (BW) is an example of an integration DSL that happens to use actors.

given an integration process (e.g. receive a message on JMS queue A, enrich it from a database, transform it, and send it to a JMS topic B), you describe it using BW language constructs. then it becomes an actor definition that you can deploy on an engine (really a managed JVM instance). there could be multiple engines running on multiple machines, and each engine can have many process instances (aka actors in our terminology) running inside of it. a process instance gets created from a process definition whenever a new message arrives on a queue (mailbox in actors’ terminology).

a scheduler inside the individual engine takes care of creating process instances (there could be thousands) and scheduling them on the worker threads.

all of this mapping happens at deploy time, as a developer you do not worry about it.

actors talk to each other using message-passing, thus your actor implementation does not even have to worry about threads or concurrency – you just express your integration logic. you could use shared memory, but it would not scale well, since you are limited to one JVM; nor would it be natural, since you have to use explicit language constructs; this language support for immutability is very convenient, as i have mentioned earlier

if you use a JMS server to pass messages around, it becomes a sort of a mailbox, holding messages for you in the queue. each incoming message would eventually spawn an instance of the actor, feeding it the message as an argument. multiple instances of the same actor can read from the same queue, thus achieving load-balancing.

once you recall that jms supports -filters- selectors you have the actors implementation that curiously matches something like erlang

note that this is not fine-grained parallelism; your units of work are more coarse-grained and very loosely coupled, but fundamentally, the model is the same, and it scales like crazy achieving massive throughput.

even if you do not end up using BW, you can implement this model by hand relatively easy.

so what if i wanted more fine-grained and more efficient support for actors in my language of choice (provided i am not using erlang)?

ruby

revactor networking library includes actors implementation (also see this great intro to actors by Tony Arciery), but i have not seen a more generic approach yet.

note that ruby is really hampered by lack of proper threading support; this is why jruby guys are in a much better shape if they were to roll their own actors implementation.

scala

this is probably the most mature implementation i’ve seen (see this paper). they take advantage of scala language features to simplify the syntax and unify synchronous and asynchronous message-passing. individual actors are represented as threads or more light-weight primitives that get scheduled to run on threads in the thread pool. it is type-safe, but it relies on convention to make sure you do not mutate your messages.

although i could see where representing actors as threads could be too heavyweight for some tasks, in the case of java and scala, your mileage may vary (see this presentation from Paul Tyme).

groovy

given language features like closures and general simpler syntax, together with the fact that it sits on top of JDK that includes java.util.concurrent, one would imagine that groovy would be a perfect candidate for actors implementation. however, the only thing i found so far was groovy actors, and it seems to have been dormant for a while.

python

i do not know enough about python’s memory model and its implementation, but i suspect is suffers from the same “feature” as ruby – i.e. global interpreter lock, which means that it won’t be able to scale to multiple CPUs (and, similar to ruby, jython that builds on JVM comes to the rescue).

the only thing i’ve looked at so far is stackless python, which is a modified version of python that makes concurrency easier (see this tutorial by Grant Olson that also includes actors). it introduces tasklets aka fibers, channels, and a scheduler among other things.

java

this is where i am a bit surprised – i do not see a good drop-in-a-jar-and-go actors library blessed and used by all. there seems to be some research projects out there, but i want something that works for me now and supports in-memory zero-copy message passing, sync/async messaging, and type safety. i am OK with abiding by conventions instead of compiler checking things for me.

i suspect that the reason for this is the fact that some rudimentary form of actors can be implemented relatively easy using existing concurrency libraries, and this approach is intuitive without putting labels on it.

nevertheless, this is what i found:

  • jetlang is a port of a .NET library and looks at Scala actors for inspiration. it is still quite beta, but it looks promising
  • kilim (from one of the principle engineers of weblogic server) still seems to be a bit too much of a research project for my taste, but the theory behind it is sound

and there is a number of research projects out there:

bottom line

actors is a great abstraction, and “good enough” version of it is easy to implement – think about it, consider it, use it!

it helps if your language/platform supports concurrency primitives to build upon. this includes true threading support that scales to many CPUs, although we could also benefit from a standard fibers implementation, since they are more lightweight than typical threads and would allow creation of a large number of actors that later could be mapped onto threads for execution.

each language could benefit from a well thought-out actors library, since it would push developers in the right direction.

it is not right for everything though – it might not be fine-grained enough, it might not map well to problems that rely on ordering of messages or presence of any other state across multiple actors or multiple messages.

to be continued

what is on the horizon that is worth noting? what are some of the interesting research topics? what have we forgotten over the years? what other heuristics/patterns and libraries could be immediately useful?

concurrency: part 1

Posted by anton
on Friday, September 12, 2008

true to the purpose of this blog, below is an attempt to organize my (admittedly very superficial) experience with concurrency.

my 10GHz CPU

you probably noticed that moore’s law does not really apply anymore when it comes to CPU speed. if it were holding up, we would have had 10GHz CPUs by now, but for half a decade we haven’t really moved past 3GHz.

that is to be expected for the current generation of hardware – the gains have to happen elsewhere. for a little while we’ll get performance boost due to increase in size and speed of the caches that would improve locality, but in the long run it seems that multiple CPUs is where the improvements are to be mined from (this also includes specialized CPUs like Cell and GPUs in general).

this means that more and more people will have to think about their applications in terms of parallel processing. this also means that optimizations will become more and more important for those workloads that cannot be parallelized and therefore will be stuck on a single CPU (for a good introduction see The Free Lunch Is Over at Dr. Dobb’s Journal).

the bottom line is that as an app developer you cannot ignore the problem any longer; to make matter worse, there is no automagical solution in the nearest future that would make your application take advantage of multiple processors.

my concurrency story

in past decade most of the stuff i’ve worked with had some sort of coarse-grained parallelism; the rest was taken care of by the underlying framework.

i started with a unix philosophy of small programs connected via pipes, each performing a simple task. a little later came in fork and signals. things were simple, and OS took care of everything.

then came the web – it was mostly stateless with the database doing all the heavy lifting when it came to shared state. we just added boxes if we needed to grow. in ETL multi-box, multi-cpu setup was also natural, and the tools were designed to conceal concurrency; same goes for integration, where concurrency was at the level of data flows, which made things rather simple.

it is only in the past year or so when i had to really dive in deeper into relatively low-level concurrent development with java.

my dog-eared copy of Java Concurrency in Practice has proved to be quite an indispensable reference. the book is a bit uneven, and the editor should have spent more time on it, but you get used to it. it is a great practical resource, especially in the presence of so much confusing and incomplete information online.

jsr-166 introduced in java5 (and the primary subject of the book) is such a productivity boost; being a part of JDK, it is a big step forward towards letting mere mortals like me really embrace concurrent programming.

i find myself using Executors convenience methods all the time: it is so easy to create a pool, and then just feed it Callable instances, getting a Future instance as a handle in return. if more flexibility is needed, i use ThreadPoolExecutor. Queues are great as a communication channel for any sort of producer/consumer scenario, anything that requires message-passing or any sort of other work hand-off. Atomics are also great – i do not have to think twice when implementing counters or any other simple data structures.

most of the time i do not even have to work with threads or low-level synchronization primitives directly – they are buried deep within the libraries. i have less nightmares, since i do not have to touch volatile as often.

at some point i’ve read both editions of doug lea’s book, but i was always hesitant to recommend it; i’d rather rely on libraries that abstracted all of this away. now that java.util.concurrent has been out for 4 years, and Java Concurrency in Practice has become a bestseller, there are no more excuses.

one thing i’ve learned though – when you think you got this stuff, you discover a whole new class of problems that make you realize how complicated all of this really is, and how truly difficult it is to write larger concurrent programs.

you really, really have to think hard about how you share your objects, how you compose them and operate on them. you need to really understand how the language and the runtime work (i find myself checking JLS quite often). this is where good OO practices like encapsulation become even more important, since you are not just risking maintenance overhead, but you are risking the correctness of your program.

now, i always told myself that programming is not an exercise in manliness. i am just an app developer; i want to ship a working code that solves customer’s problems, not spend countless hours trying to reason through non-blocking algorithms just because i decided to do something non-trivial with ConcurrentHashMap. at the same time i do not want to waste my precious CPUs, so what am i to do? shouldn’t this stuff be easier? is there something I am missing?

threads considered harmful

actually, there is no problem with threads per se; the problem is with shared state.

in a normal sequential program you only worry about the logic as it is unfolding before you – one statement after another, in order. in a concurrent program that uses threads and shared state in addition to all your usual problems you also have problem of the non-deterministic state: since at any point in time any thread can come in and mess with your data, even between the operations you considered atomic before (like counter++), the number of states that your program can be in suffers a combinatorial explosion. this makes it really hard to reason about its correctness.

your code becomes brittle, sacrificing failure isolation – one misbehaving thread can potentially harm the whole runtime (a good analogy is BSOD caused by a device driver).

in addition, things don’t compose – a transfer operation performed by a thread-safe customer between two thread-safe accounts is not going to be automatically thread-safe.

to make matter worse, some of the errors remain hidden when run on commodity 1-2 CPU IA32 hardware, but as the number of CPUs grow, or their architecture becomes less restrictive to help with concurrency, things start to break down.

for more thorough discussion see The Problem With Threads by Edward A. Lee and Cliff Click’s We Don’t Know How To Program…

now what?

a natural reaction is to forget about fine-grained parallelism and offload the hard stuff onto someone else. after all, i am an app programmer, i care about business problems, what’s all of this yak shaving about?!

in some cases we can get away with firing up individual processes to take advantage of multiple CPUs. most of the time though it means that the problem has been pushed further down the stack, which often turns out to be the database. this is the route that rails folks went, and it certainly was pragmatic approach at the time (now that they are forced to deal with efficiency, threading is back in the picture. for discussion of issues see Q/A: What Thread-safe Rails Means).

if you can get away with using individual processes, go for it (see google chrome) – you get failure isolation, you get immutability in respect to other processes (it won’t be as easy for another process to mess with your data), and as an additional benefit, you get to use all the standard tools that the OS has when it comes to managing and troubleshooting processes (as opposed to using often incomplete and idiosyncratic tools for thread management that your runtime platform of choice offers – if any).

still, as we need more and more fine-grained concurrency and as the level of concurrency increases (it is not just a handful of CPUs now, but dozens, and even hundreds), one process per task becomes too expensive (context switching, high costs of creating a new process, memory overhead, etc). so we are back to some sort of lightweight thread-like primitives running within the same process, sharing some common resources.

most of the popular languages/platforms these days provide some sort of threading and shared memory support. but as outlined above, they suffer from some fundamental problems. there are some practical things at various levels of abstractions that that can help: low-level constructs within the language/platform itself, tooling, and higher-level libraries/mini-languages

language

  • make immutability easier – take note of functional languages, but also make it practical. in java case, for instance, it could mean extending immutability to some core data structures (see scala collections) or making it easier to tag an instance as immutable (see ruby’s freeze; this reeks of boilerplate though) – this way errors will be caught at compile time
  • consider sharing data only through explicit, ideally checked at compile-time, means. thus by default nothing is shared, and in order to make something shared you have to explicitly tag it as such. ideally, this would also come with some sort of namespace support, thus limiting mutations to a sandbox (see clojure for reference)
  • make language safer to use when it comes to exposing shareable state (this is when something like static becomes a problem – see Shared Data Considered Harmful for an example that applies to concurrency)

tooling

  • static analysis tools might help, but we need to give them a bit more than just an infinite number of states. findbugs for instance, supports concurrency annotations and something like chord could also be promising. this stuff is complex though and there are limits to static analysis (and i do not even want to bring up formal proofs using process calculi)
  • i want more support from the platform to help me troubleshoot lock contention, deadlocks, cpu-intensive threads, and other concurrency-related infrastructure. sun’s hotspot has some rudimentary stuff in place, but i want more things out of the box (azul claims that they have always-on built-in tools in their product, but i have not played with them)
  • speaking of azul, i need to study them more. although perceived as a boutique solution, they are addressing issues that everyone will be facing in just a few years. seems like they ported sun’s hotspot to their hardware which allowed them to achieve scaling by automatically replacing synchronization with optimistic concurrency which scales much better. incidentally, this truism about optimistic concurrency has been obvious to database folks for decades

libraries/mini-languages

one of the approaches is to focus on your problem domain and come up with a library/language that solves your particular problem and abstracts away concurrency. web frameworks (J2EE, rails), or ETL tools, or even databases are all examples of such approaches.

this is where my interest lies as an app developer – how can i make concurrent programming easier for me, the layman.

the bottom line is that if we insist on using low-level synchronization primitives, it would be really hard to paper over the underlying complexities. right now there is no generic universal approach that will simplify concurrent programming. so at this point a pragmatic programmer is left with patterns, supporting libraries, and heuristics.

to be continued

there are some patterns (for the lack of a better word) that i found to be helpful in dealing with concurrency; there is also some stuff on the horizon that promises all sorts of benefits – is there really a silver bullet? but also there is plenty of stuff that has been with us for decades, and i would be the first one to bow my head in shame, acknowledging my ignorance.

performance optimization: combating the evil

Posted by anton
on Monday, December 31, 2007

face the enemy

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil1

continuing the topic of my previous post (do not deploy black boxes, allow for measurement, metrics, monitoring), i offer you my two rules of performance optimization (heavily optimized for enterprise environment2).

rule #1: don’t do it

i’ve seen more crimes committed against software in the name of performance optimization than for any other reason3.

don’t bother with performance optimization – deliver working software first.

instead of spending time salivating over sexy distributing caching algorithms and debating the merits of lock striping approaches, implement by feature and most likely you will find out that performance is good enough.

a fitting quote from Refactoring:

The secret to fast software, in all but hard real-time contexts, is to write tunable software first and then to tune it for sufficient speed.

i know it is hard to resist the glimmering image of a performance superhero squeezing out a dramatic 100000x speed increase. we are all guilty of dreaming about it. face it – you are not writing hard real-time systems. you’ll just have to buckle up and stick to implementing business functionality without getting to play with those exciting computational problems.

finally, know your requirements upfront – what latency can you actually tolerate, does it even matter? what throughput do you need? make sure you have the actual numbers; both average and worst case scenarios. guess what, in many cases it turns out that you do not need to optimize to begin with.

after all, performance optimization is always about trade offs – leave your options open.

rule #2: don’t do it blindly

measure, then optimize. most of the software spends most of its time in just a fraction of the code – the “hot spot” – find it and optimize it away.

having a well-factored program leads to hotspots that are easier to isolate and optimize.

way too often i see people diving in and tweaking things left and right, just because they think they know where the problems are.

at best you will waste your time, but most likely you will actually make things worse.

if you live in the java platform world, you are in luck – there are so many tools out there for modern JVMs – use them!

notes

rules are meant to be broken, but i’d rather overreact upfront to discourage frivolous optimization.

yes, yes – you have to have basic knowledge so that you do not do stupid things all over the place and end up bleeding to death from a thousand cuts. luckily, in enterprise software most of these things are very basic – a few language rules, and a few design rules – good software developer will follow them automatically.

why performance optimization is so alluring?

here’s my take on it – enterprise software is boring. you’ve done it a few times, and you do not feel like cranking out the same stuff over and over again.

so you start creating complexity to entertain yourself, to give your mind something to chew on: you fall in love with design patterns, you build beautiful multi-tier distributed designs with transactional semantics all over, and you fiddle with performance optimization on every step.

most of us have gone through it; it is like a childhood disease that you suffer from in order to become immune. most of us survived and gained valuable insight in the process. it does take a bit of self-reflection and experience to realize this though.

(those that did not survive ascended to the stratosphere and became raving zombies – and we all know what to do with zombies).

i think the key is to understand that although enterprise software is boring, the vast majority of all software projects still fail. this is where the true complexity is – figuring out how to deliver working software that customers actually use. not to mention doing it on time, on budget, and without burning your team.

this is much harder and at the first sight a lot less sexy, but there are still plenty of technical challenges to work through in order to create well-engineered systems. it is just your definition of “well-engineered” that has to change.

once you re-adjust your focus, the work is cut out for you.

it is tempting to pitch “enterprise” software against the opposite “swing” championed by the pragmatic programmers, rails, and the whole community around it. and of course, it is not the technology but the mindset.

1 donald knuth paraphrasing hoare

2 with apologies to m. a. jackson

3 with apologies to w. a. wulf

quote on scaling up vs. scaling out

Posted by anton
on Monday, October 01, 2007

Only focusing on 50X just gives you faster Elephants, not the revolutionary new breeds of animals that can serve us better.

(by werner vogels)

for fun and profit

Posted by anton
on Thursday, September 27, 2007

if you enjoyed everyone’s favorite upside-down-ternet way of making new friends, this whimsical bit is right up your alley.

it is based on cross-site request forgery (CSRF) attack.

briefly, these are the attacks that trick you into submitting a potentially damaging request to the application you are logged in to. so if you receive an email with a link to http://www.google.com/setprefs?hl=ga, which you press, it will set your google language preferences to irish.

thus you could try to impress those inquisitive souls looking for things on your site with the following apache config directive:

RedirectMatch \.(php|phtml|phps|php3)$ http://www.google.com/setprefs?hl=xx-klingon

therefore any request to a booby-trapped url on your site (in this case anything that ends in php) would set their google search language to klingon.

(stolen from here)

of course, it does not have to be an explicit server-side redirect – similar behavior can be triggered with javascript, iframes, etc.

how do you protect from it? the app has to use unique tokens in the form presented to the user (or one can start lugging around those encrypted URLs again – anyone remembers IBM’s Net.Commerce?)

since i am (somewhat reluctantly and half-asleep) reading gibson’s latest, and since these days i mostly appreciate him for sensing the zeitgeist and popularizing new art forms, i cannot shake off the feeling that there is an art piece lurking in here.

mopping up

Posted by anton
on Friday, August 24, 2007

this is a rant, inspired by working in both developer and admin roles over the years (i strongly believe in “eating your own dogfood” when it comes to building and running the apps, but this is a whole different topic).

my experience is that given a choice of manageability/logging/monitoring vs. extra performance i will always choose the former. the amount of time spent troubleshooting performance and stability issues on live application in production trumps any hardware (and sometimes even development) costs.

so instead of satisfying your inner ricer and deploying a highly-performing black box hotrod, spend the time to put the probes in, make it declaratively manageable; if your OS/hardware provides any sort of isolation and partitioning – consider it; take advantage of existing platforms and tools.

take Tibco BusinessWorks for instance. besides having their own suite of monitoring/management tools they allow (perhaps serendipitously) individual “worker engines” to be deployed in separate JVMs (which could be on different machines) so you can analyze and manage them using not only an existing ecosystem of Java tools, but also fall back on your regular OS tools – per-user, per-process, per-box.

the benefit of this simplicity becomes obvious once you have worked with apps that insist on packing everything into one JVM – worker engines, daemon-like processes, queuing, etc, etc. management and tuning becomes a nightmare; on a flipside it is guaranteed job security and high salaries.

so what can a developer do? besides the obvious, consider an api to talk to your application and tweak it as it is running (look at those ol’ smalltalk dudes), or better yet – a command-line scriptable console that exposes your app’s domain. here’s props to bea folks – their flagship server product for years had a python (jython to be exact)-based console that allowed one to connect to a running cluster and make changes to it on the fly. similar functionality is provided by rails stack, although technically you only get connectivity to the database, not the actual running application instance. still, it is a big step.

another tip of the hat in bea’s direction – their JVM for years had actually usable manageability tools; sun was late, and even when they started delivering them, the tools were really clunky (i am still waiting for something similar to jrcmd tool from sun that allows me to do simple things like collecting threaddumps from a jvm on all platforms, including windows and redirecting them to a given file, since jvm might be running with stdout sent to /dev/null). bea’s mission control has been around for a while in various forms – i want to be able to attach to my production JVM and look at GUI representation of memory distribution, object counts, stack traces, heap info; but on top of that it gives me an ability to explore and act upon exposed JMX beans both from the JVM and the app, set up triggers and alerts that start recordings, memory dumps, send emails, etc. this becomes indispensable, especially for hand-me-down apps or third-party software.

this is actually a big change in mentality – gradually people are realizing that they should be able to monitor stuff in production, live, as it is running. hence we have things like (under appreciated) dtrace, and more and more investment into the platforms that support that sort of runtime lightweight dynamic analysis. these days it is expected that apps should be on 24/7, and ability to dynamically redeploy things, reconfigure things, analyze things is crucial.

finally, i have seen way too many folks that consciously refuse to learn about how their code runs – the minimum about the OS, the network, the tools, the tuning. i am willing to consider and understand, as long as they have those that do know around. sadly, too often this responsibility gets shifted to the OS admins that could not care less.

all sorts of disclaimers apply – in many cases the apps are so small that one can pile them together and forget about them. the apps that will benefit most from the manageability stuff mentioned above are the ones that churn through a lot of data and have pretty strict uptime/latency requirements. in addition, it is assumed that there are a lot of people working on them, so tools and approaches should be somewhat uniform.

DSL for Integration

Posted by anton
on Monday, August 20, 2007

this is somewhat of a wide-eyed rant, but i wanted to get it off my chest for a while.

i have worked with Tibco BusinessWorks suite of tools for quite a bit in past few years, and i came to really appreciate it. for those not familiar with it – it is a GUI-based drag-and-drop frontend that allows one to use standard components for quickly building integration scenarios – e.g. get data from source x, transform it, then shove it into destination b.

for the longest time this sort of GUI tools were anathema for me – i learned over time that there is no silver bullet, abstractions leak, and for “general” software development these tools did not succeed in addressing complexity.

at the same time Tibco BusinessWorks was remarkably successful – one could knock out an integration scenario in under an hour and deploy it in full enterprise glory – high availability, load balancing, monitoring, etc, etc.

i think one way to explain that is to talk in terms of brooks’ “silver bullet” essay – the winning approach addresses both accidental complexity (very good tools that make a developer more productive) and essential complexity (focusing on a very narrow problem domain). that is besides plain good engineering, of course.

while accidental complexity is a subject for another post, it is interesting to note how essential complexity was addressed by focusing on the problem domain of integration.

generic “embrace and solve the world” tools have an unmanageable problem domain, and it is impossible to get them right for everyone.

in this particular case it comes down to being able to express your problem domain, define it in terms of higher-level abstractions, and then allowing those to leak gracefully as needed.

Tibco BusinessWorks excels at integration, and the domain is very simple – read the data, transform it, and load it elsewhere. as long as you keep business logic to the minimum and use the tool for what it’s good for – it shines.

at this point GUI is almost nothing to be ashamed of – it is simply a representation of the abstract syntax tree (AST) for your program – it is in a sense your integration language represented through GUI abstractions. in theory one could write a domain specific textual language to work off the same AST, and it will be yet another representation of the same thing.

Fowler’s article on DSL i mentioned before on this blog was very much responsible for this redeeming outlook on GUI tools.

here’s a good example – take a look at rails. it is a DSL for a well-defined problem domain of small web applications where you build the db and the app from scratch. if needed, it leaks abstractions gracefully, falling back on the power of ruby and metaprogramming. although it can be pushed beyond its intended domain, its strength is in its deliberate limitations.

an emergence of the next big language in the nearest future is perhaps just a utopia. instead it looks like the next step is a whole bunch of languages on top of a few existing platforms. these “smaller” languages will become more and more domain-specific, getting us closer to the promised bliss of intentional programming. their rapid adoption will build upon the strengths of a few existing platforms.

another related term that has a nice ring to it is neal’s polyglot programming.

it is really exciting to see all the stuff happening in .net and jvm camps as they port dynamic languages to their platforms. one of the things i am really looking forward to is all the existing “enterprise” stuff being augmented, glued, and morphed together using these smaller, expressive languages resulting in more “living” adaptable systems. hopefully, this will also lead to a culture shift (and not just in the form of apple laptops and steadily increasing enumerators prefixed with “web”).

jive software clearspace

Posted by anton
on Tuesday, January 30, 2007

jive software "clearspace" finally went beta, and is available for download.

first i came into contact with these guys through their IM server product. for an open-source tool it was surprisingly polished, had all the "enterprise" features i needed (flexible AD integration with a possibility of using multiple auth providers, TLS, gates to other IM systems), was self-contained for easy trial; it was stable (running on UX for almost a year without restarts), and scaled just fine for hundreds of users (after a gratifyingly viral adoption). also there was that intagible feeling that everything was just done right - directory structure made sense, startup was just as expected, it was built on familiar components - over time i came to trust these guys as competent developers that would not go astray in some pseudo-academic delusion, or succumb to ADD and play with some new tech of the day (snipsnap, anyone?).

i kept an eye on their blog ever since the initial announcement of "clearspace" product. it seems to address exactly the stuff i have been talking about for years - no stunning revelations, just a simple clean integration of IM/blogs/wiki/document management/forums/emails in an enterprise-friendly format that puts them all in context. this is not a slapped together CMS monster or behemoth of a sharepoint, but a collaboration tool that pulls together sensible implementations of all these existing forms of communication. this is something i talked about in an old school paper of mine; i also briefly talked about it here and here.

we'll see if the product actually does all this stuff, but i would definitely dedicate some time to playing with it.

why wiki, continued

Posted by anton
on Sunday, December 17, 2006

our team has been running on trac for past four months, and the results so far were very encouraging. we organize things under coarse-grained categories, and then use tags to organize content further. tags really help out a lot, i got so used to CLI-like searching (/trac/tags/tag1+tag2), and their indexing capability (ListTagged() macro).

a few things learned:

  • wikis work best for small homogeneous groups, and even then only a few "expert" people contribute; sadly it is not the whole group that enthusiastically uses the wiki as a group's collective knowledge base, as well as communication device (perhaps it is a nature of our environment though)
  • it works really well as a metadata "glue", when it pulls together different sources in one context (a few links to documents in sharepoint, a few links to some internal systems, and some text to describe it all). i would really like to extend this further and start consuming stuff from other apps (syndicate feeds, pull in reports, etc. for instance - every morning create an entry for past night's batch run issues from the report we currently have, so that person on call can start annotating it as issues are worked on)
  • i really need an auto-save feature (gmail and the like). since i am trigger-happy on the keyboard, i've lost posts a number of times. it should be trivial to implement. on a side note, i thought i would hate the new spelling support in firefox, but i find it incredibly beneficial
  • tags are great, but the consistency of tag corpus is an issue; self-imposed rules help a bit (nouns, no plurals, lowercase, etc), but a del.icio.us-like drop-down of suggestions as you type would be very helpful
  • i haven't really needed to search through attachments yet using trac, but then we still store most of our binary docs in sharepoint. the funny thing is that people save bigass ms office documents in sharepoint with revision tracking turned on (40M documents are not uncommon), which forced sharepoint admins to turn off versioning across the board, defeating one of the main benefits of sharepoint (apparently our version did not use binary diffs, saving full content every time). on top of that search within documents in our version of sharepoint is pretty much useless anyway. so considering all this i am thinking of just asking people to map a branch of our svn repo as a drive and save documents there; although it might result in a lot of commits, it would be versioned and in addition mapped into our website's namespace
  • need for templates - as we start to store more structured content like technical specifications or high-level interface descriptions that have certain required fields
  • and the final wish, or more of a pipedream - smarter markup that has semantic value that could be harvested/searched/aggregated (something along the lines of a yet-to-be-realized promise of xml-based backend of ms office) with support for intellisense-like autocompletion. as mentioned above, as we start storing interface descriptions that have certain common fields (source system, target system, integration technology, group that owns it, canonical data format used, etc), i want to be able to run queries like "show me all interfaces owned by this group", or "show me all interfaces that use this integration technology", or "show me all interfaces that feed this system". then i want to save these queries and make them dynamic, so essentially they become different views into the data. sharepoint currently supports it with excel-like functionality and views, but the content is strictly tabular. what i want is the ability to use one of these domain-specific markup microformats as i am writing my wiki entry. i can hackily mimic this to an extent with "typed" tags (i.e. interface/source/systema, interface/technology/toolb), but it just feels way too flimsy. jon udell's continuous laments on this subject were very inspiring.

why wiki

Posted by anton
on Monday, September 04, 2006

Over the years I have used various Wiki packages at work and for personal use/freelance. Currently I am using Trac for the freelance stuff, and at work I have tried a number of things, from JSPWiki and Daisy to Confluence. Currently I have been using SnipSnap at work for almost three years, and recently I have installed and configured Trac for our team.

I believe that the biggest benefit of the Wiki is its grass-roots nature, especially when it is being used by a small team on regular basis. I doubt it might ever grow to be "enterprise" in our company for various reasons (even with something like Confluence), but it is indispensable for department- and team-sized work.

Below is a small blog-like post justifying the Wiki for our team.

why wiki

wiki is the simplest possible content management system, collaboratively edited. it works best for creating and evolving documentation. it is not a document storage, but a website where each page is a document.

personally, as i work (do research, jot down ideas, document something), i take notes and evolve these notes using the wiki. i continuously refactor and connect these ideas, thus building content.

most people still do it on their own machines - they edit documents locally, and then they share with others through email and attachments. everyone knows how flawed this is - there are numerous versions floating around, no one knows which one is the latest, keeping track of edits is a nightmare if you have more than two people, finding these documents is hard, etc.

next logical step (which sharepoint took) is to store these documents in a central place. this seems like a good solution on the surface, but it is terribly inconvenient, since these documents are opaque, not being cross-linked together they lack context; in order to view them one must launch an office application, removing the context even further.

with wikis the approach is simpler - my documents are web pages; i edit them in place with simple syntax; i cross-link them easily, thus creating context; they are immediately published and available for others to see and edit. in other words, i build on the foundation that made internet happen using tools that make publishing trivial.

  • immediate tangible benefits are live documents - when i pass around the link to the wiki article, it is often self-describing, pointing to the latest version of the article, so i do not have to worry about people looking at the obsolete version of some word document in some email. as an example, when someone asks me a question, instead of putting the answer into an email, i put it in the wiki, then send them a link to it
  • i can easily see the difference between documents, since they are just text
  • it allows multiple people to easily edit documents simultaneously
  • it also allows me to easily build context - cross-link documents and, in this particular tool (Trac), supply them with tags

the idea is to keep content as open as possible, and its editing as simple as possible, lowering the barrier for entry for participants that can help evolve the content.

to summarize, i will quote ray ozzie once again (from his acm interview on social computing):
I think one of the big promises of all of this technology is to make it easy for people to leave trails of artifacts that can be used later when you don't really expect it.

trac

Trac is a Wiki engine tightly integrated with source control browser and ticketing/issues system that provides a very lightweight (in terms of process) and flexible software project management system. Its main benefit is the ability to put all software project artifacts in context, tying together source code view, changesets, tickets/issues, and documentation.

Trac instance is best scoped per project, thus somewhat limiting its use. With other wiki engines like Confluence, this is not an issue.

Trac architecture allows for a multitude of extensions; it has a vibrant user community, and it has been around for quite a while with a number of high-profile deployments. It is free.

why not sharepoint

sharepoint is a content management system (CMS). for simple "live" documentation described above sharepoint is an overkill - all i want is the simplest possible way to create ad-hoc documents, version them, and pass around simple self-describing references to them (not gargantuan sharepoint urls). i want a simple and easy way to build documentation (or artifacts, in a more general sense) as i work.

in case of documentation sharepoint is somewhat of an inversion of a problem. instead of keeping my existing word documents, what i want to do is create simple documents in place, evolve them, and version them as i work. thus when i browse the site, i want to see content, not opaque document containers. this approach bypasses the office suite, and i can see why microsoft has to be careful not to kill its cash cow by introducing wiki features into sharepoint.

if my needs go beyond simple documentation, sharepoint (especially with infopath) is a tool worth looking into. it makes it very easy to create applications on top of documents, provide workflow, complex user interfaces, etc without any need for coding.

the question really becomes how far you can go with the wiki before you need all the features of sharepoint (or other full-blown CMS). the answer is that for most people a wiki would be perfectly enough; start with the wiki and grow into CMS.

bloglines/competition

Posted by anton
on Thursday, October 13, 2005

as expected, bloglines is stepping up to the competition - now they have keyboard shortcuts and number of "keep new" items in the feed headers

of course, that is hardly useful for me, since mouse wheel and pg up/down do the job just fine

brightcove / jeremy allaire

Posted by anton
on Saturday, October 08, 2005
brighcove is the new company of jeremy allaire - the founder of cold fusion, later gobbled up by macromedia/adobe. brighcove does internet telly.

reader.google.com

Posted by anton
on Saturday, October 08, 2005

still a little buggy (imports, navigation, dealing with large subscriptions), but full of ajaxian goodness

  • has labels (no more bloglines cludgy hierarchies)
  • search (will work at some point i hope)
  • stars to mark interesting content
  • filters
  • quick sorts by relevance/date
  • no mousewheel scrolling (keyboard paging is a bit unnatural); i find myself trying to take the list of feeds and drag it
  • shortcuts (they mimic vi - what a nice twist)
  • it would be nice if i could point to the opml link, as opposed to saving it to a file first and then importing it
  • i miss the social aspect of bloglines - the popularity of the feed by readers and the actual list of readers that subscribe to the feed (not to mention recommendations)
  • when google feedfetcher accesses the feed, it does not show the number of subscriptions in the user-agent string like bloglines does

in the end it will take a while to get used to it; there is a barrier for entry with the way everything is organized - it is a step up from simplicity of bloglines and others that had a simple model of entries under individual feeds that in turn could be grouped. most of my feeds are tied to personalities - it is not a uniform sea of posts. on the other hand there are some long overdue features that bloglines lacks.

the overall experience at this point could be nicely summarized by "clunky" - the interface and the responsiveness. i have no desire to switch from bloglines at least for a while. hopefully this will spark some healthy competition.

some links:

web 2.0 / communities

Posted by anton
on Friday, October 07, 2005

an acquiantance i used to share music with a few years ago recently asked me to recommend something along the lines of gridlock and bad sector that he really enjoyed back in the day.

anyone that knows these bands, would understand why it was not an easy question to answer - the genres represented by the two are too broad, there are hundreds of names that could fit.

i briefly considered rattling off a dozen or so names off the top of my head, but then i remebered that we live in the time of web 2.0 (literally, as the conference is taking place right now), and we have last.fm (formerly known as audioscrobbler).

i can attest to the fact that the resulting "listening clouds" are a good place to start exploring similar names (here and here).

powered by

Posted by anton
on Monday, October 03, 2005

perhaps it is my vanity to blame for having chosen to run the blog software myself; of course this is an opportunity to waste an enormous amount of time tweaking the blog engine and themes, instituting a backup schedule, installing updates, fighting bugs - precisely what I have been doing with typo.

I've chosen typo over wordpress and mt, since I wanted to play with rails (and what blog-aware netizen is not guilty of drinking the kool-aid these days?). obviously it is still very much a beta software - I've spent a few hours trying to install it under www.domainname.com/blog, finally throwing in the towel and resorting to blog.domainname.com setup; the sidebar admin interface is buggy, not remembering the values, or populating them with defaults like "feed" and "null"; same goes for the comment posting interface under IE; not to mention that it is crazy resource pig - fcgi and all, it takes a few seconds to generate a non-cached page (I bet it is something to do with restarting processes and cgi and rails startup slowness in general - but I have not looked at it in details yet). oh and despite their oh-so-smug note, do not use trunk - a flurry of migration commits this weekend was a good lesson.

so now it is just a matter of time - either the newness wears off, pragmatism takes over and I switch to a proven platform, or typo matures enough to be usable.

it is a cute toy, buzzword-compliant, and still shiny and new to justify spending the time playing with it. I should try and track down a bug or two, to make up for all the bile in the paragraphs above.

i am yet to find out what kind of syntax it supports for the posts (it better not be a subset of straight html), useful wiki-like macros, uploading images, etc.