Tuesday, January 3, 2012

Drug research by computer simulation

It's quite clear to me that a great deal of medicine innovations can come from computer simulation. If a computer can fully simulate a medium composed by human cells, water, nutrients and other molecules, and also offenders such as viruses and bacterias, we should be able to test for new medicines and treatments in a much faster, safer and fair way than current methods. There would be less need for testing on animals and human voluntaries, because computer simulations can quickly filter out many faulty alternatives. Hence, the rate of success on the initial real-world tests would be greatly improved.

The computer required to do this must be very powerful even for current standards. It needs to simulate molecular or perhaps atomic interactions in three or "four" space dimensions, plus the time, taking account pressure, temperature and momentum. A human cell should contain some 100 trillion atoms. If each atom can be modeled with about 1024 bytes of data (1kb), then a full human cell would require about 90 petabytes to be simulated. While this sounds prohibitively huge, there are predictions saying that personal devices will have this capacity by 2020. So, why can't we start development of systems and algorithms right now, so that when the metal is ready, we just load the hacks and run them?

Monday, May 30, 2011

The wiki way of making music

Tell your garage friends that your band is back! And you don't even need to quit your job to make music. That's what folks at Soumix web startup are telling. The concept is somewhat old: each musician record his/her own track alone, optionally listening to previous tracks from friends while recording. Soumix then mix all tracks to form a complete hit parade. The innovation is on simplicity and practicity: everything is managed by Soumix site, with zero instalation and file management. Before Soumix, each person had to install mixing software, store tracks on files and send/receive files to friends. Soumix makes sharing of tracks with friends or other people very easy, so strange people can collaborate to form a complete orchestra! You can start recording, sharing and mixing in just a few seconds!

So far, the site is targeted to Brazilian audience. But there's already an English translation, so expect more buzz about this in the future.

Tuesday, May 3, 2011

It's not just reuse

In software development, there are two kinds of reuse: reference-reuse and fork-reuse.

Reference-reuse is what we usually understand for mere "reuse": If we need some functionality in our software, we look for existing software that provides such feature and is easy to link (works on same platform, etc).

Fork-reuse is a different way of reuse: we make a copy of existing code. The copy will have its own life, independent from original code. Some people argue that fork-reuse is not reuse at all, because "come on, you are duplicating code!". But if we look from business perspective, we conclude that by copying and adapting, we are using an existing thing to avoid an entire new development. So fork-reuse is definitely a form of reuse.

As software engineers, we are compelled to avoid fork-reuse, because bugs and enhancements to original code needs to be manually replicated to forked code. So fork-reuse is usually worst than reference-reuse, right?

Wrong, wrong, wrong! Fork-reuse is not necessarily worst. In fact, there are many scenarios where fork-reuse can be much more powerful and time-saving than reference-reuse. The benefits of fork-reuse are usually not well exploited in classes and development books. So let's enumerate them.

1. Isolation (low impact of changes).

When component is fork-reused (copied) N times to become part of other components, a change in the original will not affect the others and vice-versa. Now if a component is reference-reused by N others, then every change on it might break N components. This forces some quality proceedings such as beta releases and integrations tests. The more a component is reference-reused, greater will be the probability that a beta version finds a bug. Also, higher will be the costs of integration tests. Fork-reuse also reduces number of releases, deployments and upgrade proceedings, so the impact is further reduced.

2. Easy deployment and version management.

Because a fork-reused software usually has its sources on same project of other software, there are fewer external libraries and thus, packaging becomes easy. Even if there are external libraries, it's preferred to put them on a local directory than on a shared one, to avoid the so-called "DLL hell" problem, which was recently outdated by the "multiple JAR versions" problem of Java platform. All these problems were caused by reference-reuse. OSGi is a popular approach that solve this problem, but also increases the burden of packaging.

3. Correctness.

Why new bugs do appear? In a broad sense, because software changes. Fork-reuse reduces amount of changes caused by upgrades in the reused component, and thus reduces probability that new bugs will appear. Of course, an organization can make tests on every upgrade, which increase development costs as explained in the previous topic. But no matter how much you test, commercial software is hardly released on a bug-free state. Quality teams don't assert that a release has no bugs - instead they tell the quality is acceptable. As time passes, users discover new features and use the software in different ways, and eventually they find bugs. Some of those bugs are caused by upgrades of a reference-reused component, and not rarely such component has been rewritten - a scenario where new bugs tend appear!

4. Robustness.

This is more about architecture, or more specifically, about threads and processes. A parallel system can divide its work into processes or threads. Processes does not share their state, so they are doing fork-reuse (each modifies a local copy of the initial state). Threads do the opposite: they share caches and connections, and often are themselves shared on a worker pool. Guess which architecture is more robust? Of course, the one that divides work into processes. If a process crashes, the operating system insures that it will die alone and the others will keep working without even knowing. Usually there will be a controller subsystem that will spawn another process when needed. Now if a shared thread crashes or a shared state becomes corrupt, the whole node (with all other threads and states) will either crash together or the system degenerates as a whole.

5. Flexibility.

Changing a thing that belongs to oneself can be done quickly and with no bureaucracy. When a thing is privately used, it can be adapted or even radically changed to support particular needs. On the other hand, if someone needs a new feature or special bugfix on a shared component (and at same time wants to keep shared), he/she must fill a request, hope it will be accepted, wait until the task gets done and then until release. This often takes so much time that the project team chooses to accept the limitation or build a workaround. The client product is released and the request becomes less important, and might never get done. If it gets done after the project, there will be an upgrade whose costs were discussed before. Interesting enough, the upgrade effort might not even touch in the workaround! In some big corporate environments, it's not uncommon for an IT manager force upgrade of an in-house developed component, just to keep reference-reuse and get some advertised SOA benefit. Result: the change must be reviewed and approved by the governance committee, and then development and deployment must be done according to schedule and policies, and all this involves a lot of talking and more people than usual. The effect is slower delivery and less robustness, quite contrary to alleged SOA benefits!

6. Maintainability.

Fork-reused components can become very simple if we strip all useless things during the fork. The code becomes small, clear, direct and obvious. For example, it's often possible to replace callbacks by direct calls, to add domain specific variables for some control flow, and to put system specific logs and exception handling. Reference-reused components are usually plenty of features, which are exposed as properties, APIs and other kinds of protocol. That's necessary because a reference-reused component must not do exactly the same thing on every place it's linked. For example, text input widgets often have a property that tells if it's a password, so that when the user types something, every character is displayed with asterisks. Each property or API define new states and behaviors, that must be handled accordingly by the component. The source becomes full of flow control tests or has its logic spread in many linked classes and virtual methods that implement some design patterns. Testing all combinations of properties and scenarios becomes virtually impossible. Result: the component becomes difficult to understand and to evolve, and fixing a bug quite commonly creates another that hopefully will be catch by automated tests so it can be also fixed before release. In worst case of property bloat, a component becomes so parametrized that it can provide fully distinct behaviors! This happens when developers try to force an unnatural reuse, or when it becomes so big that no one quite understands its fundamental meaning.

7. Performance.

Because the team is free to change a local fork of some component, they can also make it more specific and suited to the target application. By removing indirections and useless transformations and decisions, the reused component becomes more optimized to the task. The simplicity and direct calls also enables more automatic platform optimizations, further increasing performance. Of course, you only get performance if you invest on some refactoring tasks. But if you use a language such as Java and an IDE such as Eclipse or IntelliJ, refactorings can be applied very quickly.

8. Knowledge.

A fork-reuse may require developers to dig into the reused code. This is an opportunity for the team to learn new APIs, design features, code styles and sometimes even new algorithms. This learning further enhances development productivity and makes the team more adaptable and open to better options. It's quite the contrary of what happens to teams that only do reference-reuse or build-from-scratch: these become confined to a few limited perspectives. Of course, if you must reuse spaghetti code, dealing with it will be painful. But beware that there are many programming styles and what is spaghetti for some developers can be well-structured and organized for others. This often happens when a developer accustomed to some paradigm (like event handling or MVC) needs to read code that uses other paradigm (like functional programming or inheritance-based extensibility).

Wait, we are adults!

After reading all this, you might think that I'm against reference-reuse. If you do, sorry but you missed my forewords. I'm not against or in favor of any of these. As software engineers, we must seek balance on every choice. I wrote this because there is hype in many engineering circles towards reference-reuse, caused by concepts such as object orientation, SOA, DRY, design patterns and some insane attempts to reduce software development cost. There are many prejudice about fork-reuse, and in some place it's barely considered, while all problems of reference-reuse seem to be ignored!

However, people and companies are evolving, and sometimes a smart group can pull the majority to a better state of knowledge. Apple, for instance, made a fork from KHTML to create WebKit. Chrome by its turn started a fork from WebKit, but after some versions they merged (perhaps it was better for WebKit than for Chrome - who knows?). LibreOffice is a fork from OpenOffice that showed great improvements already in early releases. These are examples of big software being fork-reused. There are also examples of smaller softwares, like the Jetty Server (one of the best servlet engines in my opinion), that was fork-reused by Eclipse.

Of course, fork-reuse also has its problems, and reference-reuse can be very useful in many scenarios. If weren't by fork-reuse, the GDI+ bug would not be so hard to fix and exploits wouldn't be so long live (that could also be avoided by use of a better language, of course!). The key feature for a component to be safely reference-reused is stability: the more stable a component is, the less likely it will be changed, and thus reference-reuse becomes interesting. Unfortunately, in our always-evolving world, only too generic components are actually stable. Many frameworks and specially in-house developed or custom software tend to be unstable during its first years of existence.

When I need to reuse new or unstable software, I always consider fork-reuse, specially if such component is small and has a license that allows me to change sources and pack together with my software. And in most cases I get all benefits! I prefer reference-reuse of generic and stable software such as language APIs, Apache Commons, parser generators, old (but still supported) frameworks like Spring MVC and Hibernate. There is no point in forking those things.

Conclusion

Fork-reuse is a decision that can effectively produce good software with reduced association costs, but like many things in software development, it's not a silver bullet. One must check and weight losses and benefits, and also compare with reference-reuse before doing a decision. Sometimes, we opt for build-from-scratch just to avoid some reference-reuse problems such as undesired dependencies, bad performance of layers and low robustness. On these cases, fork-reuse can be used to greatly reduce costs of a build-from-scratch decision.

Wednesday, April 20, 2011

Scala momentum

Look how Scala is getting a strong momentum:
  • Twitter has moved to Scala (old news, but worth).
  • European Union is funding Scala.
  • The Guardian news site is moving to Scala.
  • Scala's official page and Wikipedia page are the top ones when you search "scala" on Google (as of today).
  • Scala tag on stackoverflow.com contains almost 3k followers (as of today). That's more than Groovy (1.2k), Jython (<200) and JRuby (<400). Of course Python and Ruby have much more followers, but they're also much older.
  • There are many positive feedback of people who made the switch. You can find interesting articles here and here.
The only thing that still distresses me is the lack of a corporate powertrain. We know that Red Hat and probably GNU won't endorse Scala for now. I also wouldn't like to see Scala under control or heavy influence by something like Oracle or Microsoft. What would be really nice is if Eclipse.org or Apache move from the top of fence to embrace Scala.

And if you are worried about Ceylon, don't be. The improvements of Ceylon over Java are so small that no one sees a reason to move from.