Tuesday, May 3, 2011

It's not just reuse

In software development, there are two kinds of reuse: reference-reuse and fork-reuse.

Reference-reuse is what we usually understand for mere "reuse": If we need some functionality in our software, we look for existing software that provides such feature and is easy to link (works on same platform, etc).

Fork-reuse is a different way of reuse: we make a copy of existing code. The copy will have its own life, independent from original code. Some people argue that fork-reuse is not reuse at all, because "come on, you are duplicating code!". But if we look from business perspective, we conclude that by copying and adapting, we are using an existing thing to avoid an entire new development. So fork-reuse is definitely a form of reuse.

As software engineers, we are compelled to avoid fork-reuse, because bugs and enhancements to original code needs to be manually replicated to forked code. So fork-reuse is usually worst than reference-reuse, right?

Wrong, wrong, wrong! Fork-reuse is not necessarily worst. In fact, there are many scenarios where fork-reuse can be much more powerful and time-saving than reference-reuse. The benefits of fork-reuse are usually not well exploited in classes and development books. So let's enumerate them.

1. Isolation (low impact of changes).

When component is fork-reused (copied) N times to become part of other components, a change in the original will not affect the others and vice-versa. Now if a component is reference-reused by N others, then every change on it might break N components. This forces some quality proceedings such as beta releases and integrations tests. The more a component is reference-reused, greater will be the probability that a beta version finds a bug. Also, higher will be the costs of integration tests. Fork-reuse also reduces number of releases, deployments and upgrade proceedings, so the impact is further reduced.

2. Easy deployment and version management.

Because a fork-reused software usually has its sources on same project of other software, there are fewer external libraries and thus, packaging becomes easy. Even if there are external libraries, it's preferred to put them on a local directory than on a shared one, to avoid the so-called "DLL hell" problem, which was recently outdated by the "multiple JAR versions" problem of Java platform. All these problems were caused by reference-reuse. OSGi is a popular approach that solve this problem, but also increases the burden of packaging.

3. Correctness.

Why new bugs do appear? In a broad sense, because software changes. Fork-reuse reduces amount of changes caused by upgrades in the reused component, and thus reduces probability that new bugs will appear. Of course, an organization can make tests on every upgrade, which increase development costs as explained in the previous topic. But no matter how much you test, commercial software is hardly released on a bug-free state. Quality teams don't assert that a release has no bugs - instead they tell the quality is acceptable. As time passes, users discover new features and use the software in different ways, and eventually they find bugs. Some of those bugs are caused by upgrades of a reference-reused component, and not rarely such component has been rewritten - a scenario where new bugs tend appear!

4. Robustness.

This is more about architecture, or more specifically, about threads and processes. A parallel system can divide its work into processes or threads. Processes does not share their state, so they are doing fork-reuse (each modifies a local copy of the initial state). Threads do the opposite: they share caches and connections, and often are themselves shared on a worker pool. Guess which architecture is more robust? Of course, the one that divides work into processes. If a process crashes, the operating system insures that it will die alone and the others will keep working without even knowing. Usually there will be a controller subsystem that will spawn another process when needed. Now if a shared thread crashes or a shared state becomes corrupt, the whole node (with all other threads and states) will either crash together or the system degenerates as a whole.

5. Flexibility.

Changing a thing that belongs to oneself can be done quickly and with no bureaucracy. When a thing is privately used, it can be adapted or even radically changed to support particular needs. On the other hand, if someone needs a new feature or special bugfix on a shared component (and at same time wants to keep shared), he/she must fill a request, hope it will be accepted, wait until the task gets done and then until release. This often takes so much time that the project team chooses to accept the limitation or build a workaround. The client product is released and the request becomes less important, and might never get done. If it gets done after the project, there will be an upgrade whose costs were discussed before. Interesting enough, the upgrade effort might not even touch in the workaround! In some big corporate environments, it's not uncommon for an IT manager force upgrade of an in-house developed component, just to keep reference-reuse and get some advertised SOA benefit. Result: the change must be reviewed and approved by the governance committee, and then development and deployment must be done according to schedule and policies, and all this involves a lot of talking and more people than usual. The effect is slower delivery and less robustness, quite contrary to alleged SOA benefits!

6. Maintainability.

Fork-reused components can become very simple if we strip all useless things during the fork. The code becomes small, clear, direct and obvious. For example, it's often possible to replace callbacks by direct calls, to add domain specific variables for some control flow, and to put system specific logs and exception handling. Reference-reused components are usually plenty of features, which are exposed as properties, APIs and other kinds of protocol. That's necessary because a reference-reused component must not do exactly the same thing on every place it's linked. For example, text input widgets often have a property that tells if it's a password, so that when the user types something, every character is displayed with asterisks. Each property or API define new states and behaviors, that must be handled accordingly by the component. The source becomes full of flow control tests or has its logic spread in many linked classes and virtual methods that implement some design patterns. Testing all combinations of properties and scenarios becomes virtually impossible. Result: the component becomes difficult to understand and to evolve, and fixing a bug quite commonly creates another that hopefully will be catch by automated tests so it can be also fixed before release. In worst case of property bloat, a component becomes so parametrized that it can provide fully distinct behaviors! This happens when developers try to force an unnatural reuse, or when it becomes so big that no one quite understands its fundamental meaning.

7. Performance.

Because the team is free to change a local fork of some component, they can also make it more specific and suited to the target application. By removing indirections and useless transformations and decisions, the reused component becomes more optimized to the task. The simplicity and direct calls also enables more automatic platform optimizations, further increasing performance. Of course, you only get performance if you invest on some refactoring tasks. But if you use a language such as Java and an IDE such as Eclipse or IntelliJ, refactorings can be applied very quickly.

8. Knowledge.

A fork-reuse may require developers to dig into the reused code. This is an opportunity for the team to learn new APIs, design features, code styles and sometimes even new algorithms. This learning further enhances development productivity and makes the team more adaptable and open to better options. It's quite the contrary of what happens to teams that only do reference-reuse or build-from-scratch: these become confined to a few limited perspectives. Of course, if you must reuse spaghetti code, dealing with it will be painful. But beware that there are many programming styles and what is spaghetti for some developers can be well-structured and organized for others. This often happens when a developer accustomed to some paradigm (like event handling or MVC) needs to read code that uses other paradigm (like functional programming or inheritance-based extensibility).

Wait, we are adults!

After reading all this, you might think that I'm against reference-reuse. If you do, sorry but you missed my forewords. I'm not against or in favor of any of these. As software engineers, we must seek balance on every choice. I wrote this because there is hype in many engineering circles towards reference-reuse, caused by concepts such as object orientation, SOA, DRY, design patterns and some insane attempts to reduce software development cost. There are many prejudice about fork-reuse, and in some place it's barely considered, while all problems of reference-reuse seem to be ignored!

However, people and companies are evolving, and sometimes a smart group can pull the majority to a better state of knowledge. Apple, for instance, made a fork from KHTML to create WebKit. Chrome by its turn started a fork from WebKit, but after some versions they merged (perhaps it was better for WebKit than for Chrome - who knows?). LibreOffice is a fork from OpenOffice that showed great improvements already in early releases. These are examples of big software being fork-reused. There are also examples of smaller softwares, like the Jetty Server (one of the best servlet engines in my opinion), that was fork-reused by Eclipse.

Of course, fork-reuse also has its problems, and reference-reuse can be very useful in many scenarios. If weren't by fork-reuse, the GDI+ bug would not be so hard to fix and exploits wouldn't be so long live (that could also be avoided by use of a better language, of course!). The key feature for a component to be safely reference-reused is stability: the more stable a component is, the less likely it will be changed, and thus reference-reuse becomes interesting. Unfortunately, in our always-evolving world, only too generic components are actually stable. Many frameworks and specially in-house developed or custom software tend to be unstable during its first years of existence.

When I need to reuse new or unstable software, I always consider fork-reuse, specially if such component is small and has a license that allows me to change sources and pack together with my software. And in most cases I get all benefits! I prefer reference-reuse of generic and stable software such as language APIs, Apache Commons, parser generators, old (but still supported) frameworks like Spring MVC and Hibernate. There is no point in forking those things.

Conclusion

Fork-reuse is a decision that can effectively produce good software with reduced association costs, but like many things in software development, it's not a silver bullet. One must check and weight losses and benefits, and also compare with reference-reuse before doing a decision. Sometimes, we opt for build-from-scratch just to avoid some reference-reuse problems such as undesired dependencies, bad performance of layers and low robustness. On these cases, fork-reuse can be used to greatly reduce costs of a build-from-scratch decision.

No comments:

Post a Comment