Saturday, October 30, 2010

Identifying a language for distributed computing

In an ideal situation one can find a unit for measuring a particular characteristic. When this is not possible some criteria for determining the presence of the characteristic facilitates making a better decision. In this note I will attempt to discuss some guidelines for determining whether a programming language supports distributed computing.

The intuitive notion of distributed computing can be viewed as the ability of programs executing on different nodes to communicate in a cooperative manner for accomplishing their common tasks. This view reflects a major aspect of distributed computing but not its nature. Nonetheless, with this naïve description we see that the support of a language for distributed computing can be studied in terms of its abstractions for communication.

Processes running on a single node use a number of operating system facilities in their communication, including shared memory, message queues and files. These facilities are essential for writing system programs. However, a system programming language such as C cannot anticipate all possible facilities and include linguistic abstraction for them as part of the language. Thus, an engineer engaged in system programming must be aware of such system facilities, and apart from the actual solution or algorithm, also write the necessary code for establishing communication among the processes or threads involved in the solution.

Let us digress to see why C cannot include communication facilities of an operating system as linguistic abstractions. When we start with C to craft an operating system we are dealing with the instruction set of a processor. We are going to create a system that will include a number of facilities, which are not known to the designer of C language. Besides, since we have not yet created those facilities it makes no sense for the language to include them.

Now, C++ is not usable until the core of the operating system is fleshed out. The notion of class is of no value whatsoever at the level of core. However, after the operating system becomes operational C++ would be a better choice for creating system programs such as graphic compilers and debuggers. Since C++ is an extension of C one can also use C system libraries for taking advantage of operating system facilities.

However, linguistic facilities of C++ are meager and the language is too crude for building modern applications even for execution on a single system. Since C++ can only be used after the completion of the operating system, the designer is in complete control of inclusion of linguistic facilities beyond the notion of class and a few ancillary ideas. It makes one wonder whether C++ has reached its end of evolution and C++ engineers will be required to be aware of the system facilities like C engineers normally do.

Returning to our discussion, a solution that NATURALLY spans more than one node is necessarily NOT a system program. The nodes involved do not have to be identical, and even if they are, none of the nodes is in complete control of the entire program. For purposes of this article I refer to such programs as (distributed) applications in contrast to system programs. The guidelines we are going to discuss are for a language for developing distributed applications. Thus C and C++ retain their value for system programming.

Exchanging files is a major form of communication among (programs running on different) nodes. Plain text files with predetermined layout work quite well. There is no value in standardizing this method of communication except for a few cases such as Internet documents and Email servers. File exchange in any form requires that an engineer write code for communication in addition to the actual solution. Nonetheless, there are times that transferring a large amount of data is worth the overhead of writing specialized code. For this reason we will not gripe about this form of communication for distributed computing.

The focus of our attention is at forms of communication that directly exchange program objects among nodes. One approach is to extend existing communication forms on a single node to distributed computing. This is a desirable approach for solving a great deal of problems in the space of distributed computing. However, the abstractions associated with communication on a single system lack the concepts needed for solving more challenging distributed problems. Stated differently, extending facilities such as plain function calls to cross the boundaries of nodes, or using file transfer for data communication, as well as exchange of strings as in client-server model are not sufficient for tackling all problems of distributed computing.

In order to convey this issue in context let us consider an example. Some ancient edifices are still a challenge to erect in modern days. Now, if we were limited to the technology of those days, instead of cities filled with huge buildings we would see a few buildings erected in the entire world every generation or two. In terms of software this is even worse because code cannot be checked the same way as one checks walls and pipes. Essentially, without proper linguistic abstractions our programs will continue to be written the way ancient buildings were constructed, but with undetectable programming defects. Hence, it is not merely so much that something could be done but rather at what cost and with what degree of reliability.

One form of desirable abstraction can be illustrated with a team of people engaged in completing a task. All teammates have specific tasks and each one can ask another teammate for certain adjustments in their actions while continuously operating. In Z++ we refer to this model as Communicating Concurrent Processes (CCS). The importance of having a linguistic abstraction is in the fact that an engineer can incorporate the notion in his thinking while trying to solve his problem. Without such abstraction the engineer would instead have to spend time for designing and implementing a crude mechanism with the kind of abstractions available to him.

In CCS model any process can send a request to any other process in the team, whether the processes are executing on the same node or not. The sender (teller) of request carries on with its operation until it hears back from the recipient (hearer). At that time, the teller sends the data, if any, to the hearer and continues with its operation. However, the entire complexity is hidden behind an intuitive linguistic abstraction thereby relieving the engineer from having to implement a fairly complex notion that he needs as a tool to solve his problem. The complexity of the data being transferred is of no concern to the engineer. It simply resembles a sequential function call within the same application.

A second guideline is in terms of components. Sometimes we need to ask another person at a different location to get up and do certain things for a while. In terms of programming, in this model the remote process is not executing until invoked. The remote process comes up and performs some duties while communicating with its invoker, and terminates. Remote (or local) invocation of Z++ components provides a simple and intuitive solution for this model.

There is a third scenario. Sometimes the Secretary of the State Hillary Clinton needs to travel to different countries and talk to her peers because she cannot solve her problems in her office. As she travels from one country to another she retains the results of her discussions and uses them to plan for her next visit. In term of programming she is an Autonomous Agent. Indeed, Z++ agents travel as live processes from node to node retaining their complete state.

While Z++ equips an engineer with intuitive mechanisms for all three models described above, no other language provides a linguistic abstraction for any of the three notions. Using a language like C++ for distributed computing is tantamount to using ancient tools for building large edifices. It somehow works, but at what cost and how reliably.

The first question that comes to mind in choosing Z++ for crafting distributed applications is how it compares to C++ for other features. Actually, Z++ is a superset of C++ with the ability to link with C++ dynamic libraries on Windows and Linux. Perhaps a few words about the term superset are in order.

Z++ includes the entire C++ language along with enhancements and extensions of C++ features. Basically, an engineer can start out using Z++ like C++. However, C++ is a minor extension of C for inclusion of Simula classes. At the time that C++ was still called C with classes, ADA and Eiffel already had a number of useful linguistic abstractions that C++ should have incorporated. Only templates were later introduced in C++ in the form of linguistic abstractions as a replacement for C header files. The C++ exception mechanism is expensive and hardly useful.

In the following discussion, one has to keep in mind that there is a need for a system programming language like C++. The argument is that it could and should have been designed much better.

When introducing a new language the designer should incorporate all established mechanisms of software engineering. In other words, a new language must accumulate the available knowledge. The use of C structures for jump tables was known at the time of design of C as a language. Using the mechanism for class virtual tables is a good idea, but not an end. Since the widespread use of C++ for almost two decades only namespaces were added as a not-well-thought throw-in. The namespace mechanism could have been designed at least up to the level of the state-of-the-art.

That C++ is a front-end parser to C does not mean C++ can only parse classes and leave everything else as is. It could elevate the treatment of enumeration to the level of ADA and add Eiffel invariants among many other well-understood useful abstractions of modern software engineering.

This digression may explain why C++ constitutes only about 60 per cent of Z++. As a system programming language C++ cannot be elevated to the level of Z++. However, the gap between the two languages could be much less than its current state.

Returning to our discussion, a manufacturing process evolves over time by replacing awkward aspects of the process with more efficient and reliable procedures. In terms of programming this evolution takes the form of building libraries for well-understood routines. However, improved procedures of a manufacturing process do not increase the variety of its productions. Analogously, libraries only improve the quality of production and its reliability. The space of solutions extends solely by creating new means for solving challenging, unfeasible or even unsolvable problems.

Let us cite one relevant example with regard to library extensions. Consider the fact that even though C++ translates its input for C parser, yet it is the C++ parser that disallows the use of private methods in certain contexts, not the C compiler. Thus, libraries that automate XML translations may somewhat help with data transfer. However, the parser has no way of detecting errors associated with the intended abstractions.

In conclusion, we have identified three abstractions for facilitating and enabling the design and implementation of demanding distributed applications. Each of the notions of CCS, remote invocation and autonomous agents tackle a certain category of problems. A modern application more than likely will need to utilize all three abstractions. The language Z++ includes intuitive linguistic constructs for all three notions.

Labels: , , ,