GCC Hacker: Friday, February 25, 2005

Here is the material that did not make it into the Original Speech.

Involving the human mind in the process of introspection

One of the major tasks that I see in this process of understanding code is the involvement of the HUMAN MIND, the user.

I think that by feeding information about the software to the visual cortext via the eyes, or by whatever means that might be used by disabled persons, then the minds natural pattern matching and model building process will take over. When the mind is able to then pose new questions to gain more information to the introspector system then the viewpoint of the visualization system is focused on newly selected topic.

The mind will then focus on interesting aspects. The next step is to allow the patterns found to be captured and fed back into the tool. This creates a feedback loop where the meta programming tool is guided activly by the mind exploring the software.

An meta programming tool will be then successfull when it allows the programmer to directly, naturally, and efficently access the data collected out of both the software and the context of the software.

Operations on the data in the form of Structures, Lists, Trees, Graphs, Relational databases, Vectors in Memory and a simple text files. All of these forms of data are needed to allow the programmer to chose the right access method to attack the problem at hand.

Of course GUIS will be of value, and visualization tools that can layout and filter graphs will of use. But these tools need to be secondary to the goal of raw access to the data. All of this data needs to be accessed via . I personally think that the graph layout algorithms can be applied to data structures to optimize the memory of them.

The conclusion is that the introspector needs to be as slim as possible and as efficent as possible in providing useful information to the programmer. But it needs to be as open and usable as possible, providing the redundant representations of the meta data so that it can be exploited.

The Context of programming

The idea of context is difficult to define in general for meta-programs, because you have a meta-context! The context of a meta-program is related to all the contexts of the object-programs that it operates on.

Because of the idioms and the style of the programmer, the important data about a program can be encoded in a unique and programmer dependant style. This style or character of the code enbodies is the essence of the coder. Because of the seemingly unlimited expressability of a programmer, there is no way to dictate how a particular idea will be encoded. Naming and Style Conventions, Coding Styles, and Documentation contain context specific infomation that is needed to understand the code.

To make the problem worse, the Dreams and Visions of the programmer, Conversations between programmers over coffee , Unwritten assumptions, and Cultural Background plays a role in the style of code written.

Programming is Communication

Writing code is a form of formal communcation! When you view code as a message, then you can open you eyes to the interpersonal and social aspects of code that aid in its understanding.

The act of writing code has at least four aspects :

Communication of instructions to the compiler (and other meta-programs) and finally to the computer for execution. So, in the first step, you write programs for a computer. You are communcating the instructions of how the object program is to execute as the real job of a programmer. The Programmer communicates with authors of the meta-programming tool via thier Agent, the meta-program.
Communication of concepts to ones future self. The second step is to write a program so that you might be able understand and reuse the your mental state at the time of writing, the communication of the concepts to yourself.
Communication of concepts to other programmers, and third parties who might use or even further develop your code.
Communction of meta-data back to the programmer in the form of feedback to the programmer. Compiler error messages for example.

Intercepting Communication is one of the main goals of the introspector

The interception of that communication and its decoding by a third party is the next step when the code is taken out of the context of the original message to the Computer Chip.

The problem is that an outside person, will not be easily able to fully understand the captured message exchanged in a closed context with no external reference information.

So we have set the scene now for meta programming : People creating tools for thier own usage as messages to themselves and a small user group and others trying to intercept those messages.

A program is a message. Understanding a program involves decoding that message and recoding it into your context. Usage of contextual information outside of the code itself is often needed to decode the message. The introspector allows you to collect this reference data in a central repository and supports the understanding of the message.

Examples and classes of meta programs

Some examples of what I consider to fall in the class of meta-programs are :

compilers, translators and interpreters are programs that process and execute other programs
Custom User Defined programs that are written by users to process the software

Programs that affect and control the process of creating the software

build tools like Autoconf, Make, Automake, Ant that control the compilation and build process
I dont consider tools that are just used in the build process to be meta-programms even if they can be used to implement meta programs, because they are not dealing with the software directly such as Grep, Bash, Sed, and more trivally Tar, Gz and the Linux Kernel. These programs however contain important meta-data related to the program and will need to have interceptors installed to collect that data.
Tools that deal with software packages like dpkg, rpm and apt can also be consider to be meta-programs because they are providers and consumers of meta-data about the software.
Linkers, Assemblers
optimization routines of the gcc

User Space Run Time Functionality

The reflection mechanisms of java and the eval function of perl
Dynamic languages such as Lisp, Prolog, Haskell, to some extent Perl C# and many other advanced languages that have direct support for meta-programming

Profilers and runtime optimization routines

Profilers and Data Collection routines
Dynamic Linkers
JIT tools and partial specialization routines
Process Introspection and snapshoting (Core Dumps included)
The GDB debugger

Code Generators

Language Creation tools such as Yacc, Lex , Antlr and TreeCC the
program transformation tools like refactoring browser tools, aspect oriented browsers, generic programming tools

Programs that extract information from your code and deliver it to the user

code validation and model checking tools such as lint and more advanced model checking tools
reverse engineering tools, case tools, program visualization tools
intelligent code editors and program browsers that have a limited understanding of the code (emacs falls into this catagory in the strictest sense)
automatic documentation tools like Doxygen
Even IDES can be considered strictly a meta-program, or at least a container for them.
of course, I would consider my pet project, the introspector a meta program.

Metaprograms are like mushrooms, they sprout out of dark, damp and dead parts of existing code

The one thing that I have observed is that very many meta-programming projects just spontaniously sprout out of the ground, each has a similar goal, that of processing programs and making programming easier, meta-programming. Most such programs are not reusable or reused, and they mostly do not provide any well defined interface to thier meta-data.

In lisp you have a standard Meta Object Protocol, (MOP), but this is also very lisp specific although well thought out, but on the other side, there is a huge amount of meta-data in lisp that does not have a standard well defined interface into it.
The more context specific a meta-data and a meta-program is, the more effective it is for the context it is created for, the best example is an assembler or compiler optimized for a specific processor. There are a huge amount of research and experimental systems that provide various degrees of freedom to the programmer and user.

For the most part, meta programming tools can be classified into three sections :

1. So context specfic so they cannot be generally reused and are generally disposable. They sprout out of some concrete problem and are just like mushrooms that grow on some rotting material. The scope of the coverage of the fungus is limited by the scope of the problems in the object-program.

2. So abstract and complex as to be not easily usable, understandable, or practical. The context is artifical, abstract and mathematical. This is a different form of being context specific, the context is the mind of the author or his limited slice of research. This is a classical example of a message from the programmer to himself that I will explain later, and lacking any reference to the outside word.
3. The few rare cases are pratical tools that find a safe mix between abstraction and context. The C language has a very small set of abstractions, and the GCC has been able to define routines that are reuable between various languages. The problem with these pratical tools is that are in general lacking any of the advanced meta-programming features that are found in the previous two classes.

Metaprogramming tools normally dont work together or and for the most part they dont work for you

For the average programmer working on an average system, very little is available for thier usage. When you sit down to work on a normal programming task, lets say one associated with working on the source any of the GNU tools, there are basically no standard, integrated and usable meta-programming tools that you can use for all aspects of your work.

There is very little in terms of a standard interface or set of requirements that are placed on meta programs in general. This is due to fact that programming is a form of formal context specific communication that I will explain later.

Metaprogramming tools are disposable

Meta programs are tools that are for the most part disposable. Thier effects result in bugs being found and fixed in such things as validators. Or in documentation being produced. Or in code being generated. The programs themselves interact with the programmer via configuration files or a GUI or via individual commands. The programmer guides and controls the meta-programming process. So in the end, the metaprogramming tools are only as good as they are usable by a programmer. They are only as good as they are applicable to a given problem.

The set of the meta-data for a given program is very large

The compiler is a meta-program that contains a large amount of data about the software at hand, but there is a large set of programs that make up the build process. Luckily for most interesting programs the source for all these programs are available. So all of the tools that are used to direct the build of the software can be considered meta programs that affect the final object-software. If we look at all the data that is contained by all the instances of the meta-programs then we define a large set of meta data that

All these tools, when considered together, use and process many aspects of the software. So we can say that the total amount of data in memory at all points of the process of the running of the meta software contains a very good picture of the software that is being compiled. Now it is the question, how can we get the meta programs to communicate this data to us?!

Recoding the message into a RDF with an explicit context

Now, once that a program as been understood, it can encoded into a context independant representation, like RDF with explicit context, relationships and meaning.

RDF means Resource Description Framework. Resources are things of value that are worth identifing and describing. Every single aspect of the software can be represented as a graph. The Nodes in the graph are resources or literals. The edges are called predicates, they can represent pointers, containment or basically any binary relationship between nodes. In RDF each type of edge is another type of resource and can be defined in detail.

We can assign a unique resource identifier in the form of a URI to each identifier, variable, each value, each function call of the software on the static level. By adding in the concept of a program instance, time and computer we can also assign resources to dynamic things like values in memory, function stacks and frames.

When this model of the program has been started to be built, then the communication in the form of Documentation, Emails, Bug Reports, Feature Requests and Specifications about that program can be decoded, because it will reference symbols in the code. Or the code will reference symbols in the communcation.

Now, the symbols that occur in the source code could be constant integers, constant strings, identifiers in the code, or even sets and sequences of types without names.

So, the first step to decoding a program would be index the set of all identifiers. Then determining the relationship between the identifiers and the concepts is needed. Mapping of names onto wordnet resources would be a great start. The relationships between identifiers needs to be discovered.

By transforming the source code into a set of RDF statements that describe it, and also converting the context data into a similar form a union of the two graphs can be created. Relationships between the two can be found.

Application of Meta-Data to the Interceptor Pattern

If the meta program is changed so that it emits this data in a usable common format, then this data can be put into context and used to piece together a total picture of the context of the software. This is what I call the interception pattern. The message between the programmer and the machine is intercepted and recorded. There needs to be a common API for this interception. There also needs to be tools for automating this interception. That can be done by the usage of the meta data collected from the compiler and the build tools in the first pass. By decoding the data structures of the build tools we can semi automatically create serialization routines. By applying the techniques described here, each program can be trained to communicate its meta data to the introspector. Each program that is hooked up to this framework increases the knowledge available for the integration task.

The idea of the semantic printf function

The next idea would be to replace the printf routines with a general routine to query and extract the data that is available in the context of that printf. Given that we will have access to a list of all the variables available at any given context, and that we will also be able to know any variable that can be directly and indirectly accessed from that variable, it will be possible to invoke and process user specified extraction and interception code that the point of the printf. The printf could reference the point of the meta-data giving each variable to be emitted a very detailed context.

The data that we need is there, we just need to get at it

As the user of a meta program, you often feel that you are a second class citizen. Yes, well that is the core problem that I am addressing. Most programs are written to solve a problem for some person. The fact that you are using it is secondary. The gcc compiler itself is a good example of a self serving program. It represents a huge amount of knowledge that is locked up into a representation that is highly inaccessable. The fact is that much of the information that the user of the compiler needs and has to manually enter is available to the compiler developers is

Because of the large amount of open source tools, and the fact that all the GNU tools are based on a limited core set of tools all available in source format, they are a perfect target for the collection of meta data. Not only are all the source histories available, but also the documentation, the mailings list, and basically all the contextual information. There is a huge amount of publically available data about the GNU project.

The adding of meta data to C

The history of C an C like languages can be seen as an evolution of meta data and meta programs. Each new addition to the language gives more meta data information to the meta program, the compiler. Each language breaks with the previous version for some reasons, good or bad. In the end you are forced to rewrite your code to use these new features. In the end, the process is just the adding of more meta-data to the existing program and then the interpreting of this advanced meta-data by a more advanced meta-program, a better compiler. There is no reason that this meta information and the validiation of it cannot be added via other means and the processing of it decoupled from the monolithic process. Even the addition of meta data about the persistance and the network accessibility of software via DCE IDL and Corba can be specified in the same manner on top of the existing software without new syntaxes.

The reading of introspector augmented meta-data back into the meta programs.

It is reasonable to consider the idea of reading the instances of the data stored in the meta programs directly out of the introspector. The api that the introspector gives for intercepting the metadata can be used to then read the updated data back out, or even from another source. In this manner, entire programs could be translated from other languages or generated programatically. The entire set of intermediate files and file formats can be unified into a common data representation and communication mechanism. This is possible because the programs to be modified are free software and they can modified to provide this interface. The idea of the kernel module would allow for this to be done without changin the software.

The monolith and the network

The fact that the GCC is linked in the way that it is a organisational, political and socialogical descision. It can be also be split up into many independant functions. Given a mechanism for intercepting, communicating and introspecting the function frames any conceviable network of processing can be implemented without using the archaic linking mechanism used by the existing gcc.

The linker and function frame is a data bus, that can be intercepted

The linker and the function call frame represent a path of data communcation. The compiler produces tight bindings between functions and the linker copies them into the same executable. Given enough meta data about the function call, this data can be packed into a neutral data format and the functions can be implemented in a completly isolated and separated process.

Simplicity and Practicality are the key factors for the success of free software

The great science fiction author Stanislav Lem writes in his (polish to german translation) article metainformationsthoerie [1] that the evolution of ideas computer science is natural selection function that selects ideas by the commercial success of an idea and not by the gain in knowledge. He sites the meme idea of richard dawkins who compared information to genes as self replicating individuals competing for resources.

We can treat free software as a meme and analyse its attributes.

For a free software this success is defined in terms of the following terms

Replication - How often a software is executed (invoked), copied, downloaded, how often the ideas are copied, how often the software is used! We can see that the invokation of a program is the copying of the software into the core of the processor, in the moment it becomes active. We can measure the success of software as the core share of it. How often is it copied into the core of the computer, how often does it become alive.
Mutation - How often a software is changed to adapt to the environment. This is a function of how useful the software it and how easy it is to be mutated into something more useful. The paradox of free software is that the mutation functions are expensive because of the nature of the protection mechanism. Free software needs to protect itself as a meme from being mutated into non free software.
Resources - The amount of work, time, space that is required to use, understand and mutate the software. This is the cost function that is to be minimized. The memes success however is

These factors help explain Richard Gabriel's paradoxical phenoma of "Worse is Better" [2]
(Being that I am from New Jersey, I naturally identify with the New Jersey Worse is better attitude). Simplicity and practicality and interactivity are the most important factors in the success of an idea.

I say that interactivity is important, because it is simple and practical in reducing the costs of learning and using a software. When people are evaluating a software they want to within a very short period determine if this factors are met.

Free software has the paradoxical feature that the source code of successful free software tools are complex, impratical and not interactive. The situation created is that the resources that need to be invested into learning the context of free software need to be so high that the programmer becomes bound to that context and identifies with it.

How does the GPL prevent the usage of meta-data ?

This is going to get hairy here, this is question that I have been thinking about for many years!
The short answer is : there is nothing stopping any program from reading the meta-data of free software.

Reading the meta-data does not create a derived work. The meta-data of a object-program is Copyright covers the copying of the derived works. Of course if the structure of the meta-data is context specific and is a derived work of the object-program.

The solution to this entire problem can be stated as follows :

Any meta-data about a object-program that is intercepted from inside a meta-program in a the foreign program-context can be translated into a user-context without creating a derived work, only the translation routine is derived from the structure of the foreign context.

Because of the amount of data available about free software, open source and even shared source software they are all able to be translated in this manner.

The conflict between free software context and the open meta-data

The user is interested in practicality, simplicity and interactivity. The free software as a meme is interested in memotic success, replication and mutation and the controlling of resources. These two are at odds. Free software tries to protect itself to by making access to the meta-data to be impractical, complex and non interactive. The introspector has the goal of resolving this conflict and making the meta-data accessable by the user.

Conclusion

Source Code is in the end just meta-data that flows in a network of meta-programs. The communction between these meta-programs are handled via primitive mechanisms that inhibit sharing of data.

Via modification of the meta programs, a man in the middle attack can be implemented to intercept the messages from the programmer to the computer, augment this message with contextual information and unify it into a global knowledge base. Given a critical mass of meta-data the messages and data flows of a program can be understood.

This represents an end the existing concepts of using a function creating a derived work for the very fact that the compiler and linker can semit automatically create wrappers, interceptors, serializers and introspection code for any source code that is embedded in a critical mass of meta-data.

This represents a shift in power away from the creators of meta-tools to the users of them and will give more freedom to the users of free software.

[1] metainformationsthoerie http://www.heise.de/tp/deutsch/kolumnen/lem/5443/1.html
[2] Richard Gabriel : Worse is better http://www.jwz.org/doc/worse-is-better.html

GCC Hacker

Friday, February 25, 2005

Removed Text from the Introspector Lightning talk at FOSDEM 2005

About Me

Links

Previous Posts

Archives