Thursday, January 06, 2005

n3v [n triple vector]an efficient memory representing of compiler graphs

Happy new year!

I have finally gotten around to start unifying the ideas of ice cubes and rdf.

The ice cube idea was to use a binary matrix representation of the graph as a N*N cube of data.

RDF/ntriples is based on making statements of triples that describe the graph.

I have today, built a new binary representation of the ntriples format. It is based on the idea of representing each uri as an index into a vector. This index should be as compact as possible, so we can exploit the cache of the computer.

This gets into the area of linear algebra, and the tools lapack will be interesting, and ScaLAPACK
provides a distributed processing mechanism for it. I will have to write more about that in the future.

Basically it boils down to creating an vector of uri, and assigning those uris an index. Optimally the index would be a perfect hashing function.

For the introspectors gcc graphs, this index is already there, it is the node id that was assigned during the traversal of the compiler graph, so I just extract out that number encoded in the uri of the node.

For the predicates, an id is assigned as a counter, first come first serve.

The program that does this is done by the n3v_converter.pl program.

It has the following parameters :
  • input_uri the uri (file:foo.ntriples) to parse
  • map_file the map file of predicates to indexs
  • output_file the output file to produce
  • debug_file the debug file to emit
  • PACKFORMAT the format of the binary file

PACKFORMAT are three chars, one for the subject, predicate and object.

It is passed directly to perls pack routine, one page that documents it is here
Here are some useful values, but it occurs to me that a fixed width char format might be interesting as well!

  • C An unsigned char value.
  • S An unsigned short value.
(This 'short' is _exactly_ 16 bits, which may differ from
what a local C compiler calls 'short'.)
  • I An unsigned integer value.
(This 'integer' is _at_least_ 32 bits wide. Its exact
size depends on what a local C compiler calls 'int',
and may even be larger than the 'long' described in
the next item.)
  • L An unsigned long value.
(This 'long' is _exactly_ 32 bits, which may differ from
what a local C compiler calls 'long'.)
The resulting packed input file can be read directly into memory.

Here is an example program that reads the Short/Char/Short triple stucture.
Here is the input file that is it hardwired (in terms of array size) to read.

here is my post to rdfig/swig on the freenode irc chat

more to follow.

mike