SWI-Prolog -- Processing Indexed Files

Documentation
- Reference manual
- Packages
  - SWI-Prolog SGML/XML parser

6 Processing Indexed Files

In some cases applications wish to process small portions of large SGML, XML or RDF files. For example, the OpenDirectory project by Netscape has produced a 90MB RDF file representing the main index. The parser described here can process this document as a unit, but loading takes 85 seconds on a Pentium-II 450 and the resulting term requires about 70MB global stack. One option is to process the entire document and output it as a Prolog fact-base of RDF triplets, but in many cases this is undesirable. Another example is a large SGML file containing online documentation. The application normally wishes to provide only small portions at a time to the user. Loading the entire document into memory is then undesirable.

Using the parse(element) option, we open a file, seek (using seek/4) to the position of the element and read the desired element.

The index can be built using the call-back interface of sgml_parse/2. For example, the following code makes an index of the structure.rdf file of the OpenDirectory project:

:- dynamic
        location/3.                     % Id, File, Offset

rdf_index(File) :-
        retractall(location(_,_)),
        open(File, read, In, [type(binary)]),
        new_sgml_parser(Parser, []),
        set_sgml_parser(Parser, file(File)),
        set_sgml_parser(Parser, dialect(xml)),
        sgml_parse(Parser,
                   [ source(In),
                     call(begin, index_on_begin)
                   ]),
        close(In).

index_on_begin(_Element, Attributes, Parser) :-
        memberchk('r:id'=Id, Attributes),
        get_sgml_parser(Parser, charpos(Offset)),
        get_sgml_parser(Parser, file(File)),
        assert(location(Id, File, Offset)).

The following code extracts the RDF element with required id:

rdf_element(Id, Term) :-
        location(Id, File, Offset),
        load_structure(File, Term,
                       [ dialect(xml),
                         offset(Offset),
                         parse(element)
                       ]).