Note: This description assumes that you have at least partial knowledge of the AIML language. If you don't, I recommend playing with AIML for a short time, many things become obvious once you actually have a look at the files. Then again, some of them don't. (Like the 'category order'. Especially category order.) Alicebot.org is always a good resource.
Ok, here's a quick and dirty description how PASCALice works. When you start it up, it opens a file called startup.xml and hands it over to the TBotLoader class, which loads general input substitutions, sentence splitters, and then looks for the first enabled bot. When it finds it, it reads the bot-id and uses this as a key, e.g. for saving/loading variables and chatlogs. The loader then scans all the bot properties (global constants) and also the list of files it has to load.
Let's suppose we got past the initial stage of reading the startup.xml file, and the program starts parsing AIML files. Most of what happens here is determined by the XML parser PASCALice is using, and the format of the AIML files. The loader (this time it's the TAIMLLoader) reads categories one by one as they appear in the file, parses and stores the contents of the pattern, that and topic elements and when it reaches the end of the category, calls a method to store them in the TPatternMatcher tree together with the template.
What might be of interest to some is the fact that it does not do any processing of the contents of the template element - just cut & paste, so to say. Instead of storing the templates directly in memory inside the pattern matching tree, they could be left on disk (storing only references to the file positions), or put inside a database of sorts (I'm planning to try both of them in the future).
Before I get to the overview of how the actual matching is done, I'd like to say something about the parsing methods I'm using. Usually XML files are processed either with an event driven mechanism (like SAX), or a DOM tree is created first, and then processed. I prefer a 'compiler like' approach (mainly because AIML -is- a scripting language). What I'm usually doing is calling parser.scan() which simply advances the parser to the next part of the document (element, element content, processing instruction) and tokenizes it for me. I then use case/switch/if statements to make decisions - sometimes just setting a variable, sometimes calling methods (which can, and this is the nice thing, use the same mechanism). You'll see what I mean when I get to the template processing section - now back to loading categories.
PASCALice, like most current implementations uses a 'static' approach to category matching. This means that the whole path (with all it's components) gets created at load time. This is what happens right before the loader passes the category to the tree: it combines the 3 contexts (pattern, that and topic) into one string, separating the three components (contexts) with special special context id strings.
The original graphmaster takes this string, separates it into 'words' and stores each of these as nodes in a tree. You can think of if like of a directory structure, where the individual words make up the directory names - hence the name 'path' of the category. The TPatternMatcher still works very much like the old Graphmaster, but it stores the nodes according to their context (for you database people - the nodes are keyed first by context, and then by their value, and not just by their value)
Now, the bot is up and running and awaiting user input. After you press enter, the input has to be pre-processed before it can be matched to a category.
The first thing that has to be done is apply input substitutions, using the TPreprocessor class. The main purpose of them is to protect sentences and to eliminate common spelling mistakes. Many people use them to define 'synonyms' for words too, but I don't recommend such a practice. Once you use a substitution, the original word is forever lost and can never be restored. When using a srai (unless you're abusing underscores :-), you can always add a more specific pattern.
Substitutions are applied in a rather straightforward manner - for each loaded, substitution find and replace it within the input string. There is room for considerable improvement using a tree to search for them - but I haven't gotten to implement it yet :-)
After this, all redundant whitespace is removed and the input is split into sentences (remember the sentence splitters when loading?) Each sentence is then sent to the pattern matcher, and the reply is processed (for side effects). Only the last reply is shown to the user, however. This IMHO sucesfully mimics humans - even when we've heard and processed something, we don't necessarily reply to every single sentence.
Let's have a closer look at the pattern matching and the template processing.
Very briefly, the matching algorithm is a modified version of the DFS (depth first search). The pattern matcher takes the input and combines it with the current context ('that' and 'topic') into a string and tries to find a path in the tree that 'best' matches it (if we were matching dynamically, the contexts would be retrieved when needed). It starts by tokenizing this 'path to be matched' into the individual words (nodes) and traverses the tree from the root.
Unlike the normal DFS, the order in which the relevant subnodes are travelled -is- important, because (when using wildcards) the number of paths that match any given input is quite large (somewhere between 3^n and 4^n for an input of n words), and we need only one (possibly the best).
First, the underscore wildcards are checked. A wildcard in AIML means 'one or more words', and the only difference between the two kinds (_ and *) is the order in which it gets searched. Matching of wildcards is recursive - first we're trying trying to match only 1 word of the current context to the wildcard, and the remainder is tried agains it's child nodes. If there was a sucessfull match, i.e. the whole remaining input matched to a path, and the last node was a leaf (meaning containing a reference to a template), return the template, else try 2 words, then 3. If we used up all the words in the input and the current node is a leaf then it's a match, else stop matching the underscore and try to match an exact word
This is easy - either there is a childnode that equals to the input word or not. If it is, recurse a level deeper. If there was a match, terminate. If there was no match, continue with the star wildcard
What the TPatternMatcher currently does is to try to match only the current context, instead of the whole input. This helps to keep down the number of recursions on wildcards, and opens up paths for future expansion.
When a match was found, a TMatch object is created. This holds all relevant information about the match, for example the template or the strings that matched to the wildcards.
Once we have matched the input and have a response template, we need to process it too.
Here, the usefullness of the parser shows up. The TemplateProcessor creates and maintains
the parser for the template, and is responsible solely for the root element, <template>.
Each element does basically the same thing: it sequentially reads it's contents, if it's
character content then append it to a buffer, if it's another element, then call
ThatElementClass.Process() and append the results instead. The trick here is
how to get the relevant element class, which is why there is the ElementFactory. At
creation, each TElement descendant registers itself with the (global) ElementFactory
instance, supplying the name of the elements it handles (there's also a default handler for
unrecognized elements). The ElementFactory class stores the instances in a lookup table...
and virtual methods take care of the rest.
To make it flexible enough, the elements also recieve information about the last match via a TMatch object.
Ok, what's more to say? I know, these 'developer docs' could have been more to the point, with actual code examples. I might have tried a bit more to keep it clear and concise - I'm afraid that some of the explanations are quite a mess. I could have delved more into the file formats. And definitely, more comments in the code would have made this unnecessary... but hey, in the world of free software, this is probably as good as it gets. If you really want to know the gory details, don't be afraid to ask - the developer mailinglist is a great place for stuff like that. And there's always the source :-) If anyone has any questions, suggestions, I'll be glad to hear them. You know my mail.Kim Sullivan, firstname.lastname@example.org