Performance Issue: `thing` and Free Variables

It turns out that lots of MRS representations end up using thing(X) as a predicate, which means “filter X to the set of all things”. This is a large set. Because Prolog doesn’t operate on “sets” but instead backtracks through instances one by one this ended up creating a big performance bottleneck.

Here’s an example:

"what is on the table?"

[ TOP: h0
INDEX: e2
RELS: < 
[ which_q__xhh LBL: h5 ARG0: x3 [ x PERS: 3 NUM: sg ] RSTR: h6 BODY: h7 ]
[ thing__x LBL: h4 ARG0: x3 [ x PERS: 3 NUM: sg ] ]
[ _the_q__xhh LBL: h9 ARG0: x8 [ x PERS: 3 NUM: sg IND: + ] RSTR: h10 BODY: h11 ]
[ _table_n_1__x LBL: h12 ARG0: x8 [ x PERS: 3 NUM: sg IND: + ] ]
[ _on_p_loc__exx LBL: h1 ARG0: e2 [ e SF: ques TENSE: pres MOOD: indicative PROG: - PERF: - ] ARG1: x3 ARG2: x8 ]
>
HCONS: < h0 qeq h1 h6 qeq h4 h10 qeq h12 > ]

                      ┌thing__x:x3
 which_q__xhh:x3,h6,h7┤
                      │                      ┌_table_n_1__x:x8
                      └_the_q__xhh:x8,h10,h11┤
                                             └_on_p_loc__exx:e2,x3,x8
                                             
Logic: which_q__xhh(x3, thing__x(x3), _the_q__xhh(x8, _table_n_1__x(x8), _on_p_loc__exx(e2, x3, x8)))

Let’s say there are 1000 “things” in the world (which is a small number if you consider that properties are things, names are things, etc. The reality is much bigger). Because thing(x3) appears right at the beginning, Prolog will literally backtrack 1000 times through everything after the thing(x3) predicate.

Now notice that all thing__x is doing is handing, one by one, a list of things in the world to _on_p_loc__exx which checks to see if they are “on the table”. Because of the way Prolog works, if you instead run the query _on_p_loc__exx with an unbound variable (i.e. one that hasn’t been set to anything), a well-structured Prolog program will list out the things on the table much more efficiently.

I found this thing(X) pattern to be a bad enough performance problem in many queries that I had to come up with an alternative solution: have the implementation of thing(x) simply leave the variable unbound (i.e. don’t have thing(x) do anything!) and have it work through the magic of Prolog!

But, if it were that simple I wouldn’t have this whole section for it…

The problem occurs with quoting. If, in this example, you asked _on_p_loc__exx to quote itself (perhaps with the phrase “put everything on the table”), it has to come up with a representation for the variable that can be stored to disk easily. Because Prolog can’t round trip a variable and have it come back as the same variable, we need a way to map it:

% Showing what happens if you convert a variable to a string and back
?- term_string(Z, X), writeln(Z), term_string(Y, X), writeln(Y), Y=a, writeln(Z).

% It isn't the same variable anymore!
_18908      % This is what Prolog calls "Z" internally
_18918      % This is what Prolog calls "Y" internally after we create it
      % using the string representation of Z! It isn't the same!
_18908    % This is the value of Z at the very end, clearly "Y=a" didn't set it
      % So Z was not successfully round tripped through a string
X = _18908, % This is the actual string representation of Z, which is expected
Y = a.

Serializing Free Variables

So, the work-around is pretty simple…but a little hacky.

First: when we are asked to serialize a free variable in a triple using the Perplexity Ontology, we store a magic value idVariable(ERGVariable). ERGVariable is the original name of the variable from the MRS such asx3 in the example from above: idVariable(x3).

Then, when we deserialize it, we need a way to map from something like idVariable(x3) back to the real Prolog variable so that Prolog can report the answer correctly. As I showed in the previous section, this isn’t as simple as calling term_string on the variable. Instead, I create a mapping of ERG variable name to Prolog variable for all free variables. I store it on the stack as an alternative to having a global variable around that needs to be managed.

To keep it on the stack, I just use freeVariables/2 (see below) to call any predicates I want to have access to this information. Within that call, any predicate can call getFreeVariable/2 to retrieve the proper variable:

% expects Variables to be of the form: [item(ergName, ActualPrologVariable), ...]
freeVariables(Goal, Variables) :-  
  call(Goal),  
  % The retainVariables/1 call ensures freeVariables/2 remains
  % on the stack and passing the argument ensures Prolog will not 
  % garbage collect the argument.  
  retainVariables(Variables).  
  
retainVariables(_).  
  
% Retrieve the Prolog variable given an erg variable name in "Name"
getFreeVariable(Name, Variable) :-  
  prolog_current_frame(Me),  
  prolog_frame_attribute(Me, parent_goal, freeVariables(_, Variables)),  
  memberchk(item(Name, Variable), Variables).

This means that all Prolog code can ignore this optimization, as long as:

  1. There is a call to freeVariables/2 that surrounds all the predicates
  2. The code that stores triples uses getFreeVariable/2 properly

You can see this in practice in the Executing the Prolog section since that is where the predicates get wrapped with freeVariables/2.