AutoCorrect and AutoGrammar for the ERG

Next: Understanding the Output of the ERG: the Minimal Recursion Semantics Format

Previous: Using ERG for a Natural Language Interface Flow

I’ve found the ERG to sensitive enough to punctuation, spelling, etc. that I needed to do some autocorrect on common issues to make Perplexity work properly. The ERG is also very sensitive to missing articles (“open door” vs “open the door”) and will often fail to parse or generate a very unusual interpretation when articles like “a”, “an” and “the” are missing. I’ve needed to add some simple heuristics for adding articles and I’ve called this “AutoGrammar”.

AutoGrammar

Actually putting in missing articles is a hard problem to solve reliably. I ended up taking a very simple approach: if a phrase fails to parse at all or fails in the “logical sense” (by returning false), the code will blindly add articles right before any nouns in the phrase that don’t have one and try again. If that succeeds in the “logical sense” (by returning true) then the phrase gets suggested to the user.

Note that this is just a heuristic, there are plenty of places in english where a noun doesn’t need an article or, in fact, shouldn’t have one. The heuristic is using the magic of the ERG to catch these cases: if the phrase generated can’t be parsed by the ERG, we assume it wasn’t a good fix and don’t suggest it.

So for example:

leave cave fails since isn’t proper english. So, leave a cave gets tried and ultimately suggested to the user since it works.
go to some rocks might fail because we are too far away from the rocks. But, since it failed, the heurstic will see if it was a problem of “missing articles”. Since rock is a noun, it will try the correction go to some a rocks which will obviously fail and so won’t get suggested.

AutoCorrect Words

The autocorrect algorithm I’m using is very simple: it will either match a word and replace with another or match a whole phrase and replace with another one. That’s it. But the list of replacements is illustrative of the issues I’ve encountered.

Users consistently typed “pickup” instead of “pick up” and the ERG generated zero parses for it since “pickup” is not a word. Since there were zero parses the prototype would generate “That’s not proper English!” This made people thing the system really didn’t understand the word “pick up” since they thought they were spelling it properly. Ditto for all contractions like “wheres”. So, I just started autocorrecting:

"pickup": {"Replace": "pick up", "Type": "word" },
"Pickup": {"Replace": "Pick up", "Type": "word" },
"whats": {"Replace": "what's", "Type": "word" },  
"Whats": {"Replace": "What's", "Type": "word" },  
"theres": {"Replace": "there's", "Type": "word" },  
"Theres": {"Replace": "There's", "Type": "word" },  
"wheres": {"Replace": "where's", "Type": "word" },  
"Wheres": {"Replace": "Where's", "Type": "word" },  
"thats": {"Replace": "that's", "Type": "word" },  
"Thats": {"Replace": "That's", "Type": "word" },  
"theres": {"Replace": "there's", "Type": "word" },  
"Theres": {"Replace": "There's", "Type": "word" },  
"wheres": {"Replace": "where's", "Type": "word" },  
"Wheres": {"Replace": "Where's", "Type": "word" },

The ERG will generate predicates to indicate proper nouns if the user capitalizes them. Otherwise, they will usually get mapped to the nn_u_unknown__x predicate. I could have handled this case in that predicate but instead found it easier to just capitalize all the names in my scenario automatically if the user forgot to:

"lexi": {"Replace": "Lexi", "Type": "word" },
"plage": {"Replace": "Plage", "Type": "word" },

For some reason it was really difficult to get a consistent parse for greetings the user would type unless they added the final punctuation to it. Some examples:

If they forgot the final punctuation (“hi” instead of “hi!”) the ERG would not be able to tell if it was a command or a question
Typing “Hi” (with capitalization) generated a parse assuming this was a proper noun

I found that if the user typed in any number of greetings, with proper punctuation, it consistently generated a tree like the one below which is what I wanted:

                         ┌greet__ci:good_day,i8
 discourse__ihh:i9,h6,h10┤
                         └unknown__eu:e2,u5

Logic: discourse__ihh(i9, greet__ci(good_day, i8), unknown__eu(e2, u5))

So, I autocorrected the common cases so this would happen:

"hi": {"Replace": "hi!", "Type": "sentence" },
"Hi": {"Replace": "Hi!", "Type": "sentence" },
"hello": {"Replace": "hello!", "Type": "sentence" },
"Hello": {"Replace": "hello!", "Type": "sentence" },
"good day": {"Replace": "good day!", "Type": "sentence" },
"good afternoon": {"Replace": "good afternoon!", "Type": "sentence" },
"good evening": {"Replace": "good evening!", "Type": "sentence" } 

Finally, parsing of titles like the title of a book was problematic. If the user didn’t include quotes, the ERG would not recognize as a title and apply the fw_seq predicate to it. Furthermore, properly capitalizing a title actually generated more complex parses.

So, ideally, I wanted titles to be quoted and lowercase, and here’s how I did it in the autocorrect engine:

% the type "caseInsensitiveWords" matches any casing 
% of the key on the far left and replaces it with a 
% fully lowercase value in "Replace. These two entries
% ensure that a user that typed in the title quoted, but 
% some uppercase, gets fully converted to lowercase 
"'how to escape the cave system'": 
	{"Replace": "'how to escape the cave system'", 
	"Type": "caseInsensitiveWords" },  
"\"how to escape the cave system\"": 
	{"Replace": "\"how to escape the cave system\"", 
	"Type": "caseInsensitiveWords" },  
	
% the type "unquotedCaseInsensitiveWords" matches the
% key on the far left if it is unquoted, and then quotes
% it with the "Replace" value
"how to escape the cave system": 
	{"Replace": "'how to escape the cave system'", 
	"Type": "unquotedCaseInsensitiveWords" }

That is all I had to autocorrect in the prototype to make the ERG work well for this scenario, obviously performing all the normal autocorrect scenarios (e.g. replace “teh” with “the”) would have made the scenario better. However, most of them just failed in an obvious way that made the prototype return “I didn’t understand X” which was fine.

Next: Understanding the Output of the ERG: the Minimal Recursion Semantics Format

Previous: Using ERG for a Natural Language Interface Flow