lundi 31 août 2015

Tokenizing a string in Prolog using DCG

Let's say I want to tokenize a string of words (symbols) and numbers separated by whitespaces. For example, the expected result of tokenizing "aa 11" would be [tkSym("aa"), tkNum(11)].

My first attempt was the code below:

whitespace --> [Ws], { code_type(Ws, space) }, whitespace.
whitespace --> [].

letter(Let)     --> [Let], { code_type(Let, alpha) }.
symbol([Sym|T]) --> letter(Sym), symbol(T).
symbol([Sym])   --> letter(Sym).

digit(Dg)        --> [Dg], { code_type(Dg, digit) }.
digits([Dg|Dgs]) --> digit(Dg), digits(Dgs).
digits([Dg])     --> digit(Dg).

token(tkSym(Token)) --> symbol(Token). 
token(tkNum(Token)) --> digits(Digits), { number_chars(Token, Digits) }.

tokenize([Token|Tokens]) --> whitespace, token(Token), tokenize(Tokens).
tokenize([]) --> whitespace, [].  

Calling tokenize on "aa bb" leaves me with several possible responses:

 ?- tokenize(X, "aa bb", []).
 X = [tkSym([97|97]), tkSym([98|98])] ;
 X = [tkSym([97|97]), tkSym(98), tkSym(98)] ;
 X = [tkSym(97), tkSym(97), tkSym([98|98])] ;
 X = [tkSym(97), tkSym(97), tkSym(98), tkSym(98)] ;
 false.

In this case, however, it seems appropriate to expect only one correct answer. Here's another, more deterministic approach:

whitespace --> [Space], { char_type(Space, space) }, whitespace.
whitespace --> [].

symbol([Sym|T]) --> letter(Sym), !, symbol(T).
symbol([])      --> [].
letter(Let)     --> [Let], { code_type(Let, alpha) }.

% similarly for numbers

token(tkSym(Token)) --> symbol(Token).

tokenize([Token|Tokens]) --> whitespace, token(Token), !, tokenize(Tokens).
tokenize([]) --> whiteSpace, [].

But there is a problem: although the single answer to token called on "aa" is now a nice list, the tokenize predicate ends up in an infinite recursion:

 ?- token(X, "aa", []).
 X = tkSym([97, 97]).

 ?- tokenize(X, "aa", []).
 ERROR: Out of global stack

What am I missing? How is the problem usually solved in Prolog?



via Chebli Mohamed

Aucun commentaire:

Enregistrer un commentaire