A Bayesian non-parametric methodology for inferring grammar complexity

Based on a set of strings from a language, we wish to infer the complexity of the underlying grammar. To this end, we develop a methodology to choose between two classes of formal grammars in the Chomsky hierarchy: simple regular grammars and more complex context-free grammars. To do so, we introduce a probabilistic context-free grammar model in the form of a Hierarchical Dirichlet Process over rules expressed in Greibach Normal Form. In comparison to other representations, this has the advantage of nesting the regular class within the context-free class.
We consider model comparison both by exploiting this nesting, and with Bayes' factors. The model is fit using a Sequential Monte Carlo method, implemented in the Birch probabilistic programming language. We apply this methodology to data collected from primates, for which the complexity of the grammar is a key question. Joint work with Lawrence Murray and Judith Rousseau.