Walkthrough Tutorial: Learning Grammars from Corpora
This walkthrough tutorial accompanies Section 4.2 of the following paper:
Authors. (submitted). PyFCG: Fluid Construction Grammar in Python.
A second use case of FCG concerns the learning of construction grammars from corpora of language use. We take the example of fcg-propbank, an existing FCG subsystem for learning construction grammars from PropBank-annotated corpora. We demonstrate how a pretrained grammar comprising tens of thousands of constructions can be loaded into an agent and used to extract semantic frames from open-domain text, how a new grammar can be learnt from annotated data, and how large grammars can be saved
in an efficiently loadable binary format.
[ ]:
# Run this cell if you have not yet installed these packages
! pip install pyfcg
! pip install nltk
[1]:
import pyfcg as fcg
fcg.init()
Loading a pre-trained grammar
As always, we start by creating an agent. In this case, the agent is an instance of the fcg.PropBankAgent class, a subclass of the fcg.Agent class provided by the fcg-propbank subsystem. We download a pretrained, precompiled grammar for English and load it into our agent using its load_grammar_image method. The agent now has at its disposal a grammar consisting of 21,052 constructions.
[2]:
propbank_agent_pretrained = fcg.PropBankAgent()
pretrained_grammar_image = fcg.load_resource('pb-en.store')
[3]:
propbank_agent_pretrained.load_grammar_image(pretrained_grammar_image)
propbank_agent_pretrained
[3]:
<Agent: agent (id: agent-23) ~ 21052 constructions>
Our agent can now use its pretrained grammar to comprehend new utterances. Below, we instruct our agent to comprehend the passive utterance “Margaret Thatcher was elected Prime Minister of Britain.”. The resulting meaning representation reveals that the agent identified a single semantic frame that instantiates the elect.01 PropBank roleset (elect someone to an office or position). The agent also understood that the roles of candidate (arg1) and office or position (arg2) in
this instance of elect.01 are respectively taken up by “Margaret Thatcher” and “Prime Minister of Britain”.
[4]:
propbank_agent_pretrained.comprehend("Margaret Thatcher was elected Prime Minister of Britain.")
[4]:
[{'roleset': 'elect.01',
'roles': [('v', 'elected', [3]),
('arg1', 'Margaret Thatcher', [0, 1]),
('arg2', 'Prime Minister of Britain', [4, 5, 6, 7])]}]
To enhance human readability, we can choose to activate an FCG monitor to trace the comprehension process in the web interface:
[5]:
fcg.start_web_interface()
fcg.activate_monitor('trace-fcg')
propbank_agent_pretrained.comprehend("She especially enjoyed visiting the old historic churches.")
[5]:
[{'roleset': 'enjoy.01',
'roles': [('v', 'enjoyed', [2]),
('arg0', 'She', [0]),
('arg1', 'visiting the old historic churches', [3, 4, 5, 6, 7])]},
{'roleset': 'visit.01',
'roles': [('v', 'visiting', [3]),
('arg0', 'She', [0]),
('arg1', 'the old historic churches', [4, 5, 6, 7])]}]
In order to better understand the PropBank rolesets that are retrieved by our agent, we define a new function describe_roleset. The function makes use of nltk’s propbank module to look up all roles of a given roleset, together with their descriptions:
[6]:
import nltk
nltk.download('propbank')
from nltk.corpus import propbank
[nltk_data] Downloading package propbank to /Users/paul/nltk_data...
[nltk_data] Package propbank is already up-to-date!
[7]:
def describe_roleset(roleset):
nltk_roleset = propbank.roleset(roleset)
print(nltk_roleset.attrib['id'])
for role in nltk_roleset.findall("roles/role"):
print(' arg' + role.attrib['n'] + ':', role.attrib['descr'])
describe_roleset('elect.01')
elect.01
arg0: voters
arg1: candidate
arg2: office or position
[8]:
display(propbank_agent_pretrained.comprehend("Attention passengers, the taxi is arriving at Gate 1."))
describe_roleset('arrive.01')
[{'roleset': 'arrive.01',
'roles': [('v', 'arriving', [6]),
('arg1', 'the taxi', [3, 4]),
('arg3', 'at Gate 1', [7, 8, 9])]}]
arrive.01
arg1: entity in motion / 'comer'
arg2: extent -- rare)
arg3: start point -- also rare)
arg4: end point, destination
[9]:
display(propbank_agent_pretrained.comprehend("Explain this to me again."))
describe_roleset('explain.01')
[{'roleset': 'explain.01',
'roles': [('v', 'Explain', [0]),
('arg1', 'this', [1]),
('arg2', 'to me', [2, 3])]}]
explain.01
arg0: explainer
arg1: thing explained
arg2: explained to
[10]:
display(propbank_agent_pretrained.comprehend("They enjoy visiting New York."))
describe_roleset('visit.01')
describe_roleset('enjoy.01')
[{'roleset': 'visit.01',
'roles': [('v', 'visiting', [2]),
('arg0', 'They', [0]),
('arg1', 'New York', [3, 4])]},
{'roleset': 'enjoy.01',
'roles': [('v', 'enjoy', [1]),
('arg0', 'They', [0]),
('arg1', 'visiting New York', [2, 3, 4])]}]
visit.01
arg0: one party
arg1: other party
enjoy.01
arg0: enjoyer
arg1: thing enjoyed
Training a new grammar
Let us now create a second agent, again as an instance of the fcg.PropBankAgent class, but let it learn a new grammar from corpus data instead of loading a pretrained one. After having downloaded an example CoNNL file, in which a number of English sentences are annotated with PropBank rolesets, we can inspect the first sentence of the file:
[11]:
propbank_agent_learner = fcg.PropBankAgent()
conll_annotations = fcg.load_resource('pb-annotations.conll')
with open(conll_annotations, 'r') as f:
sentence_end = False
while sentence_end == False:
line = f.readline()
if not '/' in line:
sentence_end = True
else:
print(line)
/ 0 0 I / / - - (ARG0*)
/ 0 1 gave / / give give.01 (V*)
/ 0 2 flowers / / - - (ARG1*)
/ 0 3 to / / - - (ARG2*
/ 0 4 my / / - - *
/ 0 5 mother / / - - *)
We now call the agent’s learn_grammar_from_conll_file method. This call initiates the learning process implemented by the fcg-propbank subsystem and equips the agent with the resulting grammar. In this case, the agent has learnt two lexical constructions (for verbs with the lemmas give and send), two word sense constructions (for the rolesets give.01 and send.01), and two argument structure constructions (a double object construction and a prepositional dative construction).
[12]:
propbank_agent_learner.learn_grammar_from_conll_file(conll_annotations)
propbank_agent_learner
[12]:
<Agent: agent (id: agent-24) ~ 6 constructions>
We can inspect the agent’s grammar in the web interface.
[13]:
propbank_agent_learner.grammar.show_in_web_interface()
We can now instruct our agent to comprehend a previously unseen utterance, using the grammar it just learnt, by calling its comprehend method. While comprehending “The King of the Belgians sent a box of chocolates to Forrest Gump.”, the agent identifies an instance of the send.01 (give) roleset, with “The King of the Belgians” as the sender (arg0), “a box of chocolates” as the thing sent (arg1) and “to Forrest Gump” as the sent-to entity (arg2).
[14]:
display(propbank_agent_learner.comprehend('The King of the Belgians sent a box of chocolates to Forrest Gump.'))
describe_roleset('send.01')
[{'roleset': 'send.01',
'roles': [('v', 'sent', [5]),
('arg0', 'The King of the Belgians', [0, 1, 2, 3, 4]),
('arg1', 'a box of chocolates', [6, 7, 8, 9]),
('arg2', 'to Forrest Gump', [10, 11, 12])]}]
send.01
arg0: sender
arg1: sent
arg2: sent-to
After learning a grammar, it can be saved by calling the save_grammar_image method of the fcg.Agent class. This method saves the grammar to a file in a compiled, binary format that can later efficiently be loaded using an agent’s load_grammar_image method.
[15]:
propbank_agent_learner.save_grammar_image('usage-based-grammar.store')
[15]:
'/Users/paul/Projects/pyfcg/docs/source/walkthrough_tutorials/usage-based-grammar.store'
[16]:
new_agent = fcg.PropBankAgent()
new_agent.load_grammar_image('usage-based-grammar.store')
new_agent
[16]:
<Agent: agent (id: agent-25) ~ 6 constructions>