{ "cells": [ { "cell_type": "markdown", "id": "2953c54e5b188d5e", "metadata": {}, "source": [ "# Walkthrough Tutorial: Learning Grammars from Corpora\n", "\n", "
\n", "*This walkthrough tutorial accompanies Section 4.2 of the following paper:*\n", "\n", "Authors. (submitted). PyFCG: Fluid Construction Grammar in Python.\n", "\n", "
\n", "\n", "A second use case of FCG concerns the learning of construction grammars from corpora of language use. We take the example of ```fcg-propbank```, an existing FCG subsystem for learning construction grammars from PropBank-annotated corpora. We demonstrate how a pretrained grammar comprising tens of thousands of constructions can be loaded into an agent and used to extract semantic frames from open-domain text, how a new grammar can be learnt from annotated data, and how large grammars can be saved in an efficiently loadable binary format." ] }, { "cell_type": "code", "execution_count": null, "id": "52929f6263d7f3c9", "metadata": {}, "outputs": [], "source": [ "# Run this cell if you have not yet installed these packages\n", "! pip install pyfcg\n", "! pip install nltk" ] }, { "cell_type": "code", "execution_count": 1, "id": "7656b26a", "metadata": { "ExecuteTime": { "end_time": "2025-05-19T14:08:21.295715Z", "start_time": "2025-05-19T14:08:19.220456Z" } }, "outputs": [], "source": [ "import pyfcg as fcg\n", "fcg.init()" ] }, { "cell_type": "markdown", "id": "74e7b821", "metadata": {}, "source": [ "## Loading a pre-trained grammar" ] }, { "cell_type": "markdown", "id": "94a8646c7263b5a6", "metadata": {}, "source": [ "As always, we start by creating an agent. In this case, the agent is an instance of the ```fcg.PropBankAgent``` class, a subclass of the ```fcg.Agent``` class provided by the ```fcg-propbank``` subsystem. We download a pretrained, precompiled grammar for English and load it into our agent using its ```load_grammar_image``` method. The agent now has at its disposal a grammar consisting of 21,052 constructions.\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "4805bc9e", "metadata": { "ExecuteTime": { "end_time": "2025-05-19T14:08:25.796416Z", "start_time": "2025-05-19T14:08:21.383118Z" } }, "outputs": [], "source": [ "propbank_agent_pretrained = fcg.PropBankAgent()\n", "pretrained_grammar_image = fcg.load_resource('pb-en.store')" ] }, { "cell_type": "code", "execution_count": 3, "id": "a7930b3fbf30f650", "metadata": { "ExecuteTime": { "end_time": "2025-05-19T14:08:40.560434Z", "start_time": "2025-05-19T14:08:25.801688Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "propbank_agent_pretrained.load_grammar_image(pretrained_grammar_image)\n", "propbank_agent_pretrained" ] }, { "cell_type": "markdown", "id": "57b3b5ef", "metadata": {}, "source": [ "Our agent can now use its pretrained grammar to comprehend new utterances. Below, we instruct our agent to comprehend the passive utterance \"Margaret Thatcher was elected Prime Minister of Britain.\". The resulting meaning representation reveals that the agent identified a single semantic frame that instantiates the ```elect.01``` PropBank roleset (*elect someone to an office or position*). The agent also understood that the roles of *candidate* (```arg1```) and *office or position* (```arg2```) in this instance of ```elect.01``` are respectively taken up by \"Margaret Thatcher\" and \"Prime Minister of Britain\"." ] }, { "cell_type": "code", "execution_count": 4, "id": "487fbead", "metadata": { "ExecuteTime": { "end_time": "2025-05-19T14:08:41.372268Z", "start_time": "2025-05-19T14:08:40.567891Z" } }, "outputs": [ { "data": { "text/plain": [ "[{'roleset': 'elect.01',\n", " 'roles': [('v', 'elected', [3]),\n", " ('arg1', 'Margaret Thatcher', [0, 1]),\n", " ('arg2', 'Prime Minister of Britain', [4, 5, 6, 7])]}]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "propbank_agent_pretrained.comprehend(\"Margaret Thatcher was elected Prime Minister of Britain.\")" ] }, { "cell_type": "markdown", "id": "58a618ee042eddb5", "metadata": {}, "source": [ "To enhance human readability, we can choose to activate an FCG monitor to trace the comprehension process in the web interface:\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "1abc9f55", "metadata": { "ExecuteTime": { "end_time": "2025-05-19T14:08:42.219222Z", "start_time": "2025-05-19T14:08:41.377623Z" } }, "outputs": [ { "data": { "text/plain": [ "[{'roleset': 'enjoy.01',\n", " 'roles': [('v', 'enjoyed', [2]),\n", " ('arg0', 'She', [0]),\n", " ('arg1', 'visiting the old historic churches', [3, 4, 5, 6, 7])]},\n", " {'roleset': 'visit.01',\n", " 'roles': [('v', 'visiting', [3]),\n", " ('arg0', 'She', [0]),\n", " ('arg1', 'the old historic churches', [4, 5, 6, 7])]}]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fcg.start_web_interface()\n", "fcg.activate_monitor('trace-fcg')\n", "propbank_agent_pretrained.comprehend(\"She especially enjoyed visiting the old historic churches.\")" ] }, { "cell_type": "markdown", "id": "93bdd539334bc0f7", "metadata": {}, "source": [ "In order to better understand the PropBank rolesets that are retrieved by our agent, we define a new function ```describe_roleset```. The function makes use of ```nltk```'s ```propbank``` module to look up all roles of a given roleset, together with their descriptions:" ] }, { "cell_type": "code", "execution_count": 6, "id": "931008a7", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package propbank to /Users/paul/nltk_data...\n", "[nltk_data] Package propbank is already up-to-date!\n" ] } ], "source": [ "import nltk\n", "nltk.download('propbank')\n", "from nltk.corpus import propbank" ] }, { "cell_type": "code", "execution_count": 7, "id": "9c49c57e", "metadata": { "ExecuteTime": { "end_time": "2025-05-19T14:08:42.564776Z", "start_time": "2025-05-19T14:08:42.224907Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "elect.01\n", " arg0: voters\n", " arg1: candidate\n", " arg2: office or position\n" ] } ], "source": [ "def describe_roleset(roleset):\n", " nltk_roleset = propbank.roleset(roleset)\n", " print(nltk_roleset.attrib['id'])\n", " for role in nltk_roleset.findall(\"roles/role\"):\n", " print(' arg' + role.attrib['n'] + ':', role.attrib['descr'])\n", "\n", "describe_roleset('elect.01')" ] }, { "cell_type": "code", "execution_count": 8, "id": "464d64ee", "metadata": { "ExecuteTime": { "end_time": "2025-05-19T14:08:43.018865Z", "start_time": "2025-05-19T14:08:42.571745Z" } }, "outputs": [ { "data": { "text/plain": [ "[{'roleset': 'arrive.01',\n", " 'roles': [('v', 'arriving', [6]),\n", " ('arg1', 'the taxi', [3, 4]),\n", " ('arg3', 'at Gate 1', [7, 8, 9])]}]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "arrive.01\n", " arg1: entity in motion / 'comer'\n", " arg2: extent -- rare)\n", " arg3: start point -- also rare)\n", " arg4: end point, destination\n" ] } ], "source": [ "display(propbank_agent_pretrained.comprehend(\"Attention passengers, the taxi is arriving at Gate 1.\"))\n", "describe_roleset('arrive.01')" ] }, { "cell_type": "code", "execution_count": 9, "id": "76ee723f", "metadata": { "ExecuteTime": { "end_time": "2025-05-19T14:08:43.854690Z", "start_time": "2025-05-19T14:08:43.024639Z" } }, "outputs": [ { "data": { "text/plain": [ "[{'roleset': 'explain.01',\n", " 'roles': [('v', 'Explain', [0]),\n", " ('arg1', 'this', [1]),\n", " ('arg2', 'to me', [2, 3])]}]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "explain.01\n", " arg0: explainer\n", " arg1: thing explained\n", " arg2: explained to\n" ] } ], "source": [ "display(propbank_agent_pretrained.comprehend(\"Explain this to me again.\"))\n", "describe_roleset('explain.01')" ] }, { "cell_type": "code", "execution_count": 10, "id": "137b6b86", "metadata": { "ExecuteTime": { "end_time": "2025-05-19T14:08:44.463673Z", "start_time": "2025-05-19T14:08:43.874346Z" } }, "outputs": [ { "data": { "text/plain": [ "[{'roleset': 'visit.01',\n", " 'roles': [('v', 'visiting', [2]),\n", " ('arg0', 'They', [0]),\n", " ('arg1', 'New York', [3, 4])]},\n", " {'roleset': 'enjoy.01',\n", " 'roles': [('v', 'enjoy', [1]),\n", " ('arg0', 'They', [0]),\n", " ('arg1', 'visiting New York', [2, 3, 4])]}]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "visit.01\n", " arg0: one party\n", " arg1: other party\n", "enjoy.01\n", " arg0: enjoyer\n", " arg1: thing enjoyed\n" ] } ], "source": [ "display(propbank_agent_pretrained.comprehend(\"They enjoy visiting New York.\"))\n", "describe_roleset('visit.01')\n", "describe_roleset('enjoy.01')" ] }, { "cell_type": "markdown", "id": "60e900f2", "metadata": {}, "source": [ "## Training a new grammar" ] }, { "cell_type": "markdown", "id": "9a3f0d763ee7aa95", "metadata": {}, "source": [ "Let us now create a second agent, again as an instance of the ```fcg.PropBankAgent``` class, but let it learn a new grammar from corpus data instead of loading a pretrained one. After having downloaded an example CoNNL file, in which a number of English sentences are annotated with PropBank rolesets, we can inspect the first sentence of the file:\n" ] }, { "cell_type": "code", "execution_count": 11, "id": "bfe788c7", "metadata": { "ExecuteTime": { "end_time": "2025-05-19T14:08:44.501648Z", "start_time": "2025-05-19T14:08:44.469675Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/ 0 0 I / / - - (ARG0*) \n", "\n", "/ 0 1 gave / / give give.01 (V*) \n", "\n", "/ 0 2 flowers / / - - (ARG1*) \n", "\n", "/ 0 3 to / / - - (ARG2* \n", "\n", "/ 0 4 my / / - - *\n", "\n", "/ 0 5 mother / / - - *) \n", "\n" ] } ], "source": [ "propbank_agent_learner = fcg.PropBankAgent()\n", "conll_annotations = fcg.load_resource('pb-annotations.conll')\n", "with open(conll_annotations, 'r') as f:\n", " sentence_end = False\n", " while sentence_end == False:\n", " line = f.readline()\n", " if not '/' in line:\n", " sentence_end = True\n", " else:\n", " print(line)" ] }, { "cell_type": "markdown", "id": "386f4236395d65fc", "metadata": {}, "source": [ "We now call the agent's ```learn_grammar_from_conll_file``` method. This call initiates the learning process implemented by the ```fcg-propbank``` subsystem and equips the agent with the resulting grammar. In this case, the agent has learnt two lexical constructions (for verbs with the lemmas *give* and *send*), two word sense constructions (for the rolesets ```give.01``` and ```send.01```), and two argument structure constructions (a double object construction and a prepositional dative construction)." ] }, { "cell_type": "code", "execution_count": 12, "id": "761624a87d2127ff", "metadata": { "ExecuteTime": { "end_time": "2025-05-19T14:08:44.950132Z", "start_time": "2025-05-19T14:08:44.507189Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "propbank_agent_learner.learn_grammar_from_conll_file(conll_annotations)\n", "propbank_agent_learner" ] }, { "cell_type": "markdown", "id": "2a1e9255cc13b31f", "metadata": {}, "source": [ "We can inspect the agent's grammar in the web interface." ] }, { "cell_type": "code", "execution_count": 13, "id": "ff40272e45412f67", "metadata": { "ExecuteTime": { "end_time": "2025-05-19T14:08:45.057462Z", "start_time": "2025-05-19T14:08:44.958727Z" } }, "outputs": [], "source": [ "propbank_agent_learner.grammar.show_in_web_interface()\n" ] }, { "cell_type": "markdown", "id": "1115d421", "metadata": {}, "source": [ "We can now instruct our agent to comprehend a previously unseen utterance, using the grammar it just learnt, by calling its ```comprehend``` method. While comprehending \"The King of the Belgians sent a box of chocolates to Forrest Gump.\", the agent identifies an instance of the ```send.01``` (*give*) roleset, with \"The King of the Belgians\" as the *sender* (```arg0```), \"a box of chocolates\" as the *thing sent* (```arg1```) and \"to Forrest Gump\" as the *sent-to* entity (```arg2```).\n" ] }, { "cell_type": "code", "execution_count": 14, "id": "90f47627ad404a42", "metadata": { "ExecuteTime": { "end_time": "2025-05-19T14:08:45.441078Z", "start_time": "2025-05-19T14:08:45.062052Z" } }, "outputs": [ { "data": { "text/plain": [ "[{'roleset': 'send.01',\n", " 'roles': [('v', 'sent', [5]),\n", " ('arg0', 'The King of the Belgians', [0, 1, 2, 3, 4]),\n", " ('arg1', 'a box of chocolates', [6, 7, 8, 9]),\n", " ('arg2', 'to Forrest Gump', [10, 11, 12])]}]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "send.01\n", " arg0: sender\n", " arg1: sent\n", " arg2: sent-to\n" ] } ], "source": [ "display(propbank_agent_learner.comprehend('The King of the Belgians sent a box of chocolates to Forrest Gump.'))\n", "describe_roleset('send.01')" ] }, { "cell_type": "markdown", "id": "6e7571f297af62ea", "metadata": {}, "source": [ "After learning a grammar, it can be saved by calling the ```save_grammar_image``` method of the ```fcg.Agent``` class. This method saves the grammar to a file in a compiled, binary format that can later efficiently be loaded using an agent's ```load_grammar_image``` method.\n" ] }, { "cell_type": "code", "execution_count": 15, "id": "afcd9901", "metadata": { "ExecuteTime": { "end_time": "2025-05-19T14:08:45.456049Z", "start_time": "2025-05-19T14:08:45.447099Z" } }, "outputs": [ { "data": { "text/plain": [ "'/Users/paul/Projects/pyfcg/docs/source/walkthrough_tutorials/usage-based-grammar.store'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "propbank_agent_learner.save_grammar_image('usage-based-grammar.store')" ] }, { "cell_type": "code", "execution_count": 16, "id": "e507ca6b05515014", "metadata": { "ExecuteTime": { "end_time": "2025-05-19T14:08:45.489435Z", "start_time": "2025-05-19T14:08:45.472690Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_agent = fcg.PropBankAgent()\n", "new_agent.load_grammar_image('usage-based-grammar.store')\n", "new_agent" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.10" } }, "nbformat": 4, "nbformat_minor": 5 }