Demo: Creating and inspecting a narrative graph¶
This notebook will serve as a demo and small tour of some of the core functionalities of a NarrativeGraph object.
Data setup¶
For this demo notebook, we will be using News Category Dataset [1, 2] available on Kagglehub because it has short texts, timestamps and are categorized.
import time
from kagglehub import KaggleDatasetAdapter
import kagglehub
data = kagglehub.dataset_load(
KaggleDatasetAdapter.PANDAS,
"rmisra/news-category-dataset",
"News_Category_Dataset_v3.json",
pandas_kwargs=dict(lines=True),
)
data.head()
| link | headline | category | short_description | authors | date | |
|---|---|---|---|---|---|---|
| 0 | https://www.huffpost.com/entry/covid-boosters-... | Over 4 Million Americans Roll Up Sleeves For O... | U.S. NEWS | Health experts said it is too early to predict... | Carla K. Johnson, AP | 2022-09-23 |
| 1 | https://www.huffpost.com/entry/american-airlin... | American Airlines Flyer Charged, Banned For Li... | U.S. NEWS | He was subdued by passengers and crew when he ... | Mary Papenfuss | 2022-09-23 |
| 2 | https://www.huffpost.com/entry/funniest-tweets... | 23 Of The Funniest Tweets About Cats And Dogs ... | COMEDY | "Until you have a dog you don't understand wha... | Elyse Wanshel | 2022-09-23 |
| 3 | https://www.huffpost.com/entry/funniest-parent... | The Funniest Tweets From Parents This Week (Se... | PARENTING | "Accidentally put grown-up toothpaste on my to... | Caroline Bologna | 2022-09-23 |
| 4 | https://www.huffpost.com/entry/amy-cooper-lose... | Woman Who Called Cops On Black Bird-Watcher Lo... | U.S. NEWS | Amy Cooper accused investment firm Franklin Te... | Nina Golgowski | 2022-09-22 |
The columns that we will be using as input for our narrative graph.
- Documents: headline + short_description
- IDs: link, but without the part that is in all of them
- Timestamps: date
- Categories: category
There are many categories. We will create a subset with just two of them: U.S. News and Politics.
# create a sample
sample = data[data["category"].isin(["U.S. NEWS", "POLITICS"])].sample(
5000, random_state=42
)
docs = sample["headline"] + "\n\n" + sample["short_description"]
ids = sample["link"].replace("https://www.huffpost.com/entry/", "") # get rit of the first part of the URL
categories = sample["category"]
timestamps = sample["date"]
Creating the model¶
Once we have our list of documents, which is the only required input, and extra metadata in aligned lists, we can create a narrative graph.
from narrativegraphs import NarrativeGraph
model = NarrativeGraph()
model.fit(docs, doc_ids=ids, categories=categories, timestamps=timestamps)
INFO:narrativegraphs.pipeline:Adding 5000 documents to database
INFO:narrativegraphs.pipeline:Extracting triplets
INFO:narrativegraphs.pipeline:Resolving entities and predicates
INFO:narrativegraphs.pipeline:Mapping triplets and tuplets
INFO:narrativegraphs.pipeline:Calculating stats
<narrativegraphs.graphs.NarrativeGraph at 0x177aba7b0>
Inspecting the model visually¶
One of the key features of the narrativegraphs package is that it lets a user inspect the output interactively in a browser-based visualizer. It is hosted directly on your machine by the Python package – no extra dependencies required. This is achieved with the one line below.
Click the link in the log messages to open in your browser.
# create server to be viewed in own browser which blocks execution of other cells
## model.serve_visualizer()
## Or run in the background
server = model.serve_visualizer(block=False)
INFO:root:Server started in background on port 8001
server.stop()
INFO:root:Background server stopped
Stop it by hitting the stop button on the cell in Jupyter Notebook or hit CTRL+C elsewhere.
Inspecting and accessing the model programmatically¶
The graph consists of entities as nodes and their relations or cooccurrences as edges. These, along with the data that back them, like documents and extracted semantic triplets, can be retrieved from the model through properties or service attributes.
Attributes¶
We can get the graph as a whole, as NetworkX graph, through the properties .relation_graph_ and .cooccurrence_graph_.
relation_graph = model.relation_graph_
print(type(relation_graph))
<class 'networkx.classes.digraph.DiGraph'>
print(*list(relation_graph.nodes(data=True))[:3], sep="\n")
(1, {'id': 1, 'label': 'An App', 'frequency': 1, 'focus': False})
(2, {'id': 2, 'label': 'Deportation Agents', 'frequency': 1, 'focus': False})
(3, {'id': 3, 'label': 'that its estimate', 'frequency': 1, 'focus': False})
Similarly, entities and relations and everything else can be accessed as pandas.DataFrames through properties.
model.entities_
| id | label | frequency | doc_frequency | spread | adjusted_tf_idf | first_occurrence | last_occurrence | alt_labels | category | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | An App | 1 | 1 | 0.0002 | 0.000000 | 2022-03-11 | 2022-03-11 | [] | [POLITICS] |
| 1 | 2 | Deportation Agents | 1 | 1 | 0.0002 | 0.000000 | 2022-03-11 | 2022-03-11 | [] | [POLITICS] |
| 2 | 3 | that its estimate | 1 | 1 | 0.0002 | 0.000000 | 2017-12-11 | 2017-12-11 | [] | [POLITICS] |
| 3 | 4 | the non-partisan Congressional Budget Office (CBO | 1 | 1 | 0.0002 | 0.000000 | 2017-12-11 | 2017-12-11 | [] | [POLITICS] |
| 4 | 5 | The city council | 2 | 2 | 0.0004 | 1666.666667 | 2016-03-29 | 2016-08-20 | [] | [POLITICS, POLITICS] |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3416 | 3417 | His Mind | 1 | 1 | 0.0002 | 0.000000 | 2016-04-13 | 2016-04-13 | [] | [POLITICS] |
| 3417 | 3418 | The DNC Contenders | 1 | 1 | 0.0002 | 0.000000 | 2017-01-19 | 2017-01-19 | [] | [POLITICS] |
| 3418 | 3419 | Interested | 1 | 1 | 0.0002 | 0.000000 | 2017-01-19 | 2017-01-19 | [] | [POLITICS] |
| 3419 | 3420 | the stage | 1 | 1 | 0.0002 | 0.000000 | 2017-12-16 | 2017-12-16 | [] | [POLITICS] |
| 3420 | 3421 | House Science Committee | 1 | 1 | 0.0002 | 0.000000 | 2014-05-22 | 2014-05-22 | [] | [POLITICS] |
3421 rows × 10 columns
The properties (with trailing _) are nice in that they give back the data in well-known formats that one can continue working with, e.g. NetworkX graphs for graph algorithms and DataFrames for statistical analyses.
Service attributes¶
However, the service attributes offer more control and may be especially handy if the model is quite big, so that you do not necessarily want everything spit out at once.
For instance, you can search for entities with the entities service.
democrats_matches = model.entities.search("democrats")
democrats_matches[:10]
[EntityLabel(id=114, label='Democrats'), EntityLabel(id=1476, label="Democrats' big reform bill")]
democrats_id = democrats_matches[0].id
And you can create a subgraph that expands from a set of focus nodes and only includes those that pass a filter.
from datetime import date
from narrativegraphs import GraphFilter
democrats_graph = model.graph.expand_from_focus_entities(
[democrats_id],
"relation",
graph_filter=GraphFilter(
categories={'category': ["POLITICS"]},
earliest_date=date(2014, 1, 1)
)
)
# stripping labels to remove some whitespaces
print("NODES")
for node in democrats_graph.nodes:
print(node.id, node.label.strip())
print("\nEDGES")
for edge in democrats_graph.edges:
print(edge.subject_label.strip(), '--', edge.label, '->', edge.object_label.strip())
NODES 13 Trump 23 GOP 24 Bill 33 Betsy DeVos 64 Obama 77 State 86 Biden 113 A majority 114 Democrats 143 this week's "Candidate Confessional 183 Jefferson Jackson Dinner 260 Chuck Schumer 325 A Landslide 347 A Run 411 Liberals 428 health care 476 different findings 537 Record Donations 603 an even more ambitious vision 606 Planned Parenthood Shooting 616 A candidate 638 Republican Mike DeWine 678 Tehran 682 Anthony Weiner 701 judicial nominee Steven Menashi 739 This Billionaire Environmental Activist 908 Pennsylvania Republican 953 Republican Lt. Gov. Kim Guadagno 1067 the ballot box 1068 Special Election 1069 Resources 1219 more clarity 1320 world leaders 1322 Eyeing 2018 Senate Takeover 1366 key states 1369 Little Time 1440 Hugh Hewitt 1572 The Longest-Serving Woman 1748 Hemp 1778 ‘Hostage Czar 1858 Nationwide Day 1933 Black candidates 1941 the Call 2026 For Independent Commission 2073 Republican Cory Gardner 2174 Rep. Ruben Kihuen 2219 Renewed Push 2271 Oversight Committee 2387 California Assembly Member Roger Hernandez 2423 Their Nadir 2440 To End 'Corporate Culture 2454 Gubernatorial Primary 2490 A New Committee 2629 Probe 2665 About Her 2862 An Investigation 2867 over an 2988 Rep. Jamie Raskin 3007 Short But Outperform Expectations 3126 Cuomo 3395 their worst electoral position EDGES Trump -- puts -> the Call Trump -- is -> health care Trump -- is -> A candidate GOP -- challenging, Stick -> Trump GOP -- Considering, Wanted -> Obama GOP -- Moves -> Democrats Bill -- Survives -> GOP Bill -- strip -> health care Bill -- Points -> Trump Obama -- ripped -> Trump State -- Explains -> Trump Biden -- slammed -> Trump Democrats -- concede -> A majority Democrats -- discuss -> this week's "Candidate Confessional Democrats -- Rename -> Jefferson Jackson Dinner Democrats -- Turns -> A Run Democrats -- Boycotting, participated, needs -> Trump Democrats -- advanced, calls -> Bill Democrats -- Emphasizes -> health care Democrats -- Flip, Wins -> State Democrats -- Having -> different findings Democrats -- Rake -> Record Donations Democrats -- Lay -> an even more ambitious vision Democrats -- React -> Planned Parenthood Shooting Democrats -- Faces -> Republican Mike DeWine Democrats -- Booted -> Anthony Weiner Democrats -- Asks -> judicial nominee Steven Menashi Democrats -- fled, Push -> Obama Democrats -- Defeated -> Republican Lt. Gov. Kim Guadagno Democrats -- Wins -> Special Election Democrats -- dominated -> the ballot box Democrats -- Pumped -> Resources Democrats -- Demands -> more clarity Democrats -- assured -> world leaders Democrats -- Begins -> Eyeing 2018 Senate Takeover Democrats -- lead -> key states Democrats -- Having -> Little Time Democrats -- Became -> The Longest-Serving Woman Democrats -- agree -> Hemp Democrats -- calls -> ‘Hostage Czar Democrats -- lead -> Nationwide Day Democrats -- nominate -> Black candidates Democrats -- Renew -> the Call Democrats -- Push -> For Independent Commission Democrats -- lose -> Republican Cory Gardner Democrats -- calls -> Rep. Ruben Kihuen Democrats -- Make -> Renewed Push Democrats -- Asks -> Oversight Committee Democrats -- needs -> A Landslide Democrats -- Are -> Their Nadir Democrats -- Push -> To End 'Corporate Culture Democrats -- Compete -> Gubernatorial Primary Democrats -- Wanted -> A New Committee Democrats -- gave -> Betsy DeVos Democrats -- calls -> Probe Democrats -- is -> About Her Democrats -- Wanted -> An Investigation Democrats -- presided -> over an Democrats -- Tap -> Rep. Jamie Raskin Democrats -- Coming -> Short But Outperform Expectations Democrats -- calls -> Cuomo Democrats -- Welcome, Supporting -> Biden Democrats -- Are -> their worst electoral position Chuck Schumer -- Warns, Be -> Democrats Chuck Schumer -- Vows -> Trump Liberals -- puts -> Democrats A candidate -- is -> Democrats Tehran -- seeing -> Democrats This Billionaire Environmental Activist -- Picks -> Democrats Pennsylvania Republican -- Wanted -> Democrats Hugh Hewitt -- backed -> Democrats California Assembly Member Roger Hernandez -- challenging -> Democrats
Saving and loading the model¶
We can save the model for later use, especially if we have a lot of documents that takes a while to process.
model.save_to_file("demo.db", overwrite=True)
And we can load it from that saved file.
model = NarrativeGraph.load("demo")
References¶
[1] Misra, Rishabh. "News Category Dataset." arXiv preprint arXiv:2209.11429 (2022).
[2] Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).