NLP Layer
This guide provides an overview of the NLP layer in narrativegraphs/nlp/.
Pipelines
The NLP layer provides two main pipelines that orchestrate the full extraction workflow:
| Pipeline | Purpose | Extracts |
|---|---|---|
| Pipeline | Full narrative graph extraction | Triplets + cooccurrences |
| CooccurrencePipeline | Simpler cooccurrence-only extraction | Entities + cooccurrences |
Both pipelines handle:
- Adding documents to the database
- Extracting and storing annotations
- Mapping surface forms to canonical entities/predicates
- Calculating statistics
Pipeline (Full)
Uses a TripletExtractor to extract subject-predicate-object triplets, then derives entities from those triplets. Also extracts cooccurrences between entities.
Default components:
- Triplet extraction:
DependencyGraphExtractor - Cooccurrence extraction:
ChunkCooccurrenceExtractor - Entity mapping:
SubgramStemmingMapper("noun") - Predicate mapping:
SubgramStemmingMapper("verb")
CooccurrencePipeline
Uses an EntityExtractor directly to find entities, then extracts cooccurrences between them. Skips triplet extraction entirely.
Default components:
- Entity extraction:
SpacyEntityExtractor - Cooccurrence extraction:
ChunkCooccurrenceExtractor - Entity mapping:
SubgramStemmingMapper("noun")
Extraction Components
Entity Extraction (entities/)
| Class | Description |
|---|---|
| EntityExtractor | Abstract base class |
| SpacyEntityExtractor | Uses spaCy NER and/or noun chunks with configurable length filters |
SpacyEntityExtractor features:
- Configurable NER and noun chunk extraction
- Token length filtering (min/max tokens)
- Greedy non-overlapping span selection (NER takes priority)
- Pronoun filtering
- Parallel batch processing
Triplet Extraction (triplets/)
| Class | Description |
|---|---|
| TripletExtractor | Abstract base class |
| DependencyGraphExtractor | Extracts triplets from spaCy dependency parses |
Triplets consist of:
subj: Subject entity (SpanAnnotation)pred: Predicate/verb (SpanAnnotation)obj: Object entity (SpanAnnotation)context: Optional sentence context (AnnotationContext)
Cooccurrence Extraction (tuplets/)
| Class | Description |
|---|---|
| CooccurrenceExtractor | Abstract base class |
| ChunkCooccurrenceExtractor | Sentence-based windowed cooccurrences |
| DocumentCooccurrenceExtractor | All entity pairs within a document |
ChunkCooccurrenceExtractor features:
- Configurable sentence window size
- Custom boundary patterns (regex or callable)
- Sentence-level context capture
Tuplets consist of:
entity_one,entity_two: Entity pair (SpanAnnotation)context: Optional context window (AnnotationContext)
Mapping Components (mapping/)
Mappers normalize surface forms to canonical labels, creating a dict[str, str] mapping.
| Class | Description |
|---|---|
| Mapper | Abstract base class |
| StemmingMapper | Groups by Porter stemmed form |
| SubgramStemmingMapper | Stemming + subgram matching for head words |
SubgramStemmingMapper (default):
- First applies stemming normalization
- Then matches shorter forms to longer ones containing them
- Configurable for nouns or verbs via
head_word_type - Ranking by shortest label or most frequent
Supporting Components
Common Utilities (common/)
| Module | Purpose |
|---|---|
| annotation.py | Data models: SpanAnnotation, AnnotationContext |
| spacy.py | spaCy utilities: model loading, batch size calculation, span filtering |
| transformcategories.py | Normalizes various category input formats |
SpanAnnotation represents a text span with:
text: Surface formstart_char,end_char: Character offsetsnormalized_text: Optional lemma
Filtering (filtering/)
| Class | Description |
|---|---|
| BigramFilter | PMI-based bigram filtering for quality control |
BigramFilter can be used to filter out low-quality multi-word spans based on bigram co-occurrence statistics.
Architecture Diagram
Pipeline / CooccurrencePipeline
│
├── Document ingestion
│
├── Extraction
│ ├── TripletExtractor ──► Triplet (subj, pred, obj)
│ │ └── DependencyGraphExtractor (spaCy)
│ │
│ ├── EntityExtractor ──► SpanAnnotation
│ │ └── SpacyEntityExtractor (NER + noun chunks)
│ │
│ └── CooccurrenceExtractor ──► Tuplet (entity_one, entity_two)
│ ├── ChunkCooccurrenceExtractor (sentence window)
│ └── DocumentCooccurrenceExtractor (all pairs)
│
├── Mapping
│ └── Mapper ──► dict[str, str]
│ ├── StemmingMapper
│ └── SubgramStemmingMapper (default)
│
└── Stats calculation (via service layer)
Extensibility
All extraction and mapping components use abstract base classes, making it easy to implement custom:
- Entity extractors (e.g., domain-specific NER)
- Triplet extractors (e.g., rule-based, LLM-based)
- Cooccurrence extractors (e.g., paragraph-level, custom boundaries)
- Mappers (e.g., embedding-based clustering, knowledge base linking)