#3775 CSV Reader + NER Per Row
#3775 CSV Reader + NER Per Row
#3775 Create a Jupyter Notebook python application to read a CSV with twelve columns (col_1, col_2, col_3, ...) and run Named Entity Recognition against col_5 to report back entities and relations for each row of the CSV
Of course. Here is a Jupyter Notebook Python application that reads a CSV file, performs Named Entity Recognition (NER) on a specific column, and extracts potential relationships between those entities.
This solution uses the pandas
library for data manipulation and spaCy
for the natural language processing tasks.
Jupyter Notebook: Entity and Relation Extraction
This notebook demonstrates how to:
Create a sample CSV file for demonstration purposes.
Load the CSV file using
pandas
.Process text from a specific column (
col_5
) usingspaCy
.Extract Named Entities (like persons, organizations, and locations).
Infer simple relations between entities by analyzing sentence structure (Subject-Verb-Object triplets).
Display the results for each row.
Step 1: Install and Setup Libraries
First, you'll need to install pandas
and spaCy
, and then download a pre-trained spaCy
model. The en_core_web_sm
model is small and efficient, perfect for getting started.
# Install required libraries
!pip install pandas spacy
# Download the small English language model for spaCy
!python -m spacy download en_core_web_sm
Step 2: Create a Sample CSV File
For this example to be self-contained and runnable, let's create a sample CSV file named data.csv
. This file will have twelve columns as you specified, with interesting text for analysis in col_5
.
import csv
# Define the data for our sample CSV
csv_data = [
['header_1', 'header_2', 'header_3', 'header_4', 'header_5', 'header_6', 'header_7', 'header_8', 'header_9', 'header_10', 'header_11', 'header_12'],
['A1', 'B1', 'C1', 'D1', 'NASA awarded a significant contract to SpaceX for the new Artemis moon mission.', 'F1', 'G1', 'H1', 'I1', 'J1', 'K1', 'L1'],
['A2', 'B2', 'C2', 'D2', 'The new software from Microsoft will be integrated by engineers at Google.', 'F2', 'G2', 'H2', 'I2', 'J2', 'K2', 'L2'],
['A3', 'B3', 'C3', 'D3', 'Dr. Evelyn Reed published her findings on quantum computing from Cambridge University.', 'F3', 'G3', 'H3', 'I3', 'J3', 'K3', 'L3'],
['A4', 'B4', 'C4', 'D4', 'The report contained no actionable intelligence.', 'F4', 'G4', 'H4', 'I4', 'J4', 'K4', 'L4']
]
# Create column names
column_names = [f'col_{i}' for i in range(1, 13)]
csv_data[0] = column_names
# Write the data to a CSV file
with open('data.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerows(csv_data)
print("'data.csv' created successfully.")
Step 3: Load Libraries and the NLP Model
Now we'll import the necessary libraries and load the spaCy
model into an object we'll call nlp
.
import pandas as pd
import spacy
# Load the spaCy model
# For higher accuracy, you can download and use "en_core_web_md" or "en_core_web_lg"
nlp = spacy.load("en_core_web_sm")
print("Libraries and spaCy model loaded.")
Step 4: Define the Extraction Function
This function will contain the core logic. It takes a piece of text, processes it with spaCy
, and then extracts both the named entities and the simple, verb-based relations.
Named Entities: These are identified directly by the
spaCy
model (e.g.,ORG
,PERSON
).Relations: We infer these by looking for a Subject-Verb-Object pattern. We find verbs and then check for their subjects (
nsubj
) and direct objects (dobj
) in the sentence's dependency parse. This is a common and effective technique for basic relation extraction.
def extract_entities_and_relations(text):
"""
Processes text to extract named entities and simple (SVO) relations.
Args:
text (str): The input string to analyze.
Returns:
tuple: A tuple containing a list of entities and a list of relations.
"""
doc = nlp(text)
# 1. Extract Named Entities
entities = []
for ent in doc.ents:
entities.append({
'text': ent.text,
'label': ent.label_
})
# 2. Extract Subject-Verb-Object (SVO) relations
relations = []
for token in doc:
# Check if the token is a verb
if token.pos_ == 'VERB':
subject = None
obj = None
# Find subject and object among the verb's children
for child in token.children:
if 'nsubj' in child.dep_:
subject = child.text
if 'dobj' in child.dep_:
obj = child.text
if subject and obj:
# Use the verb's lemma (base form) for consistency
relations.append({
'subject': subject,
'verb': token.lemma_,
'object': obj
})
return entities, relations
Step 5: Read the CSV and Process Each Row
Finally, we'll read the data.csv
file into a pandas
DataFrame and iterate through each row. For every row, we'll apply our function to the text in col_5
and print the results in a clean, readable format.
# Load the CSV into a pandas DataFrame
df = pd.read_csv('data.csv')
# Process each row of the DataFrame
for index, row in df.iterrows():
text_to_analyze = row['col_5']
entities, relations = extract_entities_and_relations(text_to_analyze)
print(f"--- Analysis for Row {index} ---")
print(f"Text: {text_to_analyze}")
# Report Entities
if entities:
print("\nEntities Found:")
for entity in entities:
print(f" - Text: '{entity['text']}', Type: {entity['label']} ({spacy.explain(entity['label'])})")
else:
print("\nEntities Found: None")
# Report Relations
if relations:
print("\nRelations Found:")
for relation in relations:
print(f" - '{relation['subject']}' -> '{relation['verb']}' -> '{relation['object']}'")
else:
print("\nRelations Found: None")
print("\n" + "="*40 + "\n")
Visualizing the Dependency Parse (Optional)
To better understand how spaCy
is finding the relations, you can visualize the dependency parse for a single sentence. This shows the grammatical relationships between words.
from spacy import displacy
# Let's visualize the first sentence from our data
sample_text = df.loc[0, 'col_5']
doc = nlp(sample_text)
# The 'dep' style shows the dependency parse
displacy.render(doc, style="dep", jupyter=True, options={'distance': 90})
https://gemini.google.com/app/e38bf561b65fc4f6
From <https://gemini.google.com/app/> Google Gemini 2.5 Pro
Comments
Post a Comment