#3775 CSV Reader + NER Per Row

#3775 CSV Reader + NER Per Row

#3775 Create a Jupyter Notebook python application to read a CSV with twelve columns (col_1, col_2, col_3, ...) and run Named Entity Recognition against col_5 to report back entities and relations for each row of the CSV

Of course. Here is a Jupyter Notebook Python application that reads a CSV file, performs Named Entity Recognition (NER) on a specific column, and extracts potential relationships between those entities.

This solution uses the pandas library for data manipulation and spaCy for the natural language processing tasks.


Jupyter Notebook: Entity and Relation Extraction

This notebook demonstrates how to:

  1. Create a sample CSV file for demonstration purposes.

  2. Load the CSV file using pandas.

  3. Process text from a specific column (col_5) using spaCy.

  4. Extract Named Entities (like persons, organizations, and locations).

  5. Infer simple relations between entities by analyzing sentence structure (Subject-Verb-Object triplets).

  6. Display the results for each row.


Step 1: Install and Setup Libraries

First, you'll need to install pandas and spaCy, and then download a pre-trained spaCy model. The en_core_web_sm model is small and efficient, perfect for getting started.

# Install required libraries
!pip install pandas spacy

# Download the small English language model for spaCy
!python -m spacy download en_core_web_sm

Step 2: Create a Sample CSV File

For this example to be self-contained and runnable, let's create a sample CSV file named data.csv. This file will have twelve columns as you specified, with interesting text for analysis in col_5.

import csv

# Define the data for our sample CSV
csv_data = [
    ['header_1', 'header_2', 'header_3', 'header_4', 'header_5', 'header_6', 'header_7', 'header_8', 'header_9', 'header_10', 'header_11', 'header_12'],
    ['A1', 'B1', 'C1', 'D1', 'NASA awarded a significant contract to SpaceX for the new Artemis moon mission.', 'F1', 'G1', 'H1', 'I1', 'J1', 'K1', 'L1'],
    ['A2', 'B2', 'C2', 'D2', 'The new software from Microsoft will be integrated by engineers at Google.', 'F2', 'G2', 'H2', 'I2', 'J2', 'K2', 'L2'],
    ['A3', 'B3', 'C3', 'D3', 'Dr. Evelyn Reed published her findings on quantum computing from Cambridge University.', 'F3', 'G3', 'H3', 'I3', 'J3', 'K3', 'L3'],
    ['A4', 'B4', 'C4', 'D4', 'The report contained no actionable intelligence.', 'F4', 'G4', 'H4', 'I4', 'J4', 'K4', 'L4']
]

# Create column names
column_names = [f'col_{i}' for i in range(1, 13)]
csv_data[0] = column_names

# Write the data to a CSV file
with open('data.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerows(csv_data)

print("'data.csv' created successfully.")

Step 3: Load Libraries and the NLP Model

Now we'll import the necessary libraries and load the spaCy model into an object we'll call nlp.

import pandas as pd
import spacy

# Load the spaCy model
# For higher accuracy, you can download and use "en_core_web_md" or "en_core_web_lg"
nlp = spacy.load("en_core_web_sm")

print("Libraries and spaCy model loaded.")

Step 4: Define the Extraction Function

This function will contain the core logic. It takes a piece of text, processes it with spaCy, and then extracts both the named entities and the simple, verb-based relations.

  • Named Entities: These are identified directly by the spaCy model (e.g., ORG, PERSON).

  • Relations: We infer these by looking for a Subject-Verb-Object pattern. We find verbs and then check for their subjects (nsubj) and direct objects (dobj) in the sentence's dependency parse. This is a common and effective technique for basic relation extraction.

def extract_entities_and_relations(text):
    """
    Processes text to extract named entities and simple (SVO) relations.
    
    Args:
        text (str): The input string to analyze.
        
    Returns:
        tuple: A tuple containing a list of entities and a list of relations.
    """
    doc = nlp(text)
    
    # 1. Extract Named Entities
    entities = []
    for ent in doc.ents:
        entities.append({
            'text': ent.text,
            'label': ent.label_
        })
        
    # 2. Extract Subject-Verb-Object (SVO) relations
    relations = []
    for token in doc:
        # Check if the token is a verb
        if token.pos_ == 'VERB':
            subject = None
            obj = None
            # Find subject and object among the verb's children
            for child in token.children:
                if 'nsubj' in child.dep_:
                    subject = child.text
                if 'dobj' in child.dep_:
                    obj = child.text
            
            if subject and obj:
                # Use the verb's lemma (base form) for consistency
                relations.append({
                    'subject': subject,
                    'verb': token.lemma_,
                    'object': obj
                })
                
    return entities, relations

Step 5: Read the CSV and Process Each Row

Finally, we'll read the data.csv file into a pandas DataFrame and iterate through each row. For every row, we'll apply our function to the text in col_5 and print the results in a clean, readable format.

# Load the CSV into a pandas DataFrame
df = pd.read_csv('data.csv')

# Process each row of the DataFrame
for index, row in df.iterrows():
    text_to_analyze = row['col_5']
    
    entities, relations = extract_entities_and_relations(text_to_analyze)
    
    print(f"--- Analysis for Row {index} ---")
    print(f"Text: {text_to_analyze}")
    
    # Report Entities
    if entities:
        print("\nEntities Found:")
        for entity in entities:
            print(f"  - Text: '{entity['text']}', Type: {entity['label']} ({spacy.explain(entity['label'])})")
    else:
        print("\nEntities Found: None")
        
    # Report Relations
    if relations:
        print("\nRelations Found:")
        for relation in relations:
            print(f"  - '{relation['subject']}' -> '{relation['verb']}' -> '{relation['object']}'")
    else:
        print("\nRelations Found: None")
        
    print("\n" + "="*40 + "\n")

Visualizing the Dependency Parse (Optional)

To better understand how spaCy is finding the relations, you can visualize the dependency parse for a single sentence. This shows the grammatical relationships between words.

from spacy import displacy

# Let's visualize the first sentence from our data
sample_text = df.loc[0, 'col_5']
doc = nlp(sample_text)

# The 'dep' style shows the dependency parse
displacy.render(doc, style="dep", jupyter=True, options={'distance': 90})


https://gemini.google.com/app/e38bf561b65fc4f6

From <https://gemini.google.com/app/>  Google Gemini 2.5 Pro

Comments

Popular posts from this blog

#2892 Example of ChatGPT o1 Prompt

#3105 Federal and State Spending Idiocy