#4250 Python Workflow for Knowledge Graph in RDF/SPARQL/Pyvis

#4250 Python Workflow for Knowledge Graph in RDF/SPARQL/Pyvis




#4250 I want to develop the following Python workflow: 

  1. CSV -> RDF, 
  2. RDF Schema -> SPARQL Query Builder, 
  3. Tkinter GUI to query RDF -> Pyvis to visualize graph from query results

This is a classic Knowledge Graph engineering stack. To make this work seamlessly, you’ll want to use rdflib as your backbone—it handles the parsing, schema management, and SPARQL execution all in one go.

Here is a conceptual breakdown and a starter implementation to get your workflow running.


1. The Core Engine: CSV to RDF

To convert CSV to RDF, you map columns to URIs and Literals. Using a schema (RDFS) allows you to define classes and properties, ensuring your graph isn't just "flat" data.

import pandas as pd
from rdflib import Graph, Literal, RDF, URIRef, Namespace
from rdflib.namespace import RDFS, FOAF

# Initialize Graph and Namespace
g = Graph()
EX = Namespace("http://example.org/")
g.bind("ex", EX)

def convert_csv_to_rdf(csv_path):
    df = pd.read_csv(csv_path)
    for _, row in df.iterrows():
        # Example: Creating a 'Person' from a CSV row
        subject = EX[str(row['ID'])]
        g.add((subject, RDF.type, EX.Person))
        g.add((subject, RDFS.label, Literal(row['Name'])))
        g.add((subject, EX.worksAt, Literal(row['Company'])))
    return g

2. Tkinter GUI & SPARQL Builder

The GUI acts as the bridge. You can use a Text widget for the query and a Button to trigger the execution.

import tkinter as tk
from tkinter import scrolledtext

def run_query():
    query = query_input.get("1.0", tk.END)
    results = g.query(query)
    visualize_results(results)

root = tk.Tk()
root.title("SPARQL Graph Explorer")

query_input = scrolledtext.ScrolledText(root, width=60, height=10)
query_input.insert(tk.INSERT, "SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 10")
query_input.pack()

btn = tk.Button(root, text="Execute & Visualize", command=run_query)
btn.pack()

3. Visualization: Pyvis

Pyvis expects nodes and edges. You’ll need to iterate through your SPARQL result set and add them to a pyvis.network.Network object.

from pyvis.network import Network

def visualize_results(results):
    net = Network(notebook=False, directed=True, height="750px", width="100%")
    
    for row in results:
        # SPARQL results are tuples (s, p, o)
        s, p, o = str(row[0]), str(row[1]), str(row[2])
        
        net.add_node(s, label=s, color="#97c2fc")
        net.add_node(o, label=o, color="#ffff00")
        net.add_edge(s, o, label=p)
    
    net.show("graph.html")

The Complete Workflow Loop

StepToolResponsibility
IngestionPandasCleans and iterates through your raw CSV.
ModelingRDFLibAssigns URIs and applies your RDF Schema logic.
InterfaceTkinterProvides a text area for custom SPARQL queries.
RenderingPyvisConverts the $(s, p, o)$ triples into an interactive HTML graph.

Technical Tips:

  • URIs vs Literals: Ensure your CSV IDs are converted to URIRef, while descriptive text should be Literal. Pyvis handles strings well, but RDFLib needs the distinction to query correctly.

  • Schema Enforcement: If you want to use RDFS for reasoning (e.g., if $A$ is a Manager and Manager is a subclass of Person, then $A$ is a Person), you might need to use a reasoner or perform simple expansion within RDFLib before querying.

Would you like me to write a sample RDF Schema (RDFS) file that defines the relationships for your CSV data?


The schema I need is for NASA Programs: Program, SubProgram, Mission, Vehicle, System, SubSystem, Component, Document

For a NASA technical hierarchy, your schema needs to reflect a strict containment model (e.g., a System is part of a Vehicle) while also allowing for cross-cutting relationships like Documents that might describe multiple levels of the stack.

Below is a Turtle (.ttl) schema designed specifically for your requested entities.

1. NASA Technical Hierarchy Schema (nasa_schema.ttl)

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix nasa: <http://nasa.example.org/ontology#> .

### Classes (The Hierarchy) ###

nasa:Program a rdfs:Class ;
    rdfs:label "Program" ;
    rdfs:comment "Top-level NASA effort (e.g., Artemis)" .

nasa:SubProgram a rdfs:Class ;
    rdfs:label "Sub-Program" ;
    rdfs:comment "Specific initiative within a Program" .

nasa:Mission a rdfs:Class ;
    rdfs:label "Mission" ;
    rdfs:comment "A specific flight or deployment (e.g., Artemis I)" .

nasa:Vehicle a rdfs:Class ;
    rdfs:label "Vehicle" ;
    rdfs:comment "The spacecraft or carrier (e.g., Orion, SLS)" .

nasa:System a rdfs:Class ;
    rdfs:label "System" ;
    rdfs:comment "Major functional block (e.g., Propulsion, Life Support)" .

nasa:SubSystem a rdfs:Class ;
    rdfs:label "Sub-System" ;
    rdfs:comment "Dedicated functional unit within a system" .

nasa:Component a rdfs:Class ;
    rdfs:label "Component" ;
    rdfs:comment "Individual hardware or software part" .

nasa:Document a rdfs:Class ;
    rdfs:label "Technical Document" ;
    rdfs:comment "Manuals, specifications, or reports" .

### Properties (Relationships) ###

# Structural "Part Of" relationship
nasa:partOf a rdf:Property ;
    rdfs:label "is part of" ;
    rdfs:comment "General hierarchical relationship" .

# Specific Mapping Example
nasa:belongsToProgram rdfs:subPropertyOf nasa:partOf ;
    rdfs:domain nasa:SubProgram ;
    rdfs:range nasa:Program .

nasa:documentedBy a rdf:Property ;
    rdfs:label "is documented by" ;
    rdfs:domain rdfs:Resource ; # Any entity can be documented
    rdfs:range nasa:Document .

nasa:hasStatus a rdf:Property ;
    rdfs:label "current status" ;
    rdfs:range xsd:string .

2. Implementation: Mapping CSV to this Schema

When writing your Python converter, you will map your CSV columns to these URIs. If your CSV looks like this: ComponentID, Name, ParentSystem

Your mapping logic in rdflib would look like:

from rdflib import Graph, Literal, RDF, Namespace

NASA = Namespace("http://nasa.example.org/ontology#")

def add_csv_row(g, row):
    # Define the Component
    comp_uri = NASA[row['ComponentID']]
    g.add((comp_uri, RDF.type, NASA.Component))
    g.add((comp_uri, rdfs.label, Literal(row['Name'])))
    
    # Define the Relationship to the System
    system_uri = NASA[row['ParentSystem']]
    g.add((comp_uri, NASA.partOf, system_uri))

3. SPARQL Query for your Tkinter GUI

Since you have a clear hierarchy, a powerful query for your users would be finding all components associated with a specific high-level Program.

Example Query:

PREFIX nasa: <http://nasa.example.org/ontology#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?componentLabel ?systemLabel ?documentTitle
WHERE {
    ?c a nasa:Component ;
       rdfs:label ?componentLabel ;
       nasa:partOf+ ?program .  # '+' follows the 'partOf' chain up to the program
       
    ?program a nasa:Program ;
             rdfs:label "Artemis" .
             
    OPTIONAL { ?c nasa:documentedBy ?doc . ?doc rdfs:label ?documentTitle }
}

Why this works:

  • Transitive Paths: Using nasa:partOf+ in SPARQL allows your users to find everything belonging to a Program without needing to know if there is a SubProgram or Mission in between.

  • Visual Logic: In Pyvis, your "Nodes" will be the Programs/Components and "Edges" will be the partOf relationships, creating a clean "tree" visualization.

Would you like me to refine the Pyvis code to color-code the nodes differently based on whether they are a "Program" vs. a "Component"?


Yes

To make the Pyvis graph intuitive, we can assign specific hex colors to each NASA class. This helps users instantly distinguish between a high-level Program and a granular Component in the visual output.

Here is the updated visualization function. It uses a dictionary to map RDF types to colors and handles the logic for extracting those types from your SPARQL results.

1. The Color-Coded Visualization Function

from pyvis.network import Network
from rdflib import Namespace, RDF

NASA = Namespace("http://nasa.example.org/ontology#")

# Mapping NASA Classes to specific colors
# Using a "Space" inspired palette
COLOR_MAP = {
    str(NASA.Program): "#1a237e",      # Deep Navy
    str(NASA.SubProgram): "#3949ab",   # Indigo
    str(NASA.Mission): "#00838f",      # Teal
    # ---
    str(NASA.Vehicle): "#d32f2f",      # NASA Red
    str(NASA.System): "#f57c00",       # Orange
    str(NASA.SubSystem): "#fbc02d",    # Yellow
    str(NASA.Component): "#7b1fa2",    # Purple
    # ---
    str(NASA.Document): "#455a64",     # Slate/Grey
}

def visualize_nasa_graph(results):
    """
    Expects SPARQL results in the format (subject, predicate, object, subjectType)
    """
    net = Network(notebook=False, directed=True, height="800px", width="100%", bgcolor="#ffffff")
    
    # Physics makes the hierarchy look like a 'tree' or 'cluster'
    net.force_atlas_2based()

    for row in results:
        # Unpack the SPARQL row
        s, p, o, s_type = [str(x) for x in row]
        
        # Get color based on type, default to light grey if unknown
        node_color = COLOR_MAP.get(s_type, "#eeeeee")
        
        # Add Subject Node
        net.add_node(s, label=s.split('#')[-1], title=f"Type: {s_type}", color=node_color)
        
        # Add Object Node (Simplification: assumes object is a string or URI)
        net.add_node(o, label=o.split('#')[-1], color="#90a4ae") 
        
        # Add Edge
        edge_label = p.split('#')[-1]
        net.add_edge(s, o, label=edge_label)

    net.show("nasa_hierarchy.html")

2. The Updated SPARQL Query for the GUI

To make the color-coding work, your SPARQL query must specifically SELECT the type of the subject so the Python script knows which color to apply.

Update your Tkinter default query to this:

PREFIX nasa: <http://nasa.example.org/ontology#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?subject ?predicate ?object ?type
WHERE {
    ?subject ?predicate ?object .
    ?subject rdf:type ?type .
    FILTER(?type != <http://www.w3.org/2000/01/rdf-schema#Class>)
}
LIMIT 50

3. Key Enhancements Added:

  • force_atlas_2based(): This Pyvis physics engine is excellent for hierarchical data. It prevents nodes from overlapping and helps "Programs" drift to the center of their clusters.

  • label=s.split('#')[-1]: This cleans up the URIs in the visualization. Instead of seeing http://nasa.example.org/ontology#Orion, the user just sees Orion.

  • title attribute: In Pyvis, the title shows up as a tooltip when you hover over a node—perfect for showing the full RDF type.

Legend for your Users

Since you are building a GUI, you might want to display a small legend in the Tkinter sidebar:

LevelColor
Program / SubNavy / Indigo
Vehicle / MissionRed / Teal
System / ComponentOrange / Purple
DocumentationGrey

Would you like me to integrate this visualization logic into a single, runnable Python script that combines the Tkinter window and the Pyvis output?


Yes

This script integrates your entire workflow: it defines the NASA RDFS, creates a sample CSV in memory, builds the Tkinter GUI for your SPARQL queries, and triggers Pyvis to render the results in your browser.

The Complete NASA RDF Explorer

import pandas as pd
import tkinter as tk
from tkinter import scrolledtext, messagebox
from rdflib import Graph, Literal, RDF, RDFS, Namespace, URIRef
from pyvis.network import Network
import webbrowser
import os
import io

# 1. SETUP NAMESPACES & SCHEMA
NASA = Namespace("http://nasa.example.org/ontology#")
RDFS_NS = Namespace("http://www.w3.org/2000/01/rdf-schema#")

# Color Map for Visualization
COLOR_MAP = {
    str(NASA.Program): "#1a237e",      # Deep Navy
    str(NASA.SubProgram): "#3949ab",   # Indigo
    str(NASA.Mission): "#00838f",      # Teal
    str(NASA.Vehicle): "#d32f2f",      # NASA Red
    str(NASA.System): "#f57c00",       # Orange
    str(NASA.Component): "#7b1fa2",    # Purple
    str(NASA.Document): "#455a64",     # Slate
}

class NasaRdfApp:
    def __init__(self, root):
        self.root = root
        self.root.title("NASA Knowledge Graph Explorer")
        self.g = Graph()
        self.g.bind("nasa", NASA)
        
        # Initialize Data
        self.setup_graph()
        self.create_widgets()

    def setup_graph(self):
        """Simulates CSV -> RDF conversion with RDFS logic"""
        # Sample CSV Data
        csv_data = """ID,Type,Name,Parent
Prog_Artemis,Program,Artemis,None
Sub_HLS,SubProgram,Human Landing System,Prog_Artemis
Miss_Art3,Mission,Artemis III,Sub_HLS
Veh_Starship,Vehicle,Starship HLS,Miss_Art3
Sys_Propulsion,System,Raptor Engines,Veh_Starship
Comp_Valve,Component,Fuel Intake Valve,Sys_Propulsion
Doc_Spec42,Document,Safety Spec 42,Comp_Valve"""
        
        df = pd.read_csv(io.StringIO(csv_data))
        
        for _, row in df.iterrows():
            subj = NASA[row['ID']]
            obj_type = NASA[row['Type']]
            
            # Add basic triples
            self.g.add((subj, RDF.type, obj_type))
            self.g.add((subj, RDFS.label, Literal(row['Name'])))
            
            # Hierarchy logic
            if row['Parent'] != "None":
                parent = NASA[row['Parent']]
                self.g.add((subj, NASA.partOf, parent))

    def create_widgets(self):
        """GUI Layout"""
        lbl = tk.Label(self.root, text="Enter SPARQL Query:", font=('Arial', 10, 'bold'))
        lbl.pack(pady=5)

        default_query = (
            "PREFIX nasa: <http://nasa.example.org/ontology#>\n"
            "PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n"
            "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n\n"
            "SELECT ?s ?p ?o ?type\n"
            "WHERE {\n"
            "  ?s ?p ?o .\n"
            "  ?s rdf:type ?type .\n"
            "  FILTER(?type != rdfs:Class)\n"
            "}"
        )

        self.query_input = scrolledtext.ScrolledText(self.root, width=70, height=12)
        self.query_input.insert(tk.INSERT, default_query)
        self.query_input.pack(padx=10, pady=5)

        btn_frame = tk.Frame(self.root)
        btn_frame.pack(pady=10)

        run_btn = tk.Button(btn_frame, text="Execute & Visualize", command=self.run_query, 
                           bg="#2ecc71", fg="white", padx=10)
        run_btn.pack(side=tk.LEFT, padx=5)

    def run_query(self):
        try:
            query_text = self.query_input.get("1.0", tk.END)
            results = self.g.query(query_text)
            
            if len(results) == 0:
                messagebox.showinfo("Results", "No triples found for this query.")
                return
                
            self.visualize(results)
        except Exception as e:
            messagebox.showerror("Query Error", str(e))

    def visualize(self, results):
        """Pyvis Integration"""
        net = Network(notebook=False, directed=True, height="700px", width="100%", bgcolor="#ffffff")
        net.force_atlas_2based()

        for row in results:
            # Check if query returned enough columns (s, p, o, type)
            s = str(row[0])
            p = str(row[1])
            o = str(row[2])
            s_type = str(row[3]) if len(row) > 3 else "Unknown"

            # Clean labels for visualization
            s_label = s.split('#')[-1]
            o_label = o.split('#')[-1]
            p_label = p.split('#')[-1]

            # Add nodes with colors
            s_color = COLOR_MAP.get(s_type, "#90a4ae")
            net.add_node(s, label=s_label, title=f"Type: {s_type}", color=s_color)
            net.add_node(o, label=o_label, color="#cfd8dc")
            
            net.add_edge(s, o, label=p_label)

        # Output to file and open
        output_file = "nasa_graph.html"
        net.save_graph(output_file)
        webbrowser.open('file://' + os.path.realpath(output_file))

if __name__ == "__main__":
    root = tk.Tk()
    app = NasaRdfApp(root)
    root.mainloop()

How to use this:

  1. Dependencies: Ensure you have pip install rdflib pandas pyvis installed.

  2. Run: Execute the script. A window appears with a pre-loaded NASA query.

  3. Graph: Click "Execute & Visualize". It will generate nasa_graph.html and automatically open it in your browser.

  4. Explore: In the browser, you can drag the nodes. Notice how Artemis (Blue) acts as the anchor for the Sub-Programs and Components.

Customizing the CSV

In the setup_graph function, you can replace the csv_data string with pd.read_csv("your_actual_file.csv"). Just ensure your column names match the mapping logic.

Would you like me to add a feature to the GUI that allows you to browse and select a CSV file from your computer?


Yes

To make this a complete tool, I have added the filedialog module from Tkinter. This allows you to select any CSV from your computer and maps it to the NASA RDFS hierarchy dynamically.

Updated Python Workflow with CSV File Picker

import pandas as pd
import tkinter as tk
from tkinter import scrolledtext, messagebox, filedialog
from rdflib import Graph, Literal, RDF, RDFS, Namespace
from pyvis.network import Network
import webbrowser
import os

# 1. NAMESPACES & VISUAL SETTINGS
NASA = Namespace("http://nasa.example.org/ontology#")

COLOR_MAP = {
    str(NASA.Program): "#1a237e",      # Deep Navy
    str(NASA.SubProgram): "#3949ab",   # Indigo
    str(NASA.Mission): "#00838f",      # Teal
    str(NASA.Vehicle): "#d32f2f",      # NASA Red
    str(NASA.System): "#f57c00",       # Orange
    str(NASA.SubSystem): "#fbc02d",    # Yellow
    str(NASA.Component): "#7b1fa2",    # Purple
    str(NASA.Document): "#455a64",     # Slate
}

class NasaRdfApp:
    def __init__(self, root):
        self.root = root
        self.root.title("NASA Knowledge Graph Builder")
        self.g = Graph()
        self.g.bind("nasa", NASA)
        
        self.create_widgets()

    def create_widgets(self):
        """GUI Layout with File Selection and Query Area"""
        
        # --- File Selection Section ---
        file_frame = tk.LabelFrame(self.root, text="Step 1: Ingest Data", padx=10, pady=10)
        file_frame.pack(fill="x", padx=10, pady=5)

        self.file_label = tk.Label(file_frame, text="No CSV selected", fg="grey")
        self.file_label.pack(side=tk.LEFT, padx=5)

        browse_btn = tk.Button(file_frame, text="Browse CSV", command=self.load_csv)
        browse_btn.pack(side=tk.RIGHT)

        # --- Query Section ---
        query_frame = tk.LabelFrame(self.root, text="Step 2: SPARQL Query Builder", padx=10, pady=10)
        query_frame.pack(fill="both", expand=True, padx=10, pady=5)

        default_query = (
            "PREFIX nasa: <http://nasa.example.org/ontology#>\n"
            "PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\n\n"
            "SELECT ?subject ?predicate ?object ?type\n"
            "WHERE {\n"
            "  ?subject ?predicate ?object .\n"
            "  ?subject rdf:type ?type .\n"
            "}"
        )

        self.query_input = scrolledtext.ScrolledText(query_frame, width=70, height=12)
        self.query_input.insert(tk.INSERT, default_query)
        self.query_input.pack(fill="both", expand=True)

        # --- Action Section ---
        self.run_btn = tk.Button(self.root, text="Generate Interactive Graph", 
                                command=self.run_query, bg="#0052a5", fg="white", 
                                font=('Arial', 10, 'bold'), pady=10)
        self.run_btn.pack(fill="x", padx=10, pady=10)

    def load_csv(self):
        """Opens file dialog and converts CSV to RDF triples"""
        file_path = filedialog.askopenfilename(filetypes=[("CSV files", "*.csv")])
        if not file_path:
            return

        try:
            df = pd.read_csv(file_path)
            # Basic validation: check for required columns
            required = {'ID', 'Type', 'Name', 'Parent'}
            if not required.issubset(df.columns):
                messagebox.showerror("Error", f"CSV must contain columns: {required}")
                return

            self.g = Graph() # Reset graph on new load
            self.g.bind("nasa", NASA)

            for _, row in df.iterrows():
                subj = NASA[str(row['ID']).replace(" ", "_")]
                obj_type = NASA[str(row['Type'])]
                
                self.g.add((subj, RDF.type, obj_type))
                self.g.add((subj, RDFS.label, Literal(str(row['Name']))))
                
                if pd.notna(row['Parent']) and str(row['Parent']).lower() != "none":
                    parent = NASA[str(row['Parent']).replace(" ", "_")]
                    self.g.add((subj, NASA.partOf, parent))

            self.file_label.config(text=f"Loaded: {os.path.basename(file_path)}", fg="green")
            messagebox.showinfo("Success", f"Converted {len(df)} rows to RDF.")

        except Exception as e:
            messagebox.showerror("Loading Error", str(e))

    def run_query(self):
        """Executes SPARQL and launches Pyvis"""
        if len(self.g) == 0:
            messagebox.showwarning("Warning", "Please load a CSV file first.")
            return

        try:
            query_text = self.query_input.get("1.0", tk.END)
            results = self.g.query(query_text)
            self.visualize(results)
        except Exception as e:
            messagebox.showerror("Query Error", str(e))

    def visualize(self, results):
        net = Network(notebook=False, directed=True, height="750px", width="100%", bgcolor="#ffffff")
        net.force_atlas_2based()

        for row in results:
            # Expected: (subject, predicate, object, type)
            s, p, o = str(row[0]), str(row[1]), str(row[2])
            s_type = str(row[3]) if len(row) > 3 else "Unknown"

            s_label = s.split('#')[-1]
            o_label = o.split('#')[-1]
            p_label = p.split('#')[-1]

            # Use the NASA Color Map
            color = COLOR_MAP.get(s_type, "#97c2fc")

            net.add_node(s, label=s_label, title=f"Type: {s_type}", color=color)
            net.add_node(o, label=o_label, color="#dddddd")
            net.add_edge(s, o, label=p_label)

        path = os.path.abspath("nasa_output.html")
        net.save_graph(path)
        webbrowser.open("file://" + path)

if __name__ == "__main__":
    root = tk.Tk()
    app = NasaRdfApp(root)
    root.mainloop()

Expected CSV Format

To use the "Browse" feature, ensure your CSV file has these exact headers:

IDTypeNameParent
Art_1ProgramArtemis ProgramNone
Orion_VVehicleOrion CapsuleArt_1
Propulsion_SSystemMain EngineOrion_V
Doc_01DocumentSafety ManualPropulsion_S

Key Improvements:

  1. Dynamic ID Handling: It automatically replaces spaces with underscores to ensure valid RDF URIs.

  2. Clean Separation: The GUI now separates the "Ingestion" step from the "Querying" step.

  3. Error Handling: It alerts you if the CSV columns are missing or if the SPARQL syntax is incorrect.

  4. Automatic Browser Launch: Once you click generate, the browser opens immediately with the physics-enabled graph.


From <https://gemini.google.com/app/cd645441b4645112>  Google Gemini 3.0 (Fast)

Comments

Popular posts from this blog

#2892 Example of ChatGPT o1 Prompt

#3105 Federal and State Spending Idiocy