#4307 Gutenberg Bible EBook
#4307 Gutenberg Bible EBook
#4307 Tell me about https://www.gutenberg.org/cache/epub/10/pg10.txt and what could be done with it to create a knowledge graph of the Bible.
The link you provided is the Project Gutenberg eBook of the King James Version (KJV) of the Bible (eBook #10). It is a plain-text (.txt) file containing the full text of the Old and New Testaments.
Because it is structured, public domain, and highly relational, it is an ideal "seed" dataset for building a Knowledge Graph (KG).
1. Understanding the Source File
Format: UTF-8 plain text.
Structure: The file uses a standard naming convention:
Book Name->Chapter:Verse. For example,Genesis 1:1 In the beginning....Content: It includes 66 books, 1,189 chapters, and over 31,000 verses.
Advantage: The consistent
Chapter:Verseprefix makes it programmatically easy to parse using Regular Expressions (Regex).
2. How to Create a Knowledge Graph from This File
A Knowledge Graph represents data as Nodes (entities) and Edges (relationships). To transform this text file into a KG, you would follow these steps:
Step A: Named Entity Recognition (NER)
You would use Natural Language Processing (NLP) to extract key entities from the text.
People: Adam, Eve, Noah, Moses, Mary, Jesus, Paul.
Places: Eden, Ur, Egypt, Jerusalem, Nazareth, Damascus.
Events: The Flood, The Exodus, The Crucifixion.
Groups/Nations: Israelites, Philistines, Pharisees, Romans.
Step B: Defining Relationships (Edges)
The Bible is dense with specific relationship types that can be mapped:
Genealogical:
[Adam] --(father of)--> [Seth],[Abraham] --(husband of)--> [Sarah].Spatial:
[Jerusalem] --(located in)--> [Judea],[Moses] --(traveled to)--> [Mount Sinai].Affiliation:
[Peter] --(disciple of)--> [Jesus],[Paul] --(citizen of)--> [Rome].Attribution:
[Genesis] --(attributed author)--> [Moses].
Step C: Triple Extraction
You would convert verses into "Triples" (Subject-Predicate-Object).
Verse: "And Cain talked with Abel his brother: and... Cain rose up against Abel his brother, and slew him." (Gen 4:8)
Triples: *
(Cain, is_brother_of, Abel)(Cain, killed, Abel)
3. Practical Applications for the Graph
Once the data is in a graph database (like Neo4j) or an RDF format, you could perform advanced queries that are impossible with a simple text search:
Lineage Tracking: "Show the shortest path of ancestry between King David and Jesus."
Geospatial Analysis: "Map all the cities Paul visited during his third missionary journey."
Social Network Analysis: "Identify the 'most central' person in the New Testament based on the number of unique people they interacted with."
Thematic Discovery: "Find all verses where a specific location (e.g., Bethel) is associated with a specific event type (e.g., a dream or a sacrifice)."
4. Technical Workflow Suggestion
If you wanted to build this today, you could:
Clean: Use Python to strip the Project Gutenberg header/footer.
Parse: Use Regex to split the text into a CSV with columns for
Book,Chapter,Verse, andText.Enrich: Cross-reference the names with a resource like the
project to get latitude/longitude for the places.Open Bible Geocoding Load: Import the CSV into Neo4j to visualize the connections.
List the various relationships and the useful Cypher queries
To build a Bible Knowledge Graph in a database like Neo4j, you need to define the relationships (edges) that connect your entities (nodes).
Below are the key relationships derived from the text structure and biblical narrative, followed by powerful Cypher queries to explore that data.
1. Core Relationships (Schema)
| Relationship Type | Description | Example Triples |
| PARENT_OF | Genealogical lineage (father/mother to child). | (Abraham)-[:PARENT_OF]->(Isaac) |
| SPOUSE_OF | Marriage or partnership. | (Abraham)-[:SPOUSE_OF]->(Sarah) |
| LOCATED_IN | Spatial hierarchy (City in Region). | (Jerusalem)-[:LOCATED_IN]->(Judea) |
| APPEARS_IN | Entity mentioned in a specific text unit. | (Moses)-[:APPEARS_IN]->(Exodus 2:1) |
| CROSS_REFERENCES | Thematic or prophetic links between verses. | (Psalm 22:1)-[:REFERRED_BY]->(Matthew 27:46) |
| WROTE | Authorial attribution of a book. | (Paul)-[:WROTE]->(Romans) |
| CONTEMPORARY_WITH | People living or active at the same time. | (Ezra)-[:CONTEMPORARY_WITH]->(Nehemiah) |
2. Useful Cypher Queries
Once you have loaded your data from the Gutenberg text, use these queries to uncover hidden patterns.
A. Genealogical Discovery
Query: Find the lineage from Adam to a specific person (e.g., King David).
This uses variable-length paths *1..50 to trace any number of generations.
MATCH path = (adam:Person {name: "Adam"})-[:PARENT_OF*1..50]->(david:Person {name: "David"})
RETURN path
B. Thematic Map
Query: Find all locations mentioned in the same chapter as a specific person.
This helps visualize the "travel map" of a character.
MATCH (p:Person {name: "Paul"})-[:APPEARS_IN]->(v:Verse)<-[:APPEARS_IN]-(l:Location)
WHERE v.book = "Acts"
RETURN l.name, count(v) AS mentions
ORDER BY mentions DESC
C. Social Network Centrality
Query: Who is the most "connected" person in the New Testament?
This counts how many unique people appear in the same verses together.
MATCH (p1:Person)-[:APPEARS_IN]->(v:Verse)<-[:APPEARS_IN]-(p2:Person)
WHERE v.testament = "New" AND p1 <> p2
RETURN p1.name, count(DISTINCT p2) AS connections
ORDER BY connections DESC
LIMIT 10
D. Identifying "Super-Verses"
Query: Find verses that mention the most entities (People + Places).
Useful for finding pivotal narrative moments.
MATCH (e)-[:APPEARS_IN]->(v:Verse)
RETURN v.reference AS Verse, count(e) AS entity_count
ORDER BY entity_count DESC
LIMIT 5
E. Prophecy Fulfillment
Query: Find Old Testament verses referenced by New Testament authors.
MATCH (v2:Verse {testament: "New"})-[:CITES]->(v1:Verse {testament: "Old"})
RETURN v2.reference + " fulfills " + v1.reference AS Connection
3. Implementation Tip: The "Verse" Node
When using the Gutenberg text, the Verse should be your "Join Node." Instead of connecting a person directly to a place, connect them both to a Verse node.
Structure: (Person) --[:APPEARS_IN]--> (Verse) <--[:APPEARS_IN]-- (Location)
This preserves the provenance of the data—you aren't just saying "Moses was in Egypt," you are saying "The text in Exodus 2:10 places Moses in Egypt."
How would "Authorial attribution of a book" and "Thematic or prophetic links between verses" be determined based on the text file provided?
Determining these specific relationships from a raw text file requires a mix of structural pattern matching (for authors) and semantic/intertextual analysis (for links).
Here is how you would programmatically extract them from the pg10.txt file:
1. Determining Authorial Attribution
In the Project Gutenberg file, authorial attribution is often explicitly stated in the Book Headers located just before the verse text begins.
Pattern Matching: You would scan for lines that appear before the
1:1verse of each book.Examples found in the text:
"The First Book of Moses: Called Genesis" → Logic: If "Book of [Name]" or "[Name]: Called..." is found, map
[Name] -[:WROTE]-> [Book]."The Epistle of Paul the Apostle to the Romans" → Logic: Use a regex to catch
Epistle of ([A-Za-z]+).
The "WROTE" Logic:
Split the file by Book titles.
Extract the name from the title string (e.g., "Paul", "Moses", "John").
Create a
Personnode for the name and aBooknode, then link them with aWROTErelationship.
2. Determining Thematic or Prophetic Links
Since this specific text file is "flat" (it doesn't have built-in footnotes or hyperlinks), you have to generate these links using two methods:
A. Keyword/Entity Overlap (Thematic)
If two verses share rare or significant entities, they are likely thematically linked.
Method: If Verse A and Verse B both contain the unique entities
[Melchizedek]and[King], create aTHEMATIC_LINKbetween them.Cypher Logic: ```cypher MATCH (e:Entity)-[:APPEARS_IN]->(v1:Verse), (e)-[:APPEARS_IN]->(v2:Verse) WHERE v1 <> v2 AND e.rarity > threshold MERGE (v1)-[:THEMATIC_LINK {topic: e.name}]->(v2)
B. Fulfillment Language (Prophetic)
In the New Testament, authors frequently use "formulaic citations" to point back to the Old Testament. You can use Natural Language Processing (NLP) to find these specific triggers:
Triggers: Look for phrases like:
"...that it might be fulfilled which was spoken by the prophet..."
"...as it is written..."
"...for it is written..."
Extraction: When one of these phrases is found in a New Testament verse (e.g., Matthew 1:22), use a Named Entity Recognizer to find the "Prophet" named nearby (e.g., "Esaias" / Isaiah). Then, search the Old Testament for the most similar text or a direct quote to create the
FULFILLSedge.
C. Direct Quotes (Text Reuse Detection)
By running a "fuzzy string match" or a "MinHash" algorithm between the New Testament and the Old Testament sections of the file, you can identify where an author is quoting an earlier text.
Example: If a string of 7+ words in Hebrews 1 matches a string in Psalm 2, you create a
CITESrelationship between those two verse nodes.
Summary of Workflow for the KG:
Author: Parse the unique headers at the start of each of the 66 books.
Thematic: Link verses that share common "low-frequency" nouns (e.g., "Tabernacle", "Covenant").
Prophetic: Use regex to find "fulfillment formulas" and link the quoting verse to the quoted verse.
Provide a project plan to create a knowledge graph from the downloaded text file
This project plan outlines the transition from a single, massive .txt file into a queryable Property Graph (using Neo4j) or RDF Triple Store.
The main challenge with pg10.txt is its size and "flatness," so we will use a Modular Pipeline approach.
Phase 1: Data Pre-processing & Parsing
Goal: Transform the raw text into a structured tabular format (CSV/JSON).
Step 1.1: Header/Footer Cleanup. * Strip the Project Gutenberg license and metadata at the beginning and end of the file.
Step 1.2: Book Segmentation. * Write a Python script to split the file based on the book titles (e.g., "The First Book of Moses: Called Genesis").
Step 1.3: Verse Extraction. * Use Regex
^(\d+):(\d+)\s+(.*)to capture the Chapter, Verse, and Text Content.Deliverable: A
verses.csvfile with columns:[Testament, Book, Chapter, Verse, Text].
Phase 2: Entity Extraction (NER)
Goal: Identify the "Nodes" of your graph (People, Places, Groups).
Step 2.1: Rule-Based Extraction. * Use a Bible-specific name list (like the
dictionary) to tag common entities.Hitchcock’s Bible Names Step 2.2: NLP Enhancement. * Run a model like
spaCyorHuggingFace(specifically a model fine-tuned on historical or religious texts) to find entities missed by rules.Step 2.3: Entity Resolution (De-duplication). * Map "Simon," "Peter," and "Cephas" to a single
Personnode.Map "Saul of Tarsus" to "Paul."
Deliverable: An
entities.csvfile mapping names to the verses where they appear.
Phase 3: Relationship (Edge) Mining
Goal: Define how nodes connect to one another.
Step 3.1: Genealogical Mining. * Extract relationships using "begat" or "son of" patterns (e.g.,
[Name] begat [Name]).Step 3.2: Geographical Mapping. * Link People to Places mentioned in the same verse (
[:VISITED]or[:LIVED_IN]).Step 3.3: Inter-textual Links (Cross-References). * Script a search for "fulfillment formulas" (see Phase 3 in the previous response) to create
[:FULFILLS]links between the New and Old Testaments.Deliverable: A
relationships.csvfile defining[Source_ID, Relationship_Type, Target_ID].
Phase 4: Graph Construction & Loading
Goal: Import the data into a Graph Database.
Step 4.1: Define the Schema. * Labels:
Person,Location,Book,Verse,Event.Properties: Verse text, coordinates for locations, dates (where applicable).
Step 4.2: Bulk Import. * Use the Neo4j Admin Import tool for high-speed loading of your CSVs.
Step 4.3: Indexing. * Create indexes on
Person.nameandVerse.referenceto ensure fast query performance.
Phase 5: Visualization & Analysis
Goal: Make the data useful for research or discovery.
Step 5.1: Visualization. * Use Neo4j Bloom or Gephi to visually explore clusters (e.g., the "Twelve Tribes" cluster or "Pauline Epistles" network).
Step 5.2: Pathfinding. * Run algorithms like Shortest Path to find connections between disparate figures (e.g., "How is Ruth related to Jesus?").
Step 5.3: Centrality Analysis. * Apply PageRank to identify the most influential locations or figures based on text connectivity.
Suggested Tech Stack
| Component | Tool Recommendation |
| Language | Python (Pandas for data, SpaCy for NLP) |
| Database | Neo4j (Community Edition) |
| Regex Tool | |
| Orchestration | Jupyter Notebooks (To document the cleaning steps) |
Pro-tip: Don't try to build the whole graph at once. Start by just mapping the Genealogy of Genesis (Chapters 1–50). Once that "family tree" looks right in your database, expand the pipeline to the rest of the text!
Are you planning to run this locally, or are you looking for a cloud-based graph solution?
Provide useful system instructions and prompts that I can use for LLM-assistance
To successfully build a knowledge graph from the KJV Bible text, you should use a Multi-Stage Prompting strategy.
Because LLMs have context limits, do not feed the whole book at once. Instead, process the text in chunks (e.g., 10–20 verses at a time) using the following system instructions and prompts.
1. System Instruction: The "Expert Ontologist"
Set this as the global "System Prompt" to ensure the LLM remains consistent and follows a strict output schema.
System Instruction: You are a specialized Knowledge Graph Engineer and Biblical Historian. Your task is to extract factual entities and relationships from the King James Bible text.
Operational Rules:
Output Format: Respond ONLY in a valid JSON array of triples.
Schema: Each triple must follow the structure:
{"subject": "Entity", "predicate": "RELATIONSHIP_TYPE", "object": "Entity"}.Entity Normalization: Use full names (e.g., use "Jesus Christ" instead of "he"). Separate people from locations by using Title Case.
Predicate Style: Use UPPERCASE_SNAKE_CASE for predicates (e.g.,
PARENT_OF,TRAVELED_TO).Strict Adherence: Only extract facts explicitly stated or strongly implied by the verse. Do not add outside theological commentary.
2. Extraction Prompts (By Layer)
Different "passes" over the text will yield cleaner results. Use these specific user prompts for each stage of your project.
Stage A: Genealogical Extraction (The "Family Tree" Pass)
Use this for books like Genesis, Chronicles, or the Gospels' genealogies.
User Prompt: "Extract all genealogical relationships from the following Bible verses. Focus on parentage, marriage, and tribal affiliation.
Verses to Process: [INSERT VERSE CHUNK HERE]
Target Predicates: PARENT_OF, SPOUSE_OF, CHILD_OF, TRIBE_OF, BORN_IN."
Stage B: Geospatial & Movement (The "Travel Log" Pass)
Use this for Exodus, Acts, or the Gospels' narratives.
User Prompt: "Identify all geographical locations and the movement of people between them in the following text.
Verses to Process: [INSERT VERSE CHUNK HERE]
Target Predicates: LOCATED_IN, TRAVELED_TO, DEPARTED_FROM, DIED_AT, BATTLE_AT."
Stage C: Inter-textual Links (The "Prophecy" Pass)
Use this specifically for the New Testament to find connections to the Old.
User Prompt: "Analyze these New Testament verses for direct quotes or references to Old Testament events, laws, or prophecies.
Verses to Process: [INSERT VERSE CHUNK HERE]
Target Predicates: FULFILLS, CITES, ALLUDES_TO, PREFIGURES."
3. The "Entity Resolution" Prompt
After you have extracted thousands of triples, you will have duplicates (e.g., "Abram" and "Abraham"). Use this prompt to clean your list of nodes before importing to Neo4j.
User Prompt: "I am providing a list of unique entity names extracted from the Bible. Group names that refer to the same individual or location and provide a 'Master Name' for the group.
Entity List: [INSERT LIST OF NAMES]
Output Format:
{"Original Name": "Master Name"}"
4. Troubleshooting & Refining Tips
Temperature Setting: Set your LLM temperature to 0.0 or 0.1. You want deterministic extraction, not creative writing.
Context Overlap: When chunking the
pg10.txtfile, overlap the chunks by 1–2 verses (e.g., Chunk 1: Verses 1–10; Chunk 2: Verses 9–20). This prevents the LLM from missing relationships that bridge across your artificial breaks.Handling "God": You will need to decide if "God," "The LORD," and "Spirit" are separate nodes or one. Most graph builders use a single node
Person: {name: "God"}to avoid a fragmented graph.
For chunking and embedding, would splitting by Book + Chapter + verse + text, with no overlap work well? How about by Book + Chapter, with verse and text
Since you are building a Knowledge Graph, your chunking strategy needs to prioritize referential integrity (the ability to link a fact back to a specific coordinate) over the "fuzzy" semantic search typical of standard RAG applications.
Here is an analysis of your two proposed strategies, along with a "Gold Standard" recommendation for your specific use case.
1. Strategy A: Book + Chapter + Verse + Text (No Overlap)
Example Chunk: Genesis 1:1 In the beginning God created the heaven and the earth.
The Verdict: Best for Entity Extraction & Triple Mapping.
Pros: * High Precision: Every extracted relationship (
God -[:CREATED]-> Heaven) is tied to a single, immutable URI (Genesis 1:1).No Duplication: Without overlap, you won't extract the same "begat" relationship twice, saving you massive amounts of data cleaning.
Cons: * Context Loss: If a sentence spans two verses (common in the Epistles), the LLM might miss the subject of the second verse.
Embedding Weakness: Single verses are often too short for meaningful vector embeddings. "And he said unto them" appears hundreds of times; without context, the embedding is useless for search.
2. Strategy B: Book + Chapter (Verse + Text included)
Example Chunk: A single document containing all 31 verses of Genesis 1.
The Verdict: Best for Thematic Embedding & Narrative Flow.
Pros: * Better Embeddings: The vector will capture the "Vibe" of the chapter (e.g., The Creation, The Sermon on the Mount).
Cohesion: The LLM can see the "Pronoun Antecedents." It knows that "he" in verse 5 refers to "God" mentioned in verse 1.
Cons: * Extraction Noise: If you ask an LLM to "Extract all triples from Genesis 1," it may hallucinate or skip data because the context window is crowded with 30+ sentences.
Granularity Loss: When a user asks a question, the system points them to the whole chapter rather than the specific verse.
3. The "Gold Standard" Recommendation: The Sliding Window Verse-Group
For a Knowledge Graph, I recommend a sliding window of 3–5 verses, indexed by the Anchor Verse.
The Logic:
Chunk Size: 3 Verses (Current Verse + 1 Before + 1 After).
Metadata: Keep the
Book,Chapter, andVerseof the center verse as the primary ID.
Why this works:
Resolution: If verse 2 says "And he went there," the LLM has verse 1 to tell it who "he" is and verse 3 to tell it where "there" is.
Triple Attribution: You instruct the LLM: "Only extract relationships where the Subject appears in the center verse (the Anchor)." This prevents duplicate extractions across chunks.
Embedding Richness: A 3-verse window (approx. 50–100 words) provides enough "semantic weight" for a vector database (like Chroma or Pinecone) to actually differentiate between chapters.
Summary Table for Your Project
| Task | Best Chunking Strategy |
| Building the KG (Triples) | Single Verse (to ensure 1:1 mapping of facts to coordinates). |
| Entity Resolution | Book Level (to see all mentions of "Joseph" in one go). |
| Embedding (Search/QA) | Paragraph/Story Level (e.g., 5-verse windows or by "Pericope" / Story Heading). |
Pro-Tip for pg10.txt:
Since you mentioned being a "MAKER" and "solopreneur," you can save a lot of compute by only embedding the Chapter summaries first, then using the Knowledge Graph to navigate the specifics. This "Graph-RAG" approach is significantly more accurate than standard embedding search for ancient, complex texts like the Bible.
Which part of the Bible are you planning to "Mine" first with your Idea Miner? Genesis is the easiest for testing genealogy, while Acts is the best for testing geography!
From <https://gemini.google.com/app/506c954d83b3a992> Google Gemini 3.0 (Fast)
Comments
Post a Comment