#3997 NASA NPR 7150.2 PDF-to-Text-to-Regex-to-Data

Uploaded file: N_PR_7150_002D_.pdf

#3997 Role: Act as a Senior Software Engineer and Regex Specialist with expertise in technical document parsing.

Context: You are parsing a technical PDF that may contain inconsistent spacing, non-breaking spaces (\xA0), or standard OCR ligatures. Accuracy is mission-critical.
Task: Read the uploaded PDF and generate robust Regular Expressions (PCRE) for the fields listed below.
Requirements for the Regex:

Whitespace Resilience: Use \s+ or [ \t\xA0]+ instead of literal spaces to account for PDF encoding quirks.
Anchor Identification: Look for consistent "anchor" text (e.g., "ID:", "Status:", "Hazard Level") and include them in the regex as non-capturing groups (?:...) to increase precision.
Boundary Protection: Use word boundaries \b to prevent partial matches.
Validation: For each field, provide a brief "match explanation" explaining why that specific pattern was chosen based on the document's structure.
Output Format: Return a JSON array.

[
{
"name": "field1",
"description": "description",
"regex": "string",
"explanation": "why this pattern fits the document layout"
}
]
Fields to Extract:

Field 1: [Insert Name] - [Insert Description]
Field 2: [Insert Name] - [Insert Description]
Field 3: [Insert Name] - [Insert Description]
Field 4: [Insert Name] - [Insert Description]
Field 5: [Insert Name] - [Insert Description]

[

{

"name": "Software_Requirement_ID",

"description": "Extracts the unique 'SWE' requirement identifier used for mapping and compliance.",

"regex": "/\\[SWE[\\-\\s\xA0]+(\\d{3,})\\]/i",

[cite_start]"explanation": "The document consistently denotes requirements using 'SWE' followed by a hyphen or space and a 3-digit number, often enclosed in square brackets. This pattern uses a capturing group for the numeric ID while accounting for potential non-breaking spaces or varying hyphenation common in PDF-to-text conversions[cite: 123, 250, 257]."

{

"name": "NPR_Reference_Section",

"description": "Extracts section and subsection headers (e.g., 3.1.2) that anchor specific procedural requirements.",

"regex": "/^\\b(?:Section[\\s\xA0]+)?(\\d{1,2}\\.\\d{1,2}(?:\\.\\d{1,2})?)\\b/m",

"explanation": "The document uses a hierarchical numbering system. [cite_start]This regex targets these at the start of lines (common in the Table of Contents and Chapter headers) while handling the 'Section' anchor text and ensuring boundary protection to avoid matching dates or software versions[cite: 13, 28, 469]."

{

"name": "Software_Classification_Class",

"description": "Extracts the primary software classification (Class A through F) from definitions or mapping matrices.",

"regex": "/\\b(?:Class[\\s\xA0]+)([A-F])\\b/i",

"explanation": "Classification is critical for determining requirement applicability. [cite_start]This pattern looks for the anchor 'Class' followed by the specific designation (A-F), using word boundaries to ensure it doesn't match words starting with those letters[cite: 102, 107, 1529]."

{

"name": "CMMI_Maturity_Level",

"description": "Extracts the required CMMI-DEV Maturity Level ratings for project organizations.",

"regex": "/(?:CMMI[\xA0\\s]*-DEV[\\s\xA0]+Maturity[\\s\xA0]+Level[\\s\xA0]+)(\\d+)/i",

"explanation": "The document specifies organizational capability requirements (e.g., Maturity Level 3). [cite_start]This regex uses non-capturing groups for the technical anchors 'CMMI-DEV' and 'Maturity Level' to pinpoint the specific integer rating required for different software classes[cite: 689, 691, 1296]."

{

"name": "Document_Effective_Date",

"description": "Extracts the formal effective date of the NPR directive.",

"regex": "/(?:Effective[\\s\xA0]+Date:)[\\s\xA0]+([A-Z][a-z]+[\\s\xA0]\\d{2},[\\s\xA0]\\d{4})/m",

"explanation": "The effective date is a key metadata field found in the preface. [cite_start]This regex anchors on 'Effective Date:' and captures the standard date format (Month DD, YYYY), accounting for potential whitespace variations between words[cite: 9, 168]."

}

]

Would you like me to generate a Python script to execute these Regular Expressions against your PDF text for a batch extraction?

For each item in the JSON response, collect and array (i.e. [...]) of data

Based on the document structure and the patterns identified previously, here is the extracted data for each of the five fields.

[
  {
    "name": "Software_Requirement_ID",
    "description": "Extracts the unique 'SWE' requirement identifier.",
    "extracted_data": [
      "SWE-002", "SWE-004", "SWE-152", "SWE-129", "SWE-100", "SWE-098", 
      "SWE-208", "SWE-209", "SWE-212", "SWE-221", "SWE-222", "SWE-223", 
      "SWE-003", "SWE-005", "SWE-140", "SWE-095", "SWE-006", "SWE-091", 
      "SWE-092", "SWE-142", "SWE-144", "SWE-153", "SWE-215", "SWE-216", 
      "SWE-217", "SWE-214", "SWE-218", "SWE-126", "SWE-150", "SWE-021", 
      "SWE-033", "SWE-013", "SWE-024", "SWE-034", "SWE-036", "SWE-037", 
      "SWE-039", "SWE-040", "SWE-042", "SWE-139", "SWE-121", "SWE-125", 
      "SWE-027", "SWE-015", "SWE-151", "SWE-174", "SWE-016", "SWE-018", 
      "SWE-046", "SWE-017", "SWE-020", "SWE-176", "SWE-022", "SWE-141", 
      "SWE-131", "SWE-178", "SWE-179", "SWE-205", "SWE-023", "SWE-134", 
      "SWE-219", "SWE-220", "SWE-146", "SWE-206", "SWE-032", "SWE-147", 
      "SWE-148", "SWE-156", "SWE-154", "SWE-157", "SWE-159", "SWE-207", 
      "SWE-185", "SWE-210", "SWE-052", "SWE-050", "SWE-051", "SWE-184", 
      "SWE-053", "SWE-054", "SWE-055", "SWE-057", "SWE-143", "SWE-058", 
      "SWE-060", "SWE-061", "SWE-135", "SWE-062", "SWE-186", "SWE-063", 
      "SWE-136", "SWE-065", "SWE-066", "SWE-187", "SWE-068", "SWE-070", 
      "SWE-071", "SWE-073", "SWE-189", "SWE-190", "SWE-191", "SWE-192", 
      "SWE-193", "SWE-211", "SWE-075", "SWE-077", "SWE-194", "SWE-195", 
      "SWE-196", "SWE-079", "SWE-080", "SWE-081", "SWE-082", "SWE-083", 
      "SWE-084", "SWE-085", "SWE-045", "SWE-086", "SWE-087", "SWE-088", 
      "SWE-089", "SWE-090", "SWE-093", "SWE-094", "SWE-199", "SWE-200", 
      "SWE-201", "SWE-202", "SWE-203", "SWE-204"
    ]
  },
  {
    "name": "NPR_Reference_Section",
    "description": "Extracts section and subsection headers.",
    "extracted_data": [
      "1.1", "1.2", "1.3", "2.1", "2.2", "3.1", "3.2", "3.3", "3.4", 
      "3.5", "3.6", "3.7", "3.8", "3.9", "3.10", "3.11", "3.12", "4.1", 
      "4.2", "4.3", "4.4", "4.5", "4.6", "5.1", "5.2", "5.3", "5.4", 
      "5.5", "6.1", "6.2"
    ]
  },
  {
    "name": "Software_Classification_Class",
    "description": "Extracts primary software classification (Class A-F).",
    "extracted_data": [
      "Class A", "Class B", "Class C", "Class D", "Class E", "Class F"
    ]
  },
  {
    "name": "CMMI_Maturity_Level",
    "description": "Extracts required CMMI-DEV Maturity Level ratings.",
    "extracted_data": [
      "Level 3", "Level 2"
    ]
  },
  {
    "name": "Document_Effective_Date",
    "description": "Extracts the formal effective date of the NPR.",
    "extracted_data": [
      "March 08, 2022"
    ]
  }
]

Data Summary and Verification

Requirement IDs: The extraction covers the complete set of SWE (Software Engineering) tags found throughout Chapters 2–5.
Classification Structure: The document defines six distinct classes based on factors like human rating and mission criticality.
Compliance Anchors: Class A requires CMMI-DEV Level 3, while Class B generally requires Level 2.
Version Tracking: This data corresponds to NPR 7150.2D, effective as of March 08, 2022.

Would you like me to cross-reference these SWE IDs with the Authority levels (e.g., Center vs. HQ) from the Requirements Mapping Matrix in Appendix C?

Generate a Python script to execute these Regular Expressions against your PDF text for a batch extraction

Below is a Python script designed to execute the Regular Expressions against the text provided in your document. It uses the re module for batch extraction and organizes the results into a structured format.

import re
import json

# The raw content from NPR 7150.2D as a single string
# In a local environment, you would read this from a .txt or .pdf extraction
pdf_content = """
NPR 7150.2D
Effective Date: March 08, 2022
... (rest of your document text) ...
a. For Class A software: CMMI®-DEV Maturity Level 3 Rating or higher for software.
b. For Class B software... CMMI®-DEV Maturity Level 2 Rating or higher for software.
"""

# Define the Regex patterns based on the JSON specification
# Using [ \t\xA0] for whitespace resilience as requested
regex_patterns = {
    "Software_Requirement_ID": r"\[SWE[ \t\xA0\-]+(\d{3,})\]",
    "NPR_Reference_Section": r"^\b(?:Section[ \t\xA0]+)?(\d{1,2}\.\d{1,2}(?:\.\d{1,2})?)\b",
    "Software_Classification_Class": r"\b(?:Class[ \t\xA0]+)([A-F])\b",
    "CMMI_Maturity_Level": r"(?:CMMI[ \t\xA0\-\d\.]*Maturity[ \t\xA0]+Level[ \t\xA0]+)(\d+)",
    "Document_Effective_Date": r"(?:Effective[ \t\xA0]+Date:)[ \t\xA0]+([A-Z][a-z]+[ \t\xA0]\d{2},[ \t\xA0]\d{4})"
}

def extract_data(text, patterns):
    results = []
    
    for name, pattern in patterns.items():
        # Using re.MULTILINE for section anchors and re.IGNORECASE for flexibility
        matches = re.findall(pattern, text, flags=re.MULTILINE | re.IGNORECASE)
        
        # Clean up duplicates while preserving order
        unique_matches = list(dict.fromkeys(matches))
        
        results.append({
            "name": name,
            "extracted_data": unique_matches
        })
        
    return results

# Execute extraction
extracted_json = extract_data(pdf_content, regex_patterns)

# Output as JSON
print(json.dumps(extracted_json, indent=2))

Script Logic & Boundary Protection

Whitespace Handling: The patterns specifically use [ \t\xA0]+ to bridge gaps caused by OCR ligatures or non-breaking spaces frequently found in NASA procedural documents.
Anchor Precision: For CMMI_Maturity_Level, the script anchors on the phrase "Maturity Level" to ensure it captures the rating intended for software development organizations.
Multi-line Anchors: The re.MULTILINE flag is used to ensure section headers at the beginning of lines are correctly identified without picking up stray decimals in the middle of paragraphs.
Capturing Groups: Each regex uses parenthesis () to capture only the relevant data (the ID number, the Date, etc.) while ignoring the non-capturing anchor text (?:...).

Would you like me to modify the script to also map these SWE IDs to their specific Authority (e.g., Center Director vs. NASA OCE) as defined in the mapping matrix?

How can section number, SWE, software classification and authority be associated for each SWE?

To associate these components, you must reference the Requirements Mapping Matrix (Table 2) located in Appendix C of the document. This matrix acts as the central cross-reference for every Software Engineering (SWE) requirement.

Below is the structural breakdown of how these fields are associated for each requirement:

1. The Requirements Mapping Matrix Structure

Each entry in the matrix binds the following specific data points together:

Section Number: Indicates the specific procedural location in Chapters 3, 4, or 5 where the requirement is defined.
SWE ID: A unique identifier (e.g., SWE-013) used to track the requirement across the Agency.
Software Classification (Applicability): A series of columns (Class A–F) marked with an "X" if the requirement is invoked for that specific software class.
Authority: Specifies the level of management (e.g., Center, HQ, or CIO) responsible for assessing the requirement and approving any tailoring.

2. Example Association (SWE-013)

3. Logic for Association

Classification Determination: The Project Manager first classifies the software (Class A–F) based on definitions in Appendix D.
Requirement Invocation: Once the class is set, the matrix dictates which SWE IDs are mandatory (marked "X").
Tailoring Authority: If a project seeks relief from an invoked requirement, the "Authority" column identifies who must sign off on the tailoring (e.g., Center Director vs. NASA Chief Engineer).
Bidirectional Traceability: SWE-052 specifically requires maintaining traceability between higher-level requirements, design, code, and test results, ensuring the association remains intact throughout the life cycle.

4. How to Extract this Programmatically

If you are using the Python script provided previously, you can associate these fields by parsing the Table 2 rows in Appendix C. The association is positional:

Column 1: Section Number.
Column 2: SWE ID and Requirement Text.
Column 3: Authority for Classes A-E.
Column 4: Applicability for Classes A-E (mapped by position).
Column 5: Authority for Class F.

Would you like me to update the Python script to specifically parse these tabular associations from Appendix C?

Yes

To associate the Section Number, SWE ID, Software Classification, and Authority, you must parse the Requirements Mapping Matrix in Appendix C. In this matrix, these attributes are linked row-by-row to create a "contract" for compliance

1. Structural Association Logic

In the document's layout, each requirement is a single record that combines these four dimensions:

The Section & ID Anchor: The Section Number (e.g., 3.1.2) provides the topical context, while the SWE ID (e.g., SWE-033) serves as the unique primary key for that specific mandate
The Applicability Array (Classification): The columns labeled Classes A-E and Class F contain an "X" if the requirement is mandatory for that category
The Governance Link (Authority): The Authority column identifies which entity (Center, HQ, or CIO) has the power to approve "tailoring" (deviations or waivers) for that specific requirement

2. Tabular Association Example

Based on the mapping matrix, here is how these elements are associated for two distinct requirements:

Section	SWE ID	Class Applicability	Authority
3.1.2	SWE-033	Classes A, B, C, D, E, F	Center (A-E) / CIO (F)
3.6.2	SWE-141	Classes A, B (Category 1 & 2)	HQ OSMA

3. Updated Python Script for Association

The following script extracts these associations by targeting the specific tabular structure found in Appendix C. It uses a Regex capture group to pull the section and ID, then looks for the "X" markers to determine class applicability.

import re
import json

# Sample text from Appendix C Matrix
matrix_text = """
3.1.2 SWE-033 The project manager shall assess options... Center XXXXX CIO X
3.6.2 SWE-141 For projects reaching Key Decision Point A... HQ OSMA XX
"""

# Regex designed to capture the row-based association
# Group 1: Section, Group 2: SWE ID, Group 3: Authority A-E, Group 4: Classes, Group 5: Authority F
assoc_pattern = r"(\d\.\d\.\d+)\s+SWE-(\d{3,}).*?(Center|HQ|CIO|OSMA)\s+([X| ]{1,5})\s+(?:(CIO)\s+(X))?"

def get_associations(text):
    associations = []
    for match in re.finditer(assoc_pattern, text, re.S):
        section, swe_id, auth_ae, classes, auth_f, class_f = match.groups()
        
        associations.append({
            "swe_id": f"SWE-{swe_id}",
            "section": section,
            "authority_primary": auth_ae,
            "is_mandatory_a_e": "X" in classes,
            "is_mandatory_f": class_f == "X"
        })
    return associations

results = get_associations(matrix_text)
print(json.dumps(results, indent=2))

4. Why this Association Matters

Tailoring: If a project cannot meet SWE-141, the association tells the engineer they must go to HQ OSMA for approval, not just their local Center
Class Shifts: If a system's classification changes from Class D to Class B, the matrix association allows the team to instantly identify which new SWE IDs now require an "X" for compliance
Traceability: This association is the foundation for SWE-052, which requires bi-directional traceability between requirements and system hazards

Would you like me to create a full mapping table for all SWE IDs mentioned in this document to their respective Authorities?

From <https://gemini.google.com/app/> Google Gemini 3.0 Pro (Thinking)

Search This Blog

Learning from ChatGPT