#3938 Software Hazard Report PDF Parser
#3938 Software Hazard Report PDF Parser #3938 Provide a python application to parse a NASA software hazard report in PDF format, generating a CSV with: hazard report title, cause, control, verification, where each report has several causes, several controls per cause, and one or more verifications per control This is a classic data extraction problem involving hierarchical data (1 Report --> Many Causes --> Many Controls --> Many Verifications) embedded in a flat PDF format. Because NASA hazard reports vary (e.g., specific formats like text-heavy documents vs. tabular forms like NF 1707), the most robust approach is a State Machine Parser . This script reads the PDF line-by-line, detects keywords (like "Cause:", "Control:"), and maintains context to flatten the hierarchy into a CSV. Prerequisites You will need the pdfplumber library, which is superior to PyPDF2 for maintaining text layout and extracting data from forms. pip install pdfplumber pandas ...