Add PDF data extractor demo#206
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a complete demo application for extracting structured data from PDF files using JSON schemas and AI models. The demo includes both backend infrastructure and a user-friendly web interface.
Key changes:
- Backend Express server with PDF upload and processing endpoints
- Web interface with configuration, schema definition, and file upload capabilities
- Integration with Docker Model Runner for AI-powered data extraction
Reviewed Changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
demos/extractor/server.js |
Express server providing API endpoints for model listing and PDF data extraction |
demos/extractor/package.json |
Node.js project configuration with required dependencies |
demos/extractor/demo.html |
Frontend web interface for interacting with the PDF extraction service |
demos/extractor/README.md |
Documentation explaining setup, prerequisites, and usage instructions |
demos/extractor/.gitignore |
Git ignore rules for dependencies, uploads, and temporary files |
Comments suppressed due to low confidence (1)
demos/extractor/demo.html:1
- The referenced sample PDF file 'invoice.pdf' is mentioned but not included in the PR. Either include the file or update the documentation to reflect available samples.
<!DOCTYPE html>
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
🧙 Sourcery has finished reviewing your pull request! Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
Blocking issues:
- A CSRF middleware was not detected in your express application. Ensure you are either using one such as
csurforcsrf(see rule references) and/or you are properly doing CSRF validation in your routes with a token or cookies. (link)
General comments:
- Consider serving demo.html and its assets via Express's static middleware (instead of opening file://) to avoid CORS/file origin issues and streamline the demo setup.
- Extract the inline CSS and JavaScript in demo.html into separate .css and .js files to keep the markup clean and improve maintainability.
- Make the CLIENT_SERVER_URL (and other endpoints) on the client dynamic—e.g., derive from window.location or a config input—so you don't have to manually edit the HTML when pointing to a different server.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider serving demo.html and its assets via Express's static middleware (instead of opening file://) to avoid CORS/file origin issues and streamline the demo setup.
- Extract the inline CSS and JavaScript in demo.html into separate .css and .js files to keep the markup clean and improve maintainability.
- Make the CLIENT_SERVER_URL (and other endpoints) on the client dynamic—e.g., derive from window.location or a config input—so you don't have to manually edit the HTML when pointing to a different server.
## Individual Comments
### Comment 1
<location> `demos/extractor/server.js:16-18` </location>
<code_context>
+app.use(express.json());
+
+// Configure multer for file upload
+const upload = multer({
+ dest: 'uploads/',
+ limits: { fileSize: 10 * 1024 * 1024 } // 10MB limit
+});
+
</code_context>
<issue_to_address>
**suggestion (performance):** Uploaded files are stored in a local 'uploads/' directory, which may accumulate files if errors occur before cleanup.
Consider adding a periodic cleanup process or a mechanism to remove orphaned files if the server crashes before cleanup occurs.
</issue_to_address>
### Comment 2
<location> `demos/extractor/server.js:104-110` </location>
<code_context>
+ };
+
+ // Add optional parameters if provided
+ if (temperature !== undefined && temperature !== '') {
+ extractOptions.temperature = parseFloat(temperature);
+ }
+ if (maxTokens !== undefined && maxTokens !== '') {
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Temperature and maxTokens are parsed without validation for NaN or out-of-range values.
Validate 'temperature' and 'maxTokens' to ensure they are numbers and within acceptable ranges before using them.
```suggestion
// Add optional parameters if provided, with validation
if (temperature !== undefined && temperature !== '') {
const tempValue = parseFloat(temperature);
if (!isNaN(tempValue) && tempValue >= 0 && tempValue <= 2) {
extractOptions.temperature = tempValue;
} else {
console.warn(`Invalid temperature value: ${temperature}. Must be a number between 0 and 2.`);
}
}
if (maxTokens !== undefined && maxTokens !== '') {
const maxTokensValue = parseInt(maxTokens, 10);
if (!isNaN(maxTokensValue) && maxTokensValue > 0 && maxTokensValue <= 4096) {
extractOptions.maxTokens = maxTokensValue;
} else {
console.warn(`Invalid maxTokens value: ${maxTokens}. Must be a positive integer up to 4096.`);
}
}
```
</issue_to_address>
### Comment 3
<location> `demos/extractor/server.js:112-113` </location>
<code_context>
+ console.log(`Extracting data from PDF using model: ${model}`);
+ const result = await extractor.extract(extractOptions);
+
+ // Clean up uploaded file
+ await fs.unlink(pdfPath);
+ pdfPath = null;
+
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Uploaded PDF is deleted after extraction, but errors before this point may leave files behind.
Consider using a 'finally' block or centralized cleanup to guarantee file deletion even if errors occur before the current cleanup step.
```suggestion
let result;
try {
console.log(`Extracting data from PDF using model: ${model}`);
result = await extractor.extract(extractOptions);
} finally {
// Clean up uploaded file
if (pdfPath) {
try {
await fs.unlink(pdfPath);
} catch (cleanupErr) {
console.error(`Failed to delete uploaded PDF: ${pdfPath}`, cleanupErr);
}
pdfPath = null;
}
}
```
</issue_to_address>
### Comment 4
<location> `demos/extractor/server.js:154` </location>
<code_context>
+});
+
+// Create uploads directory if it doesn't exist
+const uploadsDir = path.join(__dirname, 'uploads');
+fs.mkdir(uploadsDir, { recursive: true }).catch(console.error);
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Uploads directory is created asynchronously at startup, which may race with incoming requests.
To prevent multer errors, create the uploads directory synchronously before server startup.
</issue_to_address>
### Comment 5
<location> `demos/extractor/server.js:8` </location>
<code_context>
const app = express();
</code_context>
<issue_to_address>
**security (javascript.express.security.audit.express-check-csurf-middleware-usage):** A CSRF middleware was not detected in your express application. Ensure you are either using one such as `csurf` or `csrf` (see rule references) and/or you are properly doing CSRF validation in your routes with a token or cookies.
*Source: opengrep*
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
@sourcery-ai dismiss |
Automated Sourcery review dismissed.
This pull request introduces a new demo application for extracting structured data from PDF files using JSON schemas and AI models. It adds a backend server, project configuration, documentation, and supporting files for the demo.
Given a PDF and a JSON Schema it will:
jsonreturned by the LLMSummary by Sourcery
Add a complete PDF Data Extractor demo featuring a web UI and a Node.js backend to configure AI models, define extraction schemas, upload PDFs, and retrieve structured JSON output
New Features:
Build:
Documentation: