IPO Lock-in Processor v2.0 - Step 1 Implementation

What This Replicates

This new implementation exactly replicates the following processes from your current project:

✅ Text Extraction (from `pdf_to_text.py`)

Uses pdf-layout-tool-1.0.0-all.jar (layout-preserving extraction)
Creates *_java.txt files from PDFs
Also creates *_pdfplumber.txt files as backup
Handles scanned images (warns when text <100 bytes)

✅ PNG Generation

NSE: Extracts last page from lock-in PDF
BSE: Extracts all pages and stitches them vertically
Uses PyMuPDF (fitz) + Pillow (PIL)

✅ Folder Structure

downloads/
    nse/pdf/lockin/        *.pdf
    nse/pdf/lockin/txt/    *_java.txt, *_pdfplumber.txt
    nse/pdf/lockin/png/    *.png
    nse/pdf/shp/           *.pdf
    nse/pdf/shp/txt/       *_java.txt, *_pdfplumber.txt
    bse/pdf/lockin/        *.pdf
    bse/pdf/lockin/txt/    *_java.txt, *_pdfplumber.txt
    bse/pdf/lockin/png/    *.png
    bse/pdf/shp/           *.pdf
    bse/pdf/shp/txt/       *_java.txt, *_pdfplumber.txt

✅ Filename Patterns

NSE: AAKAAR-CML68761.pdf → unique_symbol = NSE:AAKAAR
BSE: 544324-CITICHEM-Annexure-I.pdf → unique_symbol = BSE:544324

Installation

Copy .env configuration:

cp .env.example ../.env
# Edit ../.env with your database credentials

Install dependencies:

pip install pymupdf pillow pdfplumber python-dotenv mysql-connector-python

Ensure Java JAR exists: Place pdf-layout-tool-1.0.0-all.jar in the parent directory (ScripUnlockDetails/)

Usage

Single File Processing

# NSE file (dry-run mode)
python app.py AAKAAR-CML68761.pdf --nse --dryrun

# BSE file (actual processing)
python app.py 544324-CITICHEM-Annexure-I.pdf --bse

# Skip database operations
python app.py 544324-CITICHEM-Annexure-I.pdf --bse --nodb

# With GEMINI extraction (requires --GEMAPPROVED flag)
python app.py 544324-CITICHEM-Annexure-I.pdf --bse --GEMAPPROVED

Batch Processing (Coming in Next Step)

# Process all NSE files
python app.py --nse

# Process all BSE files
python app.py --bse --dryrun

What Step 1 Does

✅ Validates file paths
- Checks lock-in PDF exists
- Checks SHP PDF exists
- Parses filename to extract symbol/code
✅ Generates text files
- Creates *_java.txt using pdf-layout-tool JAR
- Creates *_pdfplumber.txt using pdfplumber
- Detects scanned images (warns if text <100 bytes)
✅ Generates PNG
- NSE: Last page only
- BSE: All pages stitched vertically
✅ Connects to database
- Tests connection with retry logic
- Skipped if --nodb flag used

Output Example

======================================================================
IPO Lock-in Processor v2.0
======================================================================
Exchange: BSE
File: 544324-CITICHEM-Annexure-I.pdf
Mode: NORMAL
======================================================================

STEPS:
  1. ⚙ Validating file paths
  2. ✓ Lock-in PDF found: downloads/bse/pdf/lockin/544324-CITICHEM-Annexure-I.pdf
  3. ✓ SHP PDF found: downloads/bse/pdf/shp/544324-CITICHEM-Annexure-II.pdf
  4. ✓ Parsed symbol: BSE:544324
  5. ⚙ Extracting Lock-in text with Java...
  6. ✓ Java TXT created: downloads/bse/pdf/lockin/txt/544324-CITICHEM-Annexure-I_java.txt (15432 chars)
  7. ⚙ Extracting Lock-in text with PDFPlumber...
  8. ✓ PDFPlumber TXT created: downloads/bse/pdf/lockin/txt/544324-CITICHEM-Annexure-I_pdfplumber.txt (14987 chars)
  9. ⚙ Extracting SHP text with Java...
  10. ✓ Java TXT created: downloads/bse/pdf/shp/txt/544324-CITICHEM-Annexure-II_java.txt (3421 chars)
  11. ⚙ Extracting SHP text with PDFPlumber...
  12. ✓ PDFPlumber TXT created: downloads/bse/pdf/shp/txt/544324-CITICHEM-Annexure-II_pdfplumber.txt (3198 chars)
  13. ⚙ Generating PNG from PDF (all page(s))...
  14. ✓ PNG generated (stitched 3 pages): downloads/bse/pdf/lockin/png/544324-CITICHEM-Annexure-I.png
  15. ⚙ Connecting to database...
  16. ✓ Database connection established

======================================================================
✅ STEP 1 COMPLETE - Ready for parsing
======================================================================

Files prepared:
  • Lock-in Java TXT:       downloads/bse/pdf/lockin/txt/544324-CITICHEM-Annexure-I_java.txt
  • Lock-in PDFPlumber TXT: downloads/bse/pdf/lockin/txt/544324-CITICHEM-Annexure-I_pdfplumber.txt
  • SHP Java TXT:           downloads/bse/pdf/shp/txt/544324-CITICHEM-Annexure-II_java.txt
  • SHP PDFPlumber TXT:     downloads/bse/pdf/shp/txt/544324-CITICHEM-Annexure-II_pdfplumber.txt
  • Lock-in PNG:            downloads/bse/pdf/lockin/png/544324-CITICHEM-Annexure-I.png

Next: Implement parsing logic (Step 2)

Testing

Test the database connection:

python db.py

Test with dry-run mode (shows what would happen without doing it):

python app.py 544324-CITICHEM-Annexure-I.pdf --bse --dryrun --verbose

Next Steps

Step 2: Parse lock-in details from *_java.txt files
Step 3: Parse SHP data from SHP *_java.txt files
Step 4: Implement validation rules (RULE1-RULE10)
Step 5: GEMINI extraction (when --GEMAPPROVED)
Step 6: Save to database
Step 7: Finalize files (move to finalized/)
Step 8: Dashboard UI

Files Created

app.py - Main processor (replicates extraction workflow)
config_new.py - Configuration (folder structure, patterns)
db.py - Database utilities (connection pooling, transactions)
.env.example - Environment template
README.md - This file

Key Differences from Current Project

✅ Same functionality, but:

Single entry point (app.py instead of multiple scripts)
Clean separation (config, db, processing)
Better error handling with exit codes
Transaction support (all-or-nothing)
Dry-run mode for testing
Progress tracking with step numbers

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
api_unlock_events.php		api_unlock_events.php
app.py		app.py
apply_schema.py		apply_schema.py
astonea_exact_java.txt		astonea_exact_java.txt
bse_lockin_fixture_java.txt		bse_lockin_fixture_java.txt
check_database_values.py		check_database_values.py
check_db_strategies.py		check_db_strategies.py
compact_report.php		compact_report.php
config.py		config.py
create_fixtures.py		create_fixtures.py
database.py		database.py
database_schema.sql		database_schema.sql
db.py		db.py
finalized_report.php		finalized_report.php
finalized_report_old.php		finalized_report_old.php
finalizer.py		finalizer.py
fixture_report.html		fixture_report.html
fixture_report.txt		fixture_report.txt
generate_fixture_report.py		generate_fixture_report.py
generate_fixture_text_report.py		generate_fixture_text_report.py
goel_construction_java.txt		goel_construction_java.txt
lockin_details.db		lockin_details.db
lockin_parser_production_unified - Copy.pyhm		lockin_parser_production_unified - Copy.pyhm
lockin_parser_production_unified.py		lockin_parser_production_unified.py
lockin_parser_unified.py		lockin_parser_unified.py
migrate_bucket_enum.sql		migrate_bucket_enum.sql
models.py		models.py
nse_lockin_fixture_java.txt		nse_lockin_fixture_java.txt
output.txt		output.txt
parser_lockin.py		parser_lockin.py
parser_lockin_production.py		parser_lockin_production.py
parser_lockin_test_wrapper.py		parser_lockin_test_wrapper.py
parser_shp.py		parser_shp.py
parser_shp_strategies.py		parser_shp_strategies.py
parser_shp_strategies_old.py		parser_shp_strategies_old.py
parser_tester.php		parser_tester.php
pdf-layout-tool-1.0.0-all.jar		pdf-layout-tool-1.0.0-all.jar
redo.py		redo.py
report.py		report.py
restore_files_from_finalize.py		restore_files_from_finalize.py
shared_parsing.py		shared_parsing.py
shp_parser_production_unified.py		shp_parser_production_unified.py
shp_parser_unified.py		shp_parser_unified.py
stderr.txt		stderr.txt
strategy_names.py		strategy_names.py
strategy_stats.py		strategy_stats.py
test.js		test.js
text_utils.py		text_utils.py
tmp_fixture_maker.py		tmp_fixture_maker.py
trace_lockin_cascade.py		trace_lockin_cascade.py
txt_editor.php		txt_editor.php
validate_shp_parser.py		validate_shp_parser.py
validator.py		validator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IPO Lock-in Processor v2.0 - Step 1 Implementation

What This Replicates

✅ Text Extraction (from `pdf_to_text.py`)

✅ PNG Generation

✅ Folder Structure

✅ Filename Patterns

Installation

Usage

Single File Processing

Batch Processing (Coming in Next Step)

What Step 1 Does

Output Example

Testing

Next Steps

Files Created

Key Differences from Current Project

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

IPO Lock-in Processor v2.0 - Step 1 Implementation

What This Replicates

✅ Text Extraction (from pdf_to_text.py)

✅ PNG Generation

✅ Folder Structure

✅ Filename Patterns

Installation

Usage

Single File Processing

Batch Processing (Coming in Next Step)

What Step 1 Does

Output Example

Testing

Next Steps

Files Created

Key Differences from Current Project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

✅ Text Extraction (from `pdf_to_text.py`)

Packages