For future use in the slopless search engine project.
- Extract main content from HTML pages with high accuracy
- Normalize encoding to UTF-8 and unescape HTML entities
- Split content into semantically meaningful chunks
- Detect and handle different languages