Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 87 additions & 1 deletion .auxiliary/notes/issues.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,89 @@
# Librovore Issues and Enhancement Opportunities

No open issues at this time.
## SSL/TLS Certificate Verification Failure

**Date Reported**: 2025-11-19
**Component**: Sphinx inventory processor (urllib-based inventory download)
**Severity**: Medium (blocks testing with some sites)

### Issue Description

When attempting to fetch Sphinx object inventories from certain sites (e.g., `docs.twistedmatrix.com`, `www.dulwich.io`), the inventory processor fails with:

```
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed:
self-signed certificate in certificate chain (_ssl.c:1017)>
```

### Observed Behavior

- ✅ **Detection/probing via httpx**: Successfully connects to sites (HEAD/GET for HTML)
- ❌ **Inventory download via urllib**: Fails SSL verification

### Root Cause

The certificate chains for these documentation sites include self-signed certificates. Different SSL handling between:
- **httpx** (used for detection): More lenient or different SSL context
- **urllib** (used in Sphinx inventory processor): Strict SSL verification against system CA bundle

### Impact

- **Structure processors** (including new Pydoctor processor) cannot be fully tested end-to-end with these sites
- **Inventory processor** cannot fetch inventory files from affected sites
- Does not affect sites with properly signed certificates

### Affected Sites

- https://docs.twistedmatrix.com/en/stable/api/
- https://www.dulwich.io/api/

### Potential Solutions

1. **Configure httpx-based inventory fetching** to use same client as detection
2. **Add SSL verification configuration** to allow disabling verification for specific domains (testing only)
3. **Report to site maintainers** about certificate chain issues
4. **Use different inventory sources** (manual creation, alternative processors)

### Notes

This issue was discovered during Pydoctor structure processor testing. The structure processor implementation is correct and works properly when inventory objects are available from other sources.

---

## Code Duplication: normalize_base_url

**Date Reported**: 2025-11-19
**Component**: Structure processors (Sphinx, Pydoctor)
**Severity**: Low (technical debt)

### Issue Description

The `normalize_base_url` function is duplicated across structure processor packages:
- `sources/librovore/structures/sphinx/urls.py`
- `sources/librovore/structures/pydoctor/urls.py`

### Current State

Both implementations are identical and handle:
- URL parsing and normalization
- File path to URL conversion
- Scheme validation (http, https, file)
- Path cleanup (trailing slash removal)

### Recommendation

Extract `normalize_base_url` and related URL utilities to a shared location:
- Option 1: `sources/librovore/structures/urls.py` (common module)
- Option 2: `sources/librovore/urls.py` (top-level utility)
- Option 3: Include in base structure processor class

### Benefits

- Reduces code duplication
- Ensures consistent URL handling across all structure processors
- Simplifies maintenance and testing
- Reduces risk of divergence between implementations

### Impact

Low priority - current duplication is manageable with only two instances. Should be addressed before adding more structure processors to prevent further duplication.
4 changes: 4 additions & 0 deletions data/configuration/general.toml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,10 @@ enabled = true
name = "mkdocs"
enabled = true

[[structure-extensions]]
name = "pydoctor"
enabled = true

# External Extension Examples
# Uncomment and modify these examples to add external documentation processors.

Expand Down
26 changes: 26 additions & 0 deletions sources/librovore/structures/pydoctor/__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# vim: set filetype=python fileencoding=utf-8:
# -*- coding: utf-8 -*-

#============================================================================#
# #
# Licensed under the Apache License, Version 2.0 (the "License"); #
# you may not use this file except in compliance with the License. #
# You may obtain a copy of the License at #
# #
# http://www.apache.org/licenses/LICENSE-2.0 #
# #
# Unless required by applicable law or agreed to in writing, software #
# distributed under the License is distributed on an "AS IS" BASIS, #
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. #
# See the License for the specific language governing permissions and #
# limitations under the License. #
# #
#============================================================================#


''' Pydoctor subpackage import namespace. '''

# ruff: noqa: F403


from ..__ import *
33 changes: 33 additions & 0 deletions sources/librovore/structures/pydoctor/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# vim: set filetype=python fileencoding=utf-8:
# -*- coding: utf-8 -*-

#============================================================================#
# #
# Licensed under the Apache License, Version 2.0 (the "License"); #
# you may not use this file except in compliance with the License. #
# You may obtain a copy of the License at #
# #
# http://www.apache.org/licenses/LICENSE-2.0 #
# #
# Unless required by applicable law or agreed to in writing, software #
# distributed under the License is distributed on an "AS IS" BASIS, #
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. #
# See the License for the specific language governing permissions and #
# limitations under the License. #
# #
#============================================================================#


''' Pydoctor documentation source detector and processor. '''


from .detection import PydoctorDetection
from .main import PydoctorProcessor

from . import __


def register( arguments: __.cabc.Mapping[ str, __.typx.Any ] ) -> None:
''' Registers configured Pydoctor processor instance. '''
processor = PydoctorProcessor( )
__.structure_processors[ processor.name ] = processor
84 changes: 84 additions & 0 deletions sources/librovore/structures/pydoctor/conversion.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# vim: set filetype=python fileencoding=utf-8:
# -*- coding: utf-8 -*-

#============================================================================#
# #
# Licensed under the Apache License, Version 2.0 (the "License"); #
# you may not use this file except in compliance with the License. #
# You may obtain a copy of the License at #
# #
# http://www.apache.org/licenses/LICENSE-2.0 #
# #
# Unless required by applicable law or agreed to in writing, software #
# distributed under the License is distributed on an "AS IS" BASIS, #
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. #
# See the License for the specific language governing permissions and #
# limitations under the License. #
# #
#============================================================================#


''' HTML to markdown conversion utilities. '''


from bs4 import BeautifulSoup as _BeautifulSoup

from . import __


class PydoctorMarkdownConverter( __.markdownify.MarkdownConverter ):
''' Custom markdownify converter for Pydoctor HTML. '''

def convert_pre(
self,
el: __.typx.Any,
text: str,
convert_as_inline: bool,
) -> str:
''' Converts pre elements with Python code detection. '''
if self.is_code_block( el ):
# Pydoctor code blocks are typically Python
code_text = el.get_text( )
return f"\n```python\n{code_text}\n```\n"
return super( ).convert_pre( el, text, convert_as_inline )

def is_code_block( self, element: __.typx.Any ) -> bool:
''' Determines if element is a code block. '''
# Pydoctor uses <pre> for code blocks
return element.name == 'pre'


def html_to_markdown( html_text: str ) -> str:
''' Converts HTML text to markdown using Pydoctor-specific patterns. '''
if not html_text.strip( ): return ''
try: cleaned_html = _preprocess_pydoctor_html( html_text )
except Exception: return html_text
try:
converter = PydoctorMarkdownConverter(
heading_style = 'ATX',
strip = [ 'nav', 'header', 'footer', 'script' ],
escape_underscores = False,
escape_asterisks = False
)
markdown = converter.convert( cleaned_html )
except Exception: return html_text
return markdown.strip( )


def _preprocess_pydoctor_html( html_text: str ) -> str:
''' Preprocesses Pydoctor HTML before markdown conversion. '''
soup: __.typx.Any = _BeautifulSoup( html_text, 'lxml' )
# Remove navigation elements
for selector in [ '.navbar', '.sidebar', '.mainnavbar' ]:
for element in soup.select( selector ):
element.decompose( )
# Remove search elements
for selector in [ '#searchBox', '.search' ]:
for element in soup.select( selector ):
element.decompose( )
# Remove Bootstrap scaffolding that doesn't contribute to content
for selector in [ '.container', '.row', '.col-md-*' ]:
for element in soup.select( selector ):
# Unwrap instead of decompose to keep content
element.unwrap( )
return str( soup )
105 changes: 105 additions & 0 deletions sources/librovore/structures/pydoctor/detection.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# vim: set filetype=python fileencoding=utf-8:
# -*- coding: utf-8 -*-

#============================================================================#
# #
# Licensed under the Apache License, Version 2.0 (the "License"); #
# you may not use this file except in compliance with the License. #
# You may obtain a copy of the License at #
# #
# http://www.apache.org/licenses/LICENSE-2.0 #
# #
# Unless required by applicable law or agreed to in writing, software #
# distributed under the License is distributed on an "AS IS" BASIS, #
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. #
# See the License for the specific language governing permissions and #
# limitations under the License. #
# #
#============================================================================#


''' Pydoctor detection and metadata extraction. '''


from urllib.parse import ParseResult as _Url

from . import __
from . import extraction as _extraction
from . import urls as _urls


_scribe = __.acquire_scribe( __name__ )


class PydoctorDetection( __.StructureDetection ):
''' Detection result for Pydoctor documentation sources. '''

source: str
normalized_source: str = ''

@classmethod
def get_capabilities( cls ) -> __.StructureProcessorCapabilities:
''' Pydoctor processor capabilities. '''
return __.StructureProcessorCapabilities(
supported_inventory_types = frozenset( { 'pydoctor' } ),
content_extraction_features = frozenset( {
__.ContentExtractionFeatures.Signatures,
__.ContentExtractionFeatures.Descriptions,
__.ContentExtractionFeatures.CodeExamples,
} ),
confidence_by_inventory_type = __.immut.Dictionary( {
'pydoctor': 1.0
} )
)

@classmethod
async def from_source(
selfclass,
auxdata: __.ApplicationGlobals,
processor: __.Processor,
source: str,
) -> __.typx.Self:
''' Constructs detection from source location. '''
detection = await processor.detect( auxdata, source )
return __.typx.cast( __.typx.Self, detection )

async def extract_contents(
self,
auxdata: __.ApplicationGlobals,
source: str,
objects: __.cabc.Sequence[ __.InventoryObject ], /,
) -> tuple[ __.ContentDocument, ... ]:
''' Extracts documentation content for specified objects. '''
documents = await _extraction.extract_contents(
auxdata, source, objects )
return tuple( documents )


async def detect_pydoctor(
auxdata: __.ApplicationGlobals, base_url: _Url
) -> float:
''' Detects if source is a Pydoctor documentation site. '''
confidence = 0.0
# Check for index.html
index_url = _urls.derive_index_url( base_url )
try:
html_content = await __.retrieve_url_as_text(
auxdata.content_cache,
index_url, duration_max = 10.0 )
except Exception as exc:
_scribe.debug( f"Detection failed for {base_url.geturl( )}: {exc}" )
return confidence
html_lower = html_content.lower( )
# Check for pydoctor meta tag (highest confidence)
if '<meta name="generator" content="pydoctor' in html_lower:
confidence = 1.0
# Check for characteristic CSS files
elif 'apidocs.css' in html_lower:
confidence = 0.8
# Check for Bootstrap-based navigation with pydoctor structure
elif 'navbar navbar-default mainnavbar' in html_lower:
confidence += 0.3
# Check for pydoctor-specific elements
if 'class="docstring"' in html_lower:
confidence += 0.2
return min( confidence, 1.0 )
Loading