[formrecognizer] Proposal: child element navigator function#21352
[formrecognizer] Proposal: child element navigator function#21352catalinaperalta wants to merge 2 commits intoAzure:mainfrom
Conversation
|
API changes have been detected in API changes + def azure.ai.formrecognizer.get_document_content_elements(
+ base_element,
+ page,
+ search_elements
+ ) -> List[Union[DocumentElement, DocumentWord, DocumentSelectionMark]]
+
+ |
| ) | ||
|
|
||
| def get_document_content_elements(base_element, page, search_elements): | ||
| # type: (DocumentLine, DocumentPage, List[str]) -> List[Union[DocumentElement, DocumentWord, DocumentSelectionMark]] |
There was a problem hiding this comment.
This should technically be something like (NOTE: we dont yet implement the DocumentStructureElement base class):
| # type: (DocumentLine, DocumentPage, List[str]) -> List[Union[DocumentElement, DocumentWord, DocumentSelectionMark]] | |
| # type: (DocumentStructureElement, DocumentPage, List[str]) -> List[Union[DocumentElement, DocumentWord, DocumentSelectionMark]] |
|
|
||
|
|
||
| class ElementNavigator(object): | ||
| """Provides element navigation methods.""" |
There was a problem hiding this comment.
my initial thought is that maybe it makes sense to "pre-compute" everything in the constructor. If someone instantiates this class it means they are interested in navigating the elements so I think it could be okay to make that assumption / take the hit.
By "pre-compute" I'm wondering if we can take, for example, all the words and lines and kind of categorize them by their offset/length? This way we can jump straight to the offset of the span of the thing passed and hopefully just do a few quick calcs to include everything it contains. Untested example of the "pre-computing" I'm kind of thinking of:
eles = {}
for page in document.pages:
for word in page.words:
if word.span.offset not in eles:
eles[word.span.offset] = {}
eles[word.span.offset][word.span.length] = word
for page in document.pages:
for line in page.lines:
for span in line.spans:
if span.offset not in eles:
eles[span.offset] = {}
eles[span.offset][span.length] = line
Maybe we keep words and lines separate, guess it depends on if we want to return a heterogeneous collection at any point. But then I think we should be able to enable a scenario like this?
poller = client.begin_analyze_document("prebuilt-document", myfile)
result = poller.result()
nav = ElementNavigator(result)
lines = nav.get_lines(result.documents[0])
Please poke holes in this since I know you've spent much more time thinking on this. :)
Also my (maybe poor) understanding is that you should be able to pass in any type that contains span or spans into the helpers (since the text/elements that they are comprised of are accessible through the AnalyzeResult.content). I see that these types all have spans, but some are kind of atomic types (like words) so maybe we would throw if somebody passed that).
AnalyzedDocument
DocumentEntity
DocumentField
DocumentKeyValueElement
DocumentLine
DocumentPage
DocumentSelectionMark
DocumentStyle
DocumentTable
DocumentTableCell
DocumentWord
(I'm still thinking about this but I'm just going to hit 'Enter' on the comment for now and come back to it) 😃
There was a problem hiding this comment.
I like your idea about how to "pre-compute" the elements by their span offset! I think that after my discussion with Johan this isn't too much of a concern, but I think this is a good idea to keep in our back pocket for a future improvement depending on how the implementation goes because the elems dict might take some memory but at the same time maybe it wont ever be anything considerable that would be a problem.
|
Closing this, since the work is being done in: #21224. |
The idea here is to provide a package level function that allows users to pass in a certain type of document element and find it's children elements. For instance in the case of a DocumentStructureElement users can call the
get_document_content_elementsfunction to find the child document content elements, such as "words" and "selection_marks".This removes the need to repeat the get children method on each model and also removes the need to store a reference to the parent in those models.
Currently this is the lazy way of performing the search which is at N^3, but this could get down to N log N if we can switch to binary search if all of the spans and document content elements are already sorted.