Skip to content

Add support for "deeper" search of optional property.#5

Merged
nightbloos merged 1 commit intomasterfrom
handle-optional-fields-with-max-tag-deep
Jan 18, 2021
Merged

Add support for "deeper" search of optional property.#5
nightbloos merged 1 commit intomasterfrom
handle-optional-fields-with-max-tag-deep

Conversation

@nightbloos
Copy link
Copy Markdown
Owner

For some ULRs was found that we can't get for strange reason the og:type data.
One of this ULRs - was youtube links.
Was detected that in YouTube they keep metadata in body (and not in head as other normal services).
And because previously the criteria for breaking loop of procession of tokens was "we have Title + description + ogImage and we passed head" - we were not able to process all other optional meta after that we pass head.

Now we are able to control how much tokens we can process before breaking loop (or if we found required optional fields already)

Comment thread goscraper.go Outdated
maxDocumentLength int64
url string
maxRedirect int
maxTokenDeep int
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxTokenDepth and accordingly in other places

Comment thread goscraper.go Outdated
doc.Preview.Name = scraper.Url.Host
// set default icon to web root if <link rel="icon" href="/favicon.ico"> not found
doc.Preview.Icon = fmt.Sprintf("%s://%s%s", scraper.Url.Scheme, scraper.Url.Host, "/favicon.ico")
deepCounter := 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simple depth should be enough

@nightbloos nightbloos force-pushed the handle-optional-fields-with-max-tag-deep branch from fd2240b to 176dfae Compare January 15, 2021 14:27
Comment thread goscraper.go
Comment on lines +468 to +473
if scraper.Options.MaxTokenDepth == 0 {
return nil
}
if ogType || depth >= scraper.Options.MaxTokenDepth {
return nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at this code I was thinking that maybe scrapet.Options.MaxTokenDepth == 0 is redundant, as depth >= scraper.Options.MaxTokenDepth will handle it anyway on the first iteration right? So we can have just second if

and another thing which hit me while iterating over this part is that it'll read much better with a little bit different naming, something like if gotOgType or if hasOgType.

I guess ogImage from above could be changed in the same vein to gotOgImage or hasOgImage?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess ogImage from above could be changed in the same vein to gotOgImage or hasOgImage

In origin lib, as you can see - we have var ogImage bool. I decide to use same "naming".
btw. Will fix naming in both places.

looking at this code I was thinking that maybe scrapet.Options.MaxTokenDepth == 0 is redundant, as depth >= scraper.Options.MaxTokenDepth will handle it anyway on the first iteration right?

You are almost right.
But from BC perspective - I'm trying to have same as soon as possible logic that will behave with default value - same as old code.
The point is that in feature - most probably we will have additional logic/checks after first if statement.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh, that bright future which will come sometime and will force us to change the code! 🙃

For some ULRs was found that we can't get for strange reason the `og:type` data.
One of this ULRs - was youtube links.
Was detected that in YouTube they keep metadata in body (and not in head as other normal services).
And because previously the criteria for breaking loop of procession of tokens was "we have Title + description + ogImage and we passed head" - we were not able to process all other optional meta after that we pass head.

Now we are able to control how much tokens we can process before breaking loop (or if we found required optional fields already)
@nightbloos nightbloos force-pushed the handle-optional-fields-with-max-tag-deep branch from 176dfae to f0a1002 Compare January 15, 2021 15:43
@nightbloos nightbloos merged commit 44a43d8 into master Jan 18, 2021
@nightbloos nightbloos deleted the handle-optional-fields-with-max-tag-deep branch January 18, 2021 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants