fix: just adding stuff from developer.py to synopsis developer#182
fix: just adding stuff from developer.py to synopsis developer#182michaelneale merged 2 commits intomainfrom
Conversation
| tmp_file.write(result) | ||
| tmp_text_file_path = tmp_file.name.replace(".html", ".txt") | ||
| plain_text = re.sub( | ||
| r"<head.*?>.*?</head>|<script.*?>.*?</script>|<style.*?>.*?</style>|<[^>]+>", |
There was a problem hiding this comment.
:nit: i think we should consider adding the html2text dependency (it's cheap and avoids the zalgo regex of doom)
There was a problem hiding this comment.
it is GPL (v3) so a no go (already looked at that)
| return path | ||
|
|
||
| @tool | ||
| def fetch_web_content(self, url: str) -> str: |
There was a problem hiding this comment.
this might be worth adding @cache though i'm not quite sure that plays well with the named temp file
edit: nevermind just noticed the temp file stays so i think memoizing would be great for refetching the same content. that being said, maybe we should add a clean up for old files?
There was a problem hiding this comment.
I think temp files that is implicit?
There was a problem hiding this comment.
@michaelneale the named file is created each time this function is called so it's not cached at all
goose/src/goose/toolkit/developer.py
Line 97 in e19006c
if we add @cache we'd memorize the function call in memory so it wouldn't need to hit disk
There was a problem hiding this comment.
the LLM gets the file back, not the content, so it won't call the fetch each time as it knows in its context that it has the file with the content
There was a problem hiding this comment.
ah right, that makes sense
|
ahh, didn't realize about licensing. is there an automation we can add to
audit/check these things? i'm not very well versed in the legalese to know
…On Tue, Oct 22, 2024 at 18:32 Michael Neale ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/goose/synopsis/toolkit.py
<#182 (comment)>:
> + Args:
+ url (str): url of the site to visit.
+ Returns:
+ (dict): A dictionary with two keys:
+ - 'html_file_path' (str): Path to a html file which has the content of the page. It will be very large so use rg to search it or head in chunks. Will contain meta data and links and markup.
+ - 'text_file_path' (str): Path to a plain text file which has the some of the content of the page. It will be large so use rg to search it or head in chunks. If content isn't there, try the html variant.
+ """ # noqa
+ friendly_name = re.sub(r"[^a-zA-Z0-9]", "_", url)[:50] # Limit length to prevent filenames from being too long
+
+ try:
+ result = httpx.get(url, follow_redirects=True).text
+ with tempfile.NamedTemporaryFile(delete=False, mode="w", suffix=f"_{friendly_name}.html") as tmp_file:
+ tmp_file.write(result)
+ tmp_text_file_path = tmp_file.name.replace(".html", ".txt")
+ plain_text = re.sub(
+ r"<head.*?>.*?</head>|<script.*?>.*?</script>|<style.*?>.*?</style>|<[^>]+>",
it is GPL (v3) so a no go (already looked at that)
—
Reply to this email directly, view it on GitHub
<#182 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABFPCKETXZ3HIS6ESE26NH3Z434DNAVCNFSM6AAAAABQNSKXZSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDGOBWGY4TKOJVGU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
This was stuff that didn't make it over yet