Crawler by weikunwu · Pull Request #2 · yanglianglu/Auto_Dash

weikunwu · 2023-10-23T18:59:46Z

Crawler implemented

Method 1

getDocumentUrls(keyword, n)

Given a keyword, get the first n number of documents from yahoo finance
Documents are returned as a list of url string

[
    'https://finance.yahoo.com/news/retail-investors-can-now-bet-on-music-royalties-we-want-people-to-have-access-150221575.html',
    'https://finance.yahoo.com/news/foxconn-faces-tax-audit-land-024521324.html',
    'https://finance.yahoo.com/news/iphone-assembler-hon-hai-dives-010601477.html',
    'https://finance.yahoo.com/m/05189731-ff16-33e1-b077-dfead8ad1cb9/paypal-s-new-boss.html',
]

Method 2

processUrls(urls):

Given a list of urls, process the urls and return a list of documents

 [
     {
       'base_url': 'https://finance.yahoo.com/m/05189731-ff16-33e1-b077-dfead8ad1cb9/paypal-s-new-boss.html',
       'title': "PayPal's New Boss",
       'body': "When PayPal Holdings reports third quarter earnings on Nov. 1, it'll give new Chief Executive Alex Chriss his first chance to lay out his turnaround strategy."
       'urls': []
     },
]

Note that this method cannot process the urls recursively

weikunwu added 2 commits October 23, 2023 13:58

Crawler

df3900f

Add urls inside document

7d729fc

yanglianglu approved these changes Oct 29, 2023

View reviewed changes

yanglianglu merged commit aa3f121 into main Oct 29, 2023

yanglianglu deleted the crawler branch October 29, 2023 22:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler#2

Crawler#2
yanglianglu merged 2 commits intomainfrom
crawler

weikunwu commented Oct 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

weikunwu commented Oct 23, 2023

Method 1

Method 2

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants