Skip to content

update datasets used in lecture cross_section#3

Merged
shlff merged 4 commits intomainfrom
update_csdata
May 3, 2023
Merged

update datasets used in lecture cross_section#3
shlff merged 4 commits intomainfrom
update_csdata

Conversation

@shlff
Copy link
Copy Markdown
Member

@shlff shlff commented Apr 16, 2023

Hi @jstac and @mmcky this PR updates datasets used in the lecture cross_section in the lecture python intro series:

To do:

  • add a README.md file to illustrate the datasets

@jstac
Copy link
Copy Markdown
Contributor

jstac commented Apr 16, 2023

Thanks @shlff. Please ping and work with @mmcky when this is ready for review.

@shlff
Copy link
Copy Markdown
Member Author

shlff commented Apr 17, 2023

Sure thanks @jstac .

Hi @mmcky this PR is ready for review. Looking forward to your comments.

@mmcky
Copy link
Copy Markdown

mmcky commented Apr 17, 2023

@shlff I would suggest you add a table to the README that includes information about the datasets such as

name description approx. size
/us_cities.txt A dataset describing about cities in the US () 32.4Mb

and then include some instructions on how to fetch the data for use in a lecture (or elsewhere).

These will be stored using git-lfs so can you fetch git-lfs directly in a lecture? Is that the aim. Perhaps we should have a quick zoom about what your requirements are here.

@shlff
Copy link
Copy Markdown
Member Author

shlff commented Apr 17, 2023 via email

@mmcky
Copy link
Copy Markdown

mmcky commented Apr 17, 2023

@shlff I see. I'm surprised this works given git-lfs are hashed commits but GitHub must resolve that link to its stored location. That's neat. I wonder what the size limit is.

df_fs = pd.read_csv('https://media.githubusercontent.com/media/QuantEcon/high_dim_data/update_csdata/cross_section/forbes-global2000.csv')

Seems that it is managed by QuantEcon data allowance and bandwidth usage.

https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-storage-and-bandwidth-usage

We have 50Gb of storage and bandwidth so not binding in the medium term. But it will be interesting to see the usage as people start downloading datasets from our GitHub account using python

@shlff
Copy link
Copy Markdown
Member Author

shlff commented Apr 17, 2023

Thanks for your advice and chat @mmcky . I've created an issue with the bandwidth usage:

I will keep you posted once I update the

  • data description,
  • instructions,
  • review and modify

@mmcky
Copy link
Copy Markdown

mmcky commented May 3, 2023

thanks @shlff if you're happy with this PR please go ahead and merge.

@shlff shlff force-pushed the update_csdata branch from 595d3d8 to 181bd74 Compare May 3, 2023 06:26
@shlff
Copy link
Copy Markdown
Member Author

shlff commented May 3, 2023

Thanks @mmcky . I've pushed some modifications to the instructions. Now the PR is ready for merging.

I will also create a PR in repo lecture-python-intro to update the URL for the related lectures which use these datasets.

@shlff shlff merged commit f72b949 into main May 3, 2023
@shlff shlff deleted the update_csdata branch May 3, 2023 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects
Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants