-
Notifications
You must be signed in to change notification settings - Fork 7
Home
Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data
free download from UNC Libraries: search on "Ivezic"
python library (astroML) and python codes to make the book figures here
Wireless access for your laptop
Git prep
Linux tutorial
After completing the linux tutorial, you should make a directory in /afs/cas.unc.edu/users/y/o/yourname/public to hold the rest of your boot camp work. If your initial disk space allocation runs out, we can arrange additional space, but it's likely to be sufficient. You can work directly on any linux machine in Hell's Kitchen (the astro computing lab), or if these machines are not powerful enough, feel free to ssh into stardust.astro.unc.edu or blarney.castle.unc.edu.
Linux bonus tracks
vi tutorial
Even if you prefer emacs or another programming editor, you should learn the basics of vi, because you may sometimes find yourself inadvertently dumped into vi when using git or linux software. Note that vi is installed by default for Linux/Mac and comes with Git Bash for Windows (see "Git prep" above under Basics).
emacs installation
Emacs is the primary alternative to vi and there are long arguments about which is better. Optionally install emacs for Windows or Mac and run the built-in tutorial in the emacs help menu. (FWIW, your instructor uses emacs.)
- Anaconda installation and basic data analysis tutorial
- Programming tutorial
- Browse Chapters 1 of the AstroML textbook (reading 1.6 more closely) and download/play with the code for Figs. 1.9-1.12
- Read Appendix A of the AstroML textbook and try out the commands
- Browse Chapter 2 and take some time to study the vectorization example on pp. 54-56 to reconstruct why it works
- Debug and speed up this template code or, preferably, this protected version of the same code after consulting these Programming Best Practices; make sure to read the instructions at the bottom of the code. The pdb package isn't necessary for such a short code, but try it anyway to see how it works. When you think you've found everything, discuss with a partner and/or the instructor.
- Optional: Use these jupyter notebook quickstart instructions to examine the example jupyter notebook called
ExploreRESOLVEandECO.ipynb(found in the current repo; download by clicking "raw" then right-clicking the raw contents and choosing "save as"). You can run this notebook partway through if you also download theECO_dr1_subset.csvinput file also provided in this repo. If you like the idea of being able to work in notebooks like this, then you can get comfortable with them first by finishing the example notebook (this effort will also give you a small taste of the sql database query language), then by creating your own jupyter notebook from scratch. For example, you could use ECO DR1 to plot stellar mass vs. environment distributions and compare them for early type and late type galaxies. Try raiding code from at least one of Figures 1.9-1.12 in the textbook. NOTE: If/when you launch your first jupyter notebook under linux you'll get a question about a kernel choice -- just click OK and you should be all set! - Optional: Check out the 10 Minutes to pandas guide to see if you'd like to learn more about this powerful data manipulation package. If you've learned about jupyter notebooks, you can play with some useful pandas commands in this pandas tutorial notebook (again, right-click on raw and "save as" to get the actual notebook file so you can run it yourself; you'll also need this input file).
Laws of Probability, Probability Distributions, Random Sampling, Uncertainties, and Confidence Intervals
- For background, look at these basic statistics slides, Chapter 3 of the textbook, and this commentary on the p-value crisis
- Complete this introductory Monte Carlo Methods tutorial, including examples involving confidence intervals, determining areas, and inverse transform sampling (as discussed in Section 3.7 and Fig. 3.25 of the textbook)
- Review these additional slides on correlation tests, a special case of hypothesis tests
- Download this code and this input file, then uncomment and run each code block sequentially to compare the Spearman Rank and Pearson Correlation Tests. Add the code necessary to include Kendall's tau in the comparison (solution here).
This is a complicated topic (!) and we'll take it one step at a time. Most people are vaguely familiar with frequentist methods for fitting functions to data, but haven't really thought deeply about them. We'll dig into frequentist methods first and come back to Bayesian methods later.
- Review these slides on chi-squared values and maximum likelihood fitting and Sections 4.2 and 4.3 of the textbook
- Scientists often describe fitting as "minimizing Chi-Squared" and use the "reduced Chi-Squared" (the minimized value of Chi-Squared normalized by the number of degrees of freedom) as an estimate of goodness of fit, but to a statistician, Chi-Squared describes a probability distribution. To learn to think about Chi-Squared more deeply, complete this Tutorial on Interpreting Chi-Squared, in which you will generate and fit fake data using the parameter-free function y=1/x.
- Complete this Tutorial on Parameter Estimation by Maximum Likelihood Model Fitting, in which you will generate and fit fake data using the function y = slope*x + intercept.
- Everything above assumed idealized data -- real data is generally more complicated to fit, and the "best fit" is not necessarily the same as the "best prediction". You can get a taste of these issues in this Tutorial on Realistic Line Fitting.
- This Tutorial on Bayesian Parameter Estimation is a sequel to the frequentist Tutorial on Parameter Estimation by Maximum Likelihood Model Fitting above. It returns to idealized data, but modifying the likelihood function to incorporate complex errors and biases (as briefly discussed in the Tutorial on Realistic Line Fitting above) works well for both Bayesian and frequentist modeling.
- Tutorial on Bootstrapping
- Repeated bootstrapping can get computationally demanding -- this Un-Tutorial on Multiprocessing explores how to speed up such an embarrassingly parallel computing task.
- This Frequentist & Bayesian Model Selection Tutorial demonstrates how you would decide between a first-order or second-order polynomial fit in a dicey case. The tutorial instructions are written into the code and sample solutions are here.