Home

Welcome to the 2017bootcamp-general wiki!

Information and materials will be organized here.

Textbook:

Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data

free download from UNC Libraries: search on "Ivezic"
python library (astroML) and python codes to make the book figures here

Basics:

Wireless access for your laptop
Git prep

Linux:

Linux tutorial
After completing the linux tutorial, you should make a directory in /afs/cas.unc.edu/users/y/o/yourname/public to hold the rest of your boot camp work. If your initial disk space allocation runs out, we can arrange additional space, but it's likely to be sufficient. You can work directly on any linux machine in Hell's Kitchen (the astro computing lab), or if these machines are not powerful enough, feel free to ssh into stardust.astro.unc.edu or blarney.castle.unc.edu.
Linux bonus tracks

Plain-Text Editing:

vi tutorial
Even if you prefer emacs or another programming editor, you should learn the basics of vi, because you may sometimes find yourself inadvertently dumped into vi when using git or linux software. Note that vi is installed by default for Linux/Mac and comes with Git Bash for Windows (see "Git prep" above under Basics).
emacs installation
Emacs is the primary alternative to vi and there are long arguments about which is better. Optionally install emacs for Windows or Mac and run the built-in tutorial in the emacs help menu. (FWIW, your instructor uses emacs.)

Version Control:

Git and GitHub tutorial

Python:

Anaconda installation and basic data analysis tutorial
Programming tutorial
Browse Chapters 1 of the AstroML textbook (reading 1.6 more closely) and download/play with the code for Figs. 1.9-1.12
Read Appendix A of the AstroML textbook and try out the commands
Browse Chapter 2 and take some time to study the vectorization example on pp. 54-56 to reconstruct why it works
Debug and speed up this template code or, preferably, this protected version of the same code after consulting these Programming Best Practices; make sure to read the instructions at the bottom of the code. The pdb package isn't necessary for such a short code, but try it anyway to see how it works. When you think you've found everything, discuss with a partner and/or the instructor.
Optional: Use these jupyter notebook quickstart instructions to examine the example jupyter notebook called ExploreRESOLVEandECO.ipynb (found in the current repo; download by clicking "raw" then right-clicking the raw contents and choosing "save as"). You can run this notebook partway through if you also download the ECO_dr1_subset.csv input file also provided in this repo. If you like the idea of being able to work in notebooks like this, then you can get comfortable with them first by finishing the example notebook (this effort will also give you a small taste of the sql database query language), then by creating your own jupyter notebook from scratch. For example, you could use ECO DR1 to plot stellar mass vs. environment distributions and compare them for early type and late type galaxies. Try raiding code from at least one of Figures 1.9-1.12 in the textbook. NOTE: If/when you launch your first jupyter notebook under linux you'll get a question about a kernel choice -- just click OK and you should be all set!
Optional: Check out the 10 Minutes to pandas guide to see if you'd like to learn more about this powerful data manipulation package. If you've learned about jupyter notebooks, you can play with some useful pandas commands in this pandas tutorial notebook (again, right-click on raw and "save as" to get the actual notebook file so you can run it yourself; you'll also need this input file).

Basic Statistics:

Laws of Probability, Probability Distributions, Random Sampling, Uncertainties, and Confidence Intervals

For background, look at these basic statistics slides, Chapter 3 of the textbook, and this commentary on the p-value crisis
Complete this introductory Monte Carlo Methods tutorial, including examples involving confidence intervals, determining areas, and inverse transform sampling (as discussed in Section 3.7 and Fig. 3.25 of the textbook)

Testing for Correlations

Review these additional slides on correlation tests, a special case of hypothesis tests
Download this code and this input file, then uncomment and run each code block sequentially to compare the Spearman Rank and Pearson Correlation Tests. Add the code necessary to include Kendall's tau in the comparison (solution here).

Plotting and Comparing Distributions

Tutorial on Distributions (Histograms, KDE, Distribution Comparison Hypothesis Tests)

Fitting Models

This is a complicated topic (!) and we'll take it one step at a time. Most people are vaguely familiar with frequentist methods for fitting functions to data, but haven't really thought deeply about them. We'll dig into frequentist methods first and come back to Bayesian methods later.

Review these slides on chi-squared values and maximum likelihood fitting and Sections 4.2 and 4.3 of the textbook
Scientists often describe fitting as "minimizing Chi-Squared" and use the "reduced Chi-Squared" (the minimized value of Chi-Squared normalized by the number of degrees of freedom) as an estimate of goodness of fit, but to a statistician, Chi-Squared describes a probability distribution. To learn to think about Chi-Squared more deeply, complete this Tutorial on Interpreting Chi-Squared, in which you will generate and fit fake data using the parameter-free function y=1/x.
Complete this Tutorial on Parameter Estimation by Maximum Likelihood Model Fitting, in which you will generate and fit fake data using the function y = slope*x + intercept.
Everything above assumed idealized data -- real data is generally more complicated to fit, and the "best fit" is not necessarily the same as the "best prediction". You can get a taste of these issues in this Tutorial on Realistic Line Fitting.
This Tutorial on Bayesian Parameter Estimation is a sequel to the frequentist Tutorial on Parameter Estimation by Maximum Likelihood Model Fitting above. It returns to idealized data, but modifying the likelihood function to incorporate complex errors and biases (as briefly discussed in the Tutorial on Realistic Line Fitting above) works well for both Bayesian and frequentist modeling.

Bootstrapping

Tutorial on Bootstrapping
Repeated bootstrapping can get computationally demanding -- this Un-Tutorial on Multiprocessing explores how to speed up such an embarrassingly parallel computing task.

Model Selection

This Frequentist & Bayesian Model Selection Tutorial demonstrates how you would decide between a first-order or second-order polynomial fit in a dicey case. The tutorial instructions are written into the code and sample solutions are here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly