diff --git a/lectures/_static/lecture_specific/cross_section/light_heavy_fig1.png b/lectures/_static/lecture_specific/cross_section/light_heavy_fig1.png new file mode 100644 index 000000000..2035bd181 Binary files /dev/null and b/lectures/_static/lecture_specific/cross_section/light_heavy_fig1.png differ diff --git a/lectures/_static/lecture_specific/cross_section/rank_size_fig1.png b/lectures/_static/lecture_specific/cross_section/rank_size_fig1.png new file mode 100644 index 000000000..872dc6144 Binary files /dev/null and b/lectures/_static/lecture_specific/cross_section/rank_size_fig1.png differ diff --git a/lectures/_toc.yml b/lectures/_toc.yml index 9750900a4..f053246e1 100644 --- a/lectures/_toc.yml +++ b/lectures/_toc.yml @@ -20,6 +20,7 @@ parts: - file: inflation_unemployment - file: inequality - file: distributions + - file: cross_section - caption: Useful Tools numbered: true chapters: diff --git a/lectures/cross_section.md b/lectures/cross_section.md new file mode 100644 index 000000000..fc538afaf --- /dev/null +++ b/lectures/cross_section.md @@ -0,0 +1,885 @@ +--- +jupytext: + text_representation: + extension: .md + format_name: myst + format_version: 0.13 + jupytext_version: 1.14.5 +kernelspec: + display_name: Python 3 (ipykernel) + language: python + name: python3 +--- + +# Heavy-Tailed Distributions + +```{contents} Contents +:depth: 2 +``` + +In addition to what's in Anaconda, this lecture will need the following libraries: + +```{code-cell} ipython3 +:tags: [hide-output] + +!pip install --upgrade yfinance quantecon pandas_datareader interpolation +``` + +We run the following code to prepare for the lecture: + +```{code-cell} ipython3 +%matplotlib inline +import matplotlib.pyplot as plt +plt.rcParams["figure.figsize"] = (11, 5) #set default figure size +import numpy as np +import quantecon as qe +import yfinance as yf +import pandas as pd +import pandas_datareader.data as web +import statsmodels.api as sm + +from interpolation import interp +from pandas_datareader import wb +from scipy.stats import norm, cauchy +from pandas.plotting import register_matplotlib_converters +register_matplotlib_converters() +``` + +## Overview + +TODO -- explain that there is a more detailed and more mathematical discussion of heavy tails in https://python.quantecon.org/heavy_tails.html + +In this section we give some motivation for the lecture. + +### Introduction: Light Tails + +Most commonly used probability distributions in classical statistics and +the natural sciences have "light tails." + +To explain this concept, let's look first at examples. + +The classic example is the [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution), which has density + +$$ f(x) = \frac{1}{\sqrt{2\pi}\sigma} + \exp\left( -\frac{(x-\mu)^2}{2 \sigma^2} \right) +$$ + +on the real line $\mathbb R = (-\infty, \infty)$. + +The two parameters $\mu$ and $\sigma$ are the mean and standard deviation +respectively. + +As $x$ deviates from $\mu$, the value of $f(x)$ goes to zero extremely +quickly. + +We can see this when we plot the density and show a histogram of observations, +as with the following code (which assumes $\mu=0$ and $\sigma=1$). + +```{code-cell} ipython3 +fig, ax = plt.subplots() +X = norm.rvs(size=1_000_000) +ax.hist(X, bins=40, alpha=0.4, label='histogram', density=True) +x_grid = np.linspace(-4, 4, 400) +ax.plot(x_grid, norm.pdf(x_grid), label='density') +ax.legend() +plt.show() +``` + +Notice how + +* the density's tails converge quickly to zero in both directions and +* even with 1,000,000 draws, we get no very large or very small observations. + +We can see the last point more clearly by executing + +```{code-cell} ipython3 +X.min(), X.max() +``` + +Here's another view of draws from the same distribution: + +```{code-cell} ipython3 +n = 2000 +fig, ax = plt.subplots() +data = norm.rvs(size=n) +ax.plot(list(range(n)), data, linestyle='', marker='o', alpha=0.5, ms=4) +ax.vlines(list(range(n)), 0, data, lw=0.2) +ax.set_ylim(-15, 15) +ax.set_xlabel('$i$') +ax.set_ylabel('$X_i$', rotation=0) +plt.show() +``` + +We have plotted each individual draw $X_i$ against $i$. + +None are very large or very small. + +In other words, extreme observations are rare and draws tend not to deviate +too much from the mean. + +As a result, many statisticians and econometricians +use rules of thumb such as "outcomes more than four or five +standard deviations from the mean can safely be ignored." + + +### When Are Light Tails Valid? + +Distributions that rarely generate extreme values are called light-tailed. + +For example, human height is light-tailed. + +Yes, it's true that we see some very tall people. + +* For example, basketballer [Sun Mingming](https://en.wikipedia.org/wiki/Sun_Mingming) is 2.32 meters tall + +But have you ever heard of someone who is 20 meters tall? Or 200? Or 2000? + +Have you ever wondered why not? + +After all, there are 8 billion people in the world! + +In essence, the reason we don't see such draws is that the distribution of +human high has very light tails. + +In fact human height is approximately normally distributed. + + +### Returns on Assets + + +But now we have to ask: does economic data always look like this? + +Let's look at some financial data first. + +Our aim is to plot the daily change in the price of Amazon (AMZN) stock for +the period from 1st January 2015 to 1st July 2022. + +This equates to daily returns if we set dividends aside. + +The code below produces the desired plot using Yahoo financial data via the `yfinance` library. + +```{code-cell} ipython3 +s = yf.download('AMZN', '2015-1-1', '2022-7-1')['Adj Close'] +r = s.pct_change() + +fig, ax = plt.subplots() + +ax.plot(r, linestyle='', marker='o', alpha=0.5, ms=4) +ax.vlines(r.index, 0, r.values, lw=0.2) +ax.set_ylabel('returns', fontsize=12) +ax.set_xlabel('date', fontsize=12) + +plt.show() +``` + +This data looks different to the draws from the normal distribution. + +Several of observations are quite extreme. + +We get a similar picture if we look at other assets, such as Bitcoin + +```{code-cell} ipython3 +s = yf.download('BTC-USD', '2015-1-1', '2022-7-1')['Adj Close'] +r = s.pct_change() + +fig, ax = plt.subplots() + +ax.plot(r, linestyle='', marker='o', alpha=0.5, ms=4) +ax.vlines(r.index, 0, r.values, lw=0.2) +ax.set_ylabel('returns', fontsize=12) +ax.set_xlabel('date', fontsize=12) + +plt.show() +``` + +The histogram also looks different to the histogram of the normal +distribution: + +```{code-cell} ipython3 +fig, ax = plt.subplots() +ax.hist(r, bins=60, alpha=0.4, label='bitcoin returns', density=True) +ax.set_xlabel('returns', fontsize=12) +plt.show() +``` + +If we look at higher frequency returns data (e.g., tick-by-tick), we often see +even more extreme observations. + +See, for example, {cite}`mandelbrot1963variation` or {cite}`rachev2003handbook`. + + +### Other Data + +The data we have just seen is said to be "heavy-tailed". + +With heavy-tailed distributions, extreme outcomes occur relatively +frequently. + +(A more careful definition is given below) + +Importantly, there are many examples of heavy-tailed distributions +observed in economic and financial settings include + +For example, the income and the wealth distributions are heavy-tailed +(see, e.g., {cite}`pareto1896cours`, {cite}`benhabib2018skewed`). + +* You can imagine this: most people have low or modest wealth but some people + are extremely rich. + +The firm size distribution is also heavy-tailed +({cite}`axtell2001zipf`, {cite}`gabaix2016power`). + +* You can imagine this too: most firms are small but some firms are enormous. + +The distribution of town and city sizes is heavy-tailed +({cite}`rozenfeld2011area`, {cite}`gabaix2016power`). + +* Most towns and cities are small but some are very large. + + +### Why Should We Care? + +Heavy tails are common in economic data but does that mean they are important? + +The answer to this question is affirmative! + +When distributions are heavy-tailed, we need to think carefully about issues +like + +* diversification and risk +* forecasting +* taxation (across a heavy-tailed income distribution), etc. + +We return to these points below. + + + +## Visual Comparisons + +Let's do some more visual comparisons to help us build intuition on the +difference between light and heavy tails. + + +The figure below shows a simulation. (You will be asked to replicate it in +the exercises.) + +The top two subfigures each show 120 independent draws from the normal +distribution, which is light-tailed. + +The bottom subfigure shows 120 independent draws from [the Cauchy +distribution](https://en.wikipedia.org/wiki/Cauchy_distribution), which is +heavy-tailed. + +(light_heavy_fig1)= +```{figure} /_static/lecture_specific/cross_section/light_heavy_fig1.png + +``` + +In the top subfigure, the standard deviation of the normal distribution is 2, +and the draws are clustered around the mean. + +In the middle subfigure, the standard deviation is increased to 12 and, as +expected, the amount of dispersion rises. + +The bottom subfigure, with the Cauchy draws, shows a different pattern: tight +clustering around the mean for the great majority of observations, combined +with a few sudden large deviations from the mean. + +This is typical of a heavy-tailed distribution. + + +## Heavy Tails in Economic Cross-Sectional Distributions + +TODO + +- Shu please add data and plots. +- Please use empirical CCDF, as in the macro dynamics book + + + +TODO Review exercises below --- are they all too hard for undergrads or should +we keep some of them. + +```{code-cell} ipython3 +def extract_wb(varlist=['NY.GDP.MKTP.CD'], + c='all', + s=1900, + e=2021, + varnames=None): + if c == "all_countries": + # keep countries only (no aggregated regions) + countries = wb.get_countries() + countries_code = countries[countries['region'] != 'Aggregates']['iso3c'].values + + df = wb.download(indicator=varlist, country=countries_code, start=s, end=e).stack().unstack(0).reset_index() + df = df.drop(['level_1'], axis=1).transpose() # set_index(['year']) + if varnames != None: + df.columns = varnames + df = df[1:] + return df +``` + +```{code-cell} ipython3 +def empirical_ccdf(data, + ax, + aw=None, # weights + label=None, + xlabel=None, + add_reg_line=False, + title=None): + """ + Take data vector and return prob values for plotting. + Upgraded empirical_ccdf + """ + y_vals = np.empty_like(data, dtype='float64') + p_vals = np.empty_like(data, dtype='float64') + n = len(data) + if aw is None: + for i, d in enumerate(data): + # record fraction of sample above d + y_vals[i] = np.sum(data >= d) / n + p_vals[i] = np.sum(data == d) / n + else: + fw = np.empty_like(aw, dtype='float64') + for i, a in enumerate(aw): + fw[i] = a / np.sum(aw) + pdf = lambda x: interp(data, fw, x) + data = np.sort(data) + j = 0 + for i, d in enumerate(data): + j += pdf(d) + y_vals[i] = 1- j + + x, y = np.log(data), np.log(y_vals) + + results = sm.OLS(y, sm.add_constant(x)).fit() + b, a = results.params + + kwargs = [('alpha', 0.3)] + if label: + kwargs.append(('label', label)) + kwargs = dict(kwargs) + + ax.scatter(x, y, **kwargs) + if add_reg_line: + ax.plot(x, x * a + b, 'k-', alpha=0.6, label=f"slope = ${a: 1.2f}$") + if not xlabel: + xlabel='log value' + ax.set_xlabel(xlabel, fontsize=12) + ax.set_ylabel("log prob.", fontsize=12) + + if label: + ax.legend(loc='lower left', fontsize=12) + + if title: + ax.set_title(title) + + return np.log(data), y_vals, p_vals +``` + +### GDP + +```{code-cell} ipython3 +# get gdp and gdp per capita for all regions and countries in 2021 + +variable_code = ['NY.GDP.MKTP.CD', 'NY.GDP.PCAP.CD'] +variable_names = ['GDP', 'GDP per capita'] + +df_gdp1 = extract_wb(varlist=variable_code, + c="all_countries", + s="2021", + e="2021", + varnames=variable_names) +``` + +```{code-cell} ipython3 +fig, axes = plt.subplots(1, 2, figsize=(8.8, 3.6)) + +for name, ax in zip(variable_names, axes): + empirical_ccdf(np.asarray(df_gdp1[name]).astype("float64"), ax, add_reg_line=False, label=name) + +plt.show() +``` + +### Firm size + +```{code-cell} ipython3 +df_fs = pd.read_csv('https://media.githubusercontent.com/media/QuantEcon/high_dim_data/update_csdata/cross_section/forbes-global2000.csv') +df_fs = df_fs[['Country', 'Sales', 'Profits', 'Assets', 'Market Value']] +``` + +```{code-cell} ipython3 +fig, ax = plt.subplots(figsize=(6.4, 3.5)) + +label="firm size (market value)" + +d = df_fs.sort_values('Market Value', ascending=False) + +empirical_ccdf(np.asarray(d['Market Value'])[0:500], ax, label=label, add_reg_line=True) + +plt.show() +``` + +### City size + +```{code-cell} ipython3 +df_cs_us = pd.read_csv('https://raw.githubusercontent.com/QuantEcon/high_dim_data/update_csdata/cross_section/cities_us.txt', delimiter="\t", header=None) +df_cs_us = df_cs_us[[0, 3]] +df_cs_us.columns = 'rank', 'pop' +x = np.asarray(df_cs_us['pop']) +citysize = [] +for i in x: + i = i.replace(",", "") + citysize.append(int(i)) +df_cs_us['pop'] = citysize +``` + +```{code-cell} ipython3 +df_cs_br = pd.read_csv('https://media.githubusercontent.com/media/QuantEcon/high_dim_data/update_csdata/cross_section/cities_brazil.csv', delimiter=",", header=None) +df_cs_br.columns = df_cs_br.iloc[0] +df_cs_br = df_cs_br[1:401] +df_cs_br = df_cs_br.astype({"pop2023": float}) +``` + +```{code-cell} ipython3 +fig, axes = plt.subplots(1, 2, figsize=(8.8, 3.6)) + + +empirical_ccdf(np.asarray(df_cs_us['pop']), axes[0], label="US", add_reg_line=True) +empirical_ccdf(np.asarray(df_cs_br['pop2023']), axes[1], label="Brazil", add_reg_line=True) + +plt.show() +``` + +### Wealth + +```{code-cell} ipython3 +df_w = pd.read_csv('https://media.githubusercontent.com/media/QuantEcon/high_dim_data/update_csdata/cross_section/forbes-billionaires.csv') +df_w = df_w[['country', 'realTimeWorth', 'realTimeRank']].dropna() +df_w = df_w.astype({'realTimeRank': int}) +df_w = df_w.sort_values('realTimeRank', ascending=True).copy() +``` + +```{code-cell} ipython3 +countries = ['United States', 'Japan', 'India', 'Italy'] +N = len(countries) + +fig, axs = plt.subplots(2, 2, figsize=(8, 6)) +axs = axs.flatten() + +for i, c in enumerate(countries): + df_w_c = df_w[df_w['country'] == c].reset_index() + z = np.asarray(df_w_c['realTimeWorth']) + # print('number of the global richest 2000 from '+ c, len(z)) + if len(z) <= 500: # cut-off number: top 500 + z = z[0:500] + + empirical_ccdf(z[0:500], axs[i], label=c, xlabel='log wealth', add_reg_line=True) + +fig.tight_layout() + +plt.show() +``` + +## Pareto Tails + +TODO Hi John I added this part with equations you cited below from lecture heavy_tails + +One specific class of heavy-tailed distributions has been found repeatedly in +economic and social phenomena: the class of so-called power laws. + +Specifically, given $\alpha > 0$, a nonnegative random variable $X$ is said to +have a **Pareto tail** with **tail index** $\alpha$ if + +```{math} +:label: plrt + +\lim_{x \to \infty} x^\alpha \, \mathbb P\{X > x\} = c. +``` + +The limit {eq}`plrt` implies the existence of positive constants $b$ and $\bar x$ +such that $\mathbb P\{X > x\} \geq b x^{- \alpha}$ whenever $x \geq \bar x$. + +The implication is that $\mathbb P\{X > x\}$ converges to zero no faster than $x^{-\alpha}$. + +In some sources, a random variable obeying {eq}`plrt` is said to have a **power law tail**. + +One example is the [Pareto distribution](https://en.wikipedia.org/wiki/Pareto_distribution). + +If $X$ has the Pareto distribution, then there are positive constants $\bar x$ +and $\alpha$ such that + +```{math} +:label: pareto + +\mathbb P\{X > x\} = +\begin{cases} + \left( \bar x/x \right)^{\alpha} + & \text{ if } x \geq \bar x + \\ + 1 + & \text{ if } x < \bar x +\end{cases} +``` + +It is easy to see that $\mathbb P\{X > x\}$ satisfies {eq}`plrt`. + +Thus, in line with the terminology, Pareto distributed random variables have a Pareto tail. + +(rank_size_fig1)= +```{figure} /_static/lecture_specific/cross_section/rank_size_fig1.png + +``` + +## Exercises + +```{exercise} +:label: ht_ex1 + +Replicate {ref}`the figure presented above ` that compares normal and Cauchy draws. + +Use `np.random.seed(11)` to set the seed. +``` + +```{solution-start} ht_ex1 +:class: dropdown +``` + +```{code-cell} ipython3 +n = 120 +np.random.seed(11) + +fig, axes = plt.subplots(3, 1, figsize=(6, 12)) + +for ax in axes: + ax.set_ylim((-120, 120)) + +s_vals = 2, 12 + +for ax, s in zip(axes[:2], s_vals): + data = np.random.randn(n) * s + ax.plot(list(range(n)), data, linestyle='', marker='o', alpha=0.5, ms=4) + ax.vlines(list(range(n)), 0, data, lw=0.2) + ax.set_title(f"draws from $N(0, \sigma^2)$ with $\sigma = {s}$", fontsize=11) + +ax = axes[2] +distribution = cauchy() +data = distribution.rvs(n) +ax.plot(list(range(n)), data, linestyle='', marker='o', alpha=0.5, ms=4) +ax.vlines(list(range(n)), 0, data, lw=0.2) +ax.set_title(f"draws from the Cauchy distribution", fontsize=11) + +plt.subplots_adjust(hspace=0.25) + +plt.show() +``` + +```{solution-end} +``` + + +```{exercise} +:label: ht_ex2 + +Prove: If $X$ has a Pareto tail with tail index $\alpha$, then +$\mathbb E[X^r] = \infty$ for all $r \geq \alpha$. +``` + +```{solution-start} ht_ex2 +:class: dropdown +``` + +Let $X$ have a Pareto tail with tail index $\alpha$ and let $F$ be its cdf. + +Fix $r \geq \alpha$. + +As discussed after {eq}`plrt`, we can take positive constants $b$ and $\bar x$ such that + +$$ +\mathbb P\{X > x\} \geq b x^{- \alpha} \text{ whenever } x \geq \bar x +$$ + +But then + +$$ +\mathbb E X^r = r \int_0^\infty x^{r-1} \mathbb P\{ X > x \} x +\geq +r \int_0^{\bar x} x^{r-1} \mathbb P\{ X > x \} x ++ r \int_{\bar x}^\infty x^{r-1} b x^{-\alpha} x. +$$ + +We know that $\int_{\bar x}^\infty x^{r-\alpha-1} x = \infty$ whenever $r - \alpha - 1 \geq -1$. + +Since $r \geq \alpha$, we have $\mathbb E X^r = \infty$. + +```{solution-end} +``` + + +```{exercise} +:label: ht_ex3 + +Repeat exercise 1, but replace the three distributions (two normal, one +Cauchy) with three Pareto distributions using different choices of +$\alpha$. + +For $\alpha$, try 1.15, 1.5 and 1.75. + +Use `np.random.seed(11)` to set the seed. +``` + + +```{solution-start} ht_ex3 +:class: dropdown +``` + +```{code-cell} ipython3 +from scipy.stats import pareto + +np.random.seed(11) + +n = 120 +alphas = [1.15, 1.50, 1.75] + +fig, axes = plt.subplots(3, 1, figsize=(6, 8)) + +for (a, ax) in zip(alphas, axes): + ax.set_ylim((-5, 50)) + data = pareto.rvs(size=n, scale=1, b=a) + ax.plot(list(range(n)), data, linestyle='', marker='o', alpha=0.5, ms=4) + ax.vlines(list(range(n)), 0, data, lw=0.2) + ax.set_title(f"Pareto draws with $\\alpha = {a}$", fontsize=11) + +plt.subplots_adjust(hspace=0.4) + +plt.show() +``` + +```{solution-end} +``` + + +```{exercise} +:label: ht_ex4 + + +Replicate the rank-size plot figure {ref}`presented above `. + +If you like you can use the function `qe.rank_size` from the `quantecon` library to generate the plots. + +Use `np.random.seed(13)` to set the seed. +``` + +```{solution-start} ht_ex4 +:class: dropdown +``` + +First let's generate the data for the plots: + +```{code-cell} ipython3 +sample_size = 1000 +np.random.seed(13) +z = np.random.randn(sample_size) + +data_1 = np.abs(z) +data_2 = np.exp(z) +data_3 = np.exp(np.random.exponential(scale=1.0, size=sample_size)) + +data_list = [data_1, data_2, data_3] +``` + +Now we plot the data: + +```{code-cell} ipython3 +fig, axes = plt.subplots(3, 1, figsize=(6, 8)) +axes = axes.flatten() +labels = ['$|z|$', '$\exp(z)$', 'Pareto with tail index $1.0$'] + +for data, label, ax in zip(data_list, labels, axes): + + rank_data, size_data = qe.rank_size(data) + + ax.loglog(rank_data, size_data, 'o', markersize=3.0, alpha=0.5, label=label) + ax.set_xlabel("log rank") + ax.set_ylabel("log size") + + ax.legend() + +fig.subplots_adjust(hspace=0.4) + +plt.show() +``` + +```{solution-end} +``` + +```{exercise} +:label: ht_ex5 + +There is an ongoing argument about whether the firm size distribution should +be modeled as a Pareto distribution or a lognormal distribution (see, e.g., +{cite}`fujiwara2004pareto`, {cite}`kondo2018us` or {cite}`schluter2019size`). + +This sounds esoteric but has real implications for a variety of economic +phenomena. + +To illustrate this fact in a simple way, let us consider an economy with +100,000 firms, an interest rate of `r = 0.05` and a corporate tax rate of +15%. + +Your task is to estimate the present discounted value of projected corporate +tax revenue over the next 10 years. + +Because we are forecasting, we need a model. + +We will suppose that + +1. the number of firms and the firm size distribution (measured in profits) remain fixed and +1. the firm size distribution is either lognormal or Pareto. + +Present discounted value of tax revenue will be estimated by + +1. generating 100,000 draws of firm profit from the firm size distribution, +1. multiplying by the tax rate, and +1. summing the results with discounting to obtain present value. + +If $X$ has the Pareto distribution, then there are positive constants $\bar x$ +and $\alpha$ such that + +The Pareto distribution is assumed to take the form {eq}`pareto` with $\bar x = 1$ and $\alpha = 1.05$. + +(The value the tail index $\alpha$ is plausible given the data {cite}`gabaix2016power`.) + +To make the lognormal option as similar as possible to the Pareto option, choose +its parameters such that the mean and median of both distributions are the same. + +Note that, for each distribution, your estimate of tax revenue will be random +because it is based on a finite number of draws. + +To take this into account, generate 100 replications (evaluations of tax revenue) +for each of the two distributions and compare the two samples by + +* producing a [violin plot](https://en.wikipedia.org/wiki/Violin_plot) visualizing the two samples side-by-side and +* printing the mean and standard deviation of both samples. + +For the seed use `np.random.seed(1234)`. + +What differences do you observe? + +(Note: a better approach to this problem would be to model firm dynamics and +try to track individual firms given the current distribution. We will discuss +firm dynamics in later lectures.) +``` + +```{solution-start} ht_ex5 +:class: dropdown +``` + +To do the exercise, we need to choose the parameters $\mu$ +and $\sigma$ of the lognormal distribution to match the mean and median +of the Pareto distribution. + +Here we understand the lognormal distribution as that of the random variable +$\exp(\mu + \sigma Z)$ when $Z$ is standard normal. + +The mean and median of the Pareto distribution {eq}`pareto` with +$\bar x = 1$ are + +$$ +\text{mean } = \frac{\alpha}{\alpha - 1} +\quad \text{and} \quad +\text{median } = 2^{1/\alpha} +$$ + +Using the corresponding expressions for the lognormal distribution leads us to +the equations + +$$ +\frac{\alpha}{\alpha - 1} = \exp(\mu + \sigma^2/2) +\quad \text{and} \quad +2^{1/\alpha} = \exp(\mu) +$$ + +which we solve for $\mu$ and $\sigma$ given $\alpha = 1.05$. + +Here is code that generates the two samples, produces the violin plot and +prints the mean and standard deviation of the two samples. + +```{code-cell} ipython3 +num_firms = 100_000 +num_years = 10 +tax_rate = 0.15 +r = 0.05 + +β = 1 / (1 + r) # discount factor + +x_bar = 1.0 +α = 1.05 + +def pareto_rvs(n): + "Uses a standard method to generate Pareto draws." + u = np.random.uniform(size=n) + y = x_bar / (u**(1/α)) + return y +``` + +Let's compute the lognormal parameters: + +```{code-cell} ipython3 +μ = np.log(2) / α +σ_sq = 2 * (np.log(α/(α - 1)) - np.log(2)/α) +σ = np.sqrt(σ_sq) +``` + +Here's a function to compute a single estimate of tax revenue for a particular +choice of distribution `dist`. + +```{code-cell} ipython3 +def tax_rev(dist): + tax_raised = 0 + for t in range(num_years): + if dist == 'pareto': + π = pareto_rvs(num_firms) + else: + π = np.exp(μ + σ * np.random.randn(num_firms)) + tax_raised += β**t * np.sum(π * tax_rate) + return tax_raised +``` + +Now let's generate the violin plot. + +```{code-cell} ipython3 +num_reps = 100 +np.random.seed(1234) + +tax_rev_lognorm = np.empty(num_reps) +tax_rev_pareto = np.empty(num_reps) + +for i in range(num_reps): + tax_rev_pareto[i] = tax_rev('pareto') + tax_rev_lognorm[i] = tax_rev('lognorm') + +fig, ax = plt.subplots() + +data = tax_rev_pareto, tax_rev_lognorm + +ax.violinplot(data) + +plt.show() +``` + +Finally, let's print the means and standard deviations. + +```{code-cell} ipython3 +tax_rev_pareto.mean(), tax_rev_pareto.std() +``` + +```{code-cell} ipython3 +tax_rev_lognorm.mean(), tax_rev_lognorm.std() +``` + +Looking at the output of the code, our main conclusion is that the Pareto +assumption leads to a lower mean and greater dispersion. + +```{solution-end} +```