update inference notebook 5

Collin Capano · Collin Capano · commit 83ac83b0de85 · 2020-06-24T15:05:59.000Z
diff --git a/tutorial/inference_5_results_io/IntroToPyCBCInference.ipynb b/tutorial/inference_5_results_io/IntroToPyCBCInference.ipynb
@@ -33,7 +33,7 @@
     "\n",
     "# This is needed to access the executables on sciserver. On a personal machine this should be ignore.\n",
     "path = %env PATH\n",
-    "%env PATH=$path:/home/idies/miniconda3/envs/py27/bin "
+    "%env PATH=$path:/home/idies/miniconda3/envs/py37/bin "
    ]
   },
   {
@@ -62,23 +62,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We will download a result file from a fully completed analysis, from [here](https://www.atlas.aei.uni-hannover.de/~cdcapano/projects/pycbc_inference/workshop-may2019/bbh_injection). In this analysis, we injected a binary black hole simulation into LIGO data (20 seconds after GW150914). The injection parameters are:\n",
-    "```\n",
-    "tc = 1126259482.420\n",
-    "mass1 = 37\n",
-    "mass2 = 32\n",
-    "ra = 2.2\n",
-    "dec = -1.25\n",
-    "inclincation = 2.5\n",
-    "coa_phase = 1.5\n",
-    "polarization = 1.75\n",
-    "distance = 100\n",
-    "f_ref = 20\n",
-    "f_lower = 18\n",
-    "approximant = IMRPhenomPv2\n",
-    "taper = start\n",
-    "```\n",
-    "These are similar to GW150914, although it is about a factor of 5 closer in distance. You can see the config file used and the run script in the [linked directory](https://www.atlas.aei.uni-hannover.de/~cdcapano/projects/pycbc_inference/workshop-may2019/bbh_injection)."
+    "We will download a result file from a fully completed analysis, from [here](https://www.atlas.aei.uni-hannover.de/~work-cdcapano/pycbc_workshop_june_2020). This is the result of running `emcee_pt` on GW150914 using the standard prior provided in the online [PyCBC Inference documentation](http://pycbc.org/pycbc/latest/html/inference/examples/gw150914.html). You can see the complete configuration file [here](https://www.atlas.aei.uni-hannover.de/~work-cdcapano/pycbc_workshop_june_2020/inference-emcee_pt.ini).\n",
+    "\n",
+    "Note that there are two files in this directory. This is because this run was run twice with different starting seeds, to accumulate more samples. We'll download one of the files now."
    ]
   },
   {
@@ -87,8 +73,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "if not os.path.exists('bbh_results.hdf'):\n",
-    "    !wget https://www.atlas.aei.uni-hannover.de/~cdcapano/projects/pycbc_inference/bbh_injection/bbh_results.hdf"
+    "if not os.path.exists('H1L1-INFERENCE_EMCEE_PT_0-1126259200-400.hdf'):\n",
+    "    !wget https://www.atlas.aei.uni-hannover.de/~work-cdcapano/pycbc_workshop_june_2020/H1L1-INFERENCE_EMCEE_PT_0-1126259200-400.hdf"
    ]
   },
   {
@@ -102,7 +88,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The results file `bbh_results.hdf` is an HDF file. We could use the standard python module for reading HDF files [h5py](http://docs.h5py.org/en/stable/) to examine it. However, there are classes in [pycbc.inference.io](http://pycbc.org/pycbc/latest/html/pycbc.inference.io.html) that inherit from [h5py.File](http://docs.h5py.org/en/stable/high/file.html#file-objects) and add several convenience functions that make it easier to read samples from the file. There is one class for each type of sampler. So to load the file, we will use [pycbc.inference.io.loadfile](http://pycbc.org/pycbc/latest/html/pycbc.inference.io.html#pycbc.inference.io.loadfile). This function automatically determines which class to use to read the file, based on what is in the file."
+    "The results file `H1L1-INFERENCE_EMCEE_PT_0-1126259200-400.hdf` is an HDF file. We could use the standard python module for reading HDF files [h5py](http://docs.h5py.org/en/stable/) to examine it. However, there are classes in [pycbc.inference.io](http://pycbc.org/pycbc/latest/html/pycbc.inference.io.html) that inherit from [h5py.File](http://docs.h5py.org/en/stable/high/file.html#file-objects) and add several convenience functions that make it easier to read samples from the file. There is one class for each type of sampler. To load the file, we will use [pycbc.inference.io.loadfile](http://pycbc.org/pycbc/latest/html/pycbc.inference.io.html#pycbc.inference.io.loadfile). This function automatically determines which class to use to read the file, based on what is in the file."
    ]
   },
   {
@@ -120,7 +106,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "fp = loadfile('bbh_results.hdf', 'r')"
+    "fp = loadfile('H1L1-INFERENCE_EMCEE_PT_0-1126259200-400.hdf', 'r')"
    ]
   },
   {
@@ -136,7 +122,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "fp.keys()"
+    "list(fp.keys())"
    ]
   },
   {
@@ -152,7 +138,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "fp.attrs.items()"
+    "list(fp.attrs.items())"
    ]
   },
   {
@@ -169,14 +155,14 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "fp['sampler_info'].keys()"
+    "list(fp['sampler_info'].keys())"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Every group can have their own `.attrs`; let's look at the `sampler_info` group's `attrs`:"
+    "We see that for `emcee_pt` (as well as any MCMC sampler), burn in information and the autocorrelation time (ACT) are stored in the sampler info. We could read these directly from the file. For example:"
    ]
   },
   {
@@ -185,15 +171,14 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "fp['sampler_info'].attrs.items()"
+    "print(fp['sampler_info/burn_in_iteration'][()])"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Injection data\n",
-    "If we are analyzing an injection, as in this case, an `injections` group is added to the file. This group contains all the information about the injection(s) that was (were) performed. Let's take a look:"
+    "However, since the ACT and burn in itertion are such important information, they are promotted to attributes. We can get them by simply doing:"
    ]
   },
   {
@@ -202,23 +187,23 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "fp['injections'].keys()"
+    "print(fp.act, fp.burn_in_iteration)"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "fp['injections'].attrs.items()"
+    "Every group can have their own `.attrs`; let's look at the `sampler_info` group's `attrs`:"
    ]
   },
   {
-   "cell_type": "markdown",
+   "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
+   "outputs": [],
    "source": [
-    "Note that there was no data sets stored in the `injections` group (`.keys()` returned an empty list). All of the injection info was in the `.attrs`. This was because a single injection was performed. If multiple injections had been done, the parameters that were varied would be stored as datasets."
+    "list(fp['sampler_info'].attrs.items())"
    ]
   },
   {
@@ -236,7 +221,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "print(fp['samples'].keys())"
+    "list(fp['samples'].keys())"
    ]
   },
   {
@@ -252,7 +237,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "fp['samples'].attrs.items()"
+    "list(fp['samples'].attrs.items())"
    ]
   },
   {
@@ -275,9 +260,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The shape of the dataset is `ntemps x nwalkers x n thinned iteration`. This run used 4 temperatures and 1000 walkers. Due to `max-samples-per-chain` being set to 1000, the data set has been thinned to include 750 samples from each walker and temperature.\n",
+    "The shape of the dataset is `ntemps x nwalkers x n thinned iteration`. This run used 20 temperatures and 200 walkers. Due to `max-samples-per-chain` being set to 1024, the data set has been thinned to include 704 samples from each walker and temperature.\n",
     "\n",
-    "If this had been an `emcee` run (which uses no temperatures), the samples datasets would have been two dimensional: `nwalkers x niterations`. If it had been a nested sampling run (with either CPNest or Multinest), the datasets would have been 1 dimensional. **The format of the samples data is sampler dependent.** This is why we have separate classes to read the results file."
+    "If this had been an `emcee` run (which uses no temperatures), the samples datasets would have been two dimensional: `nwalkers x niterations`. If it had been a nested sampling run (e.g., with `dynesty`), the datasets would have been 1 dimensional. **The format of the samples data is sampler dependent.** This is why we have separate classes to read the results file."
    ]
   },
   {
@@ -288,7 +273,7 @@
     "\n",
     "Now lets load the samples. We can use the `read_samples` function to do this. This takes as a the first argument a list of parameters to load. Here, we'll load all of the parameters. It also takes additional keyword arguments. These arguments are sampler-specific. For `emcee_pt` if we provide no additional keyword arguments, we'll get all temperatures. We just want the coldest temperature `temp=0`, as that is the posterior.\n",
     "\n",
-    "If we provide no other arguments, `read_samples` will load all of the independent samples post burn-in. That is, it will get samples from all walkers, starting from the burn in iteration, and thinned by the ACL (it gets this information from the file's `.attrs`; specifically, the `thin_start` and `thin_interval` attributes). The samples are flattened into a 1D array."
+    "If we provide no other arguments, `read_samples` will load all of the independent samples post burn-in. That is, it will get samples from all walkers, starting from the burn in iteration, and thinned by the autocorrelation time. The samples are flattened into a 1D array."
    ]
   },
   {
@@ -313,7 +298,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "So, we have 7000 independent samples. What sort of object is `samples`?"
+    "So, we have 1600 independent samples. What sort of object is `samples`?"
    ]
   },
   {
@@ -558,7 +543,7 @@
    "outputs": [],
    "source": [
     "!pycbc_inference_plot_posterior \\\n",
-    "    --input-file bbh_results.hdf \\\n",
+    "    --input-file H1L1-INFERENCE_EMCEE_PT_0-1126259200-400.hdf \\\n",
     "    --output-file mass1_mass2.png \\\n",
     "    --parameters 'primary_mass(mass1, mass2):mass1' 'secondary_mass(mass1, mass2):mass2' \\\n",
     "    --plot-scatter \\\n",
@@ -613,7 +598,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!pycbc_inference_plot_posterior --input-file bbh_results.hdf --file-help"
+    "!pycbc_inference_plot_posterior --input-file H1L1-INFERENCE_EMCEE_PT_0-1126259200-400.hdf --file-help"
    ]
   },
   {
@@ -660,7 +645,7 @@
    "outputs": [],
    "source": [
     "!pycbc_inference_extract_samples --verbose \\\n",
-    "    --input-file bbh_results.hdf \\\n",
+    "    --input-file H1L1-INFERENCE_EMCEE_PT_0-1126259200-400.hdf \\\n",
     "    --output-file mass_posterior.hdf \\\n",
     "    --parameters \\\n",
     "        'primary_mass(mass1, mass2):mass1' \\\n",
@@ -693,7 +678,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "fp.keys()"
+    "list(fp.keys())"
    ]
   },
   {
@@ -702,7 +687,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "fp['samples'].keys()"
+    "list(fp['samples'].keys())"
    ]
   },
   {
@@ -720,7 +705,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "fp.attrs.items()"
+    "list(fp.attrs.items())"
    ]
   },
   {
@@ -758,7 +743,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!ls -lh bbh_results.hdf mass_posterior.hdf"
+    "!ls -lh H1L1-INFERENCE_EMCEE_PT_0-1126259200-400.hdf mass_posterior.hdf"
    ]
   },
   {
@@ -797,16 +782,64 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Challenge:\n",
-    "Use the `--expected-parameters` option to put red lines at the injected values in the above plot. Read the output of `--help` to see the syntax that you need to provide."
+    "### Extracting all parameters using `'*'`\n",
+    "When we provided specific parameters to extract to `pycbc_inference_extract_samples`, only the parameters we specified were extracted. We can get all additionaly parameters by passing `'*'` to the `--parameters` option. Let's try it:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pycbc_inference_extract_samples --verbose \\\n",
+    "    --input-file H1L1-INFERENCE_EMCEE_PT_0-1126259200-400.hdf \\\n",
+    "    --output-file posterior.hdf \\\n",
+    "    --parameters \\\n",
+    "        'primary_mass(mass1, mass2):mass1' \\\n",
+    "        'secondary_mass(mass1, mass2):mass2' \\\n",
+    "        'mchirp_from_mass1_mass2(mass1, mass2):mchirp' \\\n",
+    "        'eta_from_mass1_mass2(mass1, mass2):eta' \\\n",
+    "        '*' \\\n",
+    "    --skip-groups data sampler_info --force"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's load the file and check that we have all the parameters in the samples group:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fp = loadfile('posterior.hdf', 'r')\n",
+    "print(sorted(fp['samples'].keys()))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We do! But...\n",
+    "\n",
+    "**Careful:** When we created this posterior file, we applied the `primary_mass` and `secondary_mass` functions to ensure that `mass1 >= mass2`. But we did not do the same for the spin parameters. This means that `spin1_*` and `spin2_*` are no longer associated with their correct objects. To ensure that `spin1_*` is associated with the larger object and `spin2_*` the smaller, we can use the [primary_spin](http://pycbc.org/pycbc/latest/html/pycbc.html#pycbc.conversions.primary_spin) and [secondary_spin](http://pycbc.org/pycbc/latest/html/pycbc.html#pycbc.conversions.secondary_spin) functions.\n",
+    "\n",
+    "### Challenge\n",
+    "\n",
+    "Recreate the `posterior.hdf` file, but this time use the `primary_spin` and `secondary_spin` functions to ensure that all `spin1_*` and `spin2_*` parmaeters are associated with the larger and smaller masses, respectively. Then use the resulting posterior file to plot the z-components of the spin of each object. (*Hint*: to plot the z-components, you need to multiply the magnitude of each object's spin (`spin1_a` and `spin2_a`) with the cosine of its polar angle (`spin1_polar` and `spin2_polar`)."
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3.7 (py37)",
    "language": "python",
-   "name": "python3"
+   "name": "py37"
   },
   "language_info": {
    "codemirror_mode": {
@@ -818,7 +851,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.2"
+   "version": "3.7.4"
   }
  },
  "nbformat": 4,