This notebook is available at arjunsavel/cortecs

Quickstart: fitting with PCA#

Let’s get started fitting opacities with cortecs. There are currently three compression algorithms that can be used: principal components analysis (PCA), polynomials, and neural networks. Each approach has their own strengths and weaknesses for a given opacity dataset. For the dataset we’ll be using in these tutorials, the below table illustrates some relevant quantities. Compression and decompression times are calculated across the entire wavelength range). Note that the accuracy of the different methods heavily depends on the dataset.

Method

Compression factor

Median absolute deviation

Compression time (s)

Decompression time (s)

PCA

2.2

0.13

0.17

0.033

Polynomials

24

5.6

8.0

0.47

Neural network

4.4

0.09

9.5\(\times 10^4\)

0.99

We support compressing the below opacities:

Opacity Format

cortecs Loader Key

PLATON

platon

HELIOS

helios

CHIMERA

chimera

Exo-Transmit

exotransmit

For our first tutorial, we’ll perform compression with PCA. Compressing with PCA essentially allows us to swap an NTEMP x NPRESSURE x NWAV array for an NTEMP x (NCOMPONENT + 1) x NWAV array, achieving approximately NPRESSURE/(NCOMPONENT + 1) compression. To get started compressing opacities, we’ll need to read in an opacity file. For the purposes of this tutorial, we’ll just use a small sample file.

First, let’s import our packages.

Setting up the objects#

[1]:
import numpy as np
import sys
import os

sys.path.insert(0, os.path.abspath("../../src"))

import cortecs
from cortecs.opac.opac import *
from cortecs.fit.fit import *
from cortecs.fit.fit_pca import *
from cortecs.eval.eval import *
from cortecs.fit.metrics import *
/home/docs/checkouts/readthedocs.org/user_builds/cortecs/checkouts/latest/src/cortecs/opac/io.py:11: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from tqdm.autonotebook import tqdm
2024-03-13 00:04:42.793706: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-13 00:04:42.836002: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-13 00:04:42.836043: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-13 00:04:42.837320: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-13 00:04:42.844070: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-13 00:04:42.845243: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-13 00:04:43.763074: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

Next, let’s define our filenames. The loader for PLATON-style opacity files is the only one that requires separate temperature, pressure, and wavelength files; no other file type requires the load_kwargs argument.

[2]:
T_filename = "temperatures.npy"
P_filename = "pressures.npy"
wl_filename = "wavelengths.npy"

cross_sec_filename = "absorb_coeffs_C2H4.npy"

load_kwargs = {
    "T_filename": P_filename,
    "P_filename": T_filename,
    "wl_filename": wl_filename,
}

With our file names defined, we can instantiate an Opac object as below.

[3]:
opac_obj = Opac(cross_sec_filename, loader="platon", load_kwargs=load_kwargs)

opac_obj
[3]:
<cortecs.opac.opac.Opac at 0x7fd76d120690>

Now we have an Opac object instantiated. The object stores information on the opacity file’s fields. Let’s investigate a few.

[4]:
opac_obj.wl  # these wavelengths are in meters
[4]:
array([3.00000000e-07, 3.00299510e-07, 3.00599320e-07, ...,
       2.99401875e-05, 2.99700788e-05, 3.00000000e-05])
[5]:
opac_obj.T  # these temperatures are in Kelvin
[5]:
array([1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03,
       1.e+04, 1.e+05, 1.e+06, 1.e+07, 1.e+08])
[6]:
opac_obj.P  # these pressures are in pascals
[6]:
array([ 100.,  200.,  300.,  400.,  500.,  600.,  700.,  800.,  900.,
       1000., 1100., 1200., 1300., 1400., 1500., 1600., 1700., 1800.,
       1900., 2000., 2100., 2200., 2300., 2400., 2500., 2600., 2700.,
       2800., 2900., 3000.])
[7]:
opac_obj.cross_section.shape
[7]:
(30, 13, 4616)
[8]:
opac_obj.cross_section = np.moveaxis(opac_obj.cross_section, 0, 1)

The wavelengths, temperatures, and pressures on which these opacities were evaluated are attributes of the Opac object.

Fitting the opacity#

Now, let’s do something interesting with the Opac object: compress it using the principal components algorithm (PCA). This process works by first finding the vectors that best describe the shape of the temperature–pressure dependence of the opacity function at a single wavelength. Then, the code fits these eigenvectors to the temperature–pressure dependence of every other wavelength.

To do so, we’ll instantiate the Fitter object. We pass nc=5 to the Fitter object to tell it to use 5 eigenvectors to fit the opacity. This is a hyperparameter that you can tune to your liking. Increasing nc can provide a better fit, but doing so will take up more RAM. Decreasing nc may yield a worse opacity fit, but it will consume less memory.

We’ll also pass wav_ind=-2 to tell the Fitter object to fit the opacity at the first wavelength. This is the wavelength at which the eigenvectors are calculated.

[9]:
fitter = Fitter(opac_obj, wav_ind=-2, nc=5)
fitter
[9]:
<cortecs.fit.fit.Fitter at 0x7fd76cef8f90>

Again, the temperature–pressure dependence at all wavelengths is fitted as a linear combination of the eigenvectors calculated at wav_ind. The success of this approach can then of course depend on whichwav_ind is chosen by the user. See the optimizing fits notebook for more information on how to choose the best wav_ind.

With our Fitter object set up, we can fit this serially.

[10]:
fitter.fit()
100%|██████████| 4616/4616 [00:00<00:00, 18215.60it/s]

This process should be pretty quick — less than a second, hopefully.

One thing to keep in mind is that the opacity values at the fitted wavelength shouldn’t be constant unless the opacity values at every wavelength is constant. The first step of the fitting function is to find the PCA components that describe the temperature–pressure dependence of the wavelength at wav_ind. Afterward, it fits these components to the opacity values at each wavelength. Therefore, the temperature–pressure dependence at wav_ind should be at least somewhat representative of the temperature–pressure dependence everywhere.

What do we do next? Now that we’ve fit the opacity, we can evaluate it. Hopefully things line up well enough.

We define our last object: an Evaluator. This step should look pretty similar as before.

[11]:
evaluator = Evaluator(opac_obj, fitter)

Saving fits to disk#

Now that we’ve performed our fits, we can save them to disk for later evaluation. We’ll use the fitter object and provide the savename argument for the base of the file names.

[12]:
fitter.save("test_pca")
[13]:
ls
/home/docs/.asdf/installs/python/3.11.6/lib/python3.11/pty.py:89: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  pid, fd = os.forkpty()
Quickstart.ipynb        fitting_polynomial.ipynb  quickstart.rst
absorb_coeffs_C2H4.npy  installation.rst          temperatures.npy
api.rst                 integrations.rst          test_pca.npz
faq.rst                 optimizing.ipynb          versioning.rst
fitting_nn.ipynb        pressures.npy             wavelengths.npy

The two files test_pca_coeffs.npy and test_pca_vectors.npy have now been created. Great!

Loading fits from disk#

Loading the PCA fits is as easy as:

[14]:
pca_results = np.load("test_pca.npz")
vectors = pca_results["vectors"]
pca_coeffs = pca_results["pca_coeffs"]

Evaluating the fits#

If you’re doing the fitting and evaluating on the same machine, it’s convenient to use the Eval object defined above to “decompress” the compressed opacities.

However, a more common workflow involves evaluating the opacities on a different machine (e.g., a supercomputing cluster) or a different Python instance. To “decompress” our opacity, we can use the eval_pca method. It takes in the desired temperature, pressure, and wavelength at which to evaluate the opacity, the temperature, pressure, and wavelength arrays of the original data, and the results of our earlier fitting step.

[15]:
temperature = 500  # Kelvin — same as the original data
pressure = 1  # Pascals — same as the original data
wavelength = 1e-5  # meters — same as the original data

fitting_results = [vectors, pca_coeffs]
cortecs.eval.eval_pca.eval_pca(
    temperature,
    pressure,
    wavelength,
    opac_obj.T,
    opac_obj.P,
    opac_obj.wl,
    fitting_results,
)
[15]:
Array(-29.517664, dtype=float32)
[16]:
temperature = 1  # Kelvin — same as the original data
pressure = 500  # Pascals — same as the original data
wavelength = 1e-5  # meters — same as the original data

fitting_results = [vectors, pca_coeffs]
cortecs.eval.eval_pca.eval_pca(
    temperature,
    pressure,
    wavelength,
    opac_obj.T,
    opac_obj.P,
    opac_obj.wl,
    fitting_results,
)
[16]:
Array(-29.869802, dtype=float32)

Oftentimes, radiative transfer calculations can be parallelized over wavelength. We provide an accelerator-friendly JAX implementation of this at a single wavelength.

[17]:
pca_coeffs_single_wavelength = pca_coeffs[0]
temperature_ind = 1
pressure_ind = 2

eval_pca_ind_wav(temperature_ind, pressure_ind, vectors, pca_coeffs_single_wavelength)
[17]:
Array(-104., dtype=float32)

Checking accuracy and speed#

Saving memory isn’t all that useful if we’re slow and inaccurate. Let’s check whether that’s the case.

First of all, time. Let’s use a lower-level routine that’s a bit more apples-to-apples comparison with array-indexing.

[18]:
temperature_ind = np.where(np.isclose(opac_obj.T, temperature))[0][0]
pressure_ind = np.where(np.isclose(opac_obj.P, pressure))[0][0]
wavelength_ind = np.where(np.isclose(opac_obj.wl, wavelength))[0][0]
pca_vectors, pca_coeffs_all_wl = evaluator.fitter_results
pca_coeffs = pca_coeffs_all_wl[wavelength_ind, :, :]
[19]:
%%timeit
eval_pca_ind_wav(temperature_ind, pressure_ind, pca_vectors, pca_coeffs)
12.4 µs ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Let’s compare this to how long it takes to access an array.

[20]:
%%timeit
opac_obj.cross_section[temperature_ind][pressure_ind][wavelength_ind]
404 ns ± 2.55 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Accessing the array directly is clearly faster. As long as this step does not bottleneck your code, however, you should be fine.

Now, let’s check the accuracy. We’ll compare the log abundances, because this is the quantity to which we are sensitive in exoplanet spectroscopy.

[21]:
AMU = 1.6605390666e-24  # atomic mass unit in cgs. From astropy!
[22]:
array_res = opac_obj.cross_section[temperature_ind][pressure_ind][wavelength_ind]
[23]:
eval_res = np.log10(
    evaluator.eval(temperature, pressure, wavelength)
    * evaluator.load_obj.species_weight
    * AMU
    * 1e-4
)
[24]:
percent_error = 100 * (eval_res - array_res) / array_res
percent_error
[24]:
-0.05525551168889813
[25]:
percent_error = 100 * (eval_res - array_res) / array_res
percent_error
[25]:
-0.05525551168889813

We’re reaching a 0.06% error in the log abundance. That might be good enough for your applications, or it might not. We recommend tuning your algorithm to the level of agreement necessary for, e.g., generating a transmission spectrum, or running an emission spectrum retrieval.

How do we do over the entire opacity function, as opposed to just a single point? Let’s use the metrics module to find out.

We’ll undersample the pressure and temperature axes by a factor of 2 using tp_undersample_factor=2. This is purely for computational reasons to keep the notebook a bit shorter.

[26]:
vals, orig_vals, abs_diffs, percent_diffs = calc_metrics(
    fitter, tp_undersample_factor=2, plot=True
);
100%|██████████| 7/7 [00:12<00:00,  1.77s/it]
../../_images/pages_Quickstart_47_1.png
../../_images/pages_Quickstart_47_2.png
[27]:
vals, orig_vals, abs_diffs, percent_diffs = calc_metrics(
    fitter, tp_undersample_factor=2, plot=True
);
100%|██████████| 7/7 [00:12<00:00,  1.76s/it]
../../_images/pages_Quickstart_48_1.png
../../_images/pages_Quickstart_48_2.png
[28]:
np.median(abs_diffs)
[28]:
-0.004931666227747655
[29]:
opac_obj.cross_section.nbytes / (vectors.nbytes + pca_coeffs.nbytes)
[29]:
6977.674418604651
[30]:
opac_obj.cross_section.nbytes / (vectors.nbytes + pca_coeffs_all_wl.nbytes)
[30]:
4.99750158233119

Note that there’s a distribution of accuracy. Ensure that this level of accuracy is adequate for your use case (e.g., run a test retrieval with and without cortecs) before applying ``cortecs`` to all your work!

[ ]: