.. dataprob documentation master file, created by sphinx-quickstart on Tue Aug 20 16:28:23 2024. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. .. include:: links.rst ====================== dataprob documentation ====================== dataprob was designed to allow scientists to easily fit user-defined models to experimental data. It allows maximum likelihood, bootstrap, and Bayesian analyses with a simple and consistent interface. Design principles ================= + **ease of use:** Users write a python function that describes their model, then load in their experimental data as a dataframe. + **dataframe centric:** Uses a pandas dataframe to specify parameter bounds, guesses, fixedness, and priors. Observed data can be passed in as a dataframe or numpy vector. All outputs are pandas dataframes. + **consistent experience:** Users can run maximum-likelihood, bootstrap resampling, or Bayesian MCMC analyses with an identical interface and nearly identical diagnostic outputs. + **interpretable:** Provides diagnostic plots and runs tests to validate fit results. Simple example ============== The following code generates noisy linear data and uses dataprob to find the maximum likelihood estimate of its slope and intercept. `Run on Google Colab `_. .. code-block:: python import dataprob import numpy as np # Generate "experimental" linear data (slope = 5, intercept = 5.7) that has # random noise on each point. x_array = np.linspace(0,10,25) noise = np.random.normal(loc=0,scale=0.5,size=x_array.shape) y_obs = 5*x_array + 5.7 + noise # 1. Define a linear model def linear_model(m=1,b=1,x=[]): return m*x + b # 2. Set up the analysis. 'method' can be "ml", "mcmc", or "bootstrap" f = dataprob.setup(linear_model, method="ml", non_fit_kwargs={"x":x_array}) # 3. Fit the parameters of linear_model model to y_obs, assuming uncertainty # of 0.5 on each observed point. f.fit(y_obs=y_obs, y_std=0.5) # 4. Access results fig = dataprob.plot_summary(f) fig = dataprob.plot_corner(f) print(f.fit_df) print(f.fit_quality) The plots will be: .. image:: _static/simple-example_plot-summary.svg :align: center :alt: data.plot_summary result :width: 75% .. image:: _static/simple-example_plot-corner.svg :align: center :alt: data.plot_corner result :width: 75% The ``f.fit_df`` dataframe will look something like: +-------+-------+----------+-------+--------+---------+-------+-----------+ | index | name | estimate | std | low_95 | high_95 | ... | prior_std | +=======+=======+==========+=======+========+=========+=======+===========+ | ``m`` | ``m`` | 5.009 | 0.045 | 4.817 | 5.202 | ... | ``NaN`` | +-------+-------+----------+-------+--------+---------+-------+-----------+ | ``b`` | ``b`` | 5.644 | 0.274 | 4.465 | 6.822 | ... | ``NaN`` | +-------+-------+----------+-------+--------+---------+-------+-----------+ The ``f.fit_quality`` dataframe will look something like: +---------------+---------------------------------------------+---------+---------+ | name | description | is_good | value | +===============+=============================================+=========+=========+ | success | fit success status | True | True | +---------------+---------------------------------------------+---------+---------+ | num_obs | number of observations | True | 25.000 | +---------------+---------------------------------------------+---------+---------+ | num_param | number of fit parameters | True | 2.000 | +---------------+---------------------------------------------+---------+---------+ | lnL | log likelihood | True | -18.761 | +---------------+---------------------------------------------+---------+---------+ | chi2 | chi^2 goodness-of-fit | True | 0.241 | +---------------+---------------------------------------------+---------+---------+ | reduced_chi2 | reduced chi^2 | True | 1.192 | +---------------+---------------------------------------------+---------+---------+ | mean0_resid | t-test for residual mean != 0 | True | 1.000 | +---------------+---------------------------------------------+---------+---------+ | durbin-watson | Durbin-Watson test for correlated residuals | True | 2.265 | +---------------+---------------------------------------------+---------+---------+ | ljung-box | Ljung-Box test for correlated residuals | True | 0.943 | +---------------+---------------------------------------------+---------+---------+ Installation ============ We recommend installing dataprob with pip: .. code-block:: shell pip install dataprob To install from source and run tests: .. code-block:: shell git clone https://github.com/harmslab/dataprob.git cd dataprob pip install . # to run test-suite pytest --runslow Examples ======== A good way to learn how to use the library is by working through examples. The following notebooks are included in the `dataprob/examples/` directory. They are self-contained demonstrations in which dataprob is used to analyze various classes of experimental data. The links below launch each notebook in Google Colab: + `api-example.ipynb `_: shows various features of the API when analyzing a linear model + `linear.ipynb `_: fit a linear model to noisy data (2 parameter, linear) + `binding.ipynb `_: a single-site binding interaction (2 parameter, sigmoidal curve) + `michaelis-menten.ipynb `_: Michaelis-Menten model of enzyme kinetics (2 parameter, sigmoidal curve) + `lagged-exponential.ipynb `_: bacterial growth curve with initial lag phase (3 parameter, exponential) + `multi-gaussian.ipynb `_: two overlapping normal distributions (6 parameter, Gaussian) + `periodic.ipynb `_: periodic data (3 parameter, sine) + `polynomial.ipynb `_: nonlinear data with no obvious form (5 parameter, polynomial) + `linear-extrapolation-folding.ipynb `_: protein equilibrium unfolding data (6 parameter, linear embedded in sigmoidal) I. Set up an analysis ===================== ---------------- 1. Write a model ---------------- The first step is to define a function that we think can reproduce our observations given some set of parameters. dataprob will find the values of the `float `_ parameters passed to the function that reproduce our observations. Such a function must meet two criteria: + The function must take parameters to fit. Parameters can be passed in as float keyword arguments to the function or as a numpy array of floats. + The function must return a numpy array the same length as the numpy array of observations. For example, the function ``good_model`` below takes the arguments ``m``, ``b``, and ``x``. If we pass in an array with 10 values for ``x``, the function returns an output array with 10 values. We can thus use it to reproduce the 10 observations in ``y_obs``. .. code-block:: python # define a model def good_model(m,b,x): return m*x + b # Array of x from 0->9 x_input = np.arange(10) x_input.shape # --> (10,) # y_calc has a line with a slope of 1 and an intercept of 2 calculated at # x from 0 to 9 y_calc = good_model(m=1,b=2,x=x_input) y_calc.shape # --> (10,) # y_obs is a line with a slope of 5 and an intercept of 2 observed at x # from 0 to 9 y_obs = x_input*5 + 2 y_obs.shape # --> (10,) ------------------- 2. Set up the model ------------------- We set up the analysis by calling ``dataprob.setup`` on our function to analyze. This returns a ``Fitter`` object. (We will call this object ``f`` in this and all following examples.) ``dataprob.setup`` guesses which parameters should be fittable. By default, it treats the first ``n`` arguments of the input function whose default is a ``float`` or ``None`` as fittable parameters. All remaining arguments are treated as non-fittable. For example, in the following code block ``dataprob.setup`` identifies ``a`` and ``b`` as fit parameters, but not ``square``. .. code-block:: python def some_fcn(a,b=2,square=False): if square is True: return a**2 + b else: return a + b f = dataprob.setup(some_fcn) print(f.param_df["name"]) # -> ["a","b"] One can change this behavior using the ``fit_parameters``, ``non_fit_kwargs`` and ``vector_first_arg`` arguments to ``dataprob.setup``. A few patterns demonstrate how this works. .. code-block:: python # A function we wish to analyze def some_fcn(a=1,b=2,c=3): return a*b*c # a, b, and c are fit parameters with guesses of 1, 2, and 3, taken from # their argument defaults. f = dataprob.setup(some_fcn) # b and c are fit parameters with guesses of 2 and 3; a is a fixed # non-fittable parameter with value 1 f = dataprob.setup(some_fcn, fit_parameters=["b","c"]) # a and c are fit parameters with guesses of 1 and 3; b is a non-fittable # parameter with a value of np.arange(20) f = dataprob.setup(some_fcn, non_fit_kwargs={"b":np.arange(20)}) # a and c are fit parameters with guesses of 14 and 7; b in a non-fittable # parameter with a value of np.arange(20) f = dataprob.setup(some_fcn, fit_parameters={"a":14,"c":7}, non_fit_kwargs={"b":np.arange(20)}) The ``vector_first_arg`` argument enables one to use a function where the first argument is an array of parameters. This allows for computationally efficient models that deal exclusively in numpy arrays. Note that a ``fit_parameters`` argument is required if ``vector_first_arg == True``. Here is an example: .. code-block:: python def some_fcn(a,b=2,c=3): return a*b*c # Treat "a" as a vector rather than as a single parameter. This vector is # built from parameters w, x, y, and z. (Assigned default guesses of 0). b # and c are fixed non-fittable parameters with values of 2 and 3. f = dataprob.setup(some_fcn, vector_first_arg=True, fit_parameters=["w","x","y","z"]) # When running the analysis, dataprob will do the following under the hood. y_calc = some_fcn(a=np.array([w,x,y,z]),b=2,c=3) One can even more finely control the assignment of fittable parameters: see the `Advanced Model Definitions`_ section for details. ------------------------------- 3. Customize the fit parameters ------------------------------- Once we have defined our model and assigned which parameters are fittable, we can control seven attributes of each fittable parameter. These are stored in ``f.param_df`` dataframe. Each row is a parameter; each column is an attribute. .. code-block:: python def some_fcn(a,b=2,c=3): return a*b*c f = dataprob.setup(some_fcn) f.param_df +-------+-------+-----------+-------------+-------------+------------+-----------+ | name | guess | fixed | lower_bound | upper_bound | prior_mean | prior_std | +=======+=======+===========+=============+=============+============+===========+ | ``a`` | 0.0 | ``False`` | ``-inf`` | ``inf`` | ``NaN`` | ``NaN`` | +-------+-------+-----------+-------------+-------------+------------+-----------+ | ``b`` | 2.0 | ``False`` | ``-inf`` | ``inf`` | ``NaN`` | ``NaN`` | +-------+-------+-----------+-------------+-------------+------------+-----------+ | ``c`` | 3.0 | ``False`` | ``-inf`` | ``inf`` | ``NaN`` | ``NaN`` | +-------+-------+-----------+-------------+-------------+------------+-----------+ The ``f.param_df`` dataframe can be accessed and edited using standard `pandas DataFrame `_ commands. The ``name`` column is set when the dataframe is initialized and cannot be changed. The ``name`` column is used as the index for the dataframe, allowing commands like the following, which sets the ``guess`` of parameter ``a`` to 10.0: .. code-block:: python # set the guess of parameter a to 10.0. f.param_df.loc["a","guess"] = 10.0 One can also edit the dataframe en masse and load in directly: .. code-block:: python df = f.param_df.copy() # do lots of edits to dataframe # ... # ... # then: f.param_df = df One can even load ``param_df`` directly from a spreadsheet. .. code-block:: python f.param_df.to_excel("my-parameters.xlsx") # edit my-parameters.xslx in excel. f.param_df = "my-parameters.xslx" The full rules for the parameter dataframe are: +-----------------+---------------------------------------------------------+ | key | value | +=================+=========================================================+ | ``name`` | string name of the parameter. should not be changed | | | by the user once the fitter is initialized. | +-----------------+---------------------------------------------------------+ | ``guess`` | initial guess as single float value (must be non-nan | | | and within bounds if specified) | +-----------------+---------------------------------------------------------+ | ``fixed`` | whether or not parameter can vary during the analysis; | | | ``True`` or ``False``. if ``False`` use the value in | | | ``guess`` for this argument when calling the function. | +-----------------+---------------------------------------------------------+ | ``lower_bound`` | parameter constrained to be ``>= lower_bound``. single | | | float value; ``-np.inf`` allowed; ``None``, ``np.nan``, | | | or ``pd.NA`` interpreted as ``-np.inf``. | +-----------------+---------------------------------------------------------+ | ``upper_bound`` | parameter constrained to be ``<= upper_bound``. single | | | float value; ``np.inf`` allowed; ``None``, ``np.nan``, | | | or ``pd.NA`` interpreted as ``np.inf``. | +-----------------+---------------------------------------------------------+ | ``prior_mean`` | single float value; ``np.nan`` allowed (see note) | +-----------------+---------------------------------------------------------+ | ``prior_std`` | single float value; ``np.nan`` allowed (see note) | +-----------------+---------------------------------------------------------+ .. note:: An MCMC analysis requires prior distributions for each parameter. (See the `Bayesian MCMC`_ section for details). Priors are specified using the ``prior_mean`` and ``prior_std`` columns. Together, these define a Gaussian prior with a mean of ``prior_mean`` and a standard deviation of ``prior_std``. Because they specify a Gaussian distribution, neither value can be ``np.inf`` and ``prior_std`` must be greater than zero. If both a Gaussian prior and bounds are defined, the Gaussian distribution is trimmed and re-normalized so its probability density function sums to one between the bounds. To use uniform priors between the bounds, set both ``prior_mean`` and ``prior_std`` to ``np.nan``. ----------------------------- 4. Set non-fittable arguments ----------------------------- One can specify arguments to the function that should not be used as fit parameters using the ``non_fit_kwargs`` dictionary. This can be passed as an argument to ``dataprob.setup``. All keys must correspond to function arguments, except in the `special case `_ where a function takes ``**kwargs``. The ``non_fit_kwargs`` dictionary can be accessed and edited after initialization; it is exposed as an attribute to ``f``. In the following example, we initially set the value of ``square`` to be ``True``, then update that to be ``False``. .. code-block:: python def some_fcn(a,b=2,square=False): if square is True: return (a*b)**2 else: return (a*b) # Initially set square to True f = dataprob.setup(some_fcn, non_fit_kwargs={"square":True}) # Oops, changed our mind. Set back to False f.non_fit_kwargs["square"] = False II. Run an Analysis =================== ---------------------------- 1. Select an analysis method ---------------------------- dataprob has three analysis methods to estimate parameter values. These are selected via the ``method`` argument to ``dataprob.setup``: :code:`f = setup(some_fcn,method="ml")`. If no method is specified, the ``ml`` method will be used. The available methods are: + **ml** (default). Do a maximum likelihood (i.e. least-squares) fit, regressing model parameters against observed data. It finds the parameters that minimize the weighted residual function. `Details `_. + **bootstrap**: Estimate parameter distributions consistent with observed data by sampling observation uncertainty, then finding maximum likelihood parameter estimates for each pseudo-replicate dataset. `Details `_. + **mcmc**: Use Markov-Chain Monte Carlo sampling to estimate the posterior distributions of model parameters. By default, the analysis uses uniform priors; however, the user can also specify Gaussian priors. `Details `_. -------------------- 2. Load observations -------------------- All analyses require the user specify a vector of observations (``y_obs``) and a vector of standard deviations on the value of each observation (``y_std``). The software will estimate model parameters that reproduce the observations in ``y_obs``, weighted by the confidence in each observation encoded by ``y_std``. ``y_obs`` and ``y_std`` must be non-nan float values. Further, all values in ``y_std`` must be larger than zero. The analysis makes several assumptions about ``y_obs`` and ``y_std``. + The error of the independent variable is negligible. For example, if you were to measure the progress of a chemical reaction over time, this software assumes no error in your time measurement. + The errors in each observation are independent. + The uncertainty for each observation value is normally distributed, with a standard deviation of ``y_std``. .. note:: dataprob assumes the ``y_std`` are `population standard deviations `_, not sample standard deviations. We therefore recommend that users calculate ``y_std`` as: .. math:: \sigma = \sqrt{ \frac{1}{N-1} \sum_{i=0}^{i < N-1} \left ( x_{i} - \mu \right )^2} where :math:`N` is the number of replicates for a given observation and :math:`\mu` is the mean of the replicates. .. note:: A typical way to estimate ``y_std`` is via technical replicates on each observed point. Sometimes this is not possible. In this case, one can declare a global ``y_std`` for all points based on an overall estimate of observation precision. For example, one might take the standard deviation of points from a flat experimental baseline and use that as the value for ``y_std`` on all points. One could even "make up" a value that "seems plausible" given the instrument setup and collected data. If one uses the same ``y_std`` for all points, the chosen value will not alter the parameter estimates, but it will directly impact the final parameter estimate uncertainty. We thus recommend being conservative and assuming your error is on the large side of plausible. **If you underestimate y_std, you overestimate parameter precision!!!** The `reduced chi2 output `_ output from ``f.fit_quality`` can help you decide if you are grossly under- or over-estimating your observation uncertainty. ``y_obs`` and ``y_std`` can be passed to the program in two different ways. The first is via ``f.fit``: .. code-block:: python def some_fcn(a,b=2,c=3): return a*b*c f = dataprob.setup(some_fcn) f.fit(y_obs=y_obs, y_std=y_std) The ``f.fit`` call also allows the user to specify a single, global ``y_std`` to use for all parameters: .. code-block:: python def some_fcn(a,b=2,c=3): return a*b*c f = dataprob.setup(some_fcn) # apply uncertainty of 0.1 to all observations f.fit(y_obs=y_obs, y_std=0.1) In addition to using ``f.fit``, one can set ``y_obs`` and ``y_obs`` from a dataframe via the ``data_df`` attribute. .. code-block:: python import pandas as pd df = pd.DataFrame({"y_obs":y_obs, "y_std":y_std}) f = dataprob.setup(some_fcn) f.data_df = df f.fit() A dataframe passed to ``data_df`` must have a ``y_obs`` and a ``y_std`` column. All other columns are ignored. The input dataframe must be either a pandas ``DataFrame`` or a string pointing to a spreadsheet that can be read by pandas. ------------------- 3. Run the analysis ------------------- As described above, we run the analysis by calling ``f.fit()``. Each method has different options that can be passed to the ``fit`` method. A description of these options can be accessed via ``help``. .. code-block:: python # A function we wish to analyze def some_fcn(a=1,b=2,c=3): return a*b*c f = dataprob.setup(some_fcn, method="mcmc") help(f.fit) .. code-block:: Help on method fit in module dataprob.fitters.bayesian.bayesian_sampler: fit(y_obs=None, y_std=None, num_walkers=100, use_ml_guess=True,num_steps=100, burn_in=0.1, num_threads=1, **emcee_kwargs) method of dataprob.fitters.bayesian.bayesian_sampler.BayesianSampler instance Perform Bayesian MCMC sampling of parameter values. Parameters ---------- y_obs : numpy.ndarray observations in a numpy array of floats that matches the shape of the output of some_function set when initializing the fitter. nan values are not allowed. y_obs must either be specified here or in the data_df dataframe. y_std : numpy.ndarray standard deviation of each observation. nan values are not allowed. y_std must either be specified here or in the data_df dataframe. num_walkers : int, default=100 number of markov chains to use in the analysis use_ml_guess : bool, default=True if true, do a maximum likelihood maximization then sample from the fit parameter covariance matrix to get the initial chain positions num_steps: int, default=100 number of steps to run each markov chain burn_in : float, default = 0.1 fraction of samples to discard from the start of the run num_threads : int number of threads to use. if `0`, use the total number of cpus. [NOT YET IMPLEMENTED] **emcee_kwargs : all remaining keyword arguments are passed to the initialization function of emcee.EnsembleSampler III. Results ============ dataprob generates outputs that allow the users to view and assess their fit results. ------------------------- Parameter values (fit_df) ------------------------- One accesses the parameter estimates via the ``f.fit_df`` pandas dataframe. The ``fit_df`` output from the `Simple example`_ above is: +-------+-------+----------+-------+--------+---------+-------+-----------+ | index | name | estimate | std | low_95 | high_95 | ... | prior_std | +=======+=======+==========+=======+========+=========+=======+===========+ | ``m`` | ``m`` | 5.009 | 0.045 | 4.817 | 5.202 | ... | ``NaN`` | +-------+-------+----------+-------+--------+---------+-------+-----------+ | ``b`` | ``b`` | 5.644 | 0.274 | 4.465 | 6.822 | ... | ``NaN`` | +-------+-------+----------+-------+--------+---------+-------+-----------+ All three methods (ml, bootstrap, and mcmc) will have identical columns. These columns report similar, but not identical, features of each parameter for each method. **ml** + ``estimate``: maximum-likelihood parameter estimate + ``std``: standard deviation of a parameter uncertainty distribution calculated assuming normally distributed error centered on the maximum-likelihood estimate. Distribution width is determined by the local curvature of the likelihood surface. (See the `Maximum likelihood`_ section for details.) + ``low_95`` and ``high_95``: 95% of the distribution described above falls between low_95 and high_95. This defines the 95% `confidence interval `_. **bootstrap** + ``estimate``: the mode of the parameter values seen in the bootstrap pseudo-replicates. + ``std``: standard deviation of this parameter over the pseudo-replicate fits. + ``low_95`` and ``high_95``: 95% of pseudo-replicate fit distribution falls between low_95 and high_95. This is determined numerically, making no assumption about the shape of the distribution. This defines the 95% `confidence interval `_. **mcmc** + ``estimate``: the mode of the parameter values seen in the posterior distribution. + ``std``: standard deviation of this parameter over the posterior distribution. + ``low_95`` and ``high_95``: 95% of posterior distribution falls between low_95 and high_95. This is determined numerically, making no assumption about the shape of the distribution. This defines the 95% `credible interval `_. In addition to the analysis output, the ``fit_df`` column holds the parameter constraints that went into the analysis (``guess``, ``fixed``, ``upper_bound``, ``lower_bound``, ``prior_mean``, and ``prior_std``). One can save this dataframe out to a spreadsheet (``fit_df.to_excel("fit-results.xlsx")``) to preserve both the parameter fit results and parameter inputs in a single file. ----------------------- Model outputs (data_df) ----------------------- After a fit is run, the ``f.data_df`` dataframe--which started with only ``y_obs`` and ``y_calc``--now has three new columns: + ``y_calc``: the model calculated using the parameters in ``f.fit_df["estimate"]``. + ``unweighted_residuals``: the difference between the model and the data (``y_calc - y_obs``) + ``weighted_residuals``: the difference between the model and the data weighted by observation uncertainty (``(y_calc - y_obs)/y_std``). One might call something like the following to make a plot of experimental points in ``y_obs`` with a line showing ``y_calc``. .. code:: python from matplotlib import pyplot as plt plt.plot(f.data_df["y_obs"],"o") plt.plot(f.data_df["y_calc"],"-") ------------ Summary plot ------------ The summary plot allows the user to assess how well the model reproduces the observations with four combined plots. .. code:: python # assuming f is a Fitter object for which f.fit() has been run fig = dataprob.plot_summary(f) fig.savefig("summary-plot.pdf") Here is an example output for a six-parameter model of protein folding (see this `notebook `_ for details). .. image:: _static/sample-comparisons/ml.svg :align: center :alt: example summary plot for a maximum likelihood fit. :width: 75% The **central plot** shows ``y_obs`` (y-axis) in the order they came in via the input array (x-axis). Each point is an observation with its ``y_std`` shown as error bars. The red line shows the output of the model calculated using the parameters in ``f.fit_df["estimate"]``. The cloud of gray lines shows the output of the model calculated with parameters drawn from ``f.samples``. The **lower plot** shows the model residuals as a function of observation number, while the **right plot** shows the model residuals as a function of ``y_obs``. The residual is calculated as ``(y_calc - y_obs)/y_std``, where ``y_calc`` is calculated using the parameters in ``f.fit_df["estimate"]``. The error bars are ``y_std``. The cloud of gray points are residuals when the model is calculated using parameters drawn from ``f.samples``. The red line shows the residual mean. The **bottom right plot** shows a histogram of the residual values, using an x-axis shared by the plot above. For help interpreting these outputs, see the `Assessing fit quality`_ section. One can control the style of the plot with the arguments passed to ``dataprob.plot_summary``. Key arguments include: + ``x_axis`` is a list of values to use for the x-axis. The length of this list must match the length of ``y_obs``. If this is not specified, ``y_obs`` are simply plotted sequentially. + ``plot_unweighted`` allows the user to select whether to plot weighted or unweighted residuals. + ``x_label`` and ``y_label`` allow the user to label the axes. + The ``*_style`` arguments alow the user to control how the plot components are drawn. These should be dictionaries holding arguments to be passed to `matplotlib plot `_ calls. For example, one could use :code:`y_obs_style={"marker":"+"}` to change the marker style for the data points to ``+`` symbols. The dictionaries available are + ``y_obs_style`` for observations, passed to `matplotlib scatter `_. + ``y_std_style`` for observation standard deviations, passed to `matplotlib errorbar `_. + ``y_calc_style`` for model lines in main plot and residuals, passed to `matplotlib plot `_. + ``sample_line_style`` for sample lines in main plot, passed to `matplotlib plot `_. + ``sample_point_style`` for sample points in residuals plots, passed to `matplotlib scatter `_. + ``hist_bar_style``, for bars in residual histogram, passed to `matplotlib fill `_. --------------------- Assessing fit quality --------------------- A model that describes the data well will have three features. These are: 1. The residuals will be normally distributed. 2. The residuals will have a mean of zero. 3. There will be no systematic deviation between the model and data (implying the residuals will be uncorrelated). The following graphic compares a good model (left) to a bad model (right), highlighting the problems in purple. .. image:: _static/good-bad-fit.svg :align: center :alt: comparison of plot summary with a good and bad model :width: 100% Aside from allowing visual assessment, dataprob calculates quality metrics for each fit. These are accessible via the ``f.fit_quality`` dataframe. The dataframe has the name and description of each analysis, whether the model passes the test (``is_good``) and the value of the test result (``value``). The dataframe also has a ``message`` column (omitted below) to help the user interpret the results. The results for the "bad model" fit are shown below: +---------------+---------------------------------------------+---------+---------+ | index | description | is_good | value | +===============+=============================================+=========+=========+ | success | fit success status | True | True | +---------------+---------------------------------------------+---------+---------+ | num_obs | number of observations | True | 30.0 | +---------------+---------------------------------------------+---------+---------+ | num_param | number of floating fit parameters | True | 2.0 | +---------------+---------------------------------------------+---------+---------+ | lnL | log likelihood | True | -652.94 | +---------------+---------------------------------------------+---------+---------+ | chi2 | chi^2 goodness-of-fit | False | 0.0 | +---------------+---------------------------------------------+---------+---------+ | reduced_chi2 | reduced chi^2 | False | 42.748 | +---------------+---------------------------------------------+---------+---------+ | mean0_resid | t-test for residual mean != 0 | False | 0.0018 | +---------------+---------------------------------------------+---------+---------+ | durbin-watson | Durbin-Watson test for correlated residuals | False | 0.036 | +---------------+---------------------------------------------+---------+---------+ | ljung-box | Ljung-Box test for correlated residuals | False | 0.000 | +---------------+---------------------------------------------+---------+---------+ The first four entries describe the model: + Whether or not the fit terminated successfully + The number of floating parameters + The number of observations in ``y_obs`` + The log likelihood of the parameters in ``f.fit_df["estimate"]`` ``num_obs`` and ``lnL`` will always have ``is_good == True``. ``num_param`` will fail if the number of parameters is greater than the number of observations. The remaining entries allow the user to assess model quality. Model goodness-of-fit --------------------- The ``chi2`` and ``reduced chi2`` statistics assess whether the distribution of residuals matches what we would expect for a well-fit model. This corresponds to a test for "non-normal residuals" from the figure above. + **chi2** This runs a :math:`\chi ^{2}` goodness-of-fit analysis. The null hypothesis for this test is that the fit residuals can be described by a :math:`\chi ^{2}` distribution. If the distribution is extreme enough to reject the null hypothesis (p < 0.05), ``is_good`` will be set to ``False`` and the following ``message`` entry will appear: A p-value of [P_VALUE < 0.05] for the a goodness-of-fit Chi^2 test is evidence that the model does not describe the model well (model violation). Look for regions where your model systematically deviates from your data. If this test fails, you may need to use a different model to describe your data. + **reduced chi2**: This calculates the reduced :math:`\chi ^{2}` (:math:`\chi ^{2}/N` where :math:`N` is the number of fit parameters). This ratio should be about 1 for a well-fit model. If the value differs from one, dataprob will set ``is_good`` to ``False`` and return one of the following messages. If the reduced :math:`\chi ^{2}` is greater than 1.25: A reduced chi^2 value [VALUE > 1.25] may mean the model does not describe the data well or that the uncertainty on each point is underestimated. If the model is passing all other tests, you likely underestimated ``y_std``. Try increasing the magnitude of this value. If the model is failing other tests---particularly the Durbin-Watson and or the Ljung-Box tests for residual correlation---you should likely use a different model to describe the data. If the reduced :math:`\chi^{2}` is below 0.75: A reduced chi^2 value [VALUE < 0.75] may mean the model is overfit or that the uncertainty on each point is overestimated. This could occur if you were too conservative in your estimated uncertainty on each observation. Alternatively, this could mean you have more fit parameters than can be constrained by your data (`overfitting `_). One way to check for this is to examine the `Corner plot`_ for parameters that strongly co-vary with one another. Residuals with zero mean ------------------------ + **residual mean is zero** This runs a one-sampled, two-tailed t-test to see whether we can reject the null hypothesis that the mean of your your residuals is zero. If this p-value is less than 0.05, this test will set ``is_good`` to ``False`` and set the following message. A p-value of [P_VALUE < 0.05] for the one-sample t-test is evidence for fit residuals with a non-zero mean. This may mean your model does not describe your data well (model violation). Look for regions where your model systematically deviates from your data. If this test fails, you may need to use a different model to describe your data. Residual correlation -------------------- A well-fit model should have residuals that are uncorrelated along ``y_obs``. Uncorrelated residuals arise from uncorrelated uncertainty in each observation; correlated residuals arise when the model systematically differs from the observational data. dataprob runs two tests for correlated residuals. + **Durbin-Watson** The `Durbin-Watson `_ test looks for correlation in residuals. It is designed so the test statistic :math:`D` ranges from 0 to 4. .. math:: D = \frac{\sum_{i=1}^{i 3] is is evidence for negatively correlated fit residuals. This may mean your model does not describe your data well (model violation). Look for regions where your model systematically deviates from your data. If this test fails, you may need to use a different model to describe your data. + **Ljung-Box** The `Ljung-Box `_ test looks for correlation in residuals. Its test statistic :math:`Q` is expected to follow a :math:`\chi ^{2}` distribution. The test statistic is: .. math:: Q = n(n + 2) \sum_{k=1}^{m} (n - k)^{-1}\hat{r}^{2}_{k} where :math:`n` is the length of the residuals vector, :math:`k` is the lag, :math:`m` is the number of lags compared (:math:`m = n - 1` as implemented in dataprob), and :math:`\hat{r}` is the residuals. :math:`Q` is expected to follow a :math:`\chi^{2}` distribution. If :math:`Q` is more extreme than :math:`\chi^{2}` for a given :math:`\alpha` and number of degrees of freedom, we reject the null hypothesis of uncorrelated residuals. If this p-value is less than 0.05, dataprob will set ``is_good`` to ``False`` and set the following message. A p-value of [P_VALUE < 0.05] for the Ljung-Box test is evidence for correlated fit residuals. This may mean your model does not describe your data well (model violation). Look for regions where your model systematically deviates from your data. If this test fails, you may need to use a different model to describe your data. ----------- Corner plot ----------- The corner plot allows the user to visually assess how well determined each parameter is and the extent to which parameters estimates co-vary with one another. The meaning of the distributions is slightly different for each method; see the `Samples`_ section below for details. .. image:: _static/sample-comparisons/ml_corner.svg :align: center :alt: corner plot result without for a maximum likelihood fit :width: 75% In the example above, we see that the ``dG_unfold`` parameter varies between about -10 and -15 (top left) and that ``m_unfold`` varies between -5.5 and -3.5 (bottom right). These values are not independent of one another. The correlation plot (bottom left) reveals that the estimates of the two parameters strongly co-vary. Knowing about co-variation is helpful for a number of reasons. 1. If the reduced :math:`\chi ^{2}` suggests over fitting, the corner plot can reveal which parameters are unconstrained and should be fixed or removed from the model. 2. If we have two correlated parameters and can find an independent way to estimate one of the parameters -- say, with a different experiment -- we know we can then estimate the other parameter. 3. It might make sense to fold these two parameters into a single parameter. From the example above, we might re-write the model with a single parameter that is the ratio of ``dG_unfold`` and ``m_value`` . 4. When presenting results in a publication, a co-variation plot allows the reader to understand what an uncertainty like :math:`\Delta \hat{G}_{unfold} = -12.5 \pm 2.5` means with respect to the other parameters in the dataset. One feature of ``datprob.plot_corner`` is the ability to filter parameters. The model above has six parameters, many of which are 'nuisance' parameters that are important fo the fit but do not provide information about the (in this case) protein chemistry. We generated the above plot using the following command: .. code:: python dataprob.plot_corner(f,filter_params=["native","denat"])) This removed any parameter that had the text "native" or "denat" in its name. If we run the same call without the filter, we get the full corner plot with all six parameters. .. image:: _static/sample-comparisons/ml_corner-no-filter.svg :align: center :alt: corner plot result without a filter applied to remove nuisance parameters :width: 75% This larger plot is useful--and should generally be checked to make sure nothing crazy is going on with the nuisance parameters--but it also buries the results for the the two parameters we care the most about (``dG_unfold`` and ``m_unfold``). See ``help(dataprob.plot_corner)`` for details on how to do this filtering. ------- Samples ------- After ``f.fit()`` runs, a fitter object will have an attribute ``f.samples``. This is an numpy array with shape ``(num_samples,num_params)`` that holds vectors of parameters sampled from the conditional parameter probability distribution. These samples are used to draw the gray fit lines in a summary plot and to construct the distributions in a corner plot. The samples are calculated in different ways for the three methods. + **ml**: Samples parameter values from the covariance matrix assuming that the parameter uncertainty is normally distributed and centered on the maximum likelihood parameter estimate. See the `Maximum likelihood`_ section for details. + **bootstrap**: Each sample is a the result of a maximum likelihood fit of the model to a pseudo-replicate dataset generated by sampling from observation uncertainty. See the `Bootstrap`_ section for details. + **mcmc**: Records the the locations of Markov chains in parameter space as they traverse the space by Metropolis Monte Carlo. Each sample is a draw from the conditional parameter posterior distribution. See the `Bayesian MCMC`_ section for details. One can also run the ``f.get_sample_df()`` method to get a dataframe holding the sample outputs rather than sample parameter estimates. This will look something like the following, where ``y_calc`` is calculated using the parameters in ``f.fit_df["estimate"]`` and the columns beginning with ``s00`` are the model evaluated with the sampled parameter values. +-------+-------+-------+---------+---------+-----+ | y_obs | y_std | y_calc| s000000 | s000001 | ... | +=======+=======+=======+=========+=========+=====+ | -5 | 1 | -4.9 | -4.8 | -5.2 | ... | +-------+-------+-------+---------+---------+-----+ | ... | ... | ... | ... | ... | ... | +-------+-------+-------+---------+---------+-----+ Analysis method details ======================= ------------------ Maximum likelihood ------------------ Description ----------- The maximum likelihood (``"ml"``) method uses the `scipy.optimize.least_squares `_ function to find parameters (:math:`\vec{x}`) that minimize the weighted residuals between the model (:math:`y_{calc}(\vec{x})`) and our observations (:math:`y_{obs}`). The weight on each observation is set to :math:`1/y_{std}`, meaning that observations with higher uncertainty contribute less in the regression. The residual function for observation :math:`i` is given by: .. math:: r_{i}(\vec{x}) = (y_{calc,i}(\vec{x}) - y_{obs,i})/y_{std,i} By default, dataprob will minimize the square of this function: .. math:: \sum_{i=0}^{i`_, which can be approximated by :math:`(2 J^{T} J)`, where :math:`J` is the `Jacobian matrix `_. (:math:`J` is automatically calculated and returned by ``scipy.optimize.least_squares``.) This approach to parameter uncertainty is discussed `here `_ and in more detail `here `_. We calculate our estimated parameter distributions (:math:`\vec{\mathcal{N}}(\hat{\mu},\hat{\sigma}^{2})`) as: .. math:: \vec{\mathcal{N}}(\hat{\mu},\hat{\sigma}^{2}) = \vec{\mathcal{N}}(\mu = 0,\sigma^{2}=1) \cdot H^{-1} + \vec{\hat{\mu}} where :math:`\vec{\mathcal{N}}(\mu = 0,\sigma^{2}=1)` is an array of normal distributions with a mean of zero and standard deviation of one, :math:`H` is the Hessian matrix, and :math:`\vec{\hat{\mu}}` is an array of the maximum likelihood parameter estimates. This approach is unaware of parameter bounds, so the code truncates any regions of the resulting distribution that fall outside user-specified fit parameter bounds. Outputs ------- + The ``f.fit_df`` dataframe reports the mean, standard deviation, and 95% confidence intervals of the estimated parameter uncertainty. + The ``plot_corner(f)`` function plots histograms of the marginal distribution for each parameter, as well histograms showing the co-variation between pairs of parameter estimates as reflected in the inverse Hessian matrix. + The ``f.samples`` attribute samples values from the covariance matrix assuming that the parameter uncertainty is normally distributed and centered on the maximum likelihood parameter estimate. --------- Bootstrap --------- Description ----------- The ``bootstrap`` method performs maximum likelihood inferences on pseudo-replicate datasets sampled from the observation uncertainty, then returns the resulting distribution of parameter value estimates. The method assumes the uncertainty each observation :math:`i` is normally distributed, with a mean of ``y_obs[i]`` and population standard deviation of ``y_std[i]``. It further assumes that each observation is independent. The ``bootstrap`` method generates a defined number of pseudo-replicate datasets by sampling form these values. For each pseudo-replicate it: 1. Draws a random value for each observation from a normal distribution with that observation's mean and standard deviation. 2. Uses ``scipy.optimize.least_squares`` to find the maximum likelihood estimate of the parameter values for that psuedo-replicate dataset. Because observation uncertainty is captured by the sampling process, the regression for each pseudo-replicate uses the *unweighted* residuals: .. math:: r(\vec{x}) = (y_{calc}(\vec{x}) - y_{obs}) meaning the regression minimizes the following: .. math:: \sum_{i=0}^{i`_ applied to the ``f.samples``. This corresponds approximately to mode of the parameter estimates. + The ``f.fit_df`` dataframe also reports the standard deviation and 95% confidence intervals of the marginal distribution across pseudo-replicates for each parameter. + The ``plot_corner(f)`` function plots histograms of the marginal distribution for each parameter, as well histograms showing the co-variation between pairs of parameter estimates. + The ``f.samples`` attribute has all pseudo-replicate results, and thus samples the full conditional distribution of parameter values. ------------- Bayesian MCMC ------------- Description ----------- In a Bayesian Markov-Chain Monte Carlo analysis, we are interested in the posterior distribution of model parameter values consistent with our data and any prior information about our parameters. To do this, we sample parameter values using `Markov-Chain Monte Carlo `_. In simple terms, we create a large number of walkers that traverse parameter space. Each walker then iterates over the following steps: 1. Take a step (that is, alter one or more parameter values). 2. Determine how this changes the `posterior probability `_ by feeding the new parameters into the model and prior distributions. 3. Either keep or reject the step based on a `Metropolis criterion `_. This criterion keeps all moves that increase the posterior probability and stochastically keeps a subset of moves that decrease the posterior probability. 4. Record the current parameter values of the walker. Each walker will spend the most of its time in high probability regions of parameter space, but will also wander briefly through low probability regions. When run for a sufficient amount of time, the paths taken by the walkers will reveal the entire parameter posterior distribution. The posterior probability for a given parameter value is calculated from the likelihood (:math:`P(y_{obs}|\vec{x})`) and the parameter prior (:math:`P(\vec{x})`). Likelihood function ------------------- The log-likelihood for a set of parameters :math:`\vec{x}` is given by the following. Note that the core of the likelihood function is the weighted square difference between ``y_calc`` and ``y_obs``. .. math:: \text{ln} \left ( \mathcal{L}(\vec{x}) \right ) = - \frac{1}{2} \left ( \sum _{i=0} ^{i < N-1} \frac{ \left ( y _{i,calc}(\vec{x}) - y _{i,obs} \right )^{2} }{y^{2} _{std,i}} + \text{ln} ( 2 \pi y^{2} _{std,i} ) \right ) Prior probabilities ------------------- The parameter `prior probability `_ encodes other information about the parameter that is not included in the model itself. In dataprob, each parameter is assigned its own independent prior. If we know nothing about the parameter, we might use "uniform" priors (also known as "uninformative" priors). This means we believe every value of the parameter is equally likely. When analyzing an experiment, this is rarely the case--we almost always have some idea of what the value of the fit parameter is likely to be. That said, uniform priors are the default if no priors are specified. This is encoded by setting ``prior_mean`` and ``prior_std`` to ``np.nan`` in the ``f.param_df`` dataframe. If we have some bounds on the parameter (say, it must be higher than 0), but know nothing else, we can use uniform priors between bounds. This means that any parameter value between the bounds has the same probability; every parameter value outside the bounds has zero probability. This is encoded by setting ``lower_bound`` and/or ``upper_bound`` in ``f.param_df``, but leaving ``prior_mean`` and ``prior_std`` as ``np.nan``. If we have more information about a parameter, we can use more specific priors. We might have a pretty good guess about a parameter from outside experiments. To make it concrete: imagine we know an equilibrium constant for a reaction is :math:`log(K) \approx -6 \pm 1` based on a low-resolution experiment. We then perform a higher-resolution experiment and want to use dataprob to extract parameter estimates. We can encode the existing information using a Gaussian prior with a mean of -6 and standard deviation of 1. In the MCMC run, parameter values near -6 will have high prior probability; parameter values well away from that will have low prior probability. This would be encoded by setting ``prior_mean`` to -6 and ``prior_std`` to 1 in ``f.param_df``. Finally, one can encode a compound prior that uses both bounds and a Gaussian prior. In this case, any parameter value above or below the specified bounds has a prior probability of zero. Within the bounds, the prior is determined by the Gaussian. Any part of the Gaussian that falls outside the bounds is truncated. The trimmed Gaussian is renormalized to account for the loss of density. Compound priors are specified when a user sets ``prior_mean``, ``prior_std``, and at least one of ``lower_bound`` and ``upper_bound``. .. note:: We **strongly** recommend setting priors on your parameters, at least in the form of upper and lower bounds. It is very rare that one has no idea what a reasonable parameter value is. Defined priors ensures the sampler spends time in reasonable regions of parameter space and thus converges. Implementation -------------- dataprob uses the `emcee `_ `EnsembleSampler `_ class to do the MCMC sampling. Users can exert fine-grained control over this calculation by passing keyword arguments to ``.fit()``. These arguments are passed to the ``__init__`` function of ``EnsembleSampler``. For example, the following code alters the default move used by emcee when walkers traverse the parameter space. .. code:: python import emcee y_obs = np.arange(10)*5 - 14 + np.random.normal(0,1,size=10) def linear(m,b): return m*np.arange(10) + b f = dataprob.setup(linear, method="mcmc") f.fit(y_obs=y_obs, y_std=0.1, moves=emcee.moves.StretchMove()) # <- change emcee move The MCMC `EnsembleSampler `_ automatically assesses paramter estimate convergence by calculating the `parameter autocorrelation times `_. This is the number of steps over which the current value of the parameter no longer predicts the final value of the parameter. Its built-in convergence criterion is :math:`50 \tau`, where :math:`\tau` is the autocorrelation time. If the convergence criterion is not met, dataprob will warn the user and ``f.fit_quality`` will have ``success`` set to ``False``. dataprob can be configured to run MCMC until convergence. It does so by running multiple cycles of MCMC, checking for convergence after each run. To use this feature, set ``max_convergence_cycles`` to more than its default value of ``1`` when calling ``f.fit()``. In empirical tests, we've found a value of ``10`` usually leads to convergence. .. code:: python def linear(m,b): return m*np.arange(10) + b f = dataprob.setup(linear, method="mcmc") f.fit(y_obs=y_obs, y_std=0.1, max_convergence_cycles=10) # <--- .. warning:: Depending on model complexity, convergence can take a long time. This is why we set the default value of ``max_convergence_cyles`` to ``1``. Outputs ------- + The ``f.fit_df["estimate"]`` is determined by finding the maximum of a `Gaussian kernel density estimator `_ applied to the ``f.samples``. This corresponds to the mode parameter values in the posterior distribution. + The ``f.fit_df`` dataframe also reports the standard deviation and 95% credibility interval of the parameter marginal posterior probability distribution in ``f.samples``. + The ``plot_corner(f)`` function plots histograms of the marginal distribution for each parameter, as well histograms showing the co-variation between pairs of parameter estimates. + The final output is the ``f.samples`` array, which holds the parameter values sampled by the MCMC walkers over the course of the run. Advanced Model definitions ========================== -------------- fit_parameters -------------- In addition to defining names, the ``fit_parameters`` argument to ``dataprob.setup`` can be used to declare parameter attributes (``guess``, ``fixed``, ``lower_bound``, ``upper_bound``, ``prior_mean``, and ``prior_std``). To do this, ``fit_parameters`` can be one of five different types: + **list.** Each entry is the name of the parameter as a string (e.g. ``['a','b']``). + **dict with float values.** The keys are the parameter names; the values are the parameter guesses (e.g. ``{'a':5,'b':11}``). + **dict with dict values.** The keys are the parameter names; the values are dictionaries keying parameter attributes to their values. For example: .. code-block:: python fit_parameters = {"a":{"guess":1,"lower_bound":0}, "b":{"upper_bound":20}} This indicates that parameter ``a`` should have a guess of ``1`` and a lower bound of zero. Parameter ``b`` should have an upper bound of ``20``. Note that the dictionary does not need to exhaustively define all parameter attributes. Any attributes that not specified are assigned defaults. + **dataframe**. The dataframe must have a ``name`` column with parameter names (this corresponds directly to the parameter names in a normal ``fit_parameters`` list). Columns then define parameter features, just like the ``param_df`` dataframe. Not all columns must be present; missing attribute columns will be assigned their default values. + **string**: The software will treat this as a filename and will attempt to load it in as a dataframe (``xlsx``, ``csv``, and ``tsv`` are recognized.) ------------------------- Functions with \*\*kwargs ------------------------- Sometimes a python function has a ``**kwargs`` argument. This grabs any keyword argument the user sends in that is not specified in the function definition and passes it into the function. For example, the following function returns the value of ``a`` unless ``b`` is also passed to the function, in which case it returns ``a * b``. .. code-block:: python def some_fcn(a=1,**kwargs): if "b" in kwargs: return a*kwargs["b"] else: return a print(some_function(a=5)) # --> 5 print(some_function(a=5,b=2)) # --> 10 If a function has a ``**kwargs`` parameter, dataprob allows one to specify arguments not in the function definition as ``fit_parameters`` or in ``non_fittable_kwargs``. For example: .. code-block:: python def some_fcn(a=1,**kwargs): # do stuff here return some_1d_numpy_array f = dataprob.setup(some_fcn, fit_parameters=['a','b','c']) # When running the analysis, dataprob will do the following under the hood. y_calc = some_fcn(a=a_value,b=b_value,c=c_value) .. toctree:: :maxdepth: 2 :caption: Contents: