quasinet package

Submodules

quasinet.ciforest module

class quasinet.ciforest.CIForestClassifier(min_samples_split=2, alpha=0.05, selector='mc', max_depth=- 1, n_estimators=100, max_feats='sqrt', n_permutations=100, early_stopping=True, muting=True, verbose=0, bootstrap=True, bayes=True, class_weight='balanced', n_jobs=- 1, random_state=None)

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

Conditional forest classifier

Parameters
  • min_samples_split (int) – Minimum samples required for a split

  • alpha (float) – Threshold value for selecting feature with permutation tests. Smaller values correspond to shallower trees

  • selector (str) – Variable selector for finding strongest association between a feature and the label

  • max_depth (int) – Maximum depth to grow tree

  • max_feats (str or int) – Maximum feats to select at each split. String arguments include ‘sqrt’, ‘log’, and ‘all’

  • n_permutations (int) – Number of permutations during feature selection

  • early_stopping (bool) – Whether to implement early stopping during feature selection. If True, then as soon as the first permutation test returns a p-value less than alpha, this feature will be chosen as the splitting variable

  • muting (bool) – Whether to perform variable muting

  • verbose (bool or int) – Controls verbosity of training and testing

  • bootstrap (bool) – Whether to perform bootstrap sampling for each tree

  • bayes (bool) – If True, performs Bayesian bootstrap sampling

  • class_weight (str) – Type of sampling during bootstrap, None for regular bootstrapping, ‘balanced’ for balanced bootstrap sampling, and ‘stratify’ for stratified bootstrap sampling

  • n_jobs (int) – Number of jobs for permutation testing

  • random_state (int) – Sets seed for random number generator

fit(X, y)

Fit conditional forest classifier

Parameters
  • X (2d array-like) – Array of features

  • y (1d array-like) – Array of labels

Returns

self – Instance of CIForestClassifier

Return type

CIForestClassifier

predict(X)

Predicts class labels for feature vectors X

Parameters

X (2d array-like) – Array of features

Returns

y – Array of predicted classes

Return type

1d array-like

predict_proba(X)

Predicts class probabilities for feature vectors X

Parameters

X (2d array-like) – Array of features

Returns

class_probs – Array of predicted class probabilities

Return type

2d array-like

set_score_request(*, sample_weight: Union[bool, None, str] = '$UNCHANGED$') quasinet.ciforest.CIForestClassifier

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns

self – The updated object.

Return type

object

quasinet.ciforest.balanced_sampled_idx(random_state, y, bayes, min_class_p)

Indices for balanced bootstrap sampling in classification

Parameters
  • random_state (int) – Sets seed for random number generator

  • y (1d array-like) – Array of labels

  • bayes (bool) – If True, performs Bayesian bootstrap sampling

  • min_class_p (float) – Minimum proportion of class labels

Returns

idx – Balanced sampled indices for each class

Return type

list

quasinet.ciforest.balanced_unsampled_idx(random_state, y, bayes, min_class_p)

Unsampled indices for balanced bootstrap sampling in classification

Parameters
  • random_state (int) – Sets seed for random number generator

  • y (1d array-like) – Array of labels

  • bayes (bool) – If True, performs Bayesian bootstrap sampling

  • min_class_p (float) – Minimum proportion of class labels

Returns

idx – Balanced unsampled indices for each class

Return type

list

quasinet.ciforest.normal_sampled_idx(random_state, n, bayes)

Indices for bootstrap sampling

Parameters
  • random_state (int) – Sets seed for random number generator

  • n (int) – Sample size

  • bayes (bool) – If True, performs Bayesian bootstrap sampling

Returns

idx – Sampled indices

Return type

list

quasinet.ciforest.normal_unsampled_idx(random_state, n, bayes)

Unsampled indices for bootstrap sampling

Parameters
  • random_state (int) – Sets seed for random number generator

  • y (1d array-like) – Array of labels

  • n (int) – Sample size

  • bayes (bool) – If True, performs Bayesian bootstrap sampling

Returns

idx – Unsampled indices

Return type

list

quasinet.ciforest.stratify_sampled_idx(random_state, y, bayes)

Indices for stratified bootstrap sampling in classification

Parameters
  • random_state (int) – Sets seed for random number generator

  • y (1d array-like) – Array of labels

  • bayes (bool) – If True, performs Bayesian bootstrap sampling

Returns

idx – Stratified sampled indices for each class

Return type

list

quasinet.ciforest.stratify_unsampled_idx(random_state, y, bayes)

Unsampled indices for stratified bootstrap sampling in classification

Parameters
  • random_state (int) – Sets seed for random number generator

  • y (1d array-like) – Array of labels

  • bayes (bool) – If True, performs Bayesian bootstrap sampling

Returns

idx – Stratified unsampled indices for each class

Return type

list

quasinet.citrees module

class quasinet.citrees.CITreeBase(min_samples_split=2, alpha=0.05, max_depth=- 1, max_feats=- 1, n_permutations=100, early_stopping=False, muting=True, verbose=0, random_state=None)

Bases: object

Base class for conditional inference tree.

Parameters
  • min_samples_split (int) – Minimum samples required for a split

  • alpha (float) – Threshold value for selecting feature with permutation tests. Smaller values correspond to shallower trees

  • max_depth (int) – Maximum depth to grow tree

  • max_feats (str or int) – Maximum feats to select at each split. String arguments include ‘sqrt’, ‘log’, and ‘all’

  • n_permutations (int) – Number of permutations during feature selection

  • early_stopping (bool) – Whether to implement early stopping during feature selection. If True, then as soon as the first permutation test returns a p-value less than alpha, this feature will be chosen as the splitting variable

  • muting (bool) – Whether to perform variable muting

  • verbose (bool or int) – Controls verbosity of training and testing

  • random_state (int) – Sets seed for random number generator

fit(X, y=None)

Train model.

X, y must contain only string datatypes.

Parameters
  • X (2d array-like) – Array of categorical features

  • y (1d array-like) – Array of labels

Returns

self – Instance of CITreeBase class

Return type

CITreeBase

predict(*args, **kwargs)

Predicts labels on test data. This method should not be callable from base class.

predict_label(X, tree=None)

Predicts label

Parameters
  • X (1d array-like) – Array of features for single sample

  • tree (CITreeBase) – Trained tree

Returns

label – Predicted label

Return type

str

print_tree(tree=None, indent=' ', child=None)

Prints tree structure

Parameters
  • tree (CITreeBase) – Trained tree model

  • indent (str) – Indent spacing

  • child (Node) – Left or right child node

Returns

Return type

None

class quasinet.citrees.CITreeClassifier(min_samples_split=2, alpha=0.05, selector='chi2', max_depth=- 1, max_feats=- 1, n_permutations=100, early_stopping=False, muting=True, verbose=0, random_state=None)

Bases: quasinet.citrees.CITreeBase, sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

Conditional inference tree classifier

NOTE: as of now, the features can only be categorical

Parameters
  • selector (str) – Variable selector for finding strongest association between a feature and the label

  • definitions (Derived from CITreeBase class; see constructor for parameter) –

fit(X, y, labels=None)

Train conditional inference tree classifier

Parameters
  • X (2d array-like) – Array of features

  • y (1d array-like) – Array of labels

  • labels (1d array-like) – Array of unique class labels

Returns

self – Instance of CITreeClassifier class

Return type

CITreeClassifier

predict(X)

Predicts class labels for feature vectors X

Parameters

X (2d array-like) – Array of features

Returns

y – Array of predicted classes

Return type

1d array-like

predict_proba(X)

Predicts class probabilities for feature vectors X

Parameters

X (2d array-like) – Array of features

Returns

class_probs – Array of predicted class probabilities

Return type

2d array-like

set_fit_request(*, labels: Union[bool, None, str] = '$UNCHANGED$') quasinet.citrees.CITreeClassifier

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for labels parameter in fit.

Returns

self – The updated object.

Return type

object

set_score_request(*, sample_weight: Union[bool, None, str] = '$UNCHANGED$') quasinet.citrees.CITreeClassifier

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns

self – The updated object.

Return type

object

quasinet.citrees.get_feature_importance(citree, normalize=True)

Get the feature importance of the citree.

Parameters
  • citree (CITreeBase) – A conditional inference tree

  • normalize (bool) – Whether to normalize the feature importance or not

Returns

col_to_importance – Mapping from column index to total feature importance

Return type

dict

quasinet.curvature module

quasinet.curvature.compute_curvature(p, delta)

Computes the curvature (scalar curvature) at a given point in the space of Quasinets.

The curvature R is computed as:

\[R = G^{ij} R_{ij}\]

where G^{ij} is the inverse of the metric tensor G_{ij}, and R_{ij} is the Ricci curvature.

Parameters: p (list[dict(str,float]): quasinet.predict_distributions() for the quasinet ‘p’ delta (float): A small number representing a change in each coordinate direction.

Returns: float: The curvature at point p.

quasinet.curvature.compute_metric_tensor(p_distrib, delta, progress=False)

Computes the metric tensor at a given point in the space of Quasinets.

The metric tensor G_ij is defined as:

\[G_{ij} = \frac{1}{2} \left( D(p + \delta p_i + \delta p_j, p) - D(p + \delta p_i, p) - D(p + \delta p_j, p) + D(p, p) \right)\]

where D is the distance function, p_i is the i-th unit Quasinet, and delta is a small perturbation.

Parameters: p_distrib (list[dict(str,float]): quasinet.predict_distributions() for the quasinet ‘p’ at which metric tensor is calculated delta (float): A small number representing a change in each coordinate direction. progress (bool): show progress bar

Returns: ndarray: The metric tensor at point p (the quasinet for which p_distrib is calculatd).

quasinet.curvature.compute_metric_tensor_derivative(p, delta)

Computes the derivative of the metric tensor at a given point in the space of Quasinets.

The derivative of the metric tensor G_ij with respect to the k-th coordinate is computed as:

\[\frac{\partial G_{ij}}{\partial p_k} = \frac{G_{ij}(p + \delta p_k) - G_{ij}(p)}{\delta}\]

Parameters: p (list[dict(str,float]): quasinet.predict_distributions() for the quasinet ‘p’ at which compute the metric tensor derivative. delta (float): A small number representing a change in each coordinate direction.

Returns: ndarray: The derivative of the metric tensor at point p.

quasinet.curvature.compute_ricci_curvature(p, delta)

Computes the Ricci curvature at a given point in the space of Quasinets.

The Ricci curvature R_ij is computed as:

\[R_{ij} = G^{kl} \left( \frac{\partial^2 G_{ij}}{\partial p_k \partial p_l} - \frac{1}{2} \frac{\partial^2 G_{kl}}{\partial p_i \partial p_j} \right)\]

where G^{kl} is the inverse of the metric tensor G_{kl}, and the partial derivatives are computed by taking the limit as delta goes to zero.

Parameters: p (list[dict(str,float]): quasinet.predict_distributions() for the quasinet ‘p’ delta (float): A small number representing a change in each coordinate direction.

Returns: ndarray: The Ricci curvature at point p.

quasinet.curvature.delta_pi(qnet_instance, index, delta)

This function modifies the distribution of the given Quasinet instance by scaling it with a scalar value in the direction of the given index.

Parameters: qnet_instance (Quasinet): The Quasinet instance to modify. index (int): The index of the feature direction to scale. delta (float): The scalar to scale the distribution with.

Returns: Quasinet: The Quasinet instance with modified distribution.

quasinet.curvature.dist_scalr_mult(D1, a)

Multiply each value in the dictionary with scalar ‘a’ and renormalize to get a valid probability distribution.

Parameters
  • D1 (dict) – Dictionary where each key-value pair represents an item and its probability.

  • a (float) – Scalar to multiply with each value of D1.

Returns

New dictionary with each value scaled and renormalized.

Return type

dict

quasinet.curvature.dist_sum(D1, D2)

Add each corresponding value in D1 and D2, then renormalize to get a valid probability distribution.

Parameters
  • D1 (dict) – Two dictionaries where each key-value pair represents an item and its probability.

  • D2 (dict) – Two dictionaries where each key-value pair represents an item and its probability.

Returns

New dictionary with each value being the sum of the corresponding values in D1 and D2, renormalized.

Return type

dict

quasinet.curvature.distance_function(p, q, NULL=None, strtype='U5')

Computes the distance between two Quasinets.

Parameters: p, q (Quasinet): The Quasinets to compute the distance between.

Returns: float: The distance between p and q.

quasinet.curvature.distance_function_distrib(p, q, i)

Compute distance between two quasinets assumeing that p, q only differ at estimator coordinates listed in i Parameters: p (list[dict(str,flaot]): quasinet.predict_distributions() q (list[dict(str,flaot]): quasinet.predict_distributions() i (1d numpy arra): list of indices on which p and q differ

quasinet.curvature.mt_worker(args)
quasinet.curvature.perturb_quasinet(qnet_instance, index, delta)

Perturbs a Quasinet in the direction of the i-th feature.

Parameters: p (list[dict(str,flaot]): quasinet.predict_distributions() i (int): The index of the feature direction to perturb in. delta (float): The magnitude of the perturbation.

Returns: Quasinet: The perturbed Quasinet.

quasinet.curvature.perturb_quasinet_distrib(p_distrib_, index, delta)

Perturbs a Quasinet in the direction of the i-th feature, using only the distributions at each estimator, which are produced by the predict_distributions function

Parameters: p (list[dict(str,flaot]): quasinet.predict_distributions() i (int): The index of the feature direction to perturb in. delta (float): The magnitude of the perturbation.

Returns: Quasinet: The perturbed Quasinet.

quasinet.curvature.scalarmod_predict_distribution(self, column_to_item, column, **kwargs)

Modify the predict_distribution function of the Quasinet object to scale the output probabilities for a specified feature.

Parameters
  • self (Quasinet object) – The Quasinet instance.

  • column_to_item (dict) – A dictionary mapping from column names to specific items.

  • column (str) – The name of the column (feature) to scale.

  • **kwargs (dict) – Additional arguments passed to the predict_distribution function.

Returns

A dictionary of probabilities for each item in the specified column.

Return type

dict

quasinet.curvature.sum_predict_distribution(self, column_to_item, column, **kwargs)

quasinet.export module

class quasinet.export.GraphvizTreeExporter(tree, outfile, response_name, feature_names, text_color='black', edge_color='gray', font_size=10, edge_label_color='deepskyblue4', pen_width=2, background_color='transparent', dpi=200, edge_fontcolor='grey14', rotate=False, add_legend=True, min_size=1, color_alpha=- 1.5, labels=None, detailed_output=False)

Bases: object

Export the tree using graphviz.

Parameters
  • qnet (Qnet) – A Qnet instance

  • outfile (str) – Output file to save results to

  • response_name (str) – Name of the y variable that we are predicting

  • feature_names (list) – Names of each of the features

  • text_color (str) – Color to set the text

  • edge_color (str) – Color to set the edges

  • pen_width (int) – Width of pen for drawing boundaries

  • dpi (int) – Image resolution

  • rotate (bool) – If True, rotate the tree

  • add_legend (bool) – If True, add a legend to the tree

  • detailed_output (bool) – If False output probability of max likelihood of label in leaves (default), else output prob. distribution.

  • edge_font_color (str) – Color of edge label text

  • min_size (int) – Minimum number of nodes to draw the tree

  • labels (list) – List of all labels, optional

Returns

Return type

None

export()
class quasinet.export.QnetGraphExporter(qnet, outfile, threshold)

Bases: object

Export the qnet as a graph to a dot file format.

Parameters
  • qnet (Qnet) – A Qnet instance

  • outfile (str) – Output file to save results to

  • threshold (float) – Numeric cutoff for edge weights. If the edge weights exceed this cutoff, then we include it into the graph.

Returns

Return type

None

export()

quasinet.feature_importance module

quasinet.feature_importance.getShap(model_, num_backgrounds=1, samples=None, num_samples=5, strtype='U5', fast_estimate=False)

Function to compute SHAP values for feature importance analysis.

Parameters
  • model (Qnet object) – The Qnet model.

  • num_backgrounds (int) – Number of background samples to generate. Default is 1.

  • num_samples (int) – Number of samples for the SHAP analysis. Default is 5.

  • strtype (str) – String type to be used for the generated numpy array. Default is ‘U5’.

  • samples (numpy array) – samples to run shap analysis on. Default is None. If None, generate via qsampling

  • fast_estimate (bool) – If True, use tree explainer with a CatBoostRegressor model for faster estimation. Default is False.

Returns

  • pandas.DataFrame – A dataframe containing the SHAP values for each feature.

  • numpy.array – numpy array of ordered indices of decsening shapvalues of model feature_names

quasinet.feature_importance.qnet_model_func(X)

Function to compute the distance matrix for a given set of sequences.

Parameters

X (numpy.ndarray) – Array of sequences.

Returns

The computed distance matrix.

Return type

numpy.ndarray

quasinet.feature_selectors module

quasinet.feature_selectors.permutation_test_chi2(x, y, B=100, random_state=None, **kwargs)

Permutation test using chi-squared.

This is used when x and y are nominal variables.

Parameters
  • x (1d array-like) – Array of n elements

  • y (1d array-like) – Array of n elements

  • n_classes (int) – Number of classes

  • B (int) – Number of permutations

  • random_state (int) – Sets seed for random number generator

Returns

p – Achieved significance level

Return type

float

quasinet.feature_selectors.permutation_test_dcor(x, y, B=100, random_state=None)

Permutation test for distance correlation

Parameters
  • x (1d array-like) – Array of n elements

  • y (1d array-like) – Array of n elements

  • B (int) – Number of permutations

  • random_state (int) – Sets seed for random number generator

Returns

p – Achieved significance level

Return type

float

quasinet.feature_selectors.permutation_test_dcor_parallel(x, y, B=100, n_jobs=- 1, random_state=None)

Parallel implementation of permutation test for distance correlation

Parameters
  • x (1d array-like) – Array of n elements

  • y (1d array-like) – Array of n elements

  • B (int) – Number of permutations

  • n_jobs (int) – Number of cpus to use for processing

  • random_state (int) – Sets seed for random number generator

Returns

p – Achieved significance level

Return type

float

quasinet.feature_selectors.permutation_test_mc(x, y, B=100, n_classes=None, random_state=None)

Permutation test for multiple correlation

Parameters
  • x (1d array-like) – Array of n elements

  • y (1d array-like) – Array of n elements

  • n_classes (int) – Number of classes

  • B (int) – Number of permutations

  • random_state (int) – Sets seed for random number generator

Returns

p – Achieved significance level

Return type

float

quasinet.feature_selectors.permutation_test_mi(x, y, B=100, random_state=None, **kwargs)

Permutation test for mutual information

Parameters
  • x (1d array-like) – Array of n elements

  • y (1d array-like) – Array of n elements

  • n_classes (int) – Number of classes

  • B (int) – Number of permutations

  • random_state (int) – Sets seed for random number generator

Returns

p – Achieved significance level

Return type

float

quasinet.feature_selectors.permutation_test_pcor(x, y, B=100, random_state=None)

Permutation test for Pearson correlation

Parameters
  • x (1d array-like) – Array of n elements

  • y (1d array-like) – Array of n elements

  • B (int) – Number of permutations

  • random_state (int) – Sets seed for random number generator

Returns

p – Achieved significance level

Return type

float

quasinet.feature_selectors.permutation_test_rdc(x, y, B=100, random_state=None)

Permutation test for randomized dependence coefficient

Parameters
  • x (1d array-like) – Array of n elements

  • y (1d array-like) – Array of n elements

  • B (int) – Number of permutations

  • random_state (int) – Sets seed for random number generator

Returns

p – Achieved significance level

Return type

float

quasinet.feature_selectors.permutation_test_rdc_parallel(x, y, B=100, n_jobs=- 1, k=10, random_state=None)

Parallel implementation of permutation test for randomized dependence coefficient

Parameters
  • x (1d array-like) – Array of n elements

  • y (1d array-like) – Array of n elements

  • B (int) – Number of permutations

  • n_jobs (int) – Number of cpus to use for processing

  • k (int) – Number of random projections for cca

  • random_state (int) – Sets seed for random number generator

Returns

p – Achieved significance level

Return type

float

quasinet.metrics module

quasinet.metrics.convert_lists_to_ctypes(V_list, key_list)
quasinet.metrics.js_divergence(p1, p2, smooth=0.0001)

Compute the Jensen-Shannon of discrete probability distributions.

Parameters
  • p1 (1d array-like) – probability distribution

  • p2 (1d array-like) – probability distribution

  • smooth (float) – amount by which to smooth out the probability distribution of p2. This is intended to deal with categories with zero probability.

Returns

js_div – js divergence

Return type

numeric

quasinet.metrics.kl_divergence(p1, p2)

Compute the Kullback–Leibler divergence of discrete probability distributions.

NOTE: we will not perform error checking in this function because this function is used very frequently. The user should check that p1 and p2 are in fact probability distributions.

Parameters
  • p1 (1d array-like) – probability distribution

  • p2 (1d array-like) – probability distribution

  • smooth (float) – amount by which to smooth out the probability distribution of p2. This is intended to deal with categories with zero probability.

Returns

output – kl divergence

Return type

numeric

quasinet.metrics.process_dict1_list(dict1_list)
quasinet.metrics.process_dict2_list(dict2_list)
quasinet.metrics.theta(seq1_list, seq2_list)
quasinet.metrics.theta_(dict1_list, dict2_list)
quasinet.metrics.theta_matrix(list_dict1_list, list_dict2_list=None)
quasinet.metrics.theta_matrix_par(list_dict1_list, list_dict2_list)

quasinet.osfix module

quasinet.osfix.osfix(OS=None)

quasinet.qnet module

class quasinet.qnet.Qnet(feature_names, min_samples_split=2, alpha=0.05, max_depth=- 1, max_feats=- 1, early_stopping=False, verbose=0, random_state=None, n_jobs=1)

Bases: object

Qnet architecture.

Parameters
  • feature_names (list) – List of names describing the features

  • min_samples_split (int) – Minimum samples required for a split

  • alpha (float) – Threshold value for selecting feature with permutation tests. Smaller values correspond to shallower trees

  • max_depth (int) – Maximum depth to grow tree

  • max_feats (str or int) – Maximum feats to select at each split. String arguments include ‘sqrt’, ‘log’, and ‘all’

  • early_stopping (bool) – Whether to implement early stopping during feature selection. If True, then as soon as the first permutation test returns a p-value less than alpha, this feature will be chosen as the splitting variable

  • verbose (bool or int) – Controls verbosity of training and testing

  • random_state (int) – Sets seed for random number generator

  • n_jobs (int) – Number of CPUs to use when training

clear_attributes()

Remove the unneeded attributes to save memory.

Parameters

None

Returns

Return type

None

fit(X, index_array=None)

Train the qnet

Examples

>>> from quasinet import qnet
>>> X = load_data()
>>> myqnet = qnet.Qnet(feature_names=feature_names)
>>> myqnet.fit(X)
Parameters
  • X (2d array-like) – Array of features

  • index_array (1d array-like) – Array of indices to generate estimators for. Uses all indices by default or if set to None.

Returns

self – Instance of Qnet class

Return type

Qnet

mix(qnet_2, feature_name_list)

Take column from qnet_2, and switch its estimator with current qnet in-place. Makes it possible to simulate behavior of current qnet if some of the estimators were behaving like a second one, and can be used to identify the maximally divergent rules of orgnaization between the two models. Also sets the attribute self.mixed to be True.

Parameters
  • qnet_2 (Qnet) – A Qnet instance

  • feature_name_list (list) – A list of variable (feature) names that would be replaced in self from qnet_2

Returns

Return type

None

predict_distribution(column_to_item, column)

Predict the probability distribution for a given column.

It may be the case that a certain column value does not appear in the resulting output. If that happens, that means the probability of that column is 0.

Parameters
  • column_to_item (dict) – dictionary mapping the column to the values the columns take

  • column (int) – column index

Returns

output – dictionary mapping possible column values to probability values

Return type

dictionary

predict_distributions(seq)

Predict the probability distributions for all the columns.

If you do not want to set a particular value for an index of seq, then set the value at the index to the global nan_value. By default, this value is the empty string.

The length of the input sequence must match the size of feature_names.

Parameters

seq (list) – list of values

Returns

prob_distributions – list of dictionaries of probability distributions, one for each index

Return type

list

viz_trees(tree_path, draw=True, big_enough_threshold=- 1, prog='dot', format='pdf', remove_dotfile=True, remove_newline=False, **kwargs)

Generate dot files for individual estimators, and optionally render them to pdf.

Parameters
  • tree_path (string) – path to where dotfiles will be generated. Creates directory if not present

  • draw (bool) – Set to True to render dotfiles (default True)

  • prog (str) – Graphviz program used for rendering (default: dot, other values: neato, fdp, sfdp)

  • format (str) – Format of rendered file (default: pdf, other values png, svg)

  • remove_dotfile (bool) – Deleted all dot files if set to True (default: True)

  • remove_newline (bool) – Remove newlines from edge labels in tree visualiztion to prettify

  • **kwargs (dict, optional) – Additional keyword arguments to be passed to export_qnet_tree. Refer to the documentation of export_qnet_tree for details on accepted arguments.

Returns

Return type

None

quasinet.qnet.export_qnet_graph(qnet, threshold, outfile)

Export the qnet as a graph of dependencies. The output will be in the .dot file format for graphs.

Parameters
  • qnet (Qnet) – A Qnet instance

  • threshold (float) – Numeric cutoff for edge weights. If the edge weights exceed this cutoff, then we include it into the graph.

  • outfile (str) – File name to save to.

Returns

Return type

None

quasinet.qnet.export_qnet_tree(qnet, index, outfile, outformat='graphviz', detailed_output=False, pen_width=3, edge_color='black', edge_label_color='black', dpi=200, text_color='black', font_size=10, background_color='transparent', rotate=False, edge_fontcolor='grey14', min_size=1, color_alpha=1.5, labels=None, add_legend=False)

Export a tree from qnet. The index determines which tree to export.

Parameters
  • qnet (Qnet) – A Qnet instance

  • index (int) – Index of the tree to export

  • outformat (str) – Can only be graphviz for now. This will output a .dot file, which you can then compile using a command like dot -Tpng file.dot -o file.png

  • detailed_output (bool) – If True return detailed probabilities of output labels in leaf nodes. default: False

  • text_color (str) – Color to set the text

  • edge_color (str) – Color to set the edges

  • pen_width (int) – Width of pen for drawing boundaries

  • dpi (int) – Image resolution

  • rotate (bool) – If True, rotate the tree

  • add_legend (bool) – If True, add a legend to the tree

  • edge_font_color (str) – Color of edge label text

  • min_size (int) – Minimum number of nodes to draw the tree

  • color_alpha (float) – Parameter for color brightness

  • labels (list) – List of all labels, optional

Returns

Return type

None

quasinet.qnet.fit_save(df, slice_range=None, n_jobs=10, alpha=0.1, file_prefix='model', strtype='U5', low_mem=True, compress=True)

Fit and save qnet model as gz.

Parameters
  • df (pandas.DataFrame) – pandas dataframe input data with featurenames as columns

  • slice_range (numpy 1D array of ints) – Index array to use to make model, paased to index_arry in fit

  • alpha (float) – qnet fit significance level

  • file_prefix (str) – filename prefix for model file

  • low_mem (bool) – turning on low memory save (default: True)

  • compress (bool) – True if we want gzipped models (default: True)

  • strtype (str) – string type specification (default: U5)

Returns

qnet

Return type

Qnet

quasinet.qnet.load_qnet(f, gz=False)

Load the qnet from a file.

Parameters
  • f (str) – File name.

  • gz (bool) – Bool to indicate if file is gzipped (default: False)

Returns

qnet

Return type

Qnet

quasinet.qnet.membership_degree(seq, qnet)

Compute the membership degree of a sequence in a qnet.

Parameters
  • seq (1d array-like) – Array of values

  • qnet (Qnet) – the Qnet that seq belongs to

Returns

membership_degree – membership degree

Return type

numeric

quasinet.qnet.qdistance(seq1, seq2, qnet1, qnet2, mismatch=False, FULL_C=False)

Compute the Jensen-Shannon of discrete probability distributions.

Parameters
  • seq1 (1d array-like) – Array of values

  • seq2 (1d array-like) – Array of values

  • qnet1 (Qnet) – the Qnet that seq1 belongs to

  • qnet2 (Qnet) – the Qnet that seq2 belongs to

  • mismatch (bool) – Indicate if there is a mismatch in feature names

Returns

output – qdistance

Return type

numeric

quasinet.qnet.qdistance_matrix(seqs1, seqs2, qnet1, qnet2)

Compute a distance matrix with the qdistance metric.

Parameters
  • seqs1 (2d array-like) – Array of values

  • seqs2 (2d array-like) – Array of values

  • qnet1 (Qnet) – the Qnet that seqs1 belongs to

  • qnet2 (Qnet) – the Qnet that seqs2 belongs to

Returns

distance_matrix – distance matrix

Return type

2d array-like

quasinet.qnet.save_qnet(qnet, f, low_mem=True, gz=False)

Save the qnet to a file.

NOTE: The file name must end in .joblib

TODO: using joblib is actually less memory efficient than using pickle. However, I don’t know if this is a general problem or this only happens under certain circumstances.

TODO: we may have to delete and garbage collection some attributes in the qnet to save memory. For example, .feature_importances_, .available_features_

Parameters
  • qnet1 (Qnet) – A Qnet instance

  • f (str) – File name

  • low_mem (bool) – If True, save the Qnet with low memory by deleting all data attributes except the tree structure (default: False)

  • gz (bool) – Specification if we want gzipped output (default: False)

Returns

Return type

None

quasinet.qsampling module

quasinet.qsampling.qsample(seq, qnet, steps, baseline_prob=None, force_change=False, alpha=None, random_seed=None)

Perform q-sampling for multiple steps.

Qsampling works as follows: Say you have a sequence and a qnet. Then we randomly pick one of the items in the sequence and then change the value of that item based on the prediction of the qnet.

Parameters
  • seq (1d array-like) – Array of values

  • qnet (Qnet) – The Qnet that seq belongs to

  • steps (int) – Number of steps to run q-sampling

  • baseline_prob (1d array-like) – Baseline probability for sampling which index

  • force_change (bool) – Whether to force the sequence to change when sampling.

  • alpha (float) – scalr multiple of qnet object, can be any real number

Returns

seq – q-sampled sequence

Return type

1d array-like

quasinet.qsampling.targeted_qsample(seq1, seq2, qnet, steps, force_change=False)

Perform targeted q-sampling for multiple steps.

seq1 is q-sampled towards seq2.

This is similar to qsample, except that we perform changes to seq1 to try to approach seq2.

Parameters
  • seq1 (1d array-like) – Array of values

  • seq2 (1d array-like) – Array of values.

  • qnet (Qnet) – The Qnet that seq1 belongs to

  • steps (int) – Number of steps to run q-sampling

  • force_change (bool) – Whether to force the sequence to change when sampling.

Returns

seq – q-sampled sequence

Return type

1d array-like

quasinet.qseqtools module

quasinet.qseqtools.list_trained_qnets()

List the possible qnets we can use.

Parameters

None

Returns

Return type

None

quasinet.qseqtools.load_sequence(file)

Load a fasta sequence from file.

Parameters

file (str) – File of a fasta sequence

Returns

seq – A fasta sequence

Return type

1d-like array

quasinet.qseqtools.load_trained_qnet(qnet_type, extra_descriptions)

Load the pre-trained qnet.

Examples

>>> load_qnet('coronavirus', 'bat')
>>> load_qnet('influenza', 'h1n1;na;2009')
Parameters
  • qnet_type (str) – The type of qnet to load

  • extra_descriptions (str) – Extra descriptions for which qnet to load. The descriptions must be split by ; for influenza.

Returns

trained_qnet – A trained qnet

Return type

qnet.Qnet

quasinet.scorers module

quasinet.scorers.MEAN(z)
quasinet.scorers.approx_wdcor(x, y)

Approximate distance correlation by binning arrays

NOTE: Code ported from R function approx.dcor at:

https://rdrr.io/cran/extracat/src/R/wdcor.R

Parameters
  • x (1d array-like) – Array of n elements

  • y (1d array-like) – Array of n elements

Returns

dcor – Distance correlation

Return type

float

quasinet.scorers.c_dcor(x, y)

Wrapper for C version of distance correlation

Parameters
  • x (1d array-like) – Array of n elements

  • y (1d array-like) – Array of n elements

Returns

dcor – Distance correlation

Return type

float

quasinet.scorers.c_wdcor(x, y, weights)

Wrapper for C version of weighted distance correlation

Parameters
  • x (1d array-like) – Array of n elements

  • y (1d array-like) – Array of n elements

  • weights (1d array-like) – Weight vector that sums to 1

Returns

dcor – Distance correlation

Return type

float

quasinet.scorers.cca(X, Y)

Largest canonical correlation

Parameters
  • X (2d array-like) – Array of n elements

  • Y (2d array-like) – Array of n elements

Returns

cor – Largest canonical correlation between X and Y

Return type

float

quasinet.scorers.cca_fast(X, Y)

Largest canonical correlation

Parameters
  • X (2d array-like) – Array of n elements

  • Y (2d array-like) – Array of n elements

Returns

cor – Largest correlation between X and Y

Return type

float

quasinet.scorers.chi2(x, y)

x and y are ordinal representations of categorical variables.

quasinet.scorers.create_chi2_table(x, y)

Create a chi-squared contingency table using x and y

quasinet.scorers.gini_index(y, labels)

Gini index for node in tree

Note: Despite being jitted, this function is still slow and a bottleneck

in the actual training phase. Sklearn’s Cython version is used to find the best split and this function is then called on the parent node and two child nodes to calculate feature importances using the mean decrease impurity formula

Parameters
  • y (1d array-like) – Array of labels

  • labels (1d array-like) – Unique labels

Returns

gini – Gini index

Return type

float

quasinet.scorers.mc_fast(x, y, n_classes)

Multiple correlation

Parameters
  • x (1d array-like) – Array of n elements

  • y (1d array-like) – Array of n elements

  • n_classes (int) – Number of classes

Returns

cor – Multiple correlation coefficient between x and y

Return type

float

quasinet.scorers.mi(x, y)

Mutual information

Parameters
  • x (1d array-like) – Array of n elements

  • y (1d array-like) – Array of n elements

Returns

info – Mutual information between x and y

Return type

float

quasinet.scorers.mse(y)

Mean squared error for node in tree

Parameters

y (1d array-like) – Array of labels

Returns

error – Mean squared error

Return type

float

quasinet.scorers.pcor(x, y)

Pearson correlation

Parameters
  • x (1d array-like) – Array of n elements

  • y (1d array-like) – Array of n elements

Returns

cor – Pearson correlation

Return type

float

quasinet.scorers.py_dcor(x, y)

Python port of C function for distance correlation

Note: Version is optimized for use with Numba

Parameters
  • x (1d array-like) – Array of n elements

  • y (1d array-like) – Array of n elements

Returns

dcor – Distance correlation

Return type

float

quasinet.scorers.py_wdcor(x, y, weights)

Python port of C function for distance correlation

Note: Version is optimized for use with Numba

Parameters
  • x (1d array-like) – Array of n elements

  • y (1d array-like) – Array of n elements

  • weights (1d array-like) – Weight vector that sums to 1

Returns

dcor – Distance correlation

Return type

float

quasinet.scorers.rdc(X, Y, k=10, s=0.16666666666666666, f=<ufunc 'sin'>)

Randomized dependence coefficient

Parameters
  • X (2d array-like) – Array of n elements

  • Y (2d array-like) – Array of n elements

  • k (int) – Number of random projections

  • s (float) – Variance of Gaussian random variables

  • f (function) – Non-linear function

Returns

cor – Randomized dependence coefficient between X and Y

Return type

float

quasinet.scorers.rdc_fast(x, y, k=10, s=0.16666666666666666, f=<ufunc 'sin'>)

Randomized dependence coefficient

Parameters
  • x (1d array-like) – Array of n elements

  • y (1d array-like) – Array of n elements

  • k (int) – Number of random projections

  • s (float) – Variance of Gaussian random variables

  • f (function) – Non-linear function

Returns

cor – Randomized dependence coefficient between x and y

Return type

float

quasinet.tree module

class quasinet.tree.Node(col=None, col_pval=None, lthreshold=None, rthreshold=None, impurity=None, value=None, left=None, right=None, label_frequency=None)

Bases: object

Decision node in tree

Parameters
  • col (int) – Integer indexing the location of feature or column

  • col_pval (float) – Probability value from permutation test for feature selection

  • lthreshold (list) – List of items for taking the left edge down the tree

  • rthreshold (list) – List of items for taking the right edge down the tree

  • impurity (float) – Impurity measuring quality of split

  • value (1d array-like or float) – For classification trees, estimate of each class probability For regression trees, central tendency estimate

  • left (Node) – Another Node

  • right (Node) – Another Node

  • label_frequency (dict) – Dictionary mapping label to its frequency

quasinet.tree.get_nodes(root, get_leaves=True, get_non_leaves=True)

Traverse a tree and get all the nodes.

TODO: may need to change this into an iterator for speed purposes

If get_leaves and get_non_leaves are both True, then we will get all the nodes.

Parameters
  • root (Node) – root node of the tree

  • get_leaves (bool) – If true, we get leaf nodes

  • get_non_leaves (bool) – If true, we get non leaf nodes.

Returns

output – list of Node

Return type

list

quasinet.utils module

quasinet.utils.analyze_dot_file(dot_file, fracThreshold=0.0)
quasinet.utils.assert_array_rank(X, rank)

Check if the input is an numpy array and has a certain rank.

Parameters
  • X (array-like) – Array to check

  • rank (int) – Rank of the tensor to check

Returns

Return type

None

quasinet.utils.assert_string_type(X, name)

Check if the input is of string datatype.

Parameters
  • X (array-like) – Array to check

  • name (str) – Name of the input

Returns

Return type

None

quasinet.utils.auc_score(y_true, y_prob)

ADD

quasinet.utils.bayes_boot_probs(n)

Bayesian bootstrap sampling for case weights

Parameters

n (int) – Number of Bayesian bootstrap samples

Returns

p – Array of sampling probabilities

Return type

1d array-like

quasinet.utils.big_enough(dot_file, big_enough_threshold=- 1)
quasinet.utils.drawtrees(dotfiles, prog='dot', format='pdf', big_enough_threshold=- 1)
quasinet.utils.estimate_margin(y_probs, y_true)

Estimates margin function of forest ensemble

Note : This function is similar to margin in R’s randomForest package

Parameters
  • y_probs (2d array-like) – Predicted probabilities where each row represents predicted class distribution for sample and each column corresponds to estimated class probability

  • y_true (1d array-like) – Array of true class labels

Returns

margin – Estimated margin of forest ensemble

Return type

float

quasinet.utils.find_matching_indices(A, B)
quasinet.utils.generate_seed()

generates a seed as function of current time and thread id for random number generator seed. Must be used when large number of qsamples are drawn in parallel

quasinet.utils.getNull(model, strtype='U5')

Function to generate an array of empty strings of same length as feature names in the model.

Parameters
  • model (Qnet object) – The Qnet model.

  • STRTYPE (str) – String type to be used for the generated numpy array. Default is ‘U5’.

Returns

An array of empty strings.

Return type

numpy.ndarray

quasinet.utils.logger(name, message)

Prints messages with style “[NAME] message”

Parameters
  • name (str) – Short title of message, for example, train or test

  • message (str) – Main description to be displayed in terminal

Returns

Return type

None

quasinet.utils.numparameters(qnetmodel)

computes total number of prameters in qnet

Parameters

model (Qnet object) – The Qnet model.

Returns

  • int – number of independent parameters.

  • float – number of internal nodes per model column.

quasinet.utils.powerset(s)

Get the power set of a list or set.

quasinet.utils.remove_newline_in_dotfile(file_path)

remove newlines from edge labels in dotfile

quasinet.utils.remove_zeros(r, axis)

Remove rows along a certain axis where the value is all zero.

quasinet.utils.sample_from_dict(distrib)

Choose an item from the distribution

Parameters

distrib (dict) – Dictionary mapping keys to its probability values

Returns

item – A chosen key from the dictionary

Return type

key of dict

quasinet.utils.scientific_notation(num)

Convert a number into scientific notation

Parameters

num (float) – Any number

Returns

output – String representation of the number

Return type

str

quasinet.zqnet module

quasinet.zqnet.extract_diagonal_blocks(M, L)
quasinet.zqnet.get_description_curl(code)
quasinet.zqnet.remove_suffix(s)
quasinet.zqnet.replace_with_d(S, j, d)
class quasinet.zqnet.zQnet(*args, **kwargs)

Bases: quasinet.qnet.Qnet

Extended Qnet architecture (zQnet).

An extension of the Qnet class with added functionality and attributes. This class introduces risk computation based on a series of metrics and provides a way to set and retrieve a model description.

Qnet : Base class

nullsequencesarray-like

Collection of sequences considered null or baseline for risk calculations.

targetstr, optional

Target variable or description. Not currently utilized in methods.

descriptionstr

Descriptive notes or commentary about the model.

auc_estimatearray-like

AUCs obtained during optimization of null sequences.

training_indexarray-like, optional

Indices used during the training phase. Not currently utilized in methods.

Parameters
  • *args – Variable length argument list inherited from Qnet.

  • **kwargs – Arbitrary keyword arguments inherited from Qnet.

personal_zshap(s, eps=1e-07)

A superfast approximation of SHAP for zQnet for individual samples

Parameters
  • s (numpy array of str) – The sequence around which we are evaluating perturbations.

  • eps (float) – shap value cutoff

Returns

dataframe with shapo values and index mapped to short description of icd10 codes

Return type

pandas.DataFrame

risk(X)

Compute the mean risk value for input X based on its distance from null sequences.

Parameters

X (2d array-like) – Input data whose risk is to be computed.

Returns

Mean risk value for the input X.

Return type

float

risk_max(X)

Compute the maximum risk value for input X based on its distance from null sequences.

Parameters

X (2d array-like) – Input data whose risk is to be computed.

Returns

Maximum risk value for the input X.

Return type

float

risk_median(X)

Compute the median risk value for input X based on its distance from null sequences.

Parameters

X (2d array-like) – Input data whose risk is to be computed.

Returns

Median risk value for the input X.

Return type

float

set_description(markdown_file)

Set the description attribute for the model using content from a markdown file.

Parameters

markdown_file (str) – Path to the markdown file containing the model’s description.

Returns

Content of the markdown file.

Return type

str

zshap(seq=None, m=35)

A superfast approximation of SHAP for zQnet

Parameters
  • seq (numpy array of str) – The sequence around which we are evaluating perturbations. By default it is the array oif empty strings, which represents average behavior

  • m (int) – Length of shap return dataframe

Returns

dataframe with shapo values and index mapped to short description of icd10 codes

Return type

pandas.DataFrame

Module contents