quasinet package¶

Submodules¶

quasinet.ciforest module¶

class quasinet.ciforest.CIForestClassifier(min_samples_split=2, alpha=0.05, selector='mc', max_depth=- 1, n_estimators=100, max_feats='sqrt', n_permutations=100, early_stopping=True, muting=True, verbose=0, bootstrap=True, bayes=True, class_weight='balanced', n_jobs=- 1, random_state=None)¶

Bases: sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

Conditional forest classifier

Parameters

min_samples_split (int) – Minimum samples required for a split
alpha (float) – Threshold value for selecting feature with permutation tests. Smaller values correspond to shallower trees
selector (str) – Variable selector for finding strongest association between a feature and the label
max_depth (int) – Maximum depth to grow tree
max_feats (str or int) – Maximum feats to select at each split. String arguments include ‘sqrt’, ‘log’, and ‘all’
n_permutations (int) – Number of permutations during feature selection
early_stopping (bool) – Whether to implement early stopping during feature selection. If True, then as soon as the first permutation test returns a p-value less than alpha, this feature will be chosen as the splitting variable
muting (bool) – Whether to perform variable muting
verbose (bool or int) – Controls verbosity of training and testing
bootstrap (bool) – Whether to perform bootstrap sampling for each tree
bayes (bool) – If True, performs Bayesian bootstrap sampling
class_weight (str) – Type of sampling during bootstrap, None for regular bootstrapping, ‘balanced’ for balanced bootstrap sampling, and ‘stratify’ for stratified bootstrap sampling
n_jobs (int) – Number of jobs for permutation testing
random_state (int) – Sets seed for random number generator

fit(X, y)¶

Fit conditional forest classifier

Parameters

X (2d array-like) – Array of features
y (1d array-like) – Array of labels

Returns

self – Instance of CIForestClassifier

Return type

CIForestClassifier

predict(X)¶

Predicts class labels for feature vectors X

Parameters: X (2d array-like) – Array of features
Returns: y – Array of predicted classes
Return type: 1d array-like

predict_proba(X)¶

Predicts class probabilities for feature vectors X

Parameters: X (2d array-like) – Array of features
Returns: class_probs – Array of predicted class probabilities
Return type: 2d array-like

set_score_request(*, sample_weight: Union[bool, None, str] = '$UNCHANGED$') → quasinet.ciforest.CIForestClassifier¶

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters: sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
Returns: self – The updated object.
Return type: object

quasinet.ciforest.balanced_sampled_idx(random_state, y, bayes, min_class_p)¶

Indices for balanced bootstrap sampling in classification

Parameters

random_state (int) – Sets seed for random number generator
y (1d array-like) – Array of labels
bayes (bool) – If True, performs Bayesian bootstrap sampling
min_class_p (float) – Minimum proportion of class labels

Returns

idx – Balanced sampled indices for each class

Return type

list

quasinet.ciforest.balanced_unsampled_idx(random_state, y, bayes, min_class_p)¶

Unsampled indices for balanced bootstrap sampling in classification

Parameters

random_state (int) – Sets seed for random number generator
y (1d array-like) – Array of labels
bayes (bool) – If True, performs Bayesian bootstrap sampling
min_class_p (float) – Minimum proportion of class labels

Returns

idx – Balanced unsampled indices for each class

Return type

list

quasinet.ciforest.normal_sampled_idx(random_state, n, bayes)¶

Indices for bootstrap sampling

Parameters

random_state (int) – Sets seed for random number generator
n (int) – Sample size
bayes (bool) – If True, performs Bayesian bootstrap sampling

Returns

idx – Sampled indices

Return type

list

quasinet.ciforest.normal_unsampled_idx(random_state, n, bayes)¶

Unsampled indices for bootstrap sampling

Parameters

random_state (int) – Sets seed for random number generator
y (1d array-like) – Array of labels
n (int) – Sample size
bayes (bool) – If True, performs Bayesian bootstrap sampling

Returns

idx – Unsampled indices

Return type

list

quasinet.ciforest.stratify_sampled_idx(random_state, y, bayes)¶

Indices for stratified bootstrap sampling in classification

Parameters

random_state (int) – Sets seed for random number generator
y (1d array-like) – Array of labels
bayes (bool) – If True, performs Bayesian bootstrap sampling

Returns

idx – Stratified sampled indices for each class

Return type

list

quasinet.ciforest.stratify_unsampled_idx(random_state, y, bayes)¶

Unsampled indices for stratified bootstrap sampling in classification

Parameters

random_state (int) – Sets seed for random number generator
y (1d array-like) – Array of labels
bayes (bool) – If True, performs Bayesian bootstrap sampling

Returns

idx – Stratified unsampled indices for each class

Return type

list

quasinet.citrees module¶

class quasinet.citrees.CITreeBase(min_samples_split=2, alpha=0.05, max_depth=- 1, max_feats=- 1, n_permutations=100, early_stopping=False, muting=True, verbose=0, random_state=None)¶

Bases: object

Base class for conditional inference tree.

Parameters

min_samples_split (int) – Minimum samples required for a split
alpha (float) – Threshold value for selecting feature with permutation tests. Smaller values correspond to shallower trees
max_depth (int) – Maximum depth to grow tree
max_feats (str or int) – Maximum feats to select at each split. String arguments include ‘sqrt’, ‘log’, and ‘all’
n_permutations (int) – Number of permutations during feature selection
early_stopping (bool) – Whether to implement early stopping during feature selection. If True, then as soon as the first permutation test returns a p-value less than alpha, this feature will be chosen as the splitting variable
muting (bool) – Whether to perform variable muting
verbose (bool or int) – Controls verbosity of training and testing
random_state (int) – Sets seed for random number generator

fit(X, y=None)¶

Train model.

X, y must contain only string datatypes.

Parameters

X (2d array-like) – Array of categorical features
y (1d array-like) – Array of labels

Returns

self – Instance of CITreeBase class

Return type

CITreeBase

predict(*args, **kwargs)¶: Predicts labels on test data. This method should not be callable from base class.

predict_label(X, tree=None)¶

Predicts label

Parameters

X (1d array-like) – Array of features for single sample
tree (CITreeBase) – Trained tree

Returns

label – Predicted label

Return type

str

print_tree(tree=None, indent=' ', child=None)¶

Prints tree structure

Parameters

tree (CITreeBase) – Trained tree model
indent (str) – Indent spacing
child (Node) – Left or right child node

Returns

Return type

None

class quasinet.citrees.CITreeClassifier(min_samples_split=2, alpha=0.05, selector='chi2', max_depth=- 1, max_feats=- 1, n_permutations=100, early_stopping=False, muting=True, verbose=0, random_state=None)¶

Bases: quasinet.citrees.CITreeBase, sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin

Conditional inference tree classifier

NOTE: as of now, the features can only be categorical

Parameters

selector (str) – Variable selector for finding strongest association between a feature and the label
definitions (Derived from CITreeBase class; see constructor for parameter) –

fit(X, y, labels=None)¶

Train conditional inference tree classifier

Parameters

X (2d array-like) – Array of features
y (1d array-like) – Array of labels
labels (1d array-like) – Array of unique class labels

Returns

self – Instance of CITreeClassifier class

Return type

CITreeClassifier

predict(X)¶

Predicts class labels for feature vectors X

Parameters: X (2d array-like) – Array of features
Returns: y – Array of predicted classes
Return type: 1d array-like

predict_proba(X)¶

Predicts class probabilities for feature vectors X

Parameters: X (2d array-like) – Array of features
Returns: class_probs – Array of predicted class probabilities
Return type: 2d array-like

set_fit_request(*, labels: Union[bool, None, str] = '$UNCHANGED$') → quasinet.citrees.CITreeClassifier¶

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters: labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for labels parameter in fit.
Returns: self – The updated object.
Return type: object

set_score_request(*, sample_weight: Union[bool, None, str] = '$UNCHANGED$') → quasinet.citrees.CITreeClassifier¶

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters: sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
Returns: self – The updated object.
Return type: object

quasinet.citrees.get_feature_importance(citree, normalize=True)¶

Get the feature importance of the citree.

Parameters

citree (CITreeBase) – A conditional inference tree
normalize (bool) – Whether to normalize the feature importance or not

Returns

col_to_importance – Mapping from column index to total feature importance

Return type

dict

quasinet.curvature module¶

quasinet.curvature.compute_curvature(p, delta)¶

Computes the curvature (scalar curvature) at a given point in the space of Quasinets.

The curvature R is computed as:

\[R = G^{ij} R_{ij}\]

where G^{ij} is the inverse of the metric tensor G_{ij}, and R_{ij} is the Ricci curvature.

Parameters: p (list[dict(str,float]): quasinet.predict_distributions() for the quasinet ‘p’ delta (float): A small number representing a change in each coordinate direction.

Returns: float: The curvature at point p.

quasinet.curvature.compute_metric_tensor(p_distrib, delta, progress=False)¶

Computes the metric tensor at a given point in the space of Quasinets.

The metric tensor G_ij is defined as:

\[G_{ij} = \frac{1}{2} \left( D(p + \delta p_i + \delta p_j, p) - D(p + \delta p_i, p) - D(p + \delta p_j, p) + D(p, p) \right)\]

where D is the distance function, p_i is the i-th unit Quasinet, and delta is a small perturbation.

Parameters: p_distrib (list[dict(str,float]): quasinet.predict_distributions() for the quasinet ‘p’ at which metric tensor is calculated delta (float): A small number representing a change in each coordinate direction. progress (bool): show progress bar

Returns: ndarray: The metric tensor at point p (the quasinet for which p_distrib is calculatd).

quasinet.curvature.compute_metric_tensor_derivative(p, delta)¶

Computes the derivative of the metric tensor at a given point in the space of Quasinets.

The derivative of the metric tensor G_ij with respect to the k-th coordinate is computed as:

\[\frac{\partial G_{ij}}{\partial p_k} = \frac{G_{ij}(p + \delta p_k) - G_{ij}(p)}{\delta}\]

Parameters: p (list[dict(str,float]): quasinet.predict_distributions() for the quasinet ‘p’ at which compute the metric tensor derivative. delta (float): A small number representing a change in each coordinate direction.

Returns: ndarray: The derivative of the metric tensor at point p.

quasinet.curvature.compute_ricci_curvature(p, delta)¶

Computes the Ricci curvature at a given point in the space of Quasinets.

The Ricci curvature R_ij is computed as:

\[R_{ij} = G^{kl} \left( \frac{\partial^2 G_{ij}}{\partial p_k \partial p_l} - \frac{1}{2} \frac{\partial^2 G_{kl}}{\partial p_i \partial p_j} \right)\]

where G^{kl} is the inverse of the metric tensor G_{kl}, and the partial derivatives are computed by taking the limit as delta goes to zero.

Parameters: p (list[dict(str,float]): quasinet.predict_distributions() for the quasinet ‘p’ delta (float): A small number representing a change in each coordinate direction.

Returns: ndarray: The Ricci curvature at point p.

quasinet.curvature.delta_pi(qnet_instance, index, delta)¶

This function modifies the distribution of the given Quasinet instance by scaling it with a scalar value in the direction of the given index.

Parameters: qnet_instance (Quasinet): The Quasinet instance to modify. index (int): The index of the feature direction to scale. delta (float): The scalar to scale the distribution with.

Returns: Quasinet: The Quasinet instance with modified distribution.

quasinet.curvature.dist_scalr_mult(D1, a)¶

Multiply each value in the dictionary with scalar ‘a’ and renormalize to get a valid probability distribution.

Parameters

D1 (dict) – Dictionary where each key-value pair represents an item and its probability.
a (float) – Scalar to multiply with each value of D1.

Returns

New dictionary with each value scaled and renormalized.

Return type

dict

quasinet.curvature.dist_sum(D1, D2)¶

Add each corresponding value in D1 and D2, then renormalize to get a valid probability distribution.

Parameters

D1 (dict) – Two dictionaries where each key-value pair represents an item and its probability.
D2 (dict) – Two dictionaries where each key-value pair represents an item and its probability.

Returns

New dictionary with each value being the sum of the corresponding values in D1 and D2, renormalized.

Return type

dict

quasinet.curvature.distance_function(p, q, NULL=None, strtype='U5')¶

Computes the distance between two Quasinets.

Parameters: p, q (Quasinet): The Quasinets to compute the distance between.

Returns: float: The distance between p and q.

quasinet.curvature.distance_function_distrib(p, q, i)¶: Compute distance between two quasinets assumeing that p, q only differ at estimator coordinates listed in i Parameters: p (list[dict(str,flaot]): quasinet.predict_distributions() q (list[dict(str,flaot]): quasinet.predict_distributions() i (1d numpy arra): list of indices on which p and q differ

quasinet.curvature.mt_worker(args)¶

quasinet.curvature.perturb_quasinet(qnet_instance, index, delta)¶

Perturbs a Quasinet in the direction of the i-th feature.

Parameters: p (list[dict(str,flaot]): quasinet.predict_distributions() i (int): The index of the feature direction to perturb in. delta (float): The magnitude of the perturbation.

Returns: Quasinet: The perturbed Quasinet.

quasinet.curvature.perturb_quasinet_distrib(p_distrib_, index, delta)¶

Perturbs a Quasinet in the direction of the i-th feature, using only the distributions at each estimator, which are produced by the predict_distributions function

Parameters: p (list[dict(str,flaot]): quasinet.predict_distributions() i (int): The index of the feature direction to perturb in. delta (float): The magnitude of the perturbation.

Returns: Quasinet: The perturbed Quasinet.

quasinet.curvature.scalarmod_predict_distribution(self, column_to_item, column, **kwargs)¶

Modify the predict_distribution function of the Quasinet object to scale the output probabilities for a specified feature.

Parameters

self (Quasinet object) – The Quasinet instance.
column_to_item (dict) – A dictionary mapping from column names to specific items.
column (str) – The name of the column (feature) to scale.
**kwargs (dict) – Additional arguments passed to the predict_distribution function.

Returns

A dictionary of probabilities for each item in the specified column.

Return type

dict

quasinet.curvature.sum_predict_distribution(self, column_to_item, column, **kwargs)¶

quasinet.export module¶

class quasinet.export.GraphvizTreeExporter(tree, outfile, response_name, feature_names, text_color='black', edge_color='gray', font_size=10, edge_label_color='deepskyblue4', pen_width=2, background_color='transparent', dpi=200, edge_fontcolor='grey14', rotate=False, add_legend=True, min_size=1, color_alpha=- 1.5, labels=None, detailed_output=False)¶

Bases: object

Export the tree using graphviz.

Parameters

qnet (Qnet) – A Qnet instance
outfile (str) – Output file to save results to
response_name (str) – Name of the y variable that we are predicting
feature_names (list) – Names of each of the features
text_color (str) – Color to set the text
edge_color (str) – Color to set the edges
pen_width (int) – Width of pen for drawing boundaries
dpi (int) – Image resolution
rotate (bool) – If True, rotate the tree
add_legend (bool) – If True, add a legend to the tree
detailed_output (bool) – If False output probability of max likelihood of label in leaves (default), else output prob. distribution.
edge_font_color (str) – Color of edge label text
min_size (int) – Minimum number of nodes to draw the tree
labels (list) – List of all labels, optional

Returns

Return type

None

export()¶

class quasinet.export.QnetGraphExporter(qnet, outfile, threshold)¶

Bases: object

Export the qnet as a graph to a dot file format.

Parameters

qnet (Qnet) – A Qnet instance
outfile (str) – Output file to save results to
threshold (float) – Numeric cutoff for edge weights. If the edge weights exceed this cutoff, then we include it into the graph.

Returns

Return type

None

export()¶

quasinet.feature_importance module¶

quasinet.feature_importance.getShap(model_, num_backgrounds=1, samples=None, num_samples=5, strtype='U5', fast_estimate=False)¶

Function to compute SHAP values for feature importance analysis.

Parameters

model (Qnet object) – The Qnet model.
num_backgrounds (int) – Number of background samples to generate. Default is 1.
num_samples (int) – Number of samples for the SHAP analysis. Default is 5.
strtype (str) – String type to be used for the generated numpy array. Default is ‘U5’.
samples (numpy array) – samples to run shap analysis on. Default is None. If None, generate via qsampling
fast_estimate (bool) – If True, use tree explainer with a CatBoostRegressor model for faster estimation. Default is False.

Returns

pandas.DataFrame – A dataframe containing the SHAP values for each feature.
numpy.array – numpy array of ordered indices of decsening shapvalues of model feature_names

quasinet.feature_importance.qnet_model_func(X)¶

Function to compute the distance matrix for a given set of sequences.

Parameters: X (numpy.ndarray) – Array of sequences.
Returns: The computed distance matrix.
Return type: numpy.ndarray

quasinet.feature_selectors module¶

quasinet.feature_selectors.permutation_test_chi2(x, y, B=100, random_state=None, **kwargs)¶

Permutation test using chi-squared.

This is used when x and y are nominal variables.

Parameters

x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
n_classes (int) – Number of classes
B (int) – Number of permutations
random_state (int) – Sets seed for random number generator

Returns

p – Achieved significance level

Return type

float

quasinet.feature_selectors.permutation_test_dcor(x, y, B=100, random_state=None)¶

Permutation test for distance correlation

Parameters

x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
B (int) – Number of permutations
random_state (int) – Sets seed for random number generator

Returns

p – Achieved significance level

Return type

float

quasinet.feature_selectors.permutation_test_dcor_parallel(x, y, B=100, n_jobs=- 1, random_state=None)¶

Parallel implementation of permutation test for distance correlation

Parameters

x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
B (int) – Number of permutations
n_jobs (int) – Number of cpus to use for processing
random_state (int) – Sets seed for random number generator

Returns

p – Achieved significance level

Return type

float

quasinet.feature_selectors.permutation_test_mc(x, y, B=100, n_classes=None, random_state=None)¶

Permutation test for multiple correlation

Parameters

x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
n_classes (int) – Number of classes
B (int) – Number of permutations
random_state (int) – Sets seed for random number generator

Returns

p – Achieved significance level

Return type

float

quasinet.feature_selectors.permutation_test_mi(x, y, B=100, random_state=None, **kwargs)¶

Permutation test for mutual information

Parameters

x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
n_classes (int) – Number of classes
B (int) – Number of permutations
random_state (int) – Sets seed for random number generator

Returns

p – Achieved significance level

Return type

float

quasinet.feature_selectors.permutation_test_pcor(x, y, B=100, random_state=None)¶

Permutation test for Pearson correlation

Parameters

x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
B (int) – Number of permutations
random_state (int) – Sets seed for random number generator

Returns

p – Achieved significance level

Return type

float

quasinet.feature_selectors.permutation_test_rdc(x, y, B=100, random_state=None)¶

Permutation test for randomized dependence coefficient

Parameters

x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
B (int) – Number of permutations
random_state (int) – Sets seed for random number generator

Returns

p – Achieved significance level

Return type

float

quasinet.feature_selectors.permutation_test_rdc_parallel(x, y, B=100, n_jobs=- 1, k=10, random_state=None)¶

Parallel implementation of permutation test for randomized dependence coefficient

Parameters

x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
B (int) – Number of permutations
n_jobs (int) – Number of cpus to use for processing
k (int) – Number of random projections for cca
random_state (int) – Sets seed for random number generator

Returns

p – Achieved significance level

Return type

float

quasinet.metrics module¶

quasinet.metrics.convert_lists_to_ctypes(V_list, key_list)¶

quasinet.metrics.js_divergence(p1, p2, smooth=0.0001)¶

Compute the Jensen-Shannon of discrete probability distributions.

Parameters

p1 (1d array-like) – probability distribution
p2 (1d array-like) – probability distribution
smooth (float) – amount by which to smooth out the probability distribution of p2. This is intended to deal with categories with zero probability.

Returns

js_div – js divergence

Return type

numeric

quasinet.metrics.kl_divergence(p1, p2)¶

Compute the Kullback–Leibler divergence of discrete probability distributions.

NOTE: we will not perform error checking in this function because this function is used very frequently. The user should check that p1 and p2 are in fact probability distributions.

Parameters

p1 (1d array-like) – probability distribution
p2 (1d array-like) – probability distribution
smooth (float) – amount by which to smooth out the probability distribution of p2. This is intended to deal with categories with zero probability.

Returns

output – kl divergence

Return type

numeric

quasinet.metrics.process_dict1_list(dict1_list)¶

quasinet.metrics.process_dict2_list(dict2_list)¶

quasinet.metrics.theta(seq1_list, seq2_list)¶

quasinet.metrics.theta_(dict1_list, dict2_list)¶

quasinet.metrics.theta_matrix(list_dict1_list, list_dict2_list=None)¶

quasinet.metrics.theta_matrix_par(list_dict1_list, list_dict2_list)¶

quasinet.osfix module¶

quasinet.osfix.osfix(OS=None)¶

quasinet.qnet module¶

class quasinet.qnet.Qnet(feature_names, min_samples_split=2, alpha=0.05, max_depth=- 1, max_feats=- 1, early_stopping=False, verbose=0, random_state=None, n_jobs=1)¶

Bases: object

Qnet architecture.

Parameters

feature_names (list) – List of names describing the features
min_samples_split (int) – Minimum samples required for a split
alpha (float) – Threshold value for selecting feature with permutation tests. Smaller values correspond to shallower trees
max_depth (int) – Maximum depth to grow tree
max_feats (str or int) – Maximum feats to select at each split. String arguments include ‘sqrt’, ‘log’, and ‘all’
early_stopping (bool) – Whether to implement early stopping during feature selection. If True, then as soon as the first permutation test returns a p-value less than alpha, this feature will be chosen as the splitting variable
verbose (bool or int) – Controls verbosity of training and testing
random_state (int) – Sets seed for random number generator
n_jobs (int) – Number of CPUs to use when training

clear_attributes()¶

Remove the unneeded attributes to save memory.

Parameters: None –
Returns
Return type: None

fit(X, index_array=None)¶

Train the qnet

Examples

>>> from quasinet import qnet
>>> X = load_data()
>>> myqnet = qnet.Qnet(feature_names=feature_names)
>>> myqnet.fit(X)

Parameters

X (2d array-like) – Array of features
index_array (1d array-like) – Array of indices to generate estimators for. Uses all indices by default or if set to None.

Returns

self – Instance of Qnet class

Return type

Qnet

mix(qnet_2, feature_name_list)¶

Take column from qnet_2, and switch its estimator with current qnet in-place. Makes it possible to simulate behavior of current qnet if some of the estimators were behaving like a second one, and can be used to identify the maximally divergent rules of orgnaization between the two models. Also sets the attribute self.mixed to be True.

Parameters

qnet_2 (Qnet) – A Qnet instance
feature_name_list (list) – A list of variable (feature) names that would be replaced in self from qnet_2

Returns

Return type

None

predict_distribution(column_to_item, column)¶

Predict the probability distribution for a given column.

It may be the case that a certain column value does not appear in the resulting output. If that happens, that means the probability of that column is 0.

Parameters

column_to_item (dict) – dictionary mapping the column to the values the columns take
column (int) – column index

Returns

output – dictionary mapping possible column values to probability values

Return type

dictionary

predict_distributions(seq)¶

Predict the probability distributions for all the columns.

If you do not want to set a particular value for an index of seq, then set the value at the index to the global nan_value. By default, this value is the empty string.

The length of the input sequence must match the size of feature_names.

Parameters: seq (list) – list of values
Returns: prob_distributions – list of dictionaries of probability distributions, one for each index
Return type: list

viz_trees(tree_path, draw=True, big_enough_threshold=- 1, prog='dot', format='pdf', remove_dotfile=True, remove_newline=False, addurl=False, base_url='https://zed.createuky.net/', **kwargs)¶

Generate dot files for individual estimators, and optionally render them to pdf.

Parameters

tree_path (string) – path to where dotfiles will be generated. Creates directory if not present
draw (bool) – Set to True to render dotfiles (default True)
prog (str) – Graphviz program used for rendering (default: dot, other values: neato, fdp, sfdp)
format (str) – Format of rendered file (default: pdf, other values png, svg)
remove_dotfile (bool) – Deleted all dot files if set to True (default: True)
remove_newline (bool) – Remove newlines from edge labels in tree visualiztion to prettify
addurl (bool) – Add url links to the node labels
base_url (str) – Url base for node url links
**kwargs (dict, optional) – Additional keyword arguments to be passed to export_qnet_tree. Refer to the documentation of export_qnet_tree for details on accepted arguments.

Returns

Return type

None

quasinet.qnet.export_qnet_graph(qnet, threshold, outfile)¶

Export the qnet as a graph of dependencies. The output will be in the .dot file format for graphs.

Parameters

qnet (Qnet) – A Qnet instance
threshold (float) – Numeric cutoff for edge weights. If the edge weights exceed this cutoff, then we include it into the graph.
outfile (str) – File name to save to.

Returns

Return type

None

quasinet.qnet.export_qnet_tree(qnet, index, outfile, outformat='graphviz', detailed_output=False, pen_width=3, edge_color='black', edge_label_color='black', dpi=200, text_color='black', font_size=10, background_color='transparent', rotate=False, edge_fontcolor='grey14', min_size=1, color_alpha=1.5, labels=None, add_legend=False, **kwargs)¶

Export a tree from qnet. The index determines which tree to export.

Parameters

qnet (Qnet) – A Qnet instance
index (int) – Index of the tree to export
outformat (str) – Can only be graphviz for now. This will output a .dot file, which you can then compile using a command like dot -Tpng file.dot -o file.png
detailed_output (bool) – If True return detailed probabilities of output labels in leaf nodes. default: False
text_color (str) – Color to set the text
edge_color (str) – Color to set the edges
pen_width (int) – Width of pen for drawing boundaries
dpi (int) – Image resolution
rotate (bool) – If True, rotate the tree
add_legend (bool) – If True, add a legend to the tree
edge_font_color (str) – Color of edge label text
min_size (int) – Minimum number of nodes to draw the tree
color_alpha (float) – Parameter for color brightness
labels (list) – List of all labels, optional

Returns

Return type

None

quasinet.qnet.fit_save(df, slice_range=None, n_jobs=10, alpha=0.1, file_prefix='model', strtype='U5', low_mem=True, compress=True)¶

Fit and save qnet model as gz.

Parameters

df (pandas.DataFrame) – pandas dataframe input data with featurenames as columns
slice_range (numpy 1D array of ints) – Index array to use to make model, paased to index_arry in fit
alpha (float) – qnet fit significance level
file_prefix (str) – filename prefix for model file
low_mem (bool) – turning on low memory save (default: True)
compress (bool) – True if we want gzipped models (default: True)
strtype (str) – string type specification (default: U5)

Returns

qnet

Return type

Qnet

quasinet.qnet.load_qnet(f, gz=False)¶

Load the qnet from a file.

Parameters

f (str) – File name.
gz (bool) – Bool to indicate if file is gzipped (default: False)

Returns

qnet

Return type

Qnet

quasinet.qnet.membership_degree(seq, qnet)¶

Compute the membership degree of a sequence in a qnet.

Parameters

seq (1d array-like) – Array of values
qnet (Qnet) – the Qnet that seq belongs to

Returns

membership_degree – membership degree

Return type

numeric

quasinet.qnet.qdistance(seq1, seq2, qnet1, qnet2, mismatch=False, FULL_C=False)¶

Compute the Jensen-Shannon of discrete probability distributions.

Parameters

seq1 (1d array-like) – Array of values
seq2 (1d array-like) – Array of values
qnet1 (Qnet) – the Qnet that seq1 belongs to
qnet2 (Qnet) – the Qnet that seq2 belongs to
mismatch (bool) – Indicate if there is a mismatch in feature names

Returns

output – qdistance

Return type

numeric

quasinet.qnet.qdistance_matrix(seqs1, seqs2, qnet1, qnet2)¶

Compute a distance matrix with the qdistance metric.

Parameters

seqs1 (2d array-like) – Array of values
seqs2 (2d array-like) – Array of values
qnet1 (Qnet) – the Qnet that seqs1 belongs to
qnet2 (Qnet) – the Qnet that seqs2 belongs to

Returns

distance_matrix – distance matrix

Return type

2d array-like

quasinet.qnet.save_qnet(qnet, f, low_mem=True, gz=False)¶

Save the qnet to a file.

TODO: using joblib is actually less memory efficient than using pickle. However, I don’t know if this is a general problem or this only happens under certain circumstances.

TODO: we may have to delete and garbage collection some attributes in the qnet to save memory. For example, .feature_importances_, .available_features_

Parameters

qnet1 (Qnet) – A Qnet instance
f (str) – File name
low_mem (bool) – If True, save the Qnet with low memory by deleting all data attributes except the tree structure (default: False)
gz (bool) – Specification if we want gzipped output (default: False)

Returns

Return type

None

quasinet.qsampling module¶

quasinet.qsampling.qsample(seq, qnet, steps, baseline_prob=None, force_change=False, alpha=None, random_seed=None)¶

Perform q-sampling for multiple steps or specified indices.

Qsampling works as follows: Say you have a sequence and a qnet. Then we randomly pick one of the items in the sequence (or a specified index) and change the value of that item based on the prediction of the qnet.

Parameters

seq (1d array-like) – Array of values
qnet (Qnet) – The Qnet that seq belongs to
steps (int or 1d array-like) – If an integer, the number of steps to run q-sampling. If an array, specifies the indices to q-sample in order.
baseline_prob (1d array-like, optional) – Baseline probability for sampling which index. Ignored if steps is an array.
force_change (bool, optional) – Whether to force the sequence to change when sampling.
alpha (float, optional) – Scalar multiple of qnet object, can be any real number.
random_seed (int, optional) – Seed for reproducible randomness.

Returns

seq – q-sampled sequence

Return type

1d array-like

quasinet.qsampling.targeted_qsample(seq1, seq2, qnet, steps, force_change=False)¶

Perform targeted q-sampling for multiple steps.

seq1 is q-sampled towards seq2.

This is similar to qsample, except that we perform changes to seq1 to try to approach seq2.

Parameters

seq1 (1d array-like) – Array of values
seq2 (1d array-like) – Array of values.
qnet (Qnet) – The Qnet that seq1 belongs to
steps (int) – Number of steps to run q-sampling
force_change (bool) – Whether to force the sequence to change when sampling.

Returns

seq – q-sampled sequence

Return type

1d array-like

quasinet.qseqtools module¶

quasinet.qseqtools.list_trained_qnets()¶

List the possible qnets we can use.

Parameters: None –
Returns
Return type: None

quasinet.qseqtools.load_sequence(file)¶

Load a fasta sequence from file.

Parameters: file (str) – File of a fasta sequence
Returns: seq – A fasta sequence
Return type: 1d-like array

quasinet.qseqtools.load_trained_qnet(qnet_type, extra_descriptions)¶

Load the pre-trained qnet.

Examples

>>> load_qnet('coronavirus', 'bat')
>>> load_qnet('influenza', 'h1n1;na;2009')

Parameters

qnet_type (str) – The type of qnet to load
extra_descriptions (str) – Extra descriptions for which qnet to load. The descriptions must be split by ; for influenza.

Returns

trained_qnet – A trained qnet

Return type

qnet.Qnet

quasinet.scorers module¶

quasinet.scorers.MEAN(z)¶

quasinet.scorers.approx_wdcor(x, y)¶

Approximate distance correlation by binning arrays

NOTE: Code ported from R function approx.dcor at:: https://rdrr.io/cran/extracat/src/R/wdcor.R

Parameters

x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements

Returns

dcor – Distance correlation

Return type

float

quasinet.scorers.c_dcor(x, y)¶

Wrapper for C version of distance correlation

Parameters

x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements

Returns

dcor – Distance correlation

Return type

float

quasinet.scorers.c_wdcor(x, y, weights)¶

Wrapper for C version of weighted distance correlation

Parameters

x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
weights (1d array-like) – Weight vector that sums to 1

Returns

dcor – Distance correlation

Return type

float

quasinet.scorers.cca(X, Y)¶

Largest canonical correlation

Parameters

X (2d array-like) – Array of n elements
Y (2d array-like) – Array of n elements

Returns

cor – Largest canonical correlation between X and Y

Return type

float

quasinet.scorers.cca_fast(X, Y)¶

Largest canonical correlation

Parameters

X (2d array-like) – Array of n elements
Y (2d array-like) – Array of n elements

Returns

cor – Largest correlation between X and Y

Return type

float

quasinet.scorers.chi2(x, y)¶: x and y are ordinal representations of categorical variables.

quasinet.scorers.create_chi2_table(x, y)¶: Create a chi-squared contingency table using x and y

quasinet.scorers.gini_index(y, labels)¶

Gini index for node in tree

Note: Despite being jitted, this function is still slow and a bottleneck: in the actual training phase. Sklearn’s Cython version is used to find the best split and this function is then called on the parent node and two child nodes to calculate feature importances using the mean decrease impurity formula

Parameters

y (1d array-like) – Array of labels
labels (1d array-like) – Unique labels

Returns

gini – Gini index

Return type

float

quasinet.scorers.mc_fast(x, y, n_classes)¶

Multiple correlation

Parameters

x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
n_classes (int) – Number of classes

Returns

cor – Multiple correlation coefficient between x and y

Return type

float

quasinet.scorers.mi(x, y)¶

Mutual information

Parameters

x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements

Returns

info – Mutual information between x and y

Return type

float

quasinet.scorers.mse(y)¶

Mean squared error for node in tree

Parameters: y (1d array-like) – Array of labels
Returns: error – Mean squared error
Return type: float

quasinet.scorers.pcor(x, y)¶

Pearson correlation

Parameters

x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements

Returns

cor – Pearson correlation

Return type

float

quasinet.scorers.py_dcor(x, y)¶

Python port of C function for distance correlation

Note: Version is optimized for use with Numba

Parameters

x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements

Returns

dcor – Distance correlation

Return type

float

quasinet.scorers.py_wdcor(x, y, weights)¶

Python port of C function for distance correlation

Note: Version is optimized for use with Numba

Parameters

x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
weights (1d array-like) – Weight vector that sums to 1

Returns

dcor – Distance correlation

Return type

float

quasinet.scorers.rdc(X, Y, k=10, s=0.16666666666666666, f=<ufunc 'sin'>)¶

Randomized dependence coefficient

Parameters

X (2d array-like) – Array of n elements
Y (2d array-like) – Array of n elements
k (int) – Number of random projections
s (float) – Variance of Gaussian random variables
f (function) – Non-linear function

Returns

cor – Randomized dependence coefficient between X and Y

Return type

float

quasinet.scorers.rdc_fast(x, y, k=10, s=0.16666666666666666, f=<ufunc 'sin'>)¶

Randomized dependence coefficient

Parameters

x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
k (int) – Number of random projections
s (float) – Variance of Gaussian random variables
f (function) – Non-linear function

Returns

cor – Randomized dependence coefficient between x and y

Return type

float

quasinet.tree module¶

class quasinet.tree.Node(col=None, col_pval=None, lthreshold=None, rthreshold=None, impurity=None, value=None, left=None, right=None, label_frequency=None)¶

Bases: object

Decision node in tree

Parameters

col (int) – Integer indexing the location of feature or column
col_pval (float) – Probability value from permutation test for feature selection
lthreshold (list) – List of items for taking the left edge down the tree
rthreshold (list) – List of items for taking the right edge down the tree
impurity (float) – Impurity measuring quality of split
value (1d array-like or float) – For classification trees, estimate of each class probability For regression trees, central tendency estimate
left (Node) – Another Node
right (Node) – Another Node
label_frequency (dict) – Dictionary mapping label to its frequency

quasinet.tree.get_nodes(root, get_leaves=True, get_non_leaves=True)¶

Traverse a tree and get all the nodes.

TODO: may need to change this into an iterator for speed purposes

If get_leaves and get_non_leaves are both True, then we will get all the nodes.

Parameters

root (Node) – root node of the tree
get_leaves (bool) – If true, we get leaf nodes
get_non_leaves (bool) – If true, we get non leaf nodes.

Returns

output – list of Node

Return type

list

quasinet.utils module¶

quasinet.utils.analyze_dot_file(dot_file, fracThreshold=0.0)¶

quasinet.utils.assert_array_rank(X, rank)¶

Check if the input is an numpy array and has a certain rank.

Parameters

X (array-like) – Array to check
rank (int) – Rank of the tensor to check

Returns

Return type

None

quasinet.utils.assert_string_type(X, name)¶

Check if the input is of string datatype.

Parameters

X (array-like) – Array to check
name (str) – Name of the input

Returns

Return type

None

quasinet.utils.auc_score(y_true, y_prob)¶: ADD

quasinet.utils.bayes_boot_probs(n)¶

Bayesian bootstrap sampling for case weights

Parameters: n (int) – Number of Bayesian bootstrap samples
Returns: p – Array of sampling probabilities
Return type: 1d array-like

quasinet.utils.big_enough(dot_file, big_enough_threshold=- 1)¶

quasinet.utils.dot4svg(dot_file, output_file, output_svg_file, directory='trees1/', dpi=70, draw=True, base_url='https://zed.createuky.net/')¶

quasinet.utils.drawtrees(dotfiles, prog='dot', format='pdf', big_enough_threshold=- 1)¶

quasinet.utils.estimate_margin(y_probs, y_true)¶

Estimates margin function of forest ensemble

Note : This function is similar to margin in R’s randomForest package

Parameters

y_probs (2d array-like) – Predicted probabilities where each row represents predicted class distribution for sample and each column corresponds to estimated class probability
y_true (1d array-like) – Array of true class labels

Returns

margin – Estimated margin of forest ensemble

Return type

float

quasinet.utils.find_matching_indices(A, B)¶

quasinet.utils.generate_seed()¶: generates a seed as function of current time and thread id for random number generator seed. Must be used when large number of qsamples are drawn in parallel

quasinet.utils.getNull(model, strtype='U5')¶

Function to generate an array of empty strings of same length as feature names in the model.

Parameters

model (Qnet object) – The Qnet model.
STRTYPE (str) – String type to be used for the generated numpy array. Default is ‘U5’.

Returns

An array of empty strings.

Return type

numpy.ndarray

quasinet.utils.logger(name, message)¶

Prints messages with style “[NAME] message”

Parameters

name (str) – Short title of message, for example, train or test
message (str) – Main description to be displayed in terminal

Returns

Return type

None

quasinet.utils.numparameters(qnetmodel)¶

computes total number of prameters in qnet

Parameters

model (Qnet object) – The Qnet model.

Returns

int – number of independent parameters.
float – number of internal nodes per model column.

quasinet.utils.powerset(s)¶: Get the power set of a list or set.

quasinet.utils.remove_newline_in_dotfile(file_path)¶: remove newlines from edge labels in dotfile

quasinet.utils.remove_zeros(r, axis)¶: Remove rows along a certain axis where the value is all zero.

quasinet.utils.sample_from_dict(distrib)¶

Choose an item from the distribution

Parameters: distrib (dict) – Dictionary mapping keys to its probability values
Returns: item – A chosen key from the dictionary
Return type: key of dict

quasinet.utils.scientific_notation(num)¶

Convert a number into scientific notation

Parameters: num (float) – Any number
Returns: output – String representation of the number
Return type: str

quasinet.zqnet module¶

quasinet.zqnet.extract_diagonal_blocks(M, L)¶

quasinet.zqnet.get_description_curl(code)¶

quasinet.zqnet.remove_suffix(s)¶

quasinet.zqnet.replace_with_d(S, j, d)¶

class quasinet.zqnet.zQnet(*args, **kwargs)¶

Bases: quasinet.qnet.Qnet

Extended Qnet architecture (zQnet).

An extension of the Qnet class with added functionality and attributes. This class introduces risk computation based on a series of metrics and provides a way to set and retrieve a model description.

Qnet : Base class

nullsequencesarray-like: Collection of sequences considered null or baseline for risk calculations.
targetstr, optional: Target variable or description. Not currently utilized in methods.
descriptionstr: Descriptive notes or commentary about the model.
auc_estimatearray-like: AUCs obtained during optimization of null sequences.
training_indexarray-like, optional: Indices used during the training phase. Not currently utilized in methods.

Parameters

*args – Variable length argument list inherited from Qnet.
**kwargs – Arbitrary keyword arguments inherited from Qnet.

personal_zshap(s, eps=1e-07)¶

A superfast approximation of SHAP for zQnet for individual samples

Parameters

s (numpy array of str) – The sequence around which we are evaluating perturbations.
eps (float) – shap value cutoff

Returns

dataframe with shapo values and index mapped to short description of icd10 codes

Return type

pandas.DataFrame

risk(X)¶

Compute the mean risk value for input X based on its distance from null sequences.

Parameters: X (2d array-like) – Input data whose risk is to be computed.
Returns: Mean risk value for the input X.
Return type: float

risk_max(X)¶

Compute the maximum risk value for input X based on its distance from null sequences.

Parameters: X (2d array-like) – Input data whose risk is to be computed.
Returns: Maximum risk value for the input X.
Return type: float

risk_median(X)¶

Compute the median risk value for input X based on its distance from null sequences.

Parameters: X (2d array-like) – Input data whose risk is to be computed.
Returns: Median risk value for the input X.
Return type: float

set_description(markdown_file)¶

Set the description attribute for the model using content from a markdown file.

Parameters: markdown_file (str) – Path to the markdown file containing the model’s description.
Returns: Content of the markdown file.
Return type: str

zshap(seq=None, m=35)¶

A superfast approximation of SHAP for zQnet

Parameters

seq (numpy array of str) – The sequence around which we are evaluating perturbations. By default it is the array oif empty strings, which represents average behavior
m (int) – Length of shap return dataframe

Returns

dataframe with shapo values and index mapped to short description of icd10 codes

Return type

pandas.DataFrame

quasinet package¶

Submodules¶

quasinet.ciforest module¶

quasinet.citrees module¶

quasinet.curvature module¶

quasinet.export module¶

quasinet.feature_importance module¶

quasinet.feature_selectors module¶

quasinet.metrics module¶

quasinet.osfix module¶

quasinet.qnet module¶

quasinet.qsampling module¶

quasinet.qseqtools module¶

quasinet.scorers module¶

quasinet.tree module¶

quasinet.utils module¶

quasinet.zqnet module¶

Module contents¶

Quasinet

Navigation

Related Topics