quasinet package¶
Submodules¶
quasinet.ciforest module¶
- class quasinet.ciforest.CIForestClassifier(min_samples_split=2, alpha=0.05, selector='mc', max_depth=- 1, n_estimators=100, max_feats='sqrt', n_permutations=100, early_stopping=True, muting=True, verbose=0, bootstrap=True, bayes=True, class_weight='balanced', n_jobs=- 1, random_state=None)¶
Bases:
sklearn.base.BaseEstimator
,sklearn.base.ClassifierMixin
Conditional forest classifier
- Parameters
min_samples_split (int) – Minimum samples required for a split
alpha (float) – Threshold value for selecting feature with permutation tests. Smaller values correspond to shallower trees
selector (str) – Variable selector for finding strongest association between a feature and the label
max_depth (int) – Maximum depth to grow tree
max_feats (str or int) – Maximum feats to select at each split. String arguments include ‘sqrt’, ‘log’, and ‘all’
n_permutations (int) – Number of permutations during feature selection
early_stopping (bool) – Whether to implement early stopping during feature selection. If True, then as soon as the first permutation test returns a p-value less than alpha, this feature will be chosen as the splitting variable
muting (bool) – Whether to perform variable muting
verbose (bool or int) – Controls verbosity of training and testing
bootstrap (bool) – Whether to perform bootstrap sampling for each tree
bayes (bool) – If True, performs Bayesian bootstrap sampling
class_weight (str) – Type of sampling during bootstrap, None for regular bootstrapping, ‘balanced’ for balanced bootstrap sampling, and ‘stratify’ for stratified bootstrap sampling
n_jobs (int) – Number of jobs for permutation testing
random_state (int) – Sets seed for random number generator
- fit(X, y)¶
Fit conditional forest classifier
- Parameters
X (2d array-like) – Array of features
y (1d array-like) – Array of labels
- Returns
self – Instance of CIForestClassifier
- Return type
- predict(X)¶
Predicts class labels for feature vectors X
- Parameters
X (2d array-like) – Array of features
- Returns
y – Array of predicted classes
- Return type
1d array-like
- predict_proba(X)¶
Predicts class probabilities for feature vectors X
- Parameters
X (2d array-like) – Array of features
- Returns
class_probs – Array of predicted class probabilities
- Return type
2d array-like
- set_score_request(*, sample_weight: Union[bool, None, str] = '$UNCHANGED$') quasinet.ciforest.CIForestClassifier ¶
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weight
parameter inscore
.- Returns
self – The updated object.
- Return type
object
- quasinet.ciforest.balanced_sampled_idx(random_state, y, bayes, min_class_p)¶
Indices for balanced bootstrap sampling in classification
- Parameters
random_state (int) – Sets seed for random number generator
y (1d array-like) – Array of labels
bayes (bool) – If True, performs Bayesian bootstrap sampling
min_class_p (float) – Minimum proportion of class labels
- Returns
idx – Balanced sampled indices for each class
- Return type
list
- quasinet.ciforest.balanced_unsampled_idx(random_state, y, bayes, min_class_p)¶
Unsampled indices for balanced bootstrap sampling in classification
- Parameters
random_state (int) – Sets seed for random number generator
y (1d array-like) – Array of labels
bayes (bool) – If True, performs Bayesian bootstrap sampling
min_class_p (float) – Minimum proportion of class labels
- Returns
idx – Balanced unsampled indices for each class
- Return type
list
- quasinet.ciforest.normal_sampled_idx(random_state, n, bayes)¶
Indices for bootstrap sampling
- Parameters
random_state (int) – Sets seed for random number generator
n (int) – Sample size
bayes (bool) – If True, performs Bayesian bootstrap sampling
- Returns
idx – Sampled indices
- Return type
list
- quasinet.ciforest.normal_unsampled_idx(random_state, n, bayes)¶
Unsampled indices for bootstrap sampling
- Parameters
random_state (int) – Sets seed for random number generator
y (1d array-like) – Array of labels
n (int) – Sample size
bayes (bool) – If True, performs Bayesian bootstrap sampling
- Returns
idx – Unsampled indices
- Return type
list
- quasinet.ciforest.stratify_sampled_idx(random_state, y, bayes)¶
Indices for stratified bootstrap sampling in classification
- Parameters
random_state (int) – Sets seed for random number generator
y (1d array-like) – Array of labels
bayes (bool) – If True, performs Bayesian bootstrap sampling
- Returns
idx – Stratified sampled indices for each class
- Return type
list
- quasinet.ciforest.stratify_unsampled_idx(random_state, y, bayes)¶
Unsampled indices for stratified bootstrap sampling in classification
- Parameters
random_state (int) – Sets seed for random number generator
y (1d array-like) – Array of labels
bayes (bool) – If True, performs Bayesian bootstrap sampling
- Returns
idx – Stratified unsampled indices for each class
- Return type
list
quasinet.citrees module¶
- class quasinet.citrees.CITreeBase(min_samples_split=2, alpha=0.05, max_depth=- 1, max_feats=- 1, n_permutations=100, early_stopping=False, muting=True, verbose=0, random_state=None)¶
Bases:
object
Base class for conditional inference tree.
- Parameters
min_samples_split (int) – Minimum samples required for a split
alpha (float) – Threshold value for selecting feature with permutation tests. Smaller values correspond to shallower trees
max_depth (int) – Maximum depth to grow tree
max_feats (str or int) – Maximum feats to select at each split. String arguments include ‘sqrt’, ‘log’, and ‘all’
n_permutations (int) – Number of permutations during feature selection
early_stopping (bool) – Whether to implement early stopping during feature selection. If True, then as soon as the first permutation test returns a p-value less than alpha, this feature will be chosen as the splitting variable
muting (bool) – Whether to perform variable muting
verbose (bool or int) – Controls verbosity of training and testing
random_state (int) – Sets seed for random number generator
- fit(X, y=None)¶
Train model.
X, y must contain only string datatypes.
- Parameters
X (2d array-like) – Array of categorical features
y (1d array-like) – Array of labels
- Returns
self – Instance of CITreeBase class
- Return type
- predict(*args, **kwargs)¶
Predicts labels on test data. This method should not be callable from base class.
- predict_label(X, tree=None)¶
Predicts label
- Parameters
X (1d array-like) – Array of features for single sample
tree (CITreeBase) – Trained tree
- Returns
label – Predicted label
- Return type
str
- print_tree(tree=None, indent=' ', child=None)¶
Prints tree structure
- Parameters
tree (CITreeBase) – Trained tree model
indent (str) – Indent spacing
child (Node) – Left or right child node
- Returns
- Return type
None
- class quasinet.citrees.CITreeClassifier(min_samples_split=2, alpha=0.05, selector='chi2', max_depth=- 1, max_feats=- 1, n_permutations=100, early_stopping=False, muting=True, verbose=0, random_state=None)¶
Bases:
quasinet.citrees.CITreeBase
,sklearn.base.BaseEstimator
,sklearn.base.ClassifierMixin
Conditional inference tree classifier
NOTE: as of now, the features can only be categorical
- Parameters
selector (str) – Variable selector for finding strongest association between a feature and the label
definitions (Derived from CITreeBase class; see constructor for parameter) –
- fit(X, y, labels=None)¶
Train conditional inference tree classifier
- Parameters
X (2d array-like) – Array of features
y (1d array-like) – Array of labels
labels (1d array-like) – Array of unique class labels
- Returns
self – Instance of CITreeClassifier class
- Return type
- predict(X)¶
Predicts class labels for feature vectors X
- Parameters
X (2d array-like) – Array of features
- Returns
y – Array of predicted classes
- Return type
1d array-like
- predict_proba(X)¶
Predicts class probabilities for feature vectors X
- Parameters
X (2d array-like) – Array of features
- Returns
class_probs – Array of predicted class probabilities
- Return type
2d array-like
- set_fit_request(*, labels: Union[bool, None, str] = '$UNCHANGED$') quasinet.citrees.CITreeClassifier ¶
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters
labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
labels
parameter infit
.- Returns
self – The updated object.
- Return type
object
- set_score_request(*, sample_weight: Union[bool, None, str] = '$UNCHANGED$') quasinet.citrees.CITreeClassifier ¶
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weight
parameter inscore
.- Returns
self – The updated object.
- Return type
object
- quasinet.citrees.get_feature_importance(citree, normalize=True)¶
Get the feature importance of the citree.
- Parameters
citree (CITreeBase) – A conditional inference tree
normalize (bool) – Whether to normalize the feature importance or not
- Returns
col_to_importance – Mapping from column index to total feature importance
- Return type
dict
quasinet.curvature module¶
- quasinet.curvature.compute_curvature(p, delta)¶
Computes the curvature (scalar curvature) at a given point in the space of Quasinets.
The curvature R is computed as:
\[R = G^{ij} R_{ij}\]where G^{ij} is the inverse of the metric tensor G_{ij}, and R_{ij} is the Ricci curvature.
Parameters: p (list[dict(str,float]): quasinet.predict_distributions() for the quasinet ‘p’ delta (float): A small number representing a change in each coordinate direction.
Returns: float: The curvature at point p.
- quasinet.curvature.compute_metric_tensor(p_distrib, delta, progress=False)¶
Computes the metric tensor at a given point in the space of Quasinets.
The metric tensor G_ij is defined as:
\[G_{ij} = \frac{1}{2} \left( D(p + \delta p_i + \delta p_j, p) - D(p + \delta p_i, p) - D(p + \delta p_j, p) + D(p, p) \right)\]where D is the distance function, p_i is the i-th unit Quasinet, and delta is a small perturbation.
Parameters: p_distrib (list[dict(str,float]): quasinet.predict_distributions() for the quasinet ‘p’ at which metric tensor is calculated delta (float): A small number representing a change in each coordinate direction. progress (bool): show progress bar
Returns: ndarray: The metric tensor at point p (the quasinet for which p_distrib is calculatd).
- quasinet.curvature.compute_metric_tensor_derivative(p, delta)¶
Computes the derivative of the metric tensor at a given point in the space of Quasinets.
The derivative of the metric tensor G_ij with respect to the k-th coordinate is computed as:
\[\frac{\partial G_{ij}}{\partial p_k} = \frac{G_{ij}(p + \delta p_k) - G_{ij}(p)}{\delta}\]Parameters: p (list[dict(str,float]): quasinet.predict_distributions() for the quasinet ‘p’ at which compute the metric tensor derivative. delta (float): A small number representing a change in each coordinate direction.
Returns: ndarray: The derivative of the metric tensor at point p.
- quasinet.curvature.compute_ricci_curvature(p, delta)¶
Computes the Ricci curvature at a given point in the space of Quasinets.
The Ricci curvature R_ij is computed as:
\[R_{ij} = G^{kl} \left( \frac{\partial^2 G_{ij}}{\partial p_k \partial p_l} - \frac{1}{2} \frac{\partial^2 G_{kl}}{\partial p_i \partial p_j} \right)\]where G^{kl} is the inverse of the metric tensor G_{kl}, and the partial derivatives are computed by taking the limit as delta goes to zero.
Parameters: p (list[dict(str,float]): quasinet.predict_distributions() for the quasinet ‘p’ delta (float): A small number representing a change in each coordinate direction.
Returns: ndarray: The Ricci curvature at point p.
- quasinet.curvature.delta_pi(qnet_instance, index, delta)¶
This function modifies the distribution of the given Quasinet instance by scaling it with a scalar value in the direction of the given index.
Parameters: qnet_instance (Quasinet): The Quasinet instance to modify. index (int): The index of the feature direction to scale. delta (float): The scalar to scale the distribution with.
Returns: Quasinet: The Quasinet instance with modified distribution.
- quasinet.curvature.dist_scalr_mult(D1, a)¶
Multiply each value in the dictionary with scalar ‘a’ and renormalize to get a valid probability distribution.
- Parameters
D1 (dict) – Dictionary where each key-value pair represents an item and its probability.
a (float) – Scalar to multiply with each value of D1.
- Returns
New dictionary with each value scaled and renormalized.
- Return type
dict
- quasinet.curvature.dist_sum(D1, D2)¶
Add each corresponding value in D1 and D2, then renormalize to get a valid probability distribution.
- Parameters
D1 (dict) – Two dictionaries where each key-value pair represents an item and its probability.
D2 (dict) – Two dictionaries where each key-value pair represents an item and its probability.
- Returns
New dictionary with each value being the sum of the corresponding values in D1 and D2, renormalized.
- Return type
dict
- quasinet.curvature.distance_function(p, q, NULL=None, strtype='U5')¶
Computes the distance between two Quasinets.
Parameters: p, q (Quasinet): The Quasinets to compute the distance between.
Returns: float: The distance between p and q.
- quasinet.curvature.distance_function_distrib(p, q, i)¶
Compute distance between two quasinets assumeing that p, q only differ at estimator coordinates listed in i Parameters: p (list[dict(str,flaot]): quasinet.predict_distributions() q (list[dict(str,flaot]): quasinet.predict_distributions() i (1d numpy arra): list of indices on which p and q differ
- quasinet.curvature.mt_worker(args)¶
- quasinet.curvature.perturb_quasinet(qnet_instance, index, delta)¶
Perturbs a Quasinet in the direction of the i-th feature.
Parameters: p (list[dict(str,flaot]): quasinet.predict_distributions() i (int): The index of the feature direction to perturb in. delta (float): The magnitude of the perturbation.
Returns: Quasinet: The perturbed Quasinet.
- quasinet.curvature.perturb_quasinet_distrib(p_distrib_, index, delta)¶
Perturbs a Quasinet in the direction of the i-th feature, using only the distributions at each estimator, which are produced by the predict_distributions function
Parameters: p (list[dict(str,flaot]): quasinet.predict_distributions() i (int): The index of the feature direction to perturb in. delta (float): The magnitude of the perturbation.
Returns: Quasinet: The perturbed Quasinet.
- quasinet.curvature.scalarmod_predict_distribution(self, column_to_item, column, **kwargs)¶
Modify the predict_distribution function of the Quasinet object to scale the output probabilities for a specified feature.
- Parameters
self (Quasinet object) – The Quasinet instance.
column_to_item (dict) – A dictionary mapping from column names to specific items.
column (str) – The name of the column (feature) to scale.
**kwargs (dict) – Additional arguments passed to the predict_distribution function.
- Returns
A dictionary of probabilities for each item in the specified column.
- Return type
dict
- quasinet.curvature.sum_predict_distribution(self, column_to_item, column, **kwargs)¶
quasinet.export module¶
- class quasinet.export.GraphvizTreeExporter(tree, outfile, response_name, feature_names, text_color='black', edge_color='gray', font_size=10, edge_label_color='deepskyblue4', pen_width=2, background_color='transparent', dpi=200, edge_fontcolor='grey14', rotate=False, add_legend=True, min_size=1, color_alpha=- 1.5, labels=None, detailed_output=False)¶
Bases:
object
Export the tree using graphviz.
- Parameters
qnet (Qnet) – A Qnet instance
outfile (str) – Output file to save results to
response_name (str) – Name of the y variable that we are predicting
feature_names (list) – Names of each of the features
text_color (str) – Color to set the text
edge_color (str) – Color to set the edges
pen_width (int) – Width of pen for drawing boundaries
dpi (int) – Image resolution
rotate (bool) – If True, rotate the tree
add_legend (bool) – If True, add a legend to the tree
detailed_output (bool) – If False output probability of max likelihood of label in leaves (default), else output prob. distribution.
edge_font_color (str) – Color of edge label text
min_size (int) – Minimum number of nodes to draw the tree
labels (list) – List of all labels, optional
- Returns
- Return type
None
- export()¶
- class quasinet.export.QnetGraphExporter(qnet, outfile, threshold)¶
Bases:
object
Export the qnet as a graph to a dot file format.
- Parameters
qnet (Qnet) – A Qnet instance
outfile (str) – Output file to save results to
threshold (float) – Numeric cutoff for edge weights. If the edge weights exceed this cutoff, then we include it into the graph.
- Returns
- Return type
None
- export()¶
quasinet.feature_importance module¶
- quasinet.feature_importance.getShap(model_, num_backgrounds=1, samples=None, num_samples=5, strtype='U5', fast_estimate=False)¶
Function to compute SHAP values for feature importance analysis.
- Parameters
model (Qnet object) – The Qnet model.
num_backgrounds (int) – Number of background samples to generate. Default is 1.
num_samples (int) – Number of samples for the SHAP analysis. Default is 5.
strtype (str) – String type to be used for the generated numpy array. Default is ‘U5’.
samples (numpy array) – samples to run shap analysis on. Default is None. If None, generate via qsampling
fast_estimate (bool) – If True, use tree explainer with a CatBoostRegressor model for faster estimation. Default is False.
- Returns
pandas.DataFrame – A dataframe containing the SHAP values for each feature.
numpy.array – numpy array of ordered indices of decsening shapvalues of model feature_names
- quasinet.feature_importance.qnet_model_func(X)¶
Function to compute the distance matrix for a given set of sequences.
- Parameters
X (numpy.ndarray) – Array of sequences.
- Returns
The computed distance matrix.
- Return type
numpy.ndarray
quasinet.feature_selectors module¶
- quasinet.feature_selectors.permutation_test_chi2(x, y, B=100, random_state=None, **kwargs)¶
Permutation test using chi-squared.
This is used when x and y are nominal variables.
- Parameters
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
n_classes (int) – Number of classes
B (int) – Number of permutations
random_state (int) – Sets seed for random number generator
- Returns
p – Achieved significance level
- Return type
float
- quasinet.feature_selectors.permutation_test_dcor(x, y, B=100, random_state=None)¶
Permutation test for distance correlation
- Parameters
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
B (int) – Number of permutations
random_state (int) – Sets seed for random number generator
- Returns
p – Achieved significance level
- Return type
float
- quasinet.feature_selectors.permutation_test_dcor_parallel(x, y, B=100, n_jobs=- 1, random_state=None)¶
Parallel implementation of permutation test for distance correlation
- Parameters
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
B (int) – Number of permutations
n_jobs (int) – Number of cpus to use for processing
random_state (int) – Sets seed for random number generator
- Returns
p – Achieved significance level
- Return type
float
- quasinet.feature_selectors.permutation_test_mc(x, y, B=100, n_classes=None, random_state=None)¶
Permutation test for multiple correlation
- Parameters
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
n_classes (int) – Number of classes
B (int) – Number of permutations
random_state (int) – Sets seed for random number generator
- Returns
p – Achieved significance level
- Return type
float
- quasinet.feature_selectors.permutation_test_mi(x, y, B=100, random_state=None, **kwargs)¶
Permutation test for mutual information
- Parameters
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
n_classes (int) – Number of classes
B (int) – Number of permutations
random_state (int) – Sets seed for random number generator
- Returns
p – Achieved significance level
- Return type
float
- quasinet.feature_selectors.permutation_test_pcor(x, y, B=100, random_state=None)¶
Permutation test for Pearson correlation
- Parameters
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
B (int) – Number of permutations
random_state (int) – Sets seed for random number generator
- Returns
p – Achieved significance level
- Return type
float
- quasinet.feature_selectors.permutation_test_rdc(x, y, B=100, random_state=None)¶
Permutation test for randomized dependence coefficient
- Parameters
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
B (int) – Number of permutations
random_state (int) – Sets seed for random number generator
- Returns
p – Achieved significance level
- Return type
float
- quasinet.feature_selectors.permutation_test_rdc_parallel(x, y, B=100, n_jobs=- 1, k=10, random_state=None)¶
Parallel implementation of permutation test for randomized dependence coefficient
- Parameters
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
B (int) – Number of permutations
n_jobs (int) – Number of cpus to use for processing
k (int) – Number of random projections for cca
random_state (int) – Sets seed for random number generator
- Returns
p – Achieved significance level
- Return type
float
quasinet.metrics module¶
- quasinet.metrics.convert_lists_to_ctypes(V_list, key_list)¶
- quasinet.metrics.js_divergence(p1, p2, smooth=0.0001)¶
Compute the Jensen-Shannon of discrete probability distributions.
- Parameters
p1 (1d array-like) – probability distribution
p2 (1d array-like) – probability distribution
smooth (float) – amount by which to smooth out the probability distribution of p2. This is intended to deal with categories with zero probability.
- Returns
js_div – js divergence
- Return type
numeric
- quasinet.metrics.kl_divergence(p1, p2)¶
Compute the Kullback–Leibler divergence of discrete probability distributions.
NOTE: we will not perform error checking in this function because this function is used very frequently. The user should check that p1 and p2 are in fact probability distributions.
- Parameters
p1 (1d array-like) – probability distribution
p2 (1d array-like) – probability distribution
smooth (float) – amount by which to smooth out the probability distribution of p2. This is intended to deal with categories with zero probability.
- Returns
output – kl divergence
- Return type
numeric
- quasinet.metrics.process_dict1_list(dict1_list)¶
- quasinet.metrics.process_dict2_list(dict2_list)¶
- quasinet.metrics.theta(seq1_list, seq2_list)¶
- quasinet.metrics.theta_(dict1_list, dict2_list)¶
- quasinet.metrics.theta_matrix(list_dict1_list, list_dict2_list=None)¶
- quasinet.metrics.theta_matrix_par(list_dict1_list, list_dict2_list)¶
quasinet.qnet module¶
- class quasinet.qnet.Qnet(feature_names, min_samples_split=2, alpha=0.05, max_depth=- 1, max_feats=- 1, early_stopping=False, verbose=0, random_state=None, n_jobs=1)¶
Bases:
object
Qnet architecture.
- Parameters
feature_names (list) – List of names describing the features
min_samples_split (int) – Minimum samples required for a split
alpha (float) – Threshold value for selecting feature with permutation tests. Smaller values correspond to shallower trees
max_depth (int) – Maximum depth to grow tree
max_feats (str or int) – Maximum feats to select at each split. String arguments include ‘sqrt’, ‘log’, and ‘all’
early_stopping (bool) – Whether to implement early stopping during feature selection. If True, then as soon as the first permutation test returns a p-value less than alpha, this feature will be chosen as the splitting variable
verbose (bool or int) – Controls verbosity of training and testing
random_state (int) – Sets seed for random number generator
n_jobs (int) – Number of CPUs to use when training
- clear_attributes()¶
Remove the unneeded attributes to save memory.
- Parameters
None –
- Returns
- Return type
None
- fit(X, index_array=None)¶
Train the qnet
Examples
>>> from quasinet import qnet >>> X = load_data() >>> myqnet = qnet.Qnet(feature_names=feature_names) >>> myqnet.fit(X)
- Parameters
X (2d array-like) – Array of features
index_array (1d array-like) – Array of indices to generate estimators for. Uses all indices by default or if set to None.
- Returns
self – Instance of Qnet class
- Return type
- mix(qnet_2, feature_name_list)¶
Take column from qnet_2, and switch its estimator with current qnet in-place. Makes it possible to simulate behavior of current qnet if some of the estimators were behaving like a second one, and can be used to identify the maximally divergent rules of orgnaization between the two models. Also sets the attribute self.mixed to be True.
- Parameters
qnet_2 (Qnet) – A Qnet instance
feature_name_list (list) – A list of variable (feature) names that would be replaced in self from qnet_2
- Returns
- Return type
None
- predict_distribution(column_to_item, column)¶
Predict the probability distribution for a given column.
It may be the case that a certain column value does not appear in the resulting output. If that happens, that means the probability of that column is 0.
- Parameters
column_to_item (dict) – dictionary mapping the column to the values the columns take
column (int) – column index
- Returns
output – dictionary mapping possible column values to probability values
- Return type
dictionary
- predict_distributions(seq)¶
Predict the probability distributions for all the columns.
If you do not want to set a particular value for an index of seq, then set the value at the index to the global nan_value. By default, this value is the empty string.
The length of the input sequence must match the size of feature_names.
- Parameters
seq (list) – list of values
- Returns
prob_distributions – list of dictionaries of probability distributions, one for each index
- Return type
list
- viz_trees(tree_path, draw=True, big_enough_threshold=- 1, prog='dot', format='pdf', remove_dotfile=True, remove_newline=False, addurl=False, base_url='https://zed.createuky.net/', **kwargs)¶
Generate dot files for individual estimators, and optionally render them to pdf.
- Parameters
tree_path (string) – path to where dotfiles will be generated. Creates directory if not present
draw (bool) – Set to True to render dotfiles (default True)
prog (str) – Graphviz program used for rendering (default: dot, other values: neato, fdp, sfdp)
format (str) – Format of rendered file (default: pdf, other values png, svg)
remove_dotfile (bool) – Deleted all dot files if set to True (default: True)
remove_newline (bool) – Remove newlines from edge labels in tree visualiztion to prettify
addurl (bool) – Add url links to the node labels
base_url (str) – Url base for node url links
**kwargs (dict, optional) – Additional keyword arguments to be passed to export_qnet_tree. Refer to the documentation of export_qnet_tree for details on accepted arguments.
- Returns
- Return type
None
- quasinet.qnet.export_qnet_graph(qnet, threshold, outfile)¶
Export the qnet as a graph of dependencies. The output will be in the .dot file format for graphs.
- Parameters
qnet (Qnet) – A Qnet instance
threshold (float) – Numeric cutoff for edge weights. If the edge weights exceed this cutoff, then we include it into the graph.
outfile (str) – File name to save to.
- Returns
- Return type
None
- quasinet.qnet.export_qnet_tree(qnet, index, outfile, outformat='graphviz', detailed_output=False, pen_width=3, edge_color='black', edge_label_color='black', dpi=200, text_color='black', font_size=10, background_color='transparent', rotate=False, edge_fontcolor='grey14', min_size=1, color_alpha=1.5, labels=None, add_legend=False)¶
Export a tree from qnet. The index determines which tree to export.
- Parameters
qnet (Qnet) – A Qnet instance
index (int) – Index of the tree to export
outformat (str) – Can only be graphviz for now. This will output a .dot file, which you can then compile using a command like dot -Tpng file.dot -o file.png
detailed_output (bool) – If True return detailed probabilities of output labels in leaf nodes. default: False
text_color (str) – Color to set the text
edge_color (str) – Color to set the edges
pen_width (int) – Width of pen for drawing boundaries
dpi (int) – Image resolution
rotate (bool) – If True, rotate the tree
add_legend (bool) – If True, add a legend to the tree
edge_font_color (str) – Color of edge label text
min_size (int) – Minimum number of nodes to draw the tree
color_alpha (float) – Parameter for color brightness
labels (list) – List of all labels, optional
- Returns
- Return type
None
- quasinet.qnet.fit_save(df, slice_range=None, n_jobs=10, alpha=0.1, file_prefix='model', strtype='U5', low_mem=True, compress=True)¶
Fit and save qnet model as gz.
- Parameters
df (pandas.DataFrame) – pandas dataframe input data with featurenames as columns
slice_range (numpy 1D array of ints) – Index array to use to make model, paased to index_arry in fit
alpha (float) – qnet fit significance level
file_prefix (str) – filename prefix for model file
low_mem (bool) – turning on low memory save (default: True)
compress (bool) – True if we want gzipped models (default: True)
strtype (str) – string type specification (default: U5)
- Returns
qnet
- Return type
- quasinet.qnet.load_qnet(f, gz=False)¶
Load the qnet from a file.
- Parameters
f (str) – File name.
gz (bool) – Bool to indicate if file is gzipped (default: False)
- Returns
qnet
- Return type
- quasinet.qnet.membership_degree(seq, qnet)¶
Compute the membership degree of a sequence in a qnet.
- Parameters
seq (1d array-like) – Array of values
qnet (Qnet) – the Qnet that seq belongs to
- Returns
membership_degree – membership degree
- Return type
numeric
- quasinet.qnet.qdistance(seq1, seq2, qnet1, qnet2, mismatch=False, FULL_C=False)¶
Compute the Jensen-Shannon of discrete probability distributions.
- quasinet.qnet.qdistance_matrix(seqs1, seqs2, qnet1, qnet2)¶
Compute a distance matrix with the qdistance metric.
- quasinet.qnet.save_qnet(qnet, f, low_mem=True, gz=False)¶
Save the qnet to a file.
TODO: using joblib is actually less memory efficient than using pickle. However, I don’t know if this is a general problem or this only happens under certain circumstances.
TODO: we may have to delete and garbage collection some attributes in the qnet to save memory. For example, .feature_importances_, .available_features_
- Parameters
qnet1 (Qnet) – A Qnet instance
f (str) – File name
low_mem (bool) – If True, save the Qnet with low memory by deleting all data attributes except the tree structure (default: False)
gz (bool) – Specification if we want gzipped output (default: False)
- Returns
- Return type
None
quasinet.qsampling module¶
- quasinet.qsampling.qsample(seq, qnet, steps, baseline_prob=None, force_change=False, alpha=None, random_seed=None)¶
Perform q-sampling for multiple steps or specified indices.
Qsampling works as follows: Say you have a sequence and a qnet. Then we randomly pick one of the items in the sequence (or a specified index) and change the value of that item based on the prediction of the qnet.
- Parameters
seq (1d array-like) – Array of values
qnet (Qnet) – The Qnet that seq belongs to
steps (int or 1d array-like) – If an integer, the number of steps to run q-sampling. If an array, specifies the indices to q-sample in order.
baseline_prob (1d array-like, optional) – Baseline probability for sampling which index. Ignored if steps is an array.
force_change (bool, optional) – Whether to force the sequence to change when sampling.
alpha (float, optional) – Scalar multiple of qnet object, can be any real number.
random_seed (int, optional) – Seed for reproducible randomness.
- Returns
seq – q-sampled sequence
- Return type
1d array-like
- quasinet.qsampling.targeted_qsample(seq1, seq2, qnet, steps, force_change=False)¶
Perform targeted q-sampling for multiple steps.
seq1 is q-sampled towards seq2.
This is similar to qsample, except that we perform changes to seq1 to try to approach seq2.
- Parameters
seq1 (1d array-like) – Array of values
seq2 (1d array-like) – Array of values.
qnet (Qnet) – The Qnet that seq1 belongs to
steps (int) – Number of steps to run q-sampling
force_change (bool) – Whether to force the sequence to change when sampling.
- Returns
seq – q-sampled sequence
- Return type
1d array-like
quasinet.qseqtools module¶
- quasinet.qseqtools.list_trained_qnets()¶
List the possible qnets we can use.
- Parameters
None –
- Returns
- Return type
None
- quasinet.qseqtools.load_sequence(file)¶
Load a fasta sequence from file.
- Parameters
file (str) – File of a fasta sequence
- Returns
seq – A fasta sequence
- Return type
1d-like array
- quasinet.qseqtools.load_trained_qnet(qnet_type, extra_descriptions)¶
Load the pre-trained qnet.
Examples
>>> load_qnet('coronavirus', 'bat') >>> load_qnet('influenza', 'h1n1;na;2009')
- Parameters
qnet_type (str) – The type of qnet to load
extra_descriptions (str) – Extra descriptions for which qnet to load. The descriptions must be split by ; for influenza.
- Returns
trained_qnet – A trained qnet
- Return type
quasinet.scorers module¶
- quasinet.scorers.MEAN(z)¶
- quasinet.scorers.approx_wdcor(x, y)¶
Approximate distance correlation by binning arrays
- NOTE: Code ported from R function approx.dcor at:
- Parameters
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
- Returns
dcor – Distance correlation
- Return type
float
- quasinet.scorers.c_dcor(x, y)¶
Wrapper for C version of distance correlation
- Parameters
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
- Returns
dcor – Distance correlation
- Return type
float
- quasinet.scorers.c_wdcor(x, y, weights)¶
Wrapper for C version of weighted distance correlation
- Parameters
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
weights (1d array-like) – Weight vector that sums to 1
- Returns
dcor – Distance correlation
- Return type
float
- quasinet.scorers.cca(X, Y)¶
Largest canonical correlation
- Parameters
X (2d array-like) – Array of n elements
Y (2d array-like) – Array of n elements
- Returns
cor – Largest canonical correlation between X and Y
- Return type
float
- quasinet.scorers.cca_fast(X, Y)¶
Largest canonical correlation
- Parameters
X (2d array-like) – Array of n elements
Y (2d array-like) – Array of n elements
- Returns
cor – Largest correlation between X and Y
- Return type
float
- quasinet.scorers.chi2(x, y)¶
x and y are ordinal representations of categorical variables.
- quasinet.scorers.create_chi2_table(x, y)¶
Create a chi-squared contingency table using x and y
- quasinet.scorers.gini_index(y, labels)¶
Gini index for node in tree
- Note: Despite being jitted, this function is still slow and a bottleneck
in the actual training phase. Sklearn’s Cython version is used to find the best split and this function is then called on the parent node and two child nodes to calculate feature importances using the mean decrease impurity formula
- Parameters
y (1d array-like) – Array of labels
labels (1d array-like) – Unique labels
- Returns
gini – Gini index
- Return type
float
- quasinet.scorers.mc_fast(x, y, n_classes)¶
Multiple correlation
- Parameters
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
n_classes (int) – Number of classes
- Returns
cor – Multiple correlation coefficient between x and y
- Return type
float
- quasinet.scorers.mi(x, y)¶
Mutual information
- Parameters
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
- Returns
info – Mutual information between x and y
- Return type
float
- quasinet.scorers.mse(y)¶
Mean squared error for node in tree
- Parameters
y (1d array-like) – Array of labels
- Returns
error – Mean squared error
- Return type
float
- quasinet.scorers.pcor(x, y)¶
Pearson correlation
- Parameters
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
- Returns
cor – Pearson correlation
- Return type
float
- quasinet.scorers.py_dcor(x, y)¶
Python port of C function for distance correlation
Note: Version is optimized for use with Numba
- Parameters
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
- Returns
dcor – Distance correlation
- Return type
float
- quasinet.scorers.py_wdcor(x, y, weights)¶
Python port of C function for distance correlation
Note: Version is optimized for use with Numba
- Parameters
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
weights (1d array-like) – Weight vector that sums to 1
- Returns
dcor – Distance correlation
- Return type
float
- quasinet.scorers.rdc(X, Y, k=10, s=0.16666666666666666, f=<ufunc 'sin'>)¶
Randomized dependence coefficient
- Parameters
X (2d array-like) – Array of n elements
Y (2d array-like) – Array of n elements
k (int) – Number of random projections
s (float) – Variance of Gaussian random variables
f (function) – Non-linear function
- Returns
cor – Randomized dependence coefficient between X and Y
- Return type
float
- quasinet.scorers.rdc_fast(x, y, k=10, s=0.16666666666666666, f=<ufunc 'sin'>)¶
Randomized dependence coefficient
- Parameters
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
k (int) – Number of random projections
s (float) – Variance of Gaussian random variables
f (function) – Non-linear function
- Returns
cor – Randomized dependence coefficient between x and y
- Return type
float
quasinet.tree module¶
- class quasinet.tree.Node(col=None, col_pval=None, lthreshold=None, rthreshold=None, impurity=None, value=None, left=None, right=None, label_frequency=None)¶
Bases:
object
Decision node in tree
- Parameters
col (int) – Integer indexing the location of feature or column
col_pval (float) – Probability value from permutation test for feature selection
lthreshold (list) – List of items for taking the left edge down the tree
rthreshold (list) – List of items for taking the right edge down the tree
impurity (float) – Impurity measuring quality of split
value (1d array-like or float) – For classification trees, estimate of each class probability For regression trees, central tendency estimate
left (Node) – Another Node
right (Node) – Another Node
label_frequency (dict) – Dictionary mapping label to its frequency
- quasinet.tree.get_nodes(root, get_leaves=True, get_non_leaves=True)¶
Traverse a tree and get all the nodes.
TODO: may need to change this into an iterator for speed purposes
If get_leaves and get_non_leaves are both True, then we will get all the nodes.
- Parameters
root (Node) – root node of the tree
get_leaves (bool) – If true, we get leaf nodes
get_non_leaves (bool) – If true, we get non leaf nodes.
- Returns
output – list of Node
- Return type
list
quasinet.utils module¶
- quasinet.utils.analyze_dot_file(dot_file, fracThreshold=0.0)¶
- quasinet.utils.assert_array_rank(X, rank)¶
Check if the input is an numpy array and has a certain rank.
- Parameters
X (array-like) – Array to check
rank (int) – Rank of the tensor to check
- Returns
- Return type
None
- quasinet.utils.assert_string_type(X, name)¶
Check if the input is of string datatype.
- Parameters
X (array-like) – Array to check
name (str) – Name of the input
- Returns
- Return type
None
- quasinet.utils.auc_score(y_true, y_prob)¶
ADD
- quasinet.utils.bayes_boot_probs(n)¶
Bayesian bootstrap sampling for case weights
- Parameters
n (int) – Number of Bayesian bootstrap samples
- Returns
p – Array of sampling probabilities
- Return type
1d array-like
- quasinet.utils.big_enough(dot_file, big_enough_threshold=- 1)¶
- quasinet.utils.dot4svg(dot_file, output_file, output_svg_file, directory='trees1/', dpi=70, draw=True, base_url='https://zed.createuky.net/')¶
- quasinet.utils.drawtrees(dotfiles, prog='dot', format='pdf', big_enough_threshold=- 1)¶
- quasinet.utils.estimate_margin(y_probs, y_true)¶
Estimates margin function of forest ensemble
Note : This function is similar to margin in R’s randomForest package
- Parameters
y_probs (2d array-like) – Predicted probabilities where each row represents predicted class distribution for sample and each column corresponds to estimated class probability
y_true (1d array-like) – Array of true class labels
- Returns
margin – Estimated margin of forest ensemble
- Return type
float
- quasinet.utils.find_matching_indices(A, B)¶
- quasinet.utils.generate_seed()¶
generates a seed as function of current time and thread id for random number generator seed. Must be used when large number of qsamples are drawn in parallel
- quasinet.utils.getNull(model, strtype='U5')¶
Function to generate an array of empty strings of same length as feature names in the model.
- Parameters
model (Qnet object) – The Qnet model.
STRTYPE (str) – String type to be used for the generated numpy array. Default is ‘U5’.
- Returns
An array of empty strings.
- Return type
numpy.ndarray
- quasinet.utils.logger(name, message)¶
Prints messages with style “[NAME] message”
- Parameters
name (str) – Short title of message, for example, train or test
message (str) – Main description to be displayed in terminal
- Returns
- Return type
None
- quasinet.utils.numparameters(qnetmodel)¶
computes total number of prameters in qnet
- Parameters
model (Qnet object) – The Qnet model.
- Returns
int – number of independent parameters.
float – number of internal nodes per model column.
- quasinet.utils.powerset(s)¶
Get the power set of a list or set.
- quasinet.utils.remove_newline_in_dotfile(file_path)¶
remove newlines from edge labels in dotfile
- quasinet.utils.remove_zeros(r, axis)¶
Remove rows along a certain axis where the value is all zero.
- quasinet.utils.sample_from_dict(distrib)¶
Choose an item from the distribution
- Parameters
distrib (dict) – Dictionary mapping keys to its probability values
- Returns
item – A chosen key from the dictionary
- Return type
key of dict
- quasinet.utils.scientific_notation(num)¶
Convert a number into scientific notation
- Parameters
num (float) – Any number
- Returns
output – String representation of the number
- Return type
str
quasinet.zqnet module¶
- quasinet.zqnet.extract_diagonal_blocks(M, L)¶
- quasinet.zqnet.get_description_curl(code)¶
- quasinet.zqnet.remove_suffix(s)¶
- quasinet.zqnet.replace_with_d(S, j, d)¶
- class quasinet.zqnet.zQnet(*args, **kwargs)¶
Bases:
quasinet.qnet.Qnet
Extended Qnet architecture (zQnet).
An extension of the Qnet class with added functionality and attributes. This class introduces risk computation based on a series of metrics and provides a way to set and retrieve a model description.
Qnet : Base class
- nullsequencesarray-like
Collection of sequences considered null or baseline for risk calculations.
- targetstr, optional
Target variable or description. Not currently utilized in methods.
- descriptionstr
Descriptive notes or commentary about the model.
- auc_estimatearray-like
AUCs obtained during optimization of null sequences.
- training_indexarray-like, optional
Indices used during the training phase. Not currently utilized in methods.
- Parameters
*args – Variable length argument list inherited from Qnet.
**kwargs – Arbitrary keyword arguments inherited from Qnet.
- personal_zshap(s, eps=1e-07)¶
A superfast approximation of SHAP for zQnet for individual samples
- Parameters
s (numpy array of str) – The sequence around which we are evaluating perturbations.
eps (float) – shap value cutoff
- Returns
dataframe with shapo values and index mapped to short description of icd10 codes
- Return type
pandas.DataFrame
- risk(X)¶
Compute the mean risk value for input X based on its distance from null sequences.
- Parameters
X (2d array-like) – Input data whose risk is to be computed.
- Returns
Mean risk value for the input X.
- Return type
float
- risk_max(X)¶
Compute the maximum risk value for input X based on its distance from null sequences.
- Parameters
X (2d array-like) – Input data whose risk is to be computed.
- Returns
Maximum risk value for the input X.
- Return type
float
- risk_median(X)¶
Compute the median risk value for input X based on its distance from null sequences.
- Parameters
X (2d array-like) – Input data whose risk is to be computed.
- Returns
Median risk value for the input X.
- Return type
float
- set_description(markdown_file)¶
Set the description attribute for the model using content from a markdown file.
- Parameters
markdown_file (str) – Path to the markdown file containing the model’s description.
- Returns
Content of the markdown file.
- Return type
str
- zshap(seq=None, m=35)¶
A superfast approximation of SHAP for zQnet
- Parameters
seq (numpy array of str) – The sequence around which we are evaluating perturbations. By default it is the array oif empty strings, which represents average behavior
m (int) – Length of shap return dataframe
- Returns
dataframe with shapo values and index mapped to short description of icd10 codes
- Return type
pandas.DataFrame