quasinet package¶
Submodules¶
quasinet.ciforest module¶
- class quasinet.ciforest.CIForestClassifier(min_samples_split=2, alpha=0.05, selector='mc', max_depth=-1, n_estimators=100, max_feats='sqrt', n_permutations=100, early_stopping=True, muting=True, verbose=0, bootstrap=True, bayes=True, class_weight='balanced', n_jobs=-1, random_state=None)¶
Bases:
BaseEstimator,ClassifierMixinConditional forest classifier
- Parameters:
min_samples_split (int) – Minimum samples required for a split
alpha (float) – Threshold value for selecting feature with permutation tests. Smaller values correspond to shallower trees
selector (str) – Variable selector for finding strongest association between a feature and the label
max_depth (int) – Maximum depth to grow tree
max_feats (str or int) – Maximum feats to select at each split. String arguments include ‘sqrt’, ‘log’, and ‘all’
n_permutations (int) – Number of permutations during feature selection
early_stopping (bool) – Whether to implement early stopping during feature selection. If True, then as soon as the first permutation test returns a p-value less than alpha, this feature will be chosen as the splitting variable
muting (bool) – Whether to perform variable muting
verbose (bool or int) – Controls verbosity of training and testing
bootstrap (bool) – Whether to perform bootstrap sampling for each tree
bayes (bool) – If True, performs Bayesian bootstrap sampling
class_weight (str) – Type of sampling during bootstrap, None for regular bootstrapping, ‘balanced’ for balanced bootstrap sampling, and ‘stratify’ for stratified bootstrap sampling
n_jobs (int) – Number of jobs for permutation testing
random_state (int) – Sets seed for random number generator
- fit(X, y)¶
Fit conditional forest classifier
- Parameters:
X (2d array-like) – Array of features
y (1d array-like) – Array of labels
- Returns:
self – Instance of CIForestClassifier
- Return type:
- predict(X)¶
Predicts class labels for feature vectors X
- Parameters:
X (2d array-like) – Array of features
- Returns:
y – Array of predicted classes
- Return type:
1d array-like
- predict_proba(X)¶
Predicts class probabilities for feature vectors X
- Parameters:
X (2d array-like) – Array of features
- Returns:
class_probs – Array of predicted class probabilities
- Return type:
2d array-like
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') CIForestClassifier¶
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weightparameter inscore.- Returns:
self – The updated object.
- Return type:
object
- quasinet.ciforest.balanced_sampled_idx(random_state, y, bayes, min_class_p)¶
Indices for balanced bootstrap sampling in classification
- Parameters:
random_state (int) – Sets seed for random number generator
y (1d array-like) – Array of labels
bayes (bool) – If True, performs Bayesian bootstrap sampling
min_class_p (float) – Minimum proportion of class labels
- Returns:
idx – Balanced sampled indices for each class
- Return type:
list
- quasinet.ciforest.balanced_unsampled_idx(random_state, y, bayes, min_class_p)¶
Unsampled indices for balanced bootstrap sampling in classification
- Parameters:
random_state (int) – Sets seed for random number generator
y (1d array-like) – Array of labels
bayes (bool) – If True, performs Bayesian bootstrap sampling
min_class_p (float) – Minimum proportion of class labels
- Returns:
idx – Balanced unsampled indices for each class
- Return type:
list
- quasinet.ciforest.normal_sampled_idx(random_state, n, bayes)¶
Indices for bootstrap sampling
- Parameters:
random_state (int) – Sets seed for random number generator
n (int) – Sample size
bayes (bool) – If True, performs Bayesian bootstrap sampling
- Returns:
idx – Sampled indices
- Return type:
list
- quasinet.ciforest.normal_unsampled_idx(random_state, n, bayes)¶
Unsampled indices for bootstrap sampling
- Parameters:
random_state (int) – Sets seed for random number generator
y (1d array-like) – Array of labels
n (int) – Sample size
bayes (bool) – If True, performs Bayesian bootstrap sampling
- Returns:
idx – Unsampled indices
- Return type:
list
- quasinet.ciforest.stratify_sampled_idx(random_state, y, bayes)¶
Indices for stratified bootstrap sampling in classification
- Parameters:
random_state (int) – Sets seed for random number generator
y (1d array-like) – Array of labels
bayes (bool) – If True, performs Bayesian bootstrap sampling
- Returns:
idx – Stratified sampled indices for each class
- Return type:
list
- quasinet.ciforest.stratify_unsampled_idx(random_state, y, bayes)¶
Unsampled indices for stratified bootstrap sampling in classification
- Parameters:
random_state (int) – Sets seed for random number generator
y (1d array-like) – Array of labels
bayes (bool) – If True, performs Bayesian bootstrap sampling
- Returns:
idx – Stratified unsampled indices for each class
- Return type:
list
quasinet.citrees module¶
- class quasinet.citrees.CITreeBase(min_samples_split=2, alpha=0.05, max_depth=-1, max_feats=-1, n_permutations=100, early_stopping=False, muting=True, verbose=0, random_state=None)¶
Bases:
objectBase class for conditional inference tree.
- Parameters:
min_samples_split (int) – Minimum samples required for a split
alpha (float) – Threshold value for selecting feature with permutation tests. Smaller values correspond to shallower trees
max_depth (int) – Maximum depth to grow tree
max_feats (str or int) – Maximum feats to select at each split. String arguments include ‘sqrt’, ‘log’, and ‘all’
n_permutations (int) – Number of permutations during feature selection
early_stopping (bool) – Whether to implement early stopping during feature selection. If True, then as soon as the first permutation test returns a p-value less than alpha, this feature will be chosen as the splitting variable
muting (bool) – Whether to perform variable muting
verbose (bool or int) – Controls verbosity of training and testing
random_state (int) – Sets seed for random number generator
- fit(X, y=None)¶
Train model.
X, y must contain only string datatypes.
- Parameters:
X (2d array-like) – Array of categorical features
y (1d array-like) – Array of labels
- Returns:
self – Instance of CITreeBase class
- Return type:
- predict(*args, **kwargs)¶
Predicts labels on test data. This method should not be callable from base class.
- predict_label(X, tree=None)¶
Predicts label
- Parameters:
X (1d array-like) – Array of features for single sample
tree (CITreeBase) – Trained tree
- Returns:
label – Predicted label
- Return type:
str
- print_tree(tree=None, indent=' ', child=None)¶
Prints tree structure
- Parameters:
tree (CITreeBase) – Trained tree model
indent (str) – Indent spacing
child (Node) – Left or right child node
- Return type:
None
- class quasinet.citrees.CITreeClassifier(min_samples_split=2, alpha=0.05, selector='chi2', max_depth=-1, max_feats=-1, n_permutations=100, early_stopping=False, muting=True, verbose=0, random_state=None)¶
Bases:
CITreeBase,BaseEstimator,ClassifierMixinConditional inference tree classifier
NOTE: as of now, the features can only be categorical
- Parameters:
selector (str) – Variable selector for finding strongest association between a feature and the label
definitions (Derived from CITreeBase class; see constructor for parameter) –
- fit(X, y, labels=None)¶
Train conditional inference tree classifier
- Parameters:
X (2d array-like) – Array of features
y (1d array-like) – Array of labels
labels (1d array-like) – Array of unique class labels
- Returns:
self – Instance of CITreeClassifier class
- Return type:
- predict(X)¶
Predicts class labels for feature vectors X
- Parameters:
X (2d array-like) – Array of features
- Returns:
y – Array of predicted classes
- Return type:
1d array-like
- predict_proba(X)¶
Predicts class probabilities for feature vectors X
- Parameters:
X (2d array-like) – Array of features
- Returns:
class_probs – Array of predicted class probabilities
- Return type:
2d array-like
- set_fit_request(*, labels: bool | None | str = '$UNCHANGED$') CITreeClassifier¶
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
labelsparameter infit.- Returns:
self – The updated object.
- Return type:
object
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') CITreeClassifier¶
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
- Parameters:
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weightparameter inscore.- Returns:
self – The updated object.
- Return type:
object
- quasinet.citrees.get_feature_importance(citree, normalize=True)¶
Get the feature importance of the citree.
- Parameters:
citree (CITreeBase) – A conditional inference tree
normalize (bool) – Whether to normalize the feature importance or not
- Returns:
col_to_importance – Mapping from column index to total feature importance
- Return type:
dict
quasinet.curvature module¶
- quasinet.curvature.compute_curvature(p, delta)¶
Computes the curvature (scalar curvature) at a given point in the space of Quasinets.
The curvature R is computed as:
\[R = G^{ij} R_{ij}\]where G^{ij} is the inverse of the metric tensor G_{ij}, and R_{ij} is the Ricci curvature.
Parameters: p (list[dict(str,float]): quasinet.predict_distributions() for the quasinet ‘p’ delta (float): A small number representing a change in each coordinate direction.
Returns: float: The curvature at point p.
- quasinet.curvature.compute_metric_tensor(p_distrib, delta, progress=False)¶
Computes the metric tensor at a given point in the space of Quasinets.
The metric tensor G_ij is defined as:
\[G_{ij} = \frac{1}{2} \left( D(p + \delta p_i + \delta p_j, p) - D(p + \delta p_i, p) - D(p + \delta p_j, p) + D(p, p) \right)\]where D is the distance function, p_i is the i-th unit Quasinet, and delta is a small perturbation.
Parameters: p_distrib (list[dict(str,float]): quasinet.predict_distributions() for the quasinet ‘p’ at which metric tensor is calculated delta (float): A small number representing a change in each coordinate direction. progress (bool): show progress bar
Returns: ndarray: The metric tensor at point p (the quasinet for which p_distrib is calculatd).
- quasinet.curvature.compute_metric_tensor_derivative(p, delta)¶
Computes the derivative of the metric tensor at a given point in the space of Quasinets.
The derivative of the metric tensor G_ij with respect to the k-th coordinate is computed as:
\[\frac{\partial G_{ij}}{\partial p_k} = \frac{G_{ij}(p + \delta p_k) - G_{ij}(p)}{\delta}\]Parameters: p (list[dict(str,float]): quasinet.predict_distributions() for the quasinet ‘p’ at which compute the metric tensor derivative. delta (float): A small number representing a change in each coordinate direction.
Returns: ndarray: The derivative of the metric tensor at point p.
- quasinet.curvature.compute_ricci_curvature(p, delta)¶
Computes the Ricci curvature at a given point in the space of Quasinets.
The Ricci curvature R_ij is computed as:
\[R_{ij} = G^{kl} \left( \frac{\partial^2 G_{ij}}{\partial p_k \partial p_l} - \frac{1}{2} \frac{\partial^2 G_{kl}}{\partial p_i \partial p_j} \right)\]where G^{kl} is the inverse of the metric tensor G_{kl}, and the partial derivatives are computed by taking the limit as delta goes to zero.
Parameters: p (list[dict(str,float]): quasinet.predict_distributions() for the quasinet ‘p’ delta (float): A small number representing a change in each coordinate direction.
Returns: ndarray: The Ricci curvature at point p.
- quasinet.curvature.delta_pi(qnet_instance, index, delta)¶
This function modifies the distribution of the given Quasinet instance by scaling it with a scalar value in the direction of the given index.
Parameters: qnet_instance (Quasinet): The Quasinet instance to modify. index (int): The index of the feature direction to scale. delta (float): The scalar to scale the distribution with.
Returns: Quasinet: The Quasinet instance with modified distribution.
- quasinet.curvature.dist_scalr_mult(D1, a)¶
Multiply each value in the dictionary with scalar ‘a’ and renormalize to get a valid probability distribution.
- Parameters:
D1 (dict) – Dictionary where each key-value pair represents an item and its probability.
a (float) – Scalar to multiply with each value of D1.
- Returns:
New dictionary with each value scaled and renormalized.
- Return type:
dict
- quasinet.curvature.dist_sum(D1, D2)¶
Add each corresponding value in D1 and D2, then renormalize to get a valid probability distribution.
- Parameters:
D1 (dict) – Two dictionaries where each key-value pair represents an item and its probability.
D2 (dict) – Two dictionaries where each key-value pair represents an item and its probability.
- Returns:
New dictionary with each value being the sum of the corresponding values in D1 and D2, renormalized.
- Return type:
dict
- quasinet.curvature.distance_function(p, q, NULL=None, strtype='U5')¶
Computes the distance between two Quasinets.
Parameters: p, q (Quasinet): The Quasinets to compute the distance between.
Returns: float: The distance between p and q.
- quasinet.curvature.distance_function_distrib(p, q, i)¶
Compute distance between two quasinets assumeing that p, q only differ at estimator coordinates listed in i Parameters: p (list[dict(str,flaot]): quasinet.predict_distributions() q (list[dict(str,flaot]): quasinet.predict_distributions() i (1d numpy arra): list of indices on which p and q differ
- quasinet.curvature.mt_worker(args)¶
- quasinet.curvature.perturb_quasinet(qnet_instance, index, delta)¶
Perturbs a Quasinet in the direction of the i-th feature.
Parameters: p (list[dict(str,flaot]): quasinet.predict_distributions() i (int): The index of the feature direction to perturb in. delta (float): The magnitude of the perturbation.
Returns: Quasinet: The perturbed Quasinet.
- quasinet.curvature.perturb_quasinet_distrib(p_distrib_, index, delta)¶
Perturbs a Quasinet in the direction of the i-th feature, using only the distributions at each estimator, which are produced by the predict_distributions function
Parameters: p (list[dict(str,flaot]): quasinet.predict_distributions() i (int): The index of the feature direction to perturb in. delta (float): The magnitude of the perturbation.
Returns: Quasinet: The perturbed Quasinet.
- quasinet.curvature.scalarmod_predict_distribution(self, column_to_item, column, **kwargs)¶
Modify the predict_distribution function of the Quasinet object to scale the output probabilities for a specified feature.
- Parameters:
self (Quasinet object) – The Quasinet instance.
column_to_item (dict) – A dictionary mapping from column names to specific items.
column (str) – The name of the column (feature) to scale.
**kwargs (dict) – Additional arguments passed to the predict_distribution function.
- Returns:
A dictionary of probabilities for each item in the specified column.
- Return type:
dict
- quasinet.curvature.sum_predict_distribution(self, column_to_item, column, **kwargs)¶
quasinet.export module¶
- class quasinet.export.GraphvizTreeExporter(tree, outfile, response_name, feature_names, text_color='black', edge_color='gray', font_size=10, edge_label_color='deepskyblue4', pen_width=2, background_color='transparent', dpi=200, edge_fontcolor='grey14', rotate=False, add_legend=True, min_size=1, color_alpha=-1.5, labels=None, detailed_output=False)¶
Bases:
objectExport the tree using graphviz.
- Parameters:
qnet (Qnet) – A Qnet instance
outfile (str) – Output file to save results to
response_name (str) – Name of the y variable that we are predicting
feature_names (list) – Names of each of the features
text_color (str) – Color to set the text
edge_color (str) – Color to set the edges
pen_width (int) – Width of pen for drawing boundaries
dpi (int) – Image resolution
rotate (bool) – If True, rotate the tree
add_legend (bool) – If True, add a legend to the tree
detailed_output (bool) – If False output probability of max likelihood of label in leaves (default), else output prob. distribution.
edge_font_color (str) – Color of edge label text
min_size (int) – Minimum number of nodes to draw the tree
labels (list) – List of all labels, optional
- Return type:
None
- export()¶
- class quasinet.export.QnetGraphExporter(qnet, outfile, threshold)¶
Bases:
objectExport the qnet as a graph to a dot file format.
- Parameters:
qnet (Qnet) – A Qnet instance
outfile (str) – Output file to save results to
threshold (float) – Numeric cutoff for edge weights. If the edge weights exceed this cutoff, then we include it into the graph.
- Return type:
None
- export()¶
quasinet.feature_importance module¶
quasinet.feature_selectors module¶
- quasinet.feature_selectors.permutation_test_chi2(x, y, B=100, random_state=None, **kwargs)¶
Permutation test using chi-squared.
This is used when x and y are nominal variables.
- Parameters:
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
n_classes (int) – Number of classes
B (int) – Number of permutations
random_state (int) – Sets seed for random number generator
- Returns:
p – Achieved significance level
- Return type:
float
- quasinet.feature_selectors.permutation_test_dcor(x, y, B=100, random_state=None)¶
Permutation test for distance correlation
- Parameters:
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
B (int) – Number of permutations
random_state (int) – Sets seed for random number generator
- Returns:
p – Achieved significance level
- Return type:
float
- quasinet.feature_selectors.permutation_test_dcor_parallel(x, y, B=100, n_jobs=-1, random_state=None)¶
Parallel implementation of permutation test for distance correlation
- Parameters:
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
B (int) – Number of permutations
n_jobs (int) – Number of cpus to use for processing
random_state (int) – Sets seed for random number generator
- Returns:
p – Achieved significance level
- Return type:
float
- quasinet.feature_selectors.permutation_test_mc(x, y, B=100, n_classes=None, random_state=None)¶
Permutation test for multiple correlation
- Parameters:
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
n_classes (int) – Number of classes
B (int) – Number of permutations
random_state (int) – Sets seed for random number generator
- Returns:
p – Achieved significance level
- Return type:
float
- quasinet.feature_selectors.permutation_test_mi(x, y, B=100, random_state=None, **kwargs)¶
Permutation test for mutual information
- Parameters:
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
n_classes (int) – Number of classes
B (int) – Number of permutations
random_state (int) – Sets seed for random number generator
- Returns:
p – Achieved significance level
- Return type:
float
- quasinet.feature_selectors.permutation_test_pcor(x, y, B=100, random_state=None)¶
Permutation test for Pearson correlation
- Parameters:
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
B (int) – Number of permutations
random_state (int) – Sets seed for random number generator
- Returns:
p – Achieved significance level
- Return type:
float
- quasinet.feature_selectors.permutation_test_rdc(x, y, B=100, random_state=None)¶
Permutation test for randomized dependence coefficient
- Parameters:
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
B (int) – Number of permutations
random_state (int) – Sets seed for random number generator
- Returns:
p – Achieved significance level
- Return type:
float
- quasinet.feature_selectors.permutation_test_rdc_parallel(x, y, B=100, n_jobs=-1, k=10, random_state=None)¶
Parallel implementation of permutation test for randomized dependence coefficient
- Parameters:
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
B (int) – Number of permutations
n_jobs (int) – Number of cpus to use for processing
k (int) – Number of random projections for cca
random_state (int) – Sets seed for random number generator
- Returns:
p – Achieved significance level
- Return type:
float
quasinet.metrics module¶
- quasinet.metrics.convert_lists_to_ctypes(V_list, key_list)¶
- quasinet.metrics.js_divergence(p1, p2, smooth=0.0001)¶
Compute the Jensen-Shannon of discrete probability distributions.
- Parameters:
p1 (1d array-like) – probability distribution
p2 (1d array-like) – probability distribution
smooth (float) – amount by which to smooth out the probability distribution of p2. This is intended to deal with categories with zero probability.
- Returns:
js_div – js divergence
- Return type:
numeric
- quasinet.metrics.kl_divergence(p1, p2)¶
Compute the Kullback–Leibler divergence of discrete probability distributions.
NOTE: we will not perform error checking in this function because this function is used very frequently. The user should check that p1 and p2 are in fact probability distributions.
- Parameters:
p1 (1d array-like) – probability distribution
p2 (1d array-like) – probability distribution
smooth (float) – amount by which to smooth out the probability distribution of p2. This is intended to deal with categories with zero probability.
- Returns:
output – kl divergence
- Return type:
numeric
- quasinet.metrics.process_dict1_list(dict1_list)¶
- quasinet.metrics.process_dict2_list(dict2_list)¶
- quasinet.metrics.theta(seq1_list, seq2_list)¶
- quasinet.metrics.theta_(dict1_list, dict2_list)¶
- quasinet.metrics.theta_matrix(list_dict1_list, list_dict2_list=None)¶
- quasinet.metrics.theta_matrix_par(list_dict1_list, list_dict2_list)¶
quasinet.osfix module¶
- quasinet.osfix.osfix(OS=None)¶
quasinet.qnet module¶
- class quasinet.qnet.Qnet(feature_names, min_samples_split=2, alpha=0.05, max_depth=-1, max_feats=-1, early_stopping=False, verbose=0, random_state=None, n_jobs=1)¶
Bases:
objectQnet architecture.
- Parameters:
feature_names (list) – List of names describing the features
min_samples_split (int) – Minimum samples required for a split
alpha (float) – Threshold value for selecting feature with permutation tests. Smaller values correspond to shallower trees
max_depth (int) – Maximum depth to grow tree
max_feats (str or int) – Maximum feats to select at each split. String arguments include ‘sqrt’, ‘log’, and ‘all’
early_stopping (bool) – Whether to implement early stopping during feature selection. If True, then as soon as the first permutation test returns a p-value less than alpha, this feature will be chosen as the splitting variable
verbose (bool or int) – Controls verbosity of training and testing
random_state (int) – Sets seed for random number generator
n_jobs (int) – Number of CPUs to use when training
- average_fidelity(s=None, eps=1e-12)¶
Compute the average fidelity of a model’s predicted distributions.
For each feature, the model predicts a probability distribution over possible values. Given the observed value for that feature, this function computes the ratio of:
the predicted probability assigned to the observed value, to
the maximum predicted probability across all possible values for that feature.
If the observed value is not present in the distribution, its probability is imputed as the maximum probability of that distribution. Ratios are thus always in the interval [0, 1].
The overall fidelity is the mean of these ratios across all features.
- Parameters:
model (object) –
A model instance providing: -
predict_distributions(s): returns a list of dictionaries, eachmapping feature values (keys) to predicted probabilities (values).
feature_names: used to determine the number of features.
s (list, optional) – Observed values for each feature. If
None, a default set of values is generated bynulldist(model).eps (float, optional) – Small constant to avoid division by zero if a distribution is empty or has maximum probability 0. Default is 1e-12.
- Returns:
mean_fidelity (float) – The average ratio across all features, i.e., the overall fidelity.
ratios (numpy.ndarray) – Array of per-feature ratios (observed probability / max probability).
- clear_attributes()¶
Remove the unneeded attributes to save memory.
- Parameters:
None –
- Return type:
None
- fit(X, index_array=None)¶
Train the qnet
Examples
>>> from quasinet import qnet >>> X = load_data() >>> myqnet = qnet.Qnet(feature_names=feature_names) >>> myqnet.fit(X)
- Parameters:
X (2d array-like) – Array of features
index_array (1d array-like) – Array of indices to generate estimators for. Uses all indices by default or if set to None.
- Returns:
self – Instance of Qnet class
- Return type:
- mix(qnet_2, feature_name_list)¶
Take column from qnet_2, and switch its estimator with current qnet in-place. Makes it possible to simulate behavior of current qnet if some of the estimators were behaving like a second one, and can be used to identify the maximally divergent rules of orgnaization between the two models. Also sets the attribute self.mixed to be True.
- Parameters:
qnet_2 (Qnet) – A Qnet instance
feature_name_list (list) – A list of variable (feature) names that would be replaced in self from qnet_2
- Return type:
None
- predict_distribution(column_to_item, column)¶
Predict the probability distribution for a given column.
It may be the case that a certain column value does not appear in the resulting output. If that happens, that means the probability of that column is 0.
- Parameters:
column_to_item (dict) – dictionary mapping the column to the values the columns take
column (int) – column index
- Returns:
output – dictionary mapping possible column values to probability values
- Return type:
dictionary
- predict_distributions(seq)¶
Predict the probability distributions for all the columns.
If you do not want to set a particular value for an index of seq, then set the value at the index to the global nan_value. By default, this value is the empty string.
The length of the input sequence must match the size of feature_names.
- Parameters:
seq (list) – list of values
- Returns:
prob_distributions – list of dictionaries of probability distributions, one for each index
- Return type:
list
- viz_trees(tree_path, draw=True, big_enough_threshold=-1, prog='dot', format='pdf', remove_dotfile=True, remove_newline=False, addurl=False, base_url='https://zed.createuky.net/', **kwargs)¶
Generate dot files for individual estimators, and optionally render them to pdf.
- Parameters:
tree_path (string) – path to where dotfiles will be generated. Creates directory if not present
draw (bool) – Set to True to render dotfiles (default True)
prog (str) – Graphviz program used for rendering (default: dot, other values: neato, fdp, sfdp)
format (str) – Format of rendered file (default: pdf, other values png, svg)
remove_dotfile (bool) – Deleted all dot files if set to True (default: True)
remove_newline (bool) – Remove newlines from edge labels in tree visualiztion to prettify
addurl (bool) – Add url links to the node labels
base_url (str) – Url base for node url links
**kwargs (dict, optional) – Additional keyword arguments to be passed to export_qnet_tree. Refer to the documentation of export_qnet_tree for details on accepted arguments.
- Return type:
None
- quasinet.qnet.export_qnet_graph(qnet, threshold, outfile)¶
Export the qnet as a graph of dependencies. The output will be in the .dot file format for graphs.
- Parameters:
qnet (Qnet) – A Qnet instance
threshold (float) – Numeric cutoff for edge weights. If the edge weights exceed this cutoff, then we include it into the graph.
outfile (str) – File name to save to.
- Return type:
None
- quasinet.qnet.export_qnet_tree(qnet, index, outfile, outformat='graphviz', detailed_output=False, pen_width=3, edge_color='black', edge_label_color='black', dpi=200, text_color='black', font_size=10, background_color='transparent', rotate=False, edge_fontcolor='grey14', min_size=1, color_alpha=1.5, labels=None, add_legend=False, **kwargs)¶
Export a tree from qnet. The index determines which tree to export.
- Parameters:
qnet (Qnet) – A Qnet instance
index (int) – Index of the tree to export
outformat (str) – Can only be graphviz for now. This will output a .dot file, which you can then compile using a command like dot -Tpng file.dot -o file.png
detailed_output (bool) – If True return detailed probabilities of output labels in leaf nodes. default: False
text_color (str) – Color to set the text
edge_color (str) – Color to set the edges
pen_width (int) – Width of pen for drawing boundaries
dpi (int) – Image resolution
rotate (bool) – If True, rotate the tree
add_legend (bool) – If True, add a legend to the tree
edge_font_color (str) – Color of edge label text
min_size (int) – Minimum number of nodes to draw the tree
color_alpha (float) – Parameter for color brightness
labels (list) – List of all labels, optional
- Return type:
None
- quasinet.qnet.fit_save(df, slice_range=None, n_jobs=10, alpha=0.1, file_prefix='model', strtype='U5', low_mem=True, compress=True)¶
Fit and save qnet model as gz.
- Parameters:
df (pandas.DataFrame) – pandas dataframe input data with featurenames as columns
slice_range (numpy 1D array of ints) – Index array to use to make model, paased to index_arry in fit
alpha (float) – qnet fit significance level
file_prefix (str) – filename prefix for model file
low_mem (bool) – turning on low memory save (default: True)
compress (bool) – True if we want gzipped models (default: True)
strtype (str) – string type specification (default: U5)
- Returns:
qnet
- Return type:
- quasinet.qnet.load_qnet(f, gz=False)¶
Load the qnet from a file.
- Parameters:
f (str) – File name.
gz (bool) – Bool to indicate if file is gzipped (default: False)
- Returns:
qnet
- Return type:
- quasinet.qnet.membership_degree(seq, qnet)¶
Compute the membership degree of a sequence in a qnet.
- Parameters:
seq (1d array-like) – Array of values
qnet (Qnet) – the Qnet that seq belongs to
- Returns:
membership_degree – membership degree
- Return type:
numeric
- quasinet.qnet.qdistance(seq1, seq2, qnet1, qnet2, mismatch=False, FULL_C=False)¶
Compute the Jensen-Shannon of discrete probability distributions.
- quasinet.qnet.qdistance_matrix(seqs1, seqs2, qnet1, qnet2)¶
Compute a distance matrix with the qdistance metric.
- quasinet.qnet.save_qnet(qnet, f, low_mem=True, gz=False)¶
Save the qnet to a file.
TODO: using joblib is actually less memory efficient than using pickle. However, I don’t know if this is a general problem or this only happens under certain circumstances.
TODO: we may have to delete and garbage collection some attributes in the qnet to save memory. For example, .feature_importances_, .available_features_
- Parameters:
qnet1 (Qnet) – A Qnet instance
f (str) – File name
low_mem (bool) – If True, save the Qnet with low memory by deleting all data attributes except the tree structure (default: False)
gz (bool) – Specification if we want gzipped output (default: False)
- Return type:
None
quasinet.qsampling module¶
- quasinet.qsampling.qsample(seq, qnet, steps, baseline_prob=None, force_change=False, alpha=None, random_seed=None)¶
Perform q-sampling for multiple steps or specified indices.
Qsampling works as follows: Say you have a sequence and a qnet. Then we randomly pick one of the items in the sequence (or a specified index) and change the value of that item based on the prediction of the qnet.
- Parameters:
seq (1d array-like) – Array of values
qnet (Qnet) – The Qnet that seq belongs to
steps (int or 1d array-like) – If an integer, the number of steps to run q-sampling. If an array, specifies the indices to q-sample in order.
baseline_prob (1d array-like, optional) – Baseline probability for sampling which index. Ignored if steps is an array.
force_change (bool, optional) – Whether to force the sequence to change when sampling.
alpha (float, optional) – Scalar multiple of qnet object, can be any real number.
random_seed (int, optional) – Seed for reproducible randomness.
- Returns:
seq – q-sampled sequence
- Return type:
1d array-like
- quasinet.qsampling.targeted_qsample(seq1, seq2, qnet, steps, force_change=False)¶
Perform targeted q-sampling for multiple steps.
seq1 is q-sampled towards seq2.
This is similar to qsample, except that we perform changes to seq1 to try to approach seq2.
- Parameters:
seq1 (1d array-like) – Array of values
seq2 (1d array-like) – Array of values.
qnet (Qnet) – The Qnet that seq1 belongs to
steps (int) – Number of steps to run q-sampling
force_change (bool) – Whether to force the sequence to change when sampling.
- Returns:
seq – q-sampled sequence
- Return type:
1d array-like
quasinet.qseqtools module¶
quasinet.scorers module¶
- quasinet.scorers.MEAN(z)¶
- quasinet.scorers.approx_wdcor(x, y)¶
Approximate distance correlation by binning arrays
- NOTE: Code ported from R function approx.dcor at:
- Parameters:
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
- Returns:
dcor – Distance correlation
- Return type:
float
- quasinet.scorers.c_dcor(x, y)¶
Wrapper for C version of distance correlation
- Parameters:
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
- Returns:
dcor – Distance correlation
- Return type:
float
- quasinet.scorers.c_wdcor(x, y, weights)¶
Wrapper for C version of weighted distance correlation
- Parameters:
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
weights (1d array-like) – Weight vector that sums to 1
- Returns:
dcor – Distance correlation
- Return type:
float
- quasinet.scorers.cca(X, Y)¶
Largest canonical correlation
- Parameters:
X (2d array-like) – Array of n elements
Y (2d array-like) – Array of n elements
- Returns:
cor – Largest canonical correlation between X and Y
- Return type:
float
- quasinet.scorers.cca_fast(X, Y)¶
Largest canonical correlation
- Parameters:
X (2d array-like) – Array of n elements
Y (2d array-like) – Array of n elements
- Returns:
cor – Largest correlation between X and Y
- Return type:
float
- quasinet.scorers.chi2(x, y)¶
x and y are ordinal representations of categorical variables.
- quasinet.scorers.create_chi2_table(x, y)¶
Create a chi-squared contingency table using x and y
- quasinet.scorers.gini_index(y, labels)¶
Gini index for node in tree
- Note: Despite being jitted, this function is still slow and a bottleneck
in the actual training phase. Sklearn’s Cython version is used to find the best split and this function is then called on the parent node and two child nodes to calculate feature importances using the mean decrease impurity formula
- Parameters:
y (1d array-like) – Array of labels
labels (1d array-like) – Unique labels
- Returns:
gini – Gini index
- Return type:
float
- quasinet.scorers.mc_fast(x, y, n_classes)¶
Multiple correlation
- Parameters:
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
n_classes (int) – Number of classes
- Returns:
cor – Multiple correlation coefficient between x and y
- Return type:
float
- quasinet.scorers.mi(x, y)¶
Mutual information
- Parameters:
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
- Returns:
info – Mutual information between x and y
- Return type:
float
- quasinet.scorers.mse(y)¶
Mean squared error for node in tree
- Parameters:
y (1d array-like) – Array of labels
- Returns:
error – Mean squared error
- Return type:
float
- quasinet.scorers.pcor(x, y)¶
Pearson correlation
- Parameters:
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
- Returns:
cor – Pearson correlation
- Return type:
float
- quasinet.scorers.py_dcor(x, y)¶
Python port of C function for distance correlation
Note: Version is optimized for use with Numba
- Parameters:
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
- Returns:
dcor – Distance correlation
- Return type:
float
- quasinet.scorers.py_wdcor(x, y, weights)¶
Python port of C function for distance correlation
Note: Version is optimized for use with Numba
- Parameters:
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
weights (1d array-like) – Weight vector that sums to 1
- Returns:
dcor – Distance correlation
- Return type:
float
- quasinet.scorers.rdc(X, Y, k=10, s=0.16666666666666666, f=<ufunc 'sin'>)¶
Randomized dependence coefficient
- Parameters:
X (2d array-like) – Array of n elements
Y (2d array-like) – Array of n elements
k (int) – Number of random projections
s (float) – Variance of Gaussian random variables
f (function) – Non-linear function
- Returns:
cor – Randomized dependence coefficient between X and Y
- Return type:
float
- quasinet.scorers.rdc_fast(x, y, k=10, s=0.16666666666666666, f=<ufunc 'sin'>)¶
Randomized dependence coefficient
- Parameters:
x (1d array-like) – Array of n elements
y (1d array-like) – Array of n elements
k (int) – Number of random projections
s (float) – Variance of Gaussian random variables
f (function) – Non-linear function
- Returns:
cor – Randomized dependence coefficient between x and y
- Return type:
float
quasinet.tree module¶
- class quasinet.tree.Node(col=None, col_pval=None, lthreshold=None, rthreshold=None, impurity=None, value=None, left=None, right=None, label_frequency=None)¶
Bases:
objectDecision node in tree
- Parameters:
col (int) – Integer indexing the location of feature or column
col_pval (float) – Probability value from permutation test for feature selection
lthreshold (list) – List of items for taking the left edge down the tree
rthreshold (list) – List of items for taking the right edge down the tree
impurity (float) – Impurity measuring quality of split
value (1d array-like or float) – For classification trees, estimate of each class probability For regression trees, central tendency estimate
left (Node) – Another Node
right (Node) – Another Node
label_frequency (dict) – Dictionary mapping label to its frequency
- quasinet.tree.get_nodes(root, get_leaves=True, get_non_leaves=True)¶
Traverse a tree and get all the nodes.
TODO: may need to change this into an iterator for speed purposes
If get_leaves and get_non_leaves are both True, then we will get all the nodes.
- Parameters:
root (Node) – root node of the tree
get_leaves (bool) – If true, we get leaf nodes
get_non_leaves (bool) – If true, we get non leaf nodes.
- Returns:
output – list of Node
- Return type:
list
quasinet.utils module¶
- quasinet.utils.analyze_dot_file(dot_file, fracThreshold=0.0)¶
- quasinet.utils.assert_array_rank(X, rank)¶
Check if the input is an numpy array and has a certain rank.
- Parameters:
X (array-like) – Array to check
rank (int) – Rank of the tensor to check
- Return type:
None
- quasinet.utils.assert_string_type(X, name)¶
Check if the input is of string datatype.
- Parameters:
X (array-like) – Array to check
name (str) – Name of the input
- Return type:
None
- quasinet.utils.auc_score(y_true, y_prob)¶
ADD
- quasinet.utils.bayes_boot_probs(n)¶
Bayesian bootstrap sampling for case weights
- Parameters:
n (int) – Number of Bayesian bootstrap samples
- Returns:
p – Array of sampling probabilities
- Return type:
1d array-like
- quasinet.utils.big_enough(dot_file, big_enough_threshold=-1)¶
- quasinet.utils.dot4svg(dot_file, output_file, output_svg_file, directory='trees1/', dpi=70, draw=True, base_url='https://zed.createuky.net/')¶
- quasinet.utils.drawtrees(dotfiles, prog='dot', format='pdf', big_enough_threshold=-1)¶
- quasinet.utils.estimate_margin(y_probs, y_true)¶
Estimates margin function of forest ensemble
Note : This function is similar to margin in R’s randomForest package
- Parameters:
y_probs (2d array-like) – Predicted probabilities where each row represents predicted class distribution for sample and each column corresponds to estimated class probability
y_true (1d array-like) – Array of true class labels
- Returns:
margin – Estimated margin of forest ensemble
- Return type:
float
- quasinet.utils.find_matching_indices(A, B)¶
- quasinet.utils.generate_seed()¶
generates a seed as function of current time and thread id for random number generator seed. Must be used when large number of qsamples are drawn in parallel
- quasinet.utils.getNull(model, strtype='U5')¶
Function to generate an array of empty strings of same length as feature names in the model.
- Parameters:
model (Qnet object) – The Qnet model.
STRTYPE (str) – String type to be used for the generated numpy array. Default is ‘U5’.
- Returns:
An array of empty strings.
- Return type:
numpy.ndarray
- quasinet.utils.logger(name, message)¶
Prints messages with style “[NAME] message”
- Parameters:
name (str) – Short title of message, for example, train or test
message (str) – Main description to be displayed in terminal
- Return type:
None
- quasinet.utils.numparameters(qnetmodel)¶
computes total number of prameters in qnet
- Parameters:
model (Qnet object) – The Qnet model.
- Returns:
int – number of independent parameters.
float – number of internal nodes per model column.
- quasinet.utils.powerset(s)¶
Get the power set of a list or set.
- quasinet.utils.remove_newline_in_dotfile(file_path)¶
remove newlines from edge labels in dotfile
- quasinet.utils.remove_zeros(r, axis)¶
Remove rows along a certain axis where the value is all zero.
- quasinet.utils.sample_from_dict(distrib)¶
Choose an item from the distribution
- Parameters:
distrib (dict) – Dictionary mapping keys to its probability values
- Returns:
item – A chosen key from the dictionary
- Return type:
key of dict
- quasinet.utils.scientific_notation(num)¶
Convert a number into scientific notation
- Parameters:
num (float) – Any number
- Returns:
output – String representation of the number
- Return type:
str
quasinet.zqnet module¶
- quasinet.zqnet.extract_diagonal_blocks(M, L)¶
- quasinet.zqnet.get_description_curl(code)¶
- quasinet.zqnet.remove_suffix(s)¶
- quasinet.zqnet.replace_with_d(S, j, d)¶
- class quasinet.zqnet.zQnet(*args, **kwargs)¶
Bases:
QnetExtended Qnet architecture (zQnet).
An extension of the Qnet class with added functionality and attributes. This class introduces risk computation based on a series of metrics and provides a way to set and retrieve a model description.
Inherits from¶
Qnet : Base class
New Attributes¶
- nullsequencesarray-like
Collection of sequences considered null or baseline for risk calculations.
- targetstr, optional
Target variable or description. Not currently utilized in methods.
- descriptionstr
Descriptive notes or commentary about the model.
- auc_estimatearray-like
AUCs obtained during optimization of null sequences.
- training_indexarray-like, optional
Indices used during the training phase. Not currently utilized in methods.
- param *args:
Variable length argument list inherited from Qnet.
- param **kwargs:
Arbitrary keyword arguments inherited from Qnet.
- personal_zshap(s, eps=1e-07)¶
A superfast approximation of SHAP for zQnet for individual samples
- Parameters:
s (numpy array of str) – The sequence around which we are evaluating perturbations.
eps (float) – shap value cutoff
- Returns:
dataframe with shapo values and index mapped to short description of icd10 codes
- Return type:
pandas.DataFrame
- risk(X)¶
Compute the mean risk value for input X based on its distance from null sequences.
- Parameters:
X (2d array-like) – Input data whose risk is to be computed.
- Returns:
Mean risk value for the input X.
- Return type:
float
- risk_max(X)¶
Compute the maximum risk value for input X based on its distance from null sequences.
- Parameters:
X (2d array-like) – Input data whose risk is to be computed.
- Returns:
Maximum risk value for the input X.
- Return type:
float
- risk_median(X)¶
Compute the median risk value for input X based on its distance from null sequences.
- Parameters:
X (2d array-like) – Input data whose risk is to be computed.
- Returns:
Median risk value for the input X.
- Return type:
float
- set_description(markdown_file)¶
Set the description attribute for the model using content from a markdown file.
- Parameters:
markdown_file (str) – Path to the markdown file containing the model’s description.
- Returns:
Content of the markdown file.
- Return type:
str
- zshap(seq=None, m=35)¶
A superfast approximation of SHAP for zQnet
- Parameters:
seq (numpy array of str) – The sequence around which we are evaluating perturbations. By default it is the array oif empty strings, which represents average behavior
m (int) – Length of shap return dataframe
- Returns:
dataframe with shapo values and index mapped to short description of icd10 codes
- Return type:
pandas.DataFrame