pyVIA core

Initialize StaVia to fully utilize functions from plotting. See Basic workflow for step-by-step instruction.

class VIA.core.VIA(data, true_label=None, edgepruning_clustering_resolution_local=1, edgepruning_clustering_resolution=0.15, labels=None, keep_all_local_dist='auto', too_big_factor=0.4, resolution_parameter=1.0, partition_type='ModularityVP', small_pop=10, jac_weighted_edges=True, knn=30, n_iter_leiden=5, random_seed=42, num_threads=-1, distance='l2', time_smallpop=15, super_cluster_labels=False, super_node_degree_list=False, super_terminal_cells=False, x_lazy=0.99, alpha_teleport=0.99, root_user=None, preserve_disconnected=True, dataset='', super_terminal_clusters=[], is_coarse=True, csr_full_graph='', csr_array_locally_pruned='', ig_full_graph='', full_neighbor_array='', full_distance_array='', embedding=None, df_annot=None, preserve_disconnected_after_pruning=False, secondary_annotations=None, pseudotime_threshold_TS=30, cluster_graph_pruning=0.15, visual_cluster_graph_pruning=0.15, neighboring_terminal_states_threshold=3, num_mcmc_simulations=1300, piegraph_arrow_head_width=0.1, piegraph_edgeweight_scalingfactor=1.5, max_visual_outgoing_edges=2, via_coarse=None, velocity_matrix=None, gene_matrix=None, velo_weight=0.5, edgebundle_pruning=None, A_velo=None, CSM=None, edgebundle_pruning_twice=False, pca_loadings=None, time_series=False, time_series_labels=None, knn_sequential=10, knn_sequential_reverse=0, t_diff_step=1, single_cell_transition_matrix=None, embedding_type='via-mds', do_compute_embedding=False, color_dict=None, user_defined_terminal_cell=[], user_defined_terminal_group=[], do_gaussian_kernel_edgeweights=False, RW2_mode=False, working_dir_fp='/home/', memory=5, viagraph_decay=0.9, p_memory=1, graph_init_pos=None, spatial_coords=None, do_spatial_knn=False, do_spatial_layout=False, spatial_knn=15, spatial_aux=[])[source]

A class to represent the VIA analysis

Parameters:

data (ndarray) – input matrix of size n_cells x n_dims. Expects the PCs or features that will be used in the TI computation. Can be e.g. adata.obsm[‘X_pca][:,0:20]
true_label (list) – list of str/int that correspond to the ground truth or reference annotations. Can also be None when no labels are available
labels (ndarray (nsamples, )) – default is None. and PARC clusters are used for the viagraph. alternatively provide a list of clustermemberships that are integer values (not strings) to construct the viagraph using another clustering method or available annotations
edgepruning_clustering_resolution_local (float) – default = 2 local level of pruning for PARC graph clustering stage. Range (0.1,3) higher numbers mean more edge retention. For large datasets can stick to just tuning edgepruning_clustering_resolution
edgepruning_clustering_resolution (float) – (optional, default = 0.15, can also set as ‘median’) graph pruning for PARC clustering stage. Higher value keeps more edges, results in fewer clusters. Smaller value removes more edges and results in more clusters. Number of standard deviations below the network’s mean-jaccard-weighted edges. 0.1-1 provide reasonable pruning. higher value means less pruning (more edges retained). e.g. a value of 0.15 means all edges that are above mean(edgeweight)-0.15*std(edge-weights) are retained. We find both 0.15 and ‘median’ to yield good results/starting point and resulting in pruning away ~ 50-60% edges
keep_all_local_dist (bool, str) – default value of ‘auto’ means that for smaller datasets local-pruning is done prior to clustering, but for large datasets local pruning is set to False for speed. can also set to be bool of True or False
too_big_factor (float) – (optional, default=0.4). Forces clusters > 0.4*n_cells to be re-clustered
resolution_parameter (float) – (default =1) larger value means more and smaller clusters
partition_type (str) – (default “ModularityVP”) Options
small_pop (int) – (default 10) Via attempts to merge Clusters with a population < 10 cells with larger clusters. If you have a very small dataset (e.g. few hundred cells), then consider lowering to e.g. 5
jac_weighted_edges (bool) – (default = True) Use weighted edges in the PARC clustering step
knn (int) – (optional, default = 30) number of K-Nearest Neighbors for HNSWlib KNN graph. Larger knn means more graph connectivity. Lower knn means more loosely connected clusters/cells
n_iter_leiden (int) –
random_seed (int) – Random seed to pass to clustering
num_threads –
distance (str) – (default ‘l2’) Euclidean distance ‘l2’ by default; other options ‘ip’ and ‘cosine’ for graph construction and similarity
visual_cluster_graph_pruning (float) – (optional, default = 0.15) This only comes into play if the user deliberately chooses not to use the default edge-bundling method of visualizating edges (draw_piechart_graph()) and instead calls draw_piechart_graph_nobundle(). It is often set to the same value as the PARC clustering level of edgepruning_clustering_resolution. This does not impact computation of terminal states, pseudotime or lineage likelihoods. It controls the number of edges plotted for visual effect
cluster_graph_pruning (float) – (optional, default =0.15) Pruning level of the cluster graph (does not impact number of clusters). Only impacts the connectivity of the clustergraph. Often set to the same value as the PARC clustering level of edgepruning_clustering_resolution.Reasonable range [0.1,1] To retain more connectivity in the clustergraph underlying the trajectory computations, increase the value
time_smallpop (max time to be allowed handling singletons) –
x_lazy (float) – (default =0.95) 1-x = probability of staying in same node (lazy). Values between 0.9-0.99 are reasonable
alpha_teleport (float) – (default = 0.99) 1-alpha is probability of jumping. Values between 0.95-0.99 are reasonable unless prior knowledge of teleportation
root_user (list, None) – can be a list of strings, a list of int or None (default is None) When the root_user is set as None and an RNA velocity matrix is available, a root will be automatically computed if the root_user is None and not velocity matrix is provided, then an arbitrary root is selected if the root_user is [‘celltype_earlystage’] where the str corresponds to an item in true_label, then a suitable starting point will be selected corresponding to this group if the root_user is [678], where 678 is the index of the cell chosen as a start cell, then this will be the designated starting cell. It is possible to give a list of root indices and groups. [120, 699] or [‘traj1_earlystage’, ‘traj2_earlystage’] when there are more than one trajectories
preserve_disconnected (bool) – (default = True) If you believe there may be disconnected trajectories then set this to False
dataset (str) – Can be set to ‘group’ or ‘’ (default). this refers to the type of root label (group level root or single cell index) you are going to provide. if your true_label has a sensible group of cells for a root then you can set dataset to ‘group’ and make the root parameter [‘labelname_root_cell_type’] if your root corresponds to one particular cell then set dataset = ‘’ (default)
embedding (ndarray) – (optional, default = None) embedding (e.g. precomputed tsne, umap, phate, via-umap) for plotting data. Size n_cells x 2 If an embedding is provided when running VIA, then a scatterplot colored by pseudotime, highlighting terminal fates
velo_weight (float) – (optional, default = 0.5) #float between [0,1]. the weight assigned to directionality and connectivity derived from scRNA-velocity
neighboring_terminal_states_threshold (int) – (default = 3). Candidates for terminal states that are neighbors of each other may be removed from the list if they have this number of more of terminal states as neighbors
knn_sequential (int) – (default =10) number of knn in the adjacent time-point for time-series data (t_i and t_i+1)
knn_sequential_reverse (int) – (default = 0) number of knn enforced from current to previous time point
t_diff_step (int) – (default =1) Number of permitted temporal intervals between connected nodes. If time data is labeled as [0,25,50,75,100,..] then t_diff_step=1 corresponds to ‘25’ and only edges within t_diff_steps are retained
is_coarse (bool) – (default = True) If running VIA in two iterations where you wish to link the second fine-grained iteration with the initial iteration, then you set to False
via_coarse (VIA) – (default = None) If instantiating a second iteration of VIA that needs to be linked to a previous iteration (e.g. via0), then set via_coarse to the previous via0 object
df_annot (DataFrame) – (default None) used for the Mouse Organ data
preserve_disconnected_after_pruning (bool) – (default = False) If you believe there are disconnected trajectories then set this to True and test your hypothesis
A_velo (ndarray) – Cluster Graph Transition matrix based on rna velocity [n_clus x n_clus]
velocity_matrix (matrix) – (default None) matrix of size [n_samples x n_genes]. this is the velocity matrix computed by scVelo (or similar package) and stored in adata.layers[‘velocity’]. The genes used for computing velocity should correspond to those useing in gene_matrix Requires gene_matrix to be provided too.
gene_matrix (matrix) – (default None) Only used if Velocity_matrix is available. matrix of size [n_samples x n_genes]. We recommend using a subset like HVGs rather than full set of genes. (need to densify input if taking from adata = adata.X.todense())
time_series (bool) – (default False) if the data has time-series labels then set to True
time_series_labels (list) – (default None) list of integer values of temporal annoataions corresponding to e.g. hours (post fert), days, or sequential ordering
pca_loadings (array) – (default None) the loadings of the pcs used to project the cells (to projected euclidean location based on velocity). n_cells x n_pcs
secondary_annotations (None) – (default None)
edgebundle_pruning (float) – (default=None) will by default be set to the same as the cluster_graph_pruning and influences the visualized level of pruning of edges. Typical values can be between [0,1] with higher numbers retaining more edges
edgebundle_pruning_twice (bool) –
default: False. When True, the edgebundling is applied to a further visually pruned (visual_cluster_graph_pruning) and can sometimes simplify the visualization. it does not impact the pseudotime and lineage computations piegraph_arrow_head_width: float

(default = 0.1) size of arrow heads in via cluster graph
piegraph_edgeweight_scalingfactor – (defaulf = 1.5) scaling factor for edge thickness in via cluster graph
max_visual_outgoing_edges (int) – (default =2) Only allows max_visual_outgoing_edges to come out of any given node. Used in differentiation_flow()
edgebundle_pruning – (default=None) will by default be set to the same as the cluster_graph_pruning and influences the visualized level of pruning of edges. Typical values can be between [0,1] with higher numbers retaining more edges
edgebundle_pruning_twice – default: False. When True, the edgebundling is applied to a further visually pruned (visual_cluster_graph_pruning) and can sometimes simplify the visualization for very cluttered graphs. it does not impact the pseudotime and lineage computations
pseudotime_threshold_TS (int) – (default = 30) corresponds to the criteria for a state to be considered a candidate terminal cell fate to be 30% or later of the computed psuedotime range
num_mcmc_simulations (int) – (default = 1300) number of random walk simulations conducted
embedding_type (str) – (default = ‘via-mds’, other options are ‘via-atlas’ and ‘via-force’
do_compute_embedding (bool) – (default = False) If you want an embedding (n_samples x2) to be computed on the basis of the via sc graph then set this to True
do_gaussian_kernel_edgeweights (bool) – (default = False) Type of edgeweighting on the graph edges
memory (1/q * edge weight to a next-node that is not a neighbor of previous node. larger number means more memory and more introspective walk. small number <1 means more exploration) – (default = 2) higher q means more memory, more retrospective/inwards randomwalk. memory = 2 means run using the non-memory Via 1.0 mode
viagraph_decay (float) – (default = 0.9) increasing decay causes more edges to merge
memory –
p_memory (1/p * edge weight to next node = previous node. large value means more exploration) –
graph_init_pos (matrix (or list of lists) to initialize the viagraph) –
spatial_coords (np.ndarray of size n_cells x 2 (denoting x,y coordinates) of each spot/cell) –
do_spatial_knn (Whether or not to do spatial mode of StaVia for graph augmentation) –
do_spatial_layout (whether to use spatial coords for layout of the clustergraph) –
spatial_knn (int = 15. number of knn's added based on spatial proximity indiciated by spatial_coords) –
spatial_aux (list = [] a list of slice IDs so that only cells/spots on the same slice are considered when building the spatial_knn graph) –

labels

length (n_samples, ) of cluster labels ndarray pre determined cluster labels user defined. #np.asarray(pre_labels).flatten()

Type:: array

single_cell_pt_markov

length n_samples of pseudotime

Type:: list

single_cell_bp

[n_lineages x n_samples] array of single cell branching probabilities towards each lineage (lineage normalized). Each column corresponds to a terminal state, in the order presented by the terminal_clusters attribute

Type:: ndarray

single_cell_bp_rownormed

[n_lineages x n_samples] array of single cell branching probabilities towards each lineage (cell normalized). Each column corresponds to a terminal state, in the order presented by the terminal_clusters attribute

Type:: ndarray

terminal_clusters

list of clusters that are cell fates/ unique lineages

Type:: list

cluster_bp

[n_clusters x n_terminal_states]. Lineage probability of cluster towards a particular terminal cluster state

Type:: ndarray

CSM

[n_cluster x n_clusters] array of cosine similarity used to weight the cluster graph transition matrix by velocity

Type:: ndarray

single_cell_transition_matrix

[n_samples x n_samples]

Type:: ndarray

terminal_clusters

(default None) list of terminal clusters

Type:: list

csr_full_graph

Type:: csr matrix of single-cell graph (augmented with sequential data when providing time_series information)

csr_array_locally_pruned

Type:: csr matrix

ig_full_graph

full_neighbor_array

user_defined_terminal_cell

Type:: list=[]

user_defined_terminal_group

Type:: list=[]

n_milestones

Type:: int = None Number of milestones in the via-mds computation (anything more than 10,000 can be computationally heavy and time consuming) Typically auto-determined within the via-mds function

embedding

[n_cells x 2] provided by user or autocomputed with via-mds or via-umap

Type:: ndarray

sc_transition_matrix(smooth_transition, b=10, use_sequentially_augmented=False)[source]

#computes the single cell level transition directions that are later used to calculate velocity of embedding #based on changes at single cell level in genes and single cell level velocity

Parameters:

smooth_transition –
b – slope of logistic function

Returns: