pyVIA core

Initialize StaVia to fully utilize functions from plotting. See Basic workflow for step-by-step instruction.

class VIA.core.VIA(data, true_label=None, edgepruning_clustering_resolution_local=1, edgepruning_clustering_resolution=0.15, labels=None, keep_all_local_dist='auto', too_big_factor=0.4, resolution_parameter=1.0, partition_type='ModularityVP', small_pop=10, jac_weighted_edges=True, knn=30, n_iter_leiden=5, random_seed=42, num_threads=-1, distance='l2', time_smallpop=15, super_cluster_labels=False, super_node_degree_list=False, super_terminal_cells=False, x_lazy=0.99, alpha_teleport=0.99, root_user=None, preserve_disconnected=True, dataset='', super_terminal_clusters=[], is_coarse=True, csr_full_graph='', csr_array_locally_pruned='', ig_full_graph='', full_neighbor_array='', full_distance_array='', embedding=None, df_annot=None, preserve_disconnected_after_pruning=False, secondary_annotations=None, pseudotime_threshold_TS=30, cluster_graph_pruning=0.15, visual_cluster_graph_pruning=0.15, neighboring_terminal_states_threshold=3, num_mcmc_simulations=1300, piegraph_arrow_head_width=0.1, piegraph_edgeweight_scalingfactor=1.5, max_visual_outgoing_edges=2, via_coarse=None, velocity_matrix=None, gene_matrix=None, velo_weight=0.5, edgebundle_pruning=None, A_velo=None, CSM=None, edgebundle_pruning_twice=False, pca_loadings=None, time_series=False, time_series_labels=None, knn_sequential=10, knn_sequential_reverse=0, t_diff_step=1, single_cell_transition_matrix=None, embedding_type='via-mds', do_compute_embedding=False, color_dict=None, user_defined_terminal_cell=[], user_defined_terminal_group=[], do_gaussian_kernel_edgeweights=False, RW2_mode=False, working_dir_fp='/home/', memory=5, viagraph_decay=0.9, p_memory=1, graph_init_pos=None, spatial_coords=None, do_spatial_knn=False, do_spatial_layout=False, spatial_knn=15, spatial_aux=[])[source]

A class to represent the VIA analysis

Parameters:
  • data (ndarray) – input matrix of size n_cells x n_dims. Expects the PCs or features that will be used in the TI computation. Can be e.g. adata.obsm[‘X_pca][:,0:20]

  • true_label (list) – list of str/int that correspond to the ground truth or reference annotations. Can also be None when no labels are available

  • labels (ndarray (nsamples, )) – default is None. and PARC clusters are used for the viagraph. alternatively provide a list of clustermemberships that are integer values (not strings) to construct the viagraph using another clustering method or available annotations

  • edgepruning_clustering_resolution_local (float) – default = 2 local level of pruning for PARC graph clustering stage. Range (0.1,3) higher numbers mean more edge retention. For large datasets can stick to just tuning edgepruning_clustering_resolution

  • edgepruning_clustering_resolution (float) – (optional, default = 0.15, can also set as ‘median’) graph pruning for PARC clustering stage. Higher value keeps more edges, results in fewer clusters. Smaller value removes more edges and results in more clusters. Number of standard deviations below the network’s mean-jaccard-weighted edges. 0.1-1 provide reasonable pruning. higher value means less pruning (more edges retained). e.g. a value of 0.15 means all edges that are above mean(edgeweight)-0.15*std(edge-weights) are retained. We find both 0.15 and ‘median’ to yield good results/starting point and resulting in pruning away ~ 50-60% edges

  • keep_all_local_dist (bool, str) – default value of ‘auto’ means that for smaller datasets local-pruning is done prior to clustering, but for large datasets local pruning is set to False for speed. can also set to be bool of True or False

  • too_big_factor (float) – (optional, default=0.4). Forces clusters > 0.4*n_cells to be re-clustered

  • resolution_parameter (float) – (default =1) larger value means more and smaller clusters

  • partition_type (str) – (default “ModularityVP”) Options

  • small_pop (int) – (default 10) Via attempts to merge Clusters with a population < 10 cells with larger clusters. If you have a very small dataset (e.g. few hundred cells), then consider lowering to e.g. 5

  • jac_weighted_edges (bool) – (default = True) Use weighted edges in the PARC clustering step

  • knn (int) – (optional, default = 30) number of K-Nearest Neighbors for HNSWlib KNN graph. Larger knn means more graph connectivity. Lower knn means more loosely connected clusters/cells

  • n_iter_leiden (int) –

  • random_seed (int) – Random seed to pass to clustering

  • num_threads

  • distance (str) – (default ‘l2’) Euclidean distance ‘l2’ by default; other options ‘ip’ and ‘cosine’ for graph construction and similarity

  • visual_cluster_graph_pruning (float) – (optional, default = 0.15) This only comes into play if the user deliberately chooses not to use the default edge-bundling method of visualizating edges (draw_piechart_graph()) and instead calls draw_piechart_graph_nobundle(). It is often set to the same value as the PARC clustering level of edgepruning_clustering_resolution. This does not impact computation of terminal states, pseudotime or lineage likelihoods. It controls the number of edges plotted for visual effect

  • cluster_graph_pruning (float) – (optional, default =0.15) Pruning level of the cluster graph (does not impact number of clusters). Only impacts the connectivity of the clustergraph. Often set to the same value as the PARC clustering level of edgepruning_clustering_resolution.Reasonable range [0.1,1] To retain more connectivity in the clustergraph underlying the trajectory computations, increase the value

  • time_smallpop (max time to be allowed handling singletons) –

  • x_lazy (float) – (default =0.95) 1-x = probability of staying in same node (lazy). Values between 0.9-0.99 are reasonable

  • alpha_teleport (float) – (default = 0.99) 1-alpha is probability of jumping. Values between 0.95-0.99 are reasonable unless prior knowledge of teleportation

  • root_user (list, None) – can be a list of strings, a list of int or None (default is None) When the root_user is set as None and an RNA velocity matrix is available, a root will be automatically computed if the root_user is None and not velocity matrix is provided, then an arbitrary root is selected if the root_user is [‘celltype_earlystage’] where the str corresponds to an item in true_label, then a suitable starting point will be selected corresponding to this group if the root_user is [678], where 678 is the index of the cell chosen as a start cell, then this will be the designated starting cell. It is possible to give a list of root indices and groups. [120, 699] or [‘traj1_earlystage’, ‘traj2_earlystage’] when there are more than one trajectories

  • preserve_disconnected (bool) – (default = True) If you believe there may be disconnected trajectories then set this to False

  • dataset (str) – Can be set to ‘group’ or ‘’ (default). this refers to the type of root label (group level root or single cell index) you are going to provide. if your true_label has a sensible group of cells for a root then you can set dataset to ‘group’ and make the root parameter [‘labelname_root_cell_type’] if your root corresponds to one particular cell then set dataset = ‘’ (default)

  • embedding (ndarray) – (optional, default = None) embedding (e.g. precomputed tsne, umap, phate, via-umap) for plotting data. Size n_cells x 2 If an embedding is provided when running VIA, then a scatterplot colored by pseudotime, highlighting terminal fates

  • velo_weight (float) – (optional, default = 0.5) #float between [0,1]. the weight assigned to directionality and connectivity derived from scRNA-velocity

  • neighboring_terminal_states_threshold (int) – (default = 3). Candidates for terminal states that are neighbors of each other may be removed from the list if they have this number of more of terminal states as neighbors

  • knn_sequential (int) – (default =10) number of knn in the adjacent time-point for time-series data (t_i and t_i+1)

  • knn_sequential_reverse (int) – (default = 0) number of knn enforced from current to previous time point

  • t_diff_step (int) – (default =1) Number of permitted temporal intervals between connected nodes. If time data is labeled as [0,25,50,75,100,..] then t_diff_step=1 corresponds to ‘25’ and only edges within t_diff_steps are retained

  • is_coarse (bool) – (default = True) If running VIA in two iterations where you wish to link the second fine-grained iteration with the initial iteration, then you set to False

  • via_coarse (VIA) – (default = None) If instantiating a second iteration of VIA that needs to be linked to a previous iteration (e.g. via0), then set via_coarse to the previous via0 object

  • df_annot (DataFrame) – (default None) used for the Mouse Organ data

  • preserve_disconnected_after_pruning (bool) – (default = False) If you believe there are disconnected trajectories then set this to True and test your hypothesis

  • A_velo (ndarray) – Cluster Graph Transition matrix based on rna velocity [n_clus x n_clus]

  • velocity_matrix (matrix) – (default None) matrix of size [n_samples x n_genes]. this is the velocity matrix computed by scVelo (or similar package) and stored in adata.layers[‘velocity’]. The genes used for computing velocity should correspond to those useing in gene_matrix Requires gene_matrix to be provided too.

  • gene_matrix (matrix) – (default None) Only used if Velocity_matrix is available. matrix of size [n_samples x n_genes]. We recommend using a subset like HVGs rather than full set of genes. (need to densify input if taking from adata = adata.X.todense())

  • time_series (bool) – (default False) if the data has time-series labels then set to True

  • time_series_labels (list) – (default None) list of integer values of temporal annoataions corresponding to e.g. hours (post fert), days, or sequential ordering

  • pca_loadings (array) – (default None) the loadings of the pcs used to project the cells (to projected euclidean location based on velocity). n_cells x n_pcs

  • secondary_annotations (None) – (default None)

  • edgebundle_pruning (float) – (default=None) will by default be set to the same as the cluster_graph_pruning and influences the visualized level of pruning of edges. Typical values can be between [0,1] with higher numbers retaining more edges

  • edgebundle_pruning_twice (bool) –

    default: False. When True, the edgebundling is applied to a further visually pruned (visual_cluster_graph_pruning) and can sometimes simplify the visualization. it does not impact the pseudotime and lineage computations piegraph_arrow_head_width: float

    (default = 0.1) size of arrow heads in via cluster graph

  • piegraph_edgeweight_scalingfactor – (defaulf = 1.5) scaling factor for edge thickness in via cluster graph

  • max_visual_outgoing_edges (int) – (default =2) Only allows max_visual_outgoing_edges to come out of any given node. Used in differentiation_flow()

  • edgebundle_pruning – (default=None) will by default be set to the same as the cluster_graph_pruning and influences the visualized level of pruning of edges. Typical values can be between [0,1] with higher numbers retaining more edges

  • edgebundle_pruning_twice – default: False. When True, the edgebundling is applied to a further visually pruned (visual_cluster_graph_pruning) and can sometimes simplify the visualization for very cluttered graphs. it does not impact the pseudotime and lineage computations

  • pseudotime_threshold_TS (int) – (default = 30) corresponds to the criteria for a state to be considered a candidate terminal cell fate to be 30% or later of the computed psuedotime range

  • num_mcmc_simulations (int) – (default = 1300) number of random walk simulations conducted

  • embedding_type (str) – (default = ‘via-mds’, other options are ‘via-atlas’ and ‘via-force’

  • do_compute_embedding (bool) – (default = False) If you want an embedding (n_samples x2) to be computed on the basis of the via sc graph then set this to True

  • do_gaussian_kernel_edgeweights (bool) – (default = False) Type of edgeweighting on the graph edges

  • memory (1/q * edge weight to a next-node that is not a neighbor of previous node. larger number means more memory and more introspective walk. small number <1 means more exploration) – (default = 2) higher q means more memory, more retrospective/inwards randomwalk. memory = 2 means run using the non-memory Via 1.0 mode

  • viagraph_decay (float) – (default = 0.9) increasing decay causes more edges to merge

  • memory

  • p_memory (1/p * edge weight to next node = previous node. large value means more exploration) –

  • graph_init_pos (matrix (or list of lists) to initialize the viagraph) –

  • spatial_coords (np.ndarray of size n_cells x 2 (denoting x,y coordinates) of each spot/cell) –

  • do_spatial_knn (Whether or not to do spatial mode of StaVia for graph augmentation) –

  • do_spatial_layout (whether to use spatial coords for layout of the clustergraph) –

  • spatial_knn (int = 15. number of knn's added based on spatial proximity indiciated by spatial_coords) –

  • spatial_aux (list = [] a list of slice IDs so that only cells/spots on the same slice are considered when building the spatial_knn graph) –

labels

length (n_samples, ) of cluster labels ndarray pre determined cluster labels user defined. #np.asarray(pre_labels).flatten()

Type:

array

single_cell_pt_markov

length n_samples of pseudotime

Type:

list

single_cell_bp

[n_lineages x n_samples] array of single cell branching probabilities towards each lineage (lineage normalized). Each column corresponds to a terminal state, in the order presented by the terminal_clusters attribute

Type:

ndarray

single_cell_bp_rownormed

[n_lineages x n_samples] array of single cell branching probabilities towards each lineage (cell normalized). Each column corresponds to a terminal state, in the order presented by the terminal_clusters attribute

Type:

ndarray

terminal_clusters

list of clusters that are cell fates/ unique lineages

Type:

list

cluster_bp

[n_clusters x n_terminal_states]. Lineage probability of cluster towards a particular terminal cluster state

Type:

ndarray

CSM

[n_cluster x n_clusters] array of cosine similarity used to weight the cluster graph transition matrix by velocity

Type:

ndarray

single_cell_transition_matrix

[n_samples x n_samples]

Type:

ndarray

terminal_clusters

(default None) list of terminal clusters

Type:

list

csr_full_graph
Type:

csr matrix of single-cell graph (augmented with sequential data when providing time_series information)

csr_array_locally_pruned
Type:

csr matrix

ig_full_graph
full_neighbor_array
user_defined_terminal_cell
Type:

list=[]

user_defined_terminal_group
Type:

list=[]

n_milestones
Type:

int = None Number of milestones in the via-mds computation (anything more than 10,000 can be computationally heavy and time consuming) Typically auto-determined within the via-mds function

embedding

[n_cells x 2] provided by user or autocomputed with via-mds or via-umap

Type:

ndarray

sc_transition_matrix(smooth_transition, b=10, use_sequentially_augmented=False)[source]

#computes the single cell level transition directions that are later used to calculate velocity of embedding #based on changes at single cell level in genes and single cell level velocity

Parameters:
  • smooth_transition

  • b – slope of logistic function

Returns: