Examples for installation checking

The examples below show how to run VIA on generic connected and disconnected data using wrapper functions and serve as a check for your installation. For more detailed guidelines on running VIA and plotting the results, please use the Notebooks. We also highlight a few difference in calling VIA when using Windows versus Linux. The data for the Jupyter Notebooks and Examples are available in the Datasets folder (smaller files) with larger datasets here

A test script is available for some of the different datasets, please change the foldername accordingly to the folder containing relevant data files

  • 1.a Toy Data (multifurcation) Multifurcation Jupyter NB

  • 1.b Toy Data (disconnected) Disconnected Jupyter NB

  • 2.a General input data formatting and wrapper function

  • 2.b General disconnected trajectories wrapper function

  • 3.a Human Embryoid Bodies (wrapper function for testing VIA)

  • 3.b Human Embryoid Bodies (Configuring VIA) EB Jupyter NB

1.a/b Toy data (Multifurcation and Disconnected)

Two examples toy datasets with annotations generated using DynToy are provided. For the step-by-step code within these wrappers, please see the corresponding Jupyter NBs.

1.a/b Run on Linux

All examples are shown according to Linux OS, small modifications are required to run on a Windows OS (see below):

import pyVIA.core as via
# ensure the data and label files are in csv format when you download/save them
# multifurcation
# the root is automatically set to  root_user = 'M1'
via.main_Toy(ncomps=10, knn=30,dataset='Toy3', random_seed=2,foldername = ".../Trajectory/Datasets/") #multifurcation
# disconnected trajectory
# In the wrapper for Toy, the root is automatically set as a list root_user = ['T1_M1', 'T2_M1'] # e.g. T2_M3 is a cell belonging to the 3rd Milestone (M3) of the   second Trajectory (T2)
via.main_Toy(ncomps=10, knn=30,dataset='Toy4',random_seed=2,foldername =".../Trajectory/Datasets/") #2 disconnected trajectories

1.a/b Run on Windows

Windows may require minor modifications in calling the code due to the way multiprocessing works in Windows compared to Linux:

#when running from an IDE you need to call the function in the following way to ensure the parallel processing works:
import os
import pyVIA.core as via
f= os.path.join(r'C:\Users\...\Documents'+'\\')
def main():
    via.main_Toy(ncomps=10, knn=30,dataset='Toy3', random_seed=2,foldername= f)
if __name__ =='__main__':
    main()

#when running directly from terminal:
import os
import pyVIA.core as via
f= os.path.join(r'C:\Users\...\Documents'+'\\')
via.main_Toy(ncomps=10, knn=30,dataset='Toy3', random_seed=2,foldername= f)

**Multifurcating toy dataset 1.a ** *(click to open interactive graph)*

Disconnected toy dataset 1.b (click to open interactive graph)

2.a General input format and wrapper function (uses example of pre-B cell differentiation)

These wrapper functions are a good start but we highly recommend you look at the tutorials as you will be afforded a much higher degree of control without much added complexity. The below wrappers operate in the 2-iteration format (a coarse followed by a fine-grained), but this is not always needed and you will have more intuitive for the behaviour of your data by following the steps in the Tutorials. Nonetheless, the following wrappers are a great way to start to familiarize yourself with the various outputs from VIA.

Datasets and labels used in this example are provided in Datasets

# Read the two files:
# 1) The first file contains 200PCs of the Bcell filtered and normalized data for the first 5000 HVG.
# 2) The second file contains raw count data for marker genes

data = pd.read_csv('./Bcell_200PCs.csv')
data_genes = pd.read_csv('./Bcell_markergenes.csv')
data_genes = data_genes.drop(['cell'], axis=1)
true_label = data['time_hour']
data = data.drop(['cell', 'time_hour'], axis=1)
adata = sc.AnnData(data_genes)
adata.obsm['X_pca'] = data.values

# use UMAP or PHate to obtain embedding that is used for single-cell level visualization
embedding = umap.UMAP(random_state=42, n_neighbors=15, init='random').fit_transform(data.values[:, 0:5])

# list marker genes or genes of interest if known in advance. otherwise marker_genes = []
marker_genes = ['Igll1', 'Myc', 'Slc7a5', 'Ldha', 'Foxo1', 'Lig4', 'Sp7']  # irf4 down-up
# call VIA. We identify an early (suitable) start cell root = [42]. Can also set an arbitrary value
via.via_wrapper(adata, true_label, embedding, knn=10, ncomps=20, jac_std_global=0.15, root=[42], dataset='',
            random_seed=1,v0_toobig=0.3, v1_toobig=0.1, marker_genes=marker_genes)

2.b VIA wrapper for generic disconnected trajectory

A slightly different wrapper is called for the disconnected scenario. Refer to the Jupytern NB for a step-by-step tutorial.:

import scanpy as sc
import pandas as pd

#foldername corresponds to the location where you have saved the Toy Disconnected data (shown in example 2)
#Read in the data and labels
df_counts = pd.read_csv(foldername + "toy_disconnected_M9_n1000d1000.csv", 'rt', delimiter=",")
df_ids = pd.read_csv(foldername + "toy_disconnected_M9_n1000d1000_ids.csv", 'rt', delimiter=",")

# Make AnnData object for wrapper function to read-in data and do PCA
df_ids['cell_id_num'] = [int(s[1::]) for s in df_ids['cell_id']]
df_counts = df_counts.drop('Unnamed: 0', 1)
df_ids = df_ids.sort_values(by=['cell_id_num'])
df_ids = df_ids.reset_index(drop=True)
true_label = df_ids['group_id']
adata_counts = sc.AnnData(df_counts, obs=df_ids)
sc.tl.pca(adata_counts, svd_solver='arpack', n_comps=10)

#Since there are 2 disconnected trajectories, we provide 2 arbitrary roots (start cells).If there are more disconnected paths, then VIA arbitrarily selects roots. #The root can also just be arbitrarily set as [1] and VIA can detect how many additional roots it must add
# The root can also be provided as a cell type level label corresponding to the groups present in "true_label", in this case the dataset must be set to 'group'
via.via_wrapper_disconnected(adata_counts, true_label, embedding=adata_counts.obsm['X_pca'][:, 0:2], root=[1,1], preserve_disconnected=True, knn=30, ncomps=10,cluster_graph_pruning_std = 1)

#in the case of connected data (i.e. only 1 graph component. e.g. Toy Data Multifurcating) then the wrapper function from example 3.a can be used:
via.via_wrapper(adata_counts, true_label, embedding=  adata_counts.obsm['X_pca'][:,0:2], root=[1], knn=30, ncomps=10,cluster_graph_pruning_std = 1)