API¶

Import vireoSNP as:

import vireoSNP

Commands¶

vireo: see manual
GTbarcode: see manual

Read / Load¶

vireoSNP.read_cellSNP(dir_name, layers=['AD', 'DP'])[source]¶

Read data from the cellSNP output directory

Parameters:	dir_name – directory full path name for cellSNP output
Returns:
Return type:	A disctionary containing AD, DP, cells and variants

vireoSNP.read_vartrix(alt_mtx, ref_mtx, cell_file, vcf_file=None)[source]¶

Read data from VarTrix

Parameters:	alt_mtx – sparse matrix file for alternative alleles ref_mtx – sparse matrix file for reference alleles cell_file – file for cell barcodes, each per line vcf_file – the vcf file used for fetch variants in VarTrix
Returns:
Return type:	A disctionary containing AD, DP, cells and optionally variants

VCF processing¶

Load VCF to matrices¶

vireoSNP.vcf.load_VCF(vcf_file, biallelic_only=False, load_sample=True, sparse=True, format_list=None)¶

Initially designed to load VCF from cellSNP output, requiring

all variants have the same format list;
a line starting with “#CHROM”, with sample ids.

If these two requirements are satisfied, this function also supports general VCF files, e.g., genotype for multiple samples.

Note, it may take a large memory, please filter the VCF with bcftools first.

Examples

Load VCF file, e.g., from cellsnp-lite output:

>>> import vireoSNP
>>> import numpy as np
>>> vcf_dat = vireoSNP.vcf.load_VCF("cellSNP.cells.vcf.gz", sparse=False,
>>>     biallelic_only=False, format_list=['GT', 'AD', 'DP', 'ALL'])
>>> var_ids = np.array(vcf_dat['variants'])
>>> samples = np.array(vcf_dat['samples'])
>>> GT_mat = np.array(vcf_dat['GenoINFO']['GT'])
>>> AD_mat = np.array(vcf_dat['GenoINFO']['AD']).astype(float)
>>> DP_mat = np.array(vcf_dat['GenoINFO']['DP']).astype(float)
>>> ALL_bases_mat = np.array(vcf_dat['GenoINFO']['ALL'])

Parse genotype probablity to tenseor¶

vireoSNP.vcf.parse_donor_GPb(GT_dat, tag='GT', min_prob=0.0)¶

Parse the donor genotype probability tag: GT, GP, or PL

Examples

>>> GProb_tensor = vireoSNP.vcf.parse_donor_GPb(vcf_dat['GenoINFO']['GT'], 'GT')

vireoSNP.vcf.match_VCF_samples(VCF_file1, VCF_file2, GT_tag1, GT_tag2)¶

Match donors in two VCF files. Please subset the VCF with bcftools first, as it is more computationally efficient:

bcftools view large_file.vcf.gz -R small_file.vcf.gz -Oz -o sub.vcf.gz

Parameters:	VCF_file1 (str) – the full path of first VCF file, in plain text or gzip / bgzip VCF_file2 (str) – the full path of second VCF file, in plain text or gzip / bgzip GT_tag1 (str) – the tag for extracting the genotype probability in VCF1: GT, GP, PL GT_tag2 (str) – the tag for extracting the genotype probability in VCF2: GT, GP, PL

Plotting¶

Heatmap plot¶

vireoSNP.plot.heat_matrix(X, yticks=None, xticks=None, rotation=45, cmap='BuGn', alpha=0.6, display_value=True, row_sort=False, aspect='auto', interpolation='none', **kwargs)[source]¶

Plot heatmap of distance matrix

Parameters:	X (numpy.array or matrix) – The matrix to plot in heatmap yticks (list) – The ticks ids for y axis xticks (list) – The ticks ids for x axis ratation (scalar) – The ratation angel for xticks cmap (str) – The colormap for the heatmap, more options: https://matplotlib.org/stable/tutorials/colors/colormaps.html alpha (scalar) – The transparency, value between 0 and 1 display_value (bool) – If True, dispaly the values in the heatmap raw_sort (bool) – If True, sort the rows with row index as row_idx = np.argsort(np.dot(X, 2np.arange(X.shape[1]))) aspect** (str) – aspect in plt.imshow interpolation (str) – interpolation in plt.imshow *kwargs (keywords & values) – *kwargs for plt.imshow
Returns:
Return type:	The return from plt.imshow

Examples

>>> from vireoSNP.plot import heat_matrix
>>> import numpy as np
>>> np.random.seed(1)
>>> X = np.random.rand(5, 7)
>>> heat_matrix(X)

(Source code, png)

Annotated heatmap plot¶

vireoSNP.plot.anno_heat(X, row_anno=None, col_anno=None, row_order_ids=None, col_order_ids=None, xticklabels=False, yticklabels=False, row_cluster=False, col_cluster=False, **kwargs)[source]¶

Heatmap with column or row annotations. Based on seaborn.clustermap() Row or column will be ordered by the annotation group.

Note, haven’t tested if input both row_anno and col_anno.

Vireo Object¶

Objects of type Vireo allow clustering cells by allelic ratio

class vireoSNP.Vireo(n_cell, n_var, n_donor, n_GT=3, learn_GT=True, learn_theta=True, ASE_mode=False, fix_beta_sum=False, beta_mu_init=None, beta_sum_init=None, ID_prob_init=None, GT_prob_init=None)[source]¶

Viroe model: Variational Inference for reconstruction of ensemble origin

The prior can be set via set_prior() before fitting the model.

beta_mu: numpy array (1, n_GT) or (n_var, n_GT): Beta mean parameter of theta’s posterior
beta_sum: numpy array (1, n_GT) or (n_var, n_GT), same as beta_mu: Beta concetration parameter of theta’s posterior
ID_prob: numpy array (n_cell, n_donor): Posterior cell assignment probability to each donor
GT_prob: numpy array (n_var, n_donor, n_GT): Posterior genotype probability per variant per donor

__init__(n_cell, n_var, n_donor, n_GT=3, learn_GT=True, learn_theta=True, ASE_mode=False, fix_beta_sum=False, beta_mu_init=None, beta_sum_init=None, ID_prob_init=None, GT_prob_init=None)[source]¶

Initialise Vireo model

Note, multiple initializations are highly recomended to avoid local optima.

Parameters:

n_cell (int.) – Number of cells
n_var (int.) – Number of variants
n_donor (int.) – Number of donors
n_GT (int.) – Number of genotype categories
learn_GT (bool.) – Whether updating GT_prob; otherwise using the initial
ASE_mode (bool.) – Whether setting allelic ratio theta to be variant specific
fix_beta_sum (bool.) – Whether fixing the concetration parameter of theta’s posterior
beta_mu_init (numpy array (1, n_GT) or (n_var, n_GT)) – Initial value of beta_mu, the mean parameter of theta
beta_sum_init (numpy array (1, n_GT) or (n_var, n_GT), same as beta_mu) – Initial value of beta_sum, the concetration parameter of theta
ID_prob_init (numpy array (n_cell, n_donor)) – Initial value of ID_prob, cell assignment probability to each donor
GT_prob_init (numpy array (n_var, n_donor, n_GT)) – Initial value of GT_prob, genotype probability per variant and donor

fit(AD, DP, max_iter=200, min_iter=5, epsilon_conv=0.01, delay_fit_theta=0, verbose=True, n_inits=50, nproc=1)[source]¶

Fit Vireo model with coordinate ascent

Parameters:

AD (scipy.sparse.csc_matrix (n_var, n_cell)) – Sparse count matrix for alternative allele
DP (scipy.sparse.csc_matrix (n_var, n_cell)) – Sparse count matrix for depths, alternative + refeerence alleles
max_iter (int) – Maximum number of iterations
min_iter – Minimum number of iterations
epsilon_conv (float) – Threshold for detecting convergence
delay_fit_theta (int) – Number of steps to delay updating theta. This can be very useful for common genetics when there is good prior on allelic ratio.
verbose (bool) – Whether print out log info

set_initial(beta_mu_init=None, beta_sum_init=None, ID_prob_init=None, GT_prob_init=None)[source]¶: Set initial values

set_prior(GT_prior=None, ID_prior=None, beta_mu_prior=None, beta_sum_prior=None, min_GP=1e-05)[source]¶

Set prior for key variables: theta, GT_prob and ID_prob. The priors are in the same shape as its according variables.

min_GP: float. Minimun genotype probability in GT_prior.

BinomMixtureVB Object¶

Objects of type BinomMixtureVB for clustering with binomial mixture model

class vireoSNP.BinomMixtureVB(n_cell, n_var, n_donor, fix_beta_sum=False, beta_mu_init=None, beta_sum_init=None, ID_prob_init=None)[source]¶

Binomial mixture model with variational inference

The prior can be set via set_prior() before fitting the model.

beta_mu: numpy array (n_var, n_donor): Beta mean parameter of theta’s posterior
beta_sum: numpy array (n_var, n_donor): Beta concetration parameter of theta’s posterior
ID_prob: numpy array (n_cell, n_donor): Posterior cell assignment probability to each donor

__init__(n_cell, n_var, n_donor, fix_beta_sum=False, beta_mu_init=None, beta_sum_init=None, ID_prob_init=None)[source]¶

Initialise Vireo model

Note, multiple initializations are highly recomended to avoid local optima.

Parameters:

n_cell (int.) – Number of cells
n_var (int.) – Number of variants
n_donor (int.) – Number of donors
fix_beta_sum (bool.) – Whether fixing the concetration parameter of theta’s posterior
beta_mu_init (numpy array (n_var, n_donor)) – Initial value of beta_mu, the mean parameter of theta
beta_sum_init (numpy array (n_var, n_donor)) – Initial value of beta_sum, the concetration parameter of theta
ID_prob_init (numpy array (n_cell, n_donor)) – Initial value of ID_prob, cell assignment probability to each donor

fit(AD, DP, n_init=10, max_iter=200, max_iter_pre=100, random_seed=None, **kwargs)[source]¶

Fit VB with multiple initializations

Parameters:

AD (scipy.sparse.csc_matrix (n_var, n_cell)) – Sparse count matrix for alternative allele
DP (scipy.sparse.csc_matrix (n_var, n_cell)) – Sparse count matrix for depths, alternative + refeerence alleles
n_inits (int) – Number of random initialisations to use
max_iter (int) – Maximum number of iterations for _fit_BV() in best initial
max_iter_pre (int) – Maximum number of iterations for _fit_BV() in multiple initials
min_iter – Minimum number of iterations for _fit_BV()
epsilon_conv (float) – Threshold for detecting convergence for _fit_BV()
verbose (bool) – Whether print out log info for _fit_BV()
random_seed (None or int) – Random seed in numpy.random for multiple initializations

set_initial(beta_mu_init=None, beta_sum_init=None, ID_prob_init=None)[source]¶: Random initialization

set_prior(ID_prior=None, beta_mu_prior=None, beta_sum_prior=None)[source]¶: Set prior for key variables: theta and ID_prob. The priors are in the same shape as its according variables.