API

Import vireoSNP as:

import vireoSNP

Commands

vireo: see manual
GTbarcode: see manual

Read / Load

vireoSNP.read_cellSNP(dir_name, layers=['AD', 'DP'])[source]

Read data from the cellSNP output directory

Parameters:: dir_name – directory full path name for cellSNP output
Return type:: A disctionary containing AD, DP, cells and variants

vireoSNP.read_vartrix(alt_mtx, ref_mtx, cell_file, vcf_file=None)[source]

Read data from VarTrix

Parameters:

alt_mtx – sparse matrix file for alternative alleles
ref_mtx – sparse matrix file for reference alleles
cell_file – file for cell barcodes, each per line
vcf_file – the vcf file used for fetch variants in VarTrix

Return type:

A disctionary containing AD, DP, cells and optionally variants

VCF processing

Load VCF to matrices

vireoSNP.vcf.load_VCF(vcf_file, biallelic_only=False, load_sample=True, sparse=True, format_list=None)

Load whole VCF file

Initially designed to load VCF from cellSNP output, requiring

all variants have the same format list;
a line starting with “#CHROM”, with sample ids.

If these two requirements are satisfied, this function also supports general VCF files, e.g., genotype for multiple samples.

Note, it may take a large memory, please filter the VCF with bcftools first.

Examples

Load VCF file, e.g., from cellsnp-lite output:

>>> import vireoSNP
>>> import numpy as np
>>> vcf_dat = vireoSNP.vcf.load_VCF("cellSNP.cells.vcf.gz", sparse=False,
>>>     biallelic_only=False, format_list=['GT', 'AD', 'DP', 'ALL'])
>>> var_ids = np.array(vcf_dat['variants'])
>>> samples = np.array(vcf_dat['samples'])
>>> GT_mat = np.array(vcf_dat['GenoINFO']['GT'])
>>> AD_mat = np.array(vcf_dat['GenoINFO']['AD']).astype(float)
>>> DP_mat = np.array(vcf_dat['GenoINFO']['DP']).astype(float)
>>> ALL_bases_mat = np.array(vcf_dat['GenoINFO']['ALL'])

Parse genotype probablity to tenseor

vireoSNP.vcf.parse_donor_GPb(GT_dat, tag='GT', min_prob=0.0)

Parse the donor genotype probability tag: GT, GP, or PL

Examples

>>> GProb_tensor = vireoSNP.vcf.parse_donor_GPb(vcf_dat['GenoINFO']['GT'], 'GT')

vireoSNP.vcf.match_VCF_samples(VCF_file1, VCF_file2, GT_tag1, GT_tag2)

Match donors in two VCF files. Please subset the VCF with bcftools first, as it is more computationally efficient:

bcftools view large_file.vcf.gz -R small_file.vcf.gz -Oz -o sub.vcf.gz

Parameters:

VCF_file1 (str) – the full path of first VCF file, in plain text or gzip / bgzip
VCF_file2 (str) – the full path of second VCF file, in plain text or gzip / bgzip
GT_tag1 (str) – the tag for extracting the genotype probability in VCF1: GT, GP, PL
GT_tag2 (str) – the tag for extracting the genotype probability in VCF2: GT, GP, PL

Plotting

Heatmap plot

vireoSNP.plot.heat_matrix(X, yticks=None, xticks=None, rotation=45, cmap='BuGn', alpha=0.6, display_value=True, row_sort=False, aspect='auto', interpolation='none', **kwargs)[source]

Plot heatmap of distance matrix

Parameters:

X (numpy.array or matrix) – The matrix to plot in heatmap
yticks (list) – The ticks ids for y axis
xticks (list) – The ticks ids for x axis
ratation (scalar) – The ratation angel for xticks
cmap (str) – The colormap for the heatmap, more options: https://matplotlib.org/stable/tutorials/colors/colormaps.html
alpha (scalar) – The transparency, value between 0 and 1
display_value (bool) – If True, dispaly the values in the heatmap
raw_sort (bool) – If True, sort the rows with row index as row_idx = np.argsort(np.dot(X, 2**np.arange(X.shape[1])))
aspect (str) – aspect in plt.imshow
interpolation (str) – interpolation in plt.imshow
**kwargs (keywords & values) – **kwargs for plt.imshow

Return type:

The return from plt.imshow

Examples

>>> from vireoSNP.plot import heat_matrix
>>> import numpy as np
>>> np.random.seed(1)
>>> X = np.random.rand(5, 7)
>>> heat_matrix(X)

(Source code, png)

Annotated heatmap plot

vireoSNP.plot.anno_heat(X, row_anno=None, col_anno=None, row_order_ids=None, col_order_ids=None, xticklabels=False, yticklabels=False, row_cluster=False, col_cluster=False, **kwargs)[source]

Heatmap with column or row annotations. Based on seaborn.clustermap() Row or column will be ordered by the annotation group.

Note, haven’t tested if input both row_anno and col_anno.

Vireo Object

Objects of type Vireo allow clustering cells by allelic ratio

class vireoSNP.Vireo(n_cell, n_var, n_donor, n_GT=3, learn_GT=True, learn_theta=True, ASE_mode=False, fix_beta_sum=False, beta_mu_init=None, beta_sum_init=None, ID_prob_init=None, GT_prob_init=None)[source]

Viroe model: Variational Inference for reconstruction of ensemble origin

The prior can be set via set_prior() before fitting the model.

Key properties

beta_mu: numpy array (1, n_GT) or (n_var, n_GT): Beta mean parameter of theta’s posterior
beta_sum: numpy array (1, n_GT) or (n_var, n_GT), same as beta_mu: Beta concetration parameter of theta’s posterior
ID_prob: numpy array (n_cell, n_donor): Posterior cell assignment probability to each donor
GT_prob: numpy array (n_var, n_donor, n_GT): Posterior genotype probability per variant per donor

__init__(n_cell, n_var, n_donor, n_GT=3, learn_GT=True, learn_theta=True, ASE_mode=False, fix_beta_sum=False, beta_mu_init=None, beta_sum_init=None, ID_prob_init=None, GT_prob_init=None)[source]

Initialise Vireo model

Note, multiple initializations are highly recomended to avoid local optima.

Parameters:

n_cell (int.) – Number of cells
n_var (int.) – Number of variants
n_donor (int.) – Number of donors
n_GT (int.) – Number of genotype categories
learn_GT (bool.) – Whether updating GT_prob; otherwise using the initial
ASE_mode (bool.) – Whether setting allelic ratio theta to be variant specific
fix_beta_sum (bool.) – Whether fixing the concetration parameter of theta’s posterior
beta_mu_init (numpy array (1, n_GT) or (n_var, n_GT)) – Initial value of beta_mu, the mean parameter of theta
beta_sum_init (numpy array (1, n_GT) or (n_var, n_GT), same as beta_mu) – Initial value of beta_sum, the concetration parameter of theta
ID_prob_init (numpy array (n_cell, n_donor)) – Initial value of ID_prob, cell assignment probability to each donor
GT_prob_init (numpy array (n_var, n_donor, n_GT)) – Initial value of GT_prob, genotype probability per variant and donor

fit(AD, DP, max_iter=200, min_iter=5, epsilon_conv=0.01, delay_fit_theta=0, verbose=True, n_inits=50, nproc=1)[source]

Fit Vireo model with coordinate ascent

Parameters:

AD (scipy.sparse.csc_matrix (n_var, n_cell)) – Sparse count matrix for alternative allele
DP (scipy.sparse.csc_matrix (n_var, n_cell)) – Sparse count matrix for depths, alternative + refeerence alleles
max_iter (int) – Maximum number of iterations
min_iter – Minimum number of iterations
epsilon_conv (float) – Threshold for detecting convergence
delay_fit_theta (int) – Number of steps to delay updating theta. This can be very useful for common genetics when there is good prior on allelic ratio.
verbose (bool) – Whether print out log info

set_initial(beta_mu_init=None, beta_sum_init=None, ID_prob_init=None, GT_prob_init=None)[source]: Set initial values

set_prior(GT_prior=None, ID_prior=None, beta_mu_prior=None, beta_sum_prior=None, min_GP=1e-05)[source]

Set prior for key variables: theta, GT_prob and ID_prob. The priors are in the same shape as its according variables.

min_GP: float. Minimun genotype probability in GT_prior.

BinomMixtureVB Object

Objects of type BinomMixtureVB for clustering with binomial mixture model

class vireoSNP.BinomMixtureVB(n_cell, n_var, n_donor, fix_beta_sum=False, beta_mu_init=None, beta_sum_init=None, ID_prob_init=None)[source]

Binomial mixture model with variational inference

The prior can be set via set_prior() before fitting the model.

Key properties

beta_mu: numpy array (n_var, n_donor): Beta mean parameter of theta’s posterior
beta_sum: numpy array (n_var, n_donor): Beta concetration parameter of theta’s posterior
ID_prob: numpy array (n_cell, n_donor): Posterior cell assignment probability to each donor

__init__(n_cell, n_var, n_donor, fix_beta_sum=False, beta_mu_init=None, beta_sum_init=None, ID_prob_init=None)[source]

Initialise Vireo model

Note, multiple initializations are highly recomended to avoid local optima.

Parameters:

n_cell (int.) – Number of cells
n_var (int.) – Number of variants
n_donor (int.) – Number of donors
fix_beta_sum (bool.) – Whether fixing the concetration parameter of theta’s posterior
beta_mu_init (numpy array (n_var, n_donor)) – Initial value of beta_mu, the mean parameter of theta
beta_sum_init (numpy array (n_var, n_donor)) – Initial value of beta_sum, the concetration parameter of theta
ID_prob_init (numpy array (n_cell, n_donor)) – Initial value of ID_prob, cell assignment probability to each donor

fit(AD, DP, n_init=10, max_iter=200, max_iter_pre=100, random_seed=None, **kwargs)[source]

Fit VB with multiple initializations

Parameters:

AD (scipy.sparse.csc_matrix (n_var, n_cell)) – Sparse count matrix for alternative allele
DP (scipy.sparse.csc_matrix (n_var, n_cell)) – Sparse count matrix for depths, alternative + refeerence alleles
n_inits (int) – Number of random initialisations to use
max_iter (int) – Maximum number of iterations for _fit_BV() in best initial
max_iter_pre (int) – Maximum number of iterations for _fit_BV() in multiple initials
min_iter – Minimum number of iterations for _fit_BV()
epsilon_conv (float) – Threshold for detecting convergence for _fit_BV()
verbose (bool) – Whether print out log info for _fit_BV()
random_seed (None or int) – Random seed in numpy.random for multiple initializations

set_initial(beta_mu_init=None, beta_sum_init=None, ID_prob_init=None)[source]: Random initialization

set_prior(ID_prior=None, beta_mu_prior=None, beta_sum_prior=None)[source]: Set prior for key variables: theta and ID_prob. The priors are in the same shape as its according variables.

VireoBulk Object

Objects of type VireoBulk for clustering with binomial mixture model

class vireoSNP.VireoBulk(n_donor, n_GT=3, psi_init=None, theta_init=[0.01, 0.5, 0.99])[source]

Estimate of donor abundance in a multipexed bulk sample

Varibale to infer

psi: numpy.array (n_donor, ): The fractional abundance of each donor in the mixture
theta: numpy.array (n_GT, ): The alternative allele rate in each genotype category

param n_GT:
type n_GT:: int, number of genotype categories
param n_donor:
type n_donor:: int, number of donors in the mixture

LR_test(**kwargs)[source]

Likelihood ratio test for psi vector in a null hypothesis.

Use **kwargs for psi_null, AD, DP, GT_prob, log in vireoSNP.LikRatio_test() function. Note, AD, DP, GT_prob the same as the self.fit() function.

__init__(n_donor, n_GT=3, psi_init=None, theta_init=[0.01, 0.5, 0.99])[source]

fit(AD, DP, GT_prob, max_iter=200, min_iter=5, epsilon_conv=0.001, learn_theta=True, delay_fit_theta=0, model='EM', verbose=False)[source]

Fit the unknown variable psi and theta with EM algorithm

Parameters:

AD (numpy.array, (n_variant, ), int) – The count vector for alternative allele in all variants
DP (numpy.array (n_variant, ), int) – The count vector for depths in all variants (i.e., two alleles)
GT_prob (numpy.array, (n_variants, n_donor, n_GT)) – The probability tensor for each genotype in each donor
learn_theta (bool) – Whether learn theta, otherwise use theta_init
delay_fit_theta (int) – The number of steps to delay in updating theta
max_iter (int) – Maximum number of iterations
min_iter – Minimum number of iterations
epsilon_conv (float) – Threshold for detecting convergence
model (string) – The algorithm used to fit the model. Only “EM” is supported for Expectation-Maximumization algorithm
verbose (bool) – Whether print out log info

LikRatio_test

vireoSNP.LikRatio_test(psi, psi_null, AD, DP, GT_prob, theta, log=False)[source]

Likelihood ratio test for psi vector in a null hypothesis. Please use the same AD, DP, and GT_prob as the fit() function.

Parameters:

psi (numpy.array (n_donor, )) – The fractional abundance of each donor in the mixture for alternative hypothesis
psi_null (numpy.array (n_donor, )) – The psi vector in a null hypothesis
AD (numpy.array, (n_variant, ), int) – The count vector for alternative allele in all variants
DP (numpy.array (n_variant, ), int) – The count vector for depths in all variants (i.e., two alleles)
GT_prob (numpy.array, (n_variants, n_donor, n_GT)) – The probability tensor for each genotype in each donor
theta (numpy.array (n_GT, )) – The alternative allele rate in each genotype category
log (bool) – If return p value in logarithm scale

Returns:

statistic (float) – The calculated chi2-statistic.
pvalue (float) – The single-tailed p-value.