API

Import vireoSNP as:

import vireoSNP

Commands

Read / Load

vireoSNP.read_cellSNP(dir_name, layers=['AD', 'DP'])[source]

Read data from the cellSNP output directory

Parameters:dir_name – directory full path name for cellSNP output
Returns:
Return type:A disctionary containing AD, DP, cells and variants
vireoSNP.read_vartrix(alt_mtx, ref_mtx, cell_file, vcf_file=None)[source]

Read data from VarTrix

Parameters:
  • alt_mtx – sparse matrix file for alternative alleles
  • ref_mtx – sparse matrix file for reference alleles
  • cell_file – file for cell barcodes, each per line
  • vcf_file – the vcf file used for fetch variants in VarTrix
Returns:

Return type:

A disctionary containing AD, DP, cells and optionally variants

VCF processing

Load VCF to matrices

vireoSNP.vcf.load_VCF(vcf_file, biallelic_only=False, load_sample=True, sparse=True, format_list=None)

Initially designed to load VCF from cellSNP output, requiring

  1. all variants have the same format list;
  2. a line starting with “#CHROM”, with sample ids.

If these two requirements are satisfied, this function also supports general VCF files, e.g., genotype for multiple samples.

Note, it may take a large memory, please filter the VCF with bcftools first.

Examples

  • Load VCF file, e.g., from cellsnp-lite output:
>>> import vireoSNP
>>> import numpy as np
>>> vcf_dat = vireoSNP.vcf.load_VCF("cellSNP.cells.vcf.gz", sparse=False,
>>>     biallelic_only=False, format_list=['GT', 'AD', 'DP', 'ALL'])
>>> var_ids = np.array(vcf_dat['variants'])
>>> samples = np.array(vcf_dat['samples'])
>>> GT_mat = np.array(vcf_dat['GenoINFO']['GT'])
>>> AD_mat = np.array(vcf_dat['GenoINFO']['AD']).astype(float)
>>> DP_mat = np.array(vcf_dat['GenoINFO']['DP']).astype(float)
>>> ALL_bases_mat = np.array(vcf_dat['GenoINFO']['ALL'])

Parse genotype probablity to tenseor

vireoSNP.vcf.parse_donor_GPb(GT_dat, tag='GT', min_prob=0.0)

Parse the donor genotype probability tag: GT, GP, or PL

Examples

>>> GProb_tensor = vireoSNP.vcf.parse_donor_GPb(vcf_dat['GenoINFO']['GT'], 'GT')
vireoSNP.vcf.match_VCF_samples(VCF_file1, VCF_file2, GT_tag1, GT_tag2)

Match donors in two VCF files. Please subset the VCF with bcftools first, as it is more computationally efficient:

bcftools view large_file.vcf.gz -R small_file.vcf.gz -Oz -o sub.vcf.gz

Parameters:
  • VCF_file1 (str) – the full path of first VCF file, in plain text or gzip / bgzip
  • VCF_file2 (str) – the full path of second VCF file, in plain text or gzip / bgzip
  • GT_tag1 (str) – the tag for extracting the genotype probability in VCF1: GT, GP, PL
  • GT_tag2 (str) – the tag for extracting the genotype probability in VCF2: GT, GP, PL

Plotting

Heatmap plot

vireoSNP.plot.heat_matrix(X, yticks=None, xticks=None, rotation=45, cmap='BuGn', alpha=0.6, display_value=True, row_sort=False, aspect='auto', interpolation='none', **kwargs)[source]

Plot heatmap of distance matrix

Parameters:
  • X (numpy.array or matrix) – The matrix to plot in heatmap
  • yticks (list) – The ticks ids for y axis
  • xticks (list) – The ticks ids for x axis
  • ratation (scalar) – The ratation angel for xticks
  • cmap (str) – The colormap for the heatmap, more options: https://matplotlib.org/stable/tutorials/colors/colormaps.html
  • alpha (scalar) – The transparency, value between 0 and 1
  • display_value (bool) – If True, dispaly the values in the heatmap
  • raw_sort (bool) – If True, sort the rows with row index as row_idx = np.argsort(np.dot(X, 2**np.arange(X.shape[1])))
  • aspect (str) – aspect in plt.imshow
  • interpolation (str) – interpolation in plt.imshow
  • **kwargs (keywords & values) – **kwargs for plt.imshow
Returns:

Return type:

The return from plt.imshow

Examples

>>> from vireoSNP.plot import heat_matrix
>>> import numpy as np
>>> np.random.seed(1)
>>> X = np.random.rand(5, 7)
>>> heat_matrix(X)

(Source code, png)

_images/API-1.png

Annotated heatmap plot

vireoSNP.plot.anno_heat(X, row_anno=None, col_anno=None, row_order_ids=None, col_order_ids=None, xticklabels=False, yticklabels=False, row_cluster=False, col_cluster=False, **kwargs)[source]

Heatmap with column or row annotations. Based on seaborn.clustermap() Row or column will be ordered by the annotation group.

Note, haven’t tested if input both row_anno and col_anno.

Vireo Object

Objects of type Vireo allow clustering cells by allelic ratio

class vireoSNP.Vireo(n_cell, n_var, n_donor, n_GT=3, learn_GT=True, learn_theta=True, ASE_mode=False, fix_beta_sum=False, beta_mu_init=None, beta_sum_init=None, ID_prob_init=None, GT_prob_init=None)[source]

Viroe model: Variational Inference for reconstruction of ensemble origin

The prior can be set via set_prior() before fitting the model.

beta_mu: numpy array (1, n_GT) or (n_var, n_GT)
Beta mean parameter of theta’s posterior
beta_sum: numpy array (1, n_GT) or (n_var, n_GT), same as beta_mu
Beta concetration parameter of theta’s posterior
ID_prob: numpy array (n_cell, n_donor)
Posterior cell assignment probability to each donor
GT_prob: numpy array (n_var, n_donor, n_GT)
Posterior genotype probability per variant per donor
__init__(n_cell, n_var, n_donor, n_GT=3, learn_GT=True, learn_theta=True, ASE_mode=False, fix_beta_sum=False, beta_mu_init=None, beta_sum_init=None, ID_prob_init=None, GT_prob_init=None)[source]

Initialise Vireo model

Note, multiple initializations are highly recomended to avoid local optima.

Parameters:
  • n_cell (int.) – Number of cells
  • n_var (int.) – Number of variants
  • n_donor (int.) – Number of donors
  • n_GT (int.) – Number of genotype categories
  • learn_GT (bool.) – Whether updating GT_prob; otherwise using the initial
  • ASE_mode (bool.) – Whether setting allelic ratio theta to be variant specific
  • fix_beta_sum (bool.) – Whether fixing the concetration parameter of theta’s posterior
  • beta_mu_init (numpy array (1, n_GT) or (n_var, n_GT)) – Initial value of beta_mu, the mean parameter of theta
  • beta_sum_init (numpy array (1, n_GT) or (n_var, n_GT), same as beta_mu) – Initial value of beta_sum, the concetration parameter of theta
  • ID_prob_init (numpy array (n_cell, n_donor)) – Initial value of ID_prob, cell assignment probability to each donor
  • GT_prob_init (numpy array (n_var, n_donor, n_GT)) – Initial value of GT_prob, genotype probability per variant and donor
fit(AD, DP, max_iter=200, min_iter=5, epsilon_conv=0.01, delay_fit_theta=0, verbose=True, n_inits=50, nproc=1)[source]

Fit Vireo model with coordinate ascent

Parameters:
  • AD (scipy.sparse.csc_matrix (n_var, n_cell)) – Sparse count matrix for alternative allele
  • DP (scipy.sparse.csc_matrix (n_var, n_cell)) – Sparse count matrix for depths, alternative + refeerence alleles
  • max_iter (int) – Maximum number of iterations
  • min_iter – Minimum number of iterations
  • epsilon_conv (float) – Threshold for detecting convergence
  • delay_fit_theta (int) – Number of steps to delay updating theta. This can be very useful for common genetics when there is good prior on allelic ratio.
  • verbose (bool) – Whether print out log info
set_initial(beta_mu_init=None, beta_sum_init=None, ID_prob_init=None, GT_prob_init=None)[source]

Set initial values

set_prior(GT_prior=None, ID_prior=None, beta_mu_prior=None, beta_sum_prior=None, min_GP=1e-05)[source]

Set prior for key variables: theta, GT_prob and ID_prob. The priors are in the same shape as its according variables.

min_GP: float. Minimun genotype probability in GT_prior.

BinomMixtureVB Object

Objects of type BinomMixtureVB for clustering with binomial mixture model

class vireoSNP.BinomMixtureVB(n_cell, n_var, n_donor, fix_beta_sum=False, beta_mu_init=None, beta_sum_init=None, ID_prob_init=None)[source]

Binomial mixture model with variational inference

The prior can be set via set_prior() before fitting the model.

beta_mu: numpy array (n_var, n_donor)
Beta mean parameter of theta’s posterior
beta_sum: numpy array (n_var, n_donor)
Beta concetration parameter of theta’s posterior
ID_prob: numpy array (n_cell, n_donor)
Posterior cell assignment probability to each donor
__init__(n_cell, n_var, n_donor, fix_beta_sum=False, beta_mu_init=None, beta_sum_init=None, ID_prob_init=None)[source]

Initialise Vireo model

Note, multiple initializations are highly recomended to avoid local optima.

Parameters:
  • n_cell (int.) – Number of cells
  • n_var (int.) – Number of variants
  • n_donor (int.) – Number of donors
  • fix_beta_sum (bool.) – Whether fixing the concetration parameter of theta’s posterior
  • beta_mu_init (numpy array (n_var, n_donor)) – Initial value of beta_mu, the mean parameter of theta
  • beta_sum_init (numpy array (n_var, n_donor)) – Initial value of beta_sum, the concetration parameter of theta
  • ID_prob_init (numpy array (n_cell, n_donor)) – Initial value of ID_prob, cell assignment probability to each donor
fit(AD, DP, n_init=10, max_iter=200, max_iter_pre=100, random_seed=None, **kwargs)[source]

Fit VB with multiple initializations

Parameters:
  • AD (scipy.sparse.csc_matrix (n_var, n_cell)) – Sparse count matrix for alternative allele
  • DP (scipy.sparse.csc_matrix (n_var, n_cell)) – Sparse count matrix for depths, alternative + refeerence alleles
  • n_inits (int) – Number of random initialisations to use
  • max_iter (int) – Maximum number of iterations for _fit_BV() in best initial
  • max_iter_pre (int) – Maximum number of iterations for _fit_BV() in multiple initials
  • min_iter – Minimum number of iterations for _fit_BV()
  • epsilon_conv (float) – Threshold for detecting convergence for _fit_BV()
  • verbose (bool) – Whether print out log info for _fit_BV()
  • random_seed (None or int) – Random seed in numpy.random for multiple initializations
set_initial(beta_mu_init=None, beta_sum_init=None, ID_prob_init=None)[source]

Random initialization

set_prior(ID_prior=None, beta_mu_prior=None, beta_sum_prior=None)[source]

Set prior for key variables: theta and ID_prob. The priors are in the same shape as its according variables.