This manuscript (permalink) was automatically generated from related-sciences/gwas-analysis-manuscript@b72722d on July 6, 2020.
John Doe
XXXX-XXXX-XXXX-XXXX
·
johndoe
·
johndoe
Department of Something, University of Whatever
· Funded by Grant XXXXXXXX
Jane Roe
XXXX-XXXX-XXXX-XXXX
·
janeroe
Department of Something, University of Whatever; Department of Whatever, University of Something
In memory EDA for genotyping data
Supports vcf, plink,
Has vcf_to_{npz,zarr,recarray} methods
Represents genotype calls as 3D uint8 arrays
GenotypeArray > This class represents data on discrete genotype calls as a 3-dimensional numpy array of integers. By convention the first dimension corresponds to the variants genotyped, the second dimension corresponds to the samples genotyped, and the third dimension corresponds to the ploidy of the samples.
Each integer within the array corresponds to an allele index, where 0 is the reference allele, 1 is the first alternate allele, 2 is the second alternate allele, … and -1 (or any other negative integer) is a missing allele call. A single byte integer dtype (int8) can represent up to 127 distinct alleles, which is usually sufficient. The actual alleles (i.e., the alternate nucleotide sequences) and the physical positions of the variants within the genome of an organism are stored in separate arrays, discussed elsewhere.
In many cases the number of distinct alleles for each variant is small, e.g., less than 10, or even 2 (all variants are biallelic). In these cases a genotype array is not the most compact way of storing genotype data in memory. This class defines functions for bit-packing diploid genotype calls into single bytes, and for transforming genotype arrays into sparse matrices, which can assist in cases where memory usage needs to be minimised. Note however that these more compact representations do not allow the same flexibility in terms of using numpy universal functions to access and manipulate data.
Supports bitpacking by collapsing the ploidy dimension (axis=2) into a single byte using 4 bits for each uint8
Phasing is supported by assuming the ordering in the ploidy dimension has meaning: > If the genotype calls are unphased then the ordering of alleles along the third (ploidy) dimension is arbitrary
http://alimanfoo.github.io/2016/06/10/scikit-allel-tour.html > The scikit-allel genotype array convention is flexible, allowing for multiallelic and polyploid genotype calls. However, it is not very compact, requiring 2 bytes of memory for each call. A set of calls for 10,000,000 SNPs in 1,000 samples thus requires 20G of memory.
One option to work with large arrays is to use bit-packing, i.e., to pack two or more items of data into a single byte. E.g., this is what the plink BED format does. If you have have diploid calls that are only ever biallelic, then it is possible to fit 4 genotype calls into a single byte. This is 8 times smaller than the NumPy unpacked representation.
However, coding against bit-packed data is not very convenient. Also, there are several libraries available for Python which allow N-dimensional arrays to be stored using compression: h5py, bcolz and zarr. Genotype data is usually extremely compressible due to sparsity - most calls are homozygous ref, i.e., (0, 0), so there are a lot of zeros.
Dependent Projects:
--read-freq
is the opportune paramEstimator | LD | Pop Structure | Recent Admixture |
---|---|---|---|
GRM (Hail/PLINK) | |||
IBD (Hail/PLINK) | |||
KING | X | X | |
PC-Relate (Hail) | X | X | X |
--make-rel
is realized relationship matrix (presumably same as Hail RRM) and cites GCTA: A Tool for Genome-wide Complex Trait Analysis as the reference implementation (this paper has the formula in the Hail docs)--genome
is kinship coefficient estimatorstruct {
GT: call,
AD: array<int32>,
DP: int32,
GQ: int32,
PL: array<int32>,
PGT: call,
PID: str
}
ldd /opt/conda/envs/hail/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-x86_64-linux-gnu.so
> libcblas.so.3 => /opt/conda/envs/hail/lib/python3.7/site-packages/numpy/core/../../../../libcblas.so.3 (0x00007f0b737e1000)
readlink -e /opt/conda/envs/hail/lib/python3.7/site-packages/numpy/core/../../../../libcblas.so.3
> /opt/conda/envs/hail/lib/libopenblasp-r0.3.7.so
blas=*=openblas
and conda-forge::numpy
to environment.yamlldd /opt/conda/envs/hail/lib/python3.7/site-packages/scipy/linalg/_fblas.cpython-37m-x86_64-linux-gnu.so
>> libopenblasp-r0-2ecf47d5.3.7.dev.so => /opt/conda/envs/hail/lib/python3.7/site-packages/scipy/linalg/../.libs/libopenblasp-r0-2ecf47d5.3.7.dev.so (0x00007f13d0f56000)