singlecellmultiomics.molecule package¶
Submodules¶
singlecellmultiomics.molecule.chic module¶
-
class
singlecellmultiomics.molecule.chic.
AnnotatedCHICMolecule
(fragment, features, **kwargs)[source]¶ Bases:
singlecellmultiomics.molecule.chic.CHICMolecule
,singlecellmultiomics.molecule.featureannotatedmolecule.FeatureAnnotatedMolecule
Chic Molecule which is annotated with features (genes/exons/introns, .. )
Parameters: - fragments (singlecellmultiomics.fragment.Fragment) – Fragments to associate to the molecule
- features (singlecellmultiomics.features.FeatureContainer) – container to use to obtain features from
- **kwargs – extra args
Write BAM tags to all reads associated to this molecule
- This function sets the following tags:
- mI : most common umi
- DA : allele
- af : amount of associated fragments
- rt : rt_reaction_index
- rd : rt_duplicate_index
- TR : Total RT reactions
- ap : phasing information (if allele_resolver is set)
- TF : total fragments
- ms : size of the molecule (largest fragment)
-
class
singlecellmultiomics.molecule.chic.
CHICMolecule
(fragment, **kwargs)[source]¶ Bases:
singlecellmultiomics.molecule.molecule.Molecule
CHIC Molecule class
Parameters: - fragments (singlecellmultiomics.fragment.Fragment) – Fragments to associate to the molecule
- **kwargs – extra args
-
get_cut_site
()[source]¶ For restriction based protocol data, obtain genomic location of cut site
Returns: None if site is not available chromosome (str) position (int) strand (bool)
-
get_fragment_span_sequence
(reference=None)[source]¶ Obtain the sequence between the start and end of the molecule
Parameters: reference (pysam.FastaFile) – reference to use. If not specified self.reference is used Returns: sequence (str)
-
is_valid
(set_rejection_reasons=False)[source]¶ Check if the molecule is valid All of the following requirements should be met: - no multimapping - no low mapping mapping_quality (Change molecule.min_max_mapping_quality to set the threshold) - molecule is associated with at least one valid fragment
Parameters: - set_rejection_reasons (bool) – When set to True, all reads get a
- reason (rejection) –
Returns: True when all requirements are met, False otherwise
Return type: is_valid (bool)
-
update_ligation_motif
()[source]¶ Extract lh tag from associated reads and set most common one to the RZ tag
Write BAM tags to all reads associated to this molecule
- This function sets the following tags:
- mI : most common umi
- DA : allele
- af : amount of associated fragments
- rt : rt_reaction_index
- rd : rt_duplicate_index
- TR : Total RT reactions
- ap : phasing information (if allele_resolver is set)
- TF : total fragments
- ms : size of the molecule (largest fragment)
-
class
singlecellmultiomics.molecule.chic.
CHICNLAMolecule
(fragment, reference, **kwargs)[source]¶ Bases:
singlecellmultiomics.molecule.molecule.Molecule
CHIC NLA Molecule class
Parameters: - fragments (singlecellmultiomics.fragment.Fragment) – Fragments to associate to the molecule
- **kwargs – extra args
Write BAM tags to all reads associated to this molecule
- This function sets the following tags:
- mI : most common umi
- DA : allele
- af : amount of associated fragments
- rt : rt_reaction_index
- rd : rt_duplicate_index
- TR : Total RT reactions
- ap : phasing information (if allele_resolver is set)
- TF : total fragments
- ms : size of the molecule (largest fragment)
singlecellmultiomics.molecule.consensus module¶
-
singlecellmultiomics.molecule.consensus.
base_calling_matrix_to_df
(x, ref_info=None, NUC_RADIUS=1, USE_RT=True)[source]¶ Convert numpy base calling feature matrix to pandas dataframe with annotated columns
Parameters: - x (np.array) – feature matrix
- ref_info (list) – reference position annotations (will be used as index)
- NUC_RADIUS (int) – generate kmer features target nucleotide
- USE_RT (bool) – use RT reaction features
Returns: df (pd.DataFrame)
-
singlecellmultiomics.molecule.consensus.
calculate_consensus
(molecule, consensus_model, molecular_identifier, out, **model_kwargs)[source]¶ Create consensus read for molecule
Parameters: - molecule (singlecellmultiomics.molecule.Molecule) –
- consensus_model –
- molecular_identifier (str) – identier for this molecule, will be suffixed to the reference_id
- out (pysam.AlingmentFile) – target bam file
- **model_kwargs – arguments passed to the consensus model
-
singlecellmultiomics.molecule.consensus.
get_consensus_training_data
(molecule_iterator, mask_variants=None, n_train=100000, skip_already_covered_bases=True, **feature_matrix_args)[source]¶ Create a tensor/matrix containing alignment and base calling information, which can be used for consensus calling. This function also creates a vector containing the corresponding reference bases, which can be used for training a consensus model.
Parameters: - molecule_iterator – generator which generates molecules from which base calling feature matrices are extracted
- mask_variants (pysam.VariantFile) – variant locations which should be excluded from the matrix
- n_train (int) – amount of rows in the matrix
- skip_already_covered_bases (bool) – when True every reference position is at most a single row in the output matrix, this prevents overfitting
- **feature_matrix_args – Arguments to pass to the feature matrix function of the molecules.
singlecellmultiomics.molecule.featureannotatedmolecule module¶
-
class
singlecellmultiomics.molecule.featureannotatedmolecule.
FeatureAnnotatedMolecule
(fragment, features, stranded=None, auto_set_intron_exon_features=False, capture_locations=False, **kwargs)[source]¶ Bases:
singlecellmultiomics.molecule.molecule.Molecule
Molecule which is annotated with features (genes/exons/introns, .. )
-
annotate
(method=0)[source]¶ Parameters: method (int) – 0, obtain blocks and then obtain features. 1, try to obtain features for every aligned base
Write BAM tags to all reads associated to this molecule
- This function sets the following tags:
- mI : most common umi
- DA : allele
- af : amount of associated fragments
- rt : rt_reaction_index
- rd : rt_duplicate_index
- TR : Total RT reactions
- ap : phasing information (if allele_resolver is set)
- TF : total fragments
- ms : size of the molecule (largest fragment)
Write molecule information to the supplied reads as BAM tags
-
-
class
singlecellmultiomics.molecule.featureannotatedmolecule.
TranscriptMolecule
(fragment, **kwargs)[source]¶ Bases:
singlecellmultiomics.molecule.molecule.Molecule
Write BAM tags to all reads associated to this molecule
- This function sets the following tags:
- mI : most common umi
- DA : allele
- af : amount of associated fragments
- rt : rt_reaction_index
- rd : rt_duplicate_index
- TR : Total RT reactions
- ap : phasing information (if allele_resolver is set)
- TF : total fragments
- ms : size of the molecule (largest fragment)
singlecellmultiomics.molecule.filter module¶
-
singlecellmultiomics.molecule.filter.
molecule_iterator_filter
(molecule_iterator, min_fragments=None, max_fragments=None, min_mapping_qual=None, max_mapping_qual=None, min_ivt_duplicates=None, max_ivt_duplicates=None, both_pairs_mapped=None, min_span=None, max_span=None)[source]¶ Filter iterable with molecules
molecule_iterator (iterable) : molecules to filter from min_fragments (int) : minimum amount of fragments associated to molecule max_fragments (int) : maximum amount of fragments associated to molecule min_mapping_qual(int) : minimum maximum mapping quality for a single associated fragment max_mapping_qual(int) : maximum maximum mapping quality for a single associated fragment min_ivt_duplicates(int) : minimum amount of in vitro transcription copies max_ivt_duplicates(int) : maximum amount of in vitro transcription copies both_pairs_mapped(bool) : molecule should have at least one fragment with both pairs mapped min_span(int) : minimum amount of bases aligned with reference max_span : maximum amount of bases aligned with reference
singlecellmultiomics.molecule.fourthiouridine module¶
-
class
singlecellmultiomics.molecule.fourthiouridine.
FourThiouridine
(fragments=None, classifier=None, **kwargs)[source]¶ Bases:
singlecellmultiomics.molecule.molecule.Molecule
-
is_valid
(set_rejection_reasons=False)[source]¶ Check if the molecule is valid All of the following requirements should be met: - no multimapping - no low mapping mapping_quality (Change molecule.min_max_mapping_quality to set the threshold) - molecule is associated with at least one valid fragment
Parameters: - set_rejection_reasons (bool) – When set to True, all reads get a
- reason (rejection) –
Returns: True when all requirements are met, False otherwise
Return type: is_valid (bool)
-
singlecellmultiomics.molecule.iterator module¶
-
class
singlecellmultiomics.molecule.iterator.
MoleculeIterator
(alignments, molecule_class=<class 'singlecellmultiomics.molecule.molecule.Molecule'>, fragment_class=<class 'singlecellmultiomics.fragment.fragment.Fragment'>, check_eject_every=10000, molecule_class_args={}, fragment_class_args={}, perform_qflag=True, pooling_method=1, yield_invalid=False, yield_overflow=True, query_name_flagger=None, ignore_collisions=True, every_fragment_as_molecule=False, yield_secondary=False, yield_supplementary=False, max_buffer_size=None, iterator_class=<class 'pysamiterators.iterators.iterators.MatePairIterator'>, skip_contigs=None, progress_callback_function=None, min_mapping_qual=None, perform_allele_clustering=False, **pysamArgs)[source]¶ Bases:
object
Iterate over molecules in pysam.AlignmentFile or reads from a generator or list
Example
>>> !wget https://github.com/BuysDB/SingleCellMultiOmics/blob/master/data/mini_nla_test.bam?raw=true -O mini_nla_test.bam >>> !wget https://github.com/BuysDB/SingleCellMultiOmics/blob/master/data/mini_nla_test.bam.bai?raw=true -O mini_nla_test.bam.bai >>> import pysam >>> from singlecellmultiomics.molecule import NlaIIIMolecule, MoleculeIterator >>> from singlecellmultiomics.fragment import NlaIIIFragment >>> import pysamiterators >>> alignments = pysam.AlignmentFile('mini_nla_test.bam') >>> for molecule in MoleculeIterator( >>> alignments=alignments, >>> molecule_class=singlecellmultiomics.molecule.NlaIIIMolecule, >>> fragment_class=singlecellmultiomics.fragment.NlaIIIFragment, >>> ): >>> break >>> molecule NlaIIIMolecule with 1 assinged fragments Allele :No allele assigned Fragment: sample:APKS1P25-NLAP2L2_57 umi:CCG span:chr1 164834728-164834868 strand:+ has R1: yes has R2: no randomer trimmed: no DS:164834865 RS:0 RZ:CAT Restriction site:('chr1', 164834865)
It is also possible to supply and iterator instead of a SAM/BAM file handle .. rubric:: Example
>>> from singlecellmultiomics.molecule import MoleculeIterator >>> from singlecellmultiomics.fragment import Fragment >>> import pysam >>> # Create SAM file to write some example reads to: >>> test_sam = pysam.AlignmentFile('test.sam','w',reference_names=['chr1','chr2'],reference_lengths=[1000,1000]) >>> read_A = pysam.AlignedSegment(test_sam.header) >>> read_A.set_tag('SM','CELL_1') >>> read_A.set_tag('RX','CAT') >>> read_A.reference_name = 'chr1' >>> read_A.reference_start = 100 >>> read_A.query_sequence = 'ATCGGG' >>> read_A.cigarstring = '6M' >>> read_A.mapping_quality = 60 >>> # Create a second read which is a duplicate of the previous >>> read_B = pysam.AlignedSegment(test_sam.header) >>> read_B.set_tag('SM','CELL_1') >>> read_B.set_tag('RX','CAT') >>> read_B.reference_name = 'chr1' >>> read_B.reference_start = 100 >>> read_B.query_sequence = 'ATCGG' >>> read_B.cigarstring = '5M' >>> read_B.mapping_quality = 60 >>> # Create a thids read which is belonging to another cell >>> read_C = pysam.AlignedSegment(test_sam.header) >>> read_C.set_tag('SM','CELL_2') >>> read_C.set_tag('RX','CAT') >>> read_C.reference_name = 'chr1' >>> read_C.reference_start = 100 >>> read_C.query_sequence = 'ATCGG' >>> read_C.cigarstring = '5M' >>> read_C.mapping_quality = 60 >>> # Set up an iterable containing the reads: >>> reads = [ read_A,read_B,read_C ] >>> molecules = [] >>> for molecule in MoleculeIterator( reads ): >>> print(molecule) Molecule with 2 assinged fragments Allele :No allele assigned Fragment: sample:CELL_1 umi:CAT span:chr1 100-106 strand:+ has R1: yes has R2: no randomer trimmed: no Fragment: sample:CELL_1 umi:CAT span:chr1 100-105 strand:+ has R1: yes has R2: no randomer trimmed: no Molecule with 1 assinged fragments Allele :No allele assigned Fragment: sample:CELL_2 umi:CAT span:chr1 100-105 strand:+ has R1: yes has R2: no randomer trimmed: no
In the next example the molecules overlapping with a single location on chromosome ‘1’ position 420000 are extracted Don’t forget to supply check_eject_every = None, this allows non-sorted data to be passed to the MoleculeIterator.
Example
>>> from singlecellmultiomics.bamProcessing import mate_pileup >>> from singlecellmultiomics.molecule import MoleculeIterator >>> with pysam.AlignmentFile('example.bam') as alignments: >>> for molecule in MoleculeIterator( >>> mate_pileup(alignments, contig='1', position=420000, check_eject_every=None) >>> ): >>> pass
Warning
Make sure the reads being supplied to the MoleculeIterator sorted by genomic coordinate! If the reads are not sorted set check_eject_every=None
singlecellmultiomics.molecule.molecule module¶
-
class
singlecellmultiomics.molecule.molecule.
Molecule
(fragments: Optional[Iterable[T_co]] = None, cache_size: int = 10000, reference: Union[pysam.libcfaidx.FastaFile, pysamiterators.iterators.iterators.CachedFasta] = None, min_max_mapping_quality: Optional[int] = None, mapability_reader: Optional[singlecellmultiomics.bamProcessing.bamFunctions.MapabilityReader] = None, allele_resolver: Optional[singlecellmultiomics.alleleTools.alleleTools.AlleleResolver] = None, max_associated_fragments=None, allele_assingment_method=1, **kwargs)[source]¶ Bases:
object
Molecule class, contains one or more associated fragments
-
fragments
¶ associated fragments
Type: list
-
spanStart
¶ starting coordinate of molecule, None if not available
Type: int
-
spanEnd
¶ ending coordinate of molecule, None if not available
Type: int
-
chromosome
¶ mapping chromosome of molecule, None if not available
Type: str
-
cache_size
¶ radius of molecule assignment cache
Type: int
-
reference
¶ reference file, used to obtain base contexts and correct aligned_pairs iteration when the MD tag is not correct
Type: pysam.FastaFile
-
strand
¶ mapping strand. True when strand is REVERSE False when strand is FORWARD None when strand is not determined
Type: bool
-
__getitem__
(index)[source]¶ Obtain a fragment belonging to this molecule.
Parameters: index (int) – index of the fragment [0 ,1 , 2 ..] Returns: fragment (singlecellmultiomics.fragment.Fragment)
-
__iter__
()[source]¶ Iterate over all associated fragments
Yields: singlecellmultiomics.fragment.Fragment
-
add_fragment
(fragment, use_hash=True)[source]¶ Associate a fragment with this Molecule
Parameters: fragment (singlecellmultiomics.fragment.Fragment) – Fragment to associate Returns: Returns False when the fragments which have already been associated to the molecule refuse the fragment Return type: has_been_added (bool) Raises: OverflowError
– when too many fragments have been associated already control this with .max_associated_fragments attribute
-
add_molecule
(other)[source]¶ Merge other molecule into this molecule. Merges by assigning all fragments in other to this molecule.
-
aligned_length
¶
-
allele
¶
-
allele_confidence
¶ Returns confidence(int) : a phred scalled confidence value for the allele assignment, returns zero when no allele is associated to the molecule
-
allele_informative_base_probabilities
¶
-
allele_likelihoods
¶ Per allele likelihood
Returns: {allele_name : likelihood} Return type: likelihoods (dict)
-
allele_probabilities
¶ Per allele probability
Returns: {allele_name : prob} Return type: likelihoods (dict)
-
base_confidences
¶
-
base_likelihoods
¶
-
base_probabilities
¶
-
calculate_consensus
(consensus_model, molecular_identifier, out, **model_kwargs)[source]¶ Create consensus read for molecule
Parameters: - consensus_model –
- molecular_identifier (str) – identier for this molecule, will be suffixed to the reference_id
- out (pysam.AlingmentFile) – target bam file
- **model_kwargs – arguments passed to the consensus model
-
can_be_split_into_allele_molecules
¶
-
can_be_yielded
(chromosome, position)[source]¶ Check if the molecule is far enough away from the supplied location to be ejected from a buffer.
Parameters: - chromosome (str) – chromosome / contig of location to test
- position (int) – genomic location of location to test
Returns: True when the molecule is far enough away from the supplied location to be ejected from a buffer.
Return type: can_be_yielded (bool)
-
check_variants
(variants, exclude_other_calls=True)[source]¶ Verify variants in molecule
Parameters: variants (pysam.VariantFile) – Variant file handle to extract variants from Returns: { (chrom,pos) : ( call (str) ): observations (int) } Return type: dict (defaultdict( Counter ))
-
contains_valid_fragment
()[source]¶ Check if an associated fragment exists which returns True for is_valid()
Returns: True when any associated fragment is_valid() Return type: contains_valid_fragment (bool)
-
deduplicate_to_single
(target_bam, read_name, classifier, reference=None)[source]¶ Deduplicate all reads associated to this molecule to a single pseudoread
Parameters: - target_bam (pysam.AlignmentFile) – file to associate the read with
- read_name (str) – name of the pseudoread
- classifier (sklearn classifier) – classifier for consensus prediction
Returns: Pseudo-read containing aggregated information
Return type: read (pysam.AlignedSegment)
-
deduplicate_to_single_CIGAR_spaced
(target_bam, read_name, classifier=None, max_N_span=300, reference=None, **feature_matrix_args)[source]¶ Deduplicate all associated reads to a single pseudoread, when the span is larger than max_N_span the read is split up in multi-segments. Uncovered locations are spaced using N’s in the CIGAR.
Parameters: - target_bam (pysam.AlignmentFile) – file to associate the read with
- read_name (str) – name of the pseudoread
- classifier (sklearn classifier) – classifier for consensus prediction
Returns: reads( list [ pysam.AlignedSegment ] )
-
deduplicate_to_single_CIGAR_spaced_from_dict
(target_bam, read_name, base_call_dict, max_N_span=300)[source]¶ Deduplicate all associated reads to a single pseudoread, when the span is larger than max_N_span the read is split up in multi-segments. Uncovered locations are spaced using N’s in the CIGAR.
Parameters: - target_bam (pysam.AlignmentFile) – file to associate the read with
- read_name (str) – name of the pseudoread
- classifier (sklearn classifier) – classifier for consensus prediction
Returns: reads( list [ pysam.AlignedSegment ] )
-
estimated_max_length
¶ Obtain the estimated size of the fragment, returns None when estimation is not possible Takes into account removed bases (R2) Assumes inwards sequencing orientation
-
get_CIGAR
(reference=None)[source]¶ Get alignment of all associated reads
Returns: reference bases CIGAR : alignment of feature matrix to reference tuples (operation, count) reference(pysam.FastaFile) : reference to fetch reference bases from, if not supplied the MD tag is used Return type: y
-
get_aligned_blocks
()[source]¶ get all consecutive blocks of aligned reference positions
Returns: [ (start, end), (start, end) ] Return type: sorted list of aligned blocks (list)
-
get_aligned_reference_bases_dict
()[source]¶ Get dictionary containing all reference bases to which this molecule aligns :returns: { (chrom,pos) : ‘A’, (chrom,pos):’T’, .. } :rtype: aligned_reference_positions (dict)
-
get_alignment_stats
()[source]¶ Get dictionary containing mean amount clip/insert/deletion/matches per base
Returns: - dictionary {
- clips_per_bp(int),
deletions_per_bp(int),
matches_per_bp(int),
inserts_per_bp(int),
alternative_hits_per_read(int),
}
Return type: cigar_stats(dict)
-
get_alignment_tensor
(max_reads, window_radius=20, centroid=None, mask_centroid=False, refence_backed=False, skip_missing_reads=False)[source]¶ Obtain a tensor representation of the molecule alignment around the given centroid
Parameters: - max_reads (int) – maximum amount of reads returned in the tensor, this will be the amount of rows/4 of the returned feature matrix
- window_radius (int) – radius of bp around centroid
- centroid (int) – center of extracted window, when not specified the cut location of the molecule is used
- mask_centroid (bool) – when True, mask reference base at centroid with N
- refence_backed (bool) – when True the molecules reference is used to emit reference bases instead of the MD tag
Returns: (4*window_radius*2*max_reads) dimensional feature matrix
Return type: tensor_repr(np.array)
-
get_allele
(allele_resolver=None, return_allele_informative_base_dict=False)[source]¶ Obtain the allele(s) this molecule maps to
Parameters: - allele_resolver (singlecellmultiomics.alleleTools.AlleleResolver) – resolver used
- return_allele_informative_base_dict (bool) – return dictionary containing the bases used for allele determination
- defaultdict(list, –
- {'allele1' – [(‘chr18’, 410937, ‘T’), (‘chr18’, 410943, ‘G’), (‘chr18’, 410996, ‘G’), (‘chr18’, 411068, ‘A’)]})
Returns: Set of strings containing associated alleles
Return type: alleles(set( str ))
-
get_allele_likelihoods
()[source]¶ Obtain the allele(s) this molecule maps to
Parameters: - allele_resolver (singlecellmultiomics.alleleTools.AlleleResolver) – resolver used
- return_allele_informative_base_dict (bool) – return dictionary containing the bases used for allele determination
- defaultdict(list, –
- {'allele1' –
[ (‘chr18’, 410937, ‘T’), (‘chr18’, 410943, ‘G’), (‘chr18’, 410996, ‘G’), (‘chr18’, 411068, ‘A’)]})
Returns: likelihood, ‘allele_b’:likelihood }
Return type: { ‘allele_a’
-
get_barcode_sequences
()[source]¶ Obtain (Cell) barcode sequences associated to molecule
Returns: barcode sequence(s) Return type: barcode sequences (set)
-
get_base_calling_feature_matrix
(return_ref_info=False, start=None, end=None, reference=None, NUC_RADIUS=1, USE_RT=True, select_read_groups=None)[source]¶ Obtain feature matrix for base calling
Parameters: - return_ref_info (bool) – return both X and array with feature information
- start (int) – start of range, genomic position
- end (int) – end of range (inclusive), genomic position
- reference (pysam.FastaFile) – reference to fetch reference bases from, if not supplied the MD tag is used
- NUC_RADIUS (int) – generate kmer features target nucleotide
- USE_RT (bool) – use RT reaction features
- select_read_groups (set) – only use reads from these read groups to generate features
-
get_base_calling_feature_matrix_spaced
[source]¶ Obtain a base-calling feature matrix for all reference aligned bases.
Returns: feature matrix y : reference bases CIGAR : alignment of feature matrix to reference tuples (operation, count) reference(pysam.FastaFile) : reference to fetch reference bases from, if not supplied the MD tag is used Return type: X
-
get_base_calling_training_data
(mask_variants=None, might_be_variant_function=None, reference=None, **feature_matrix_args)[source]¶
-
get_base_confidence_dict
()[source]¶ Get dictionary containing base calls per position and the corresponding confidences
Returns: (contig (str), position (int) ) : base (str) : prob correct (list) Return type: obs (dict)
-
get_base_observation_dict
(return_refbases=False, allow_N=False, allow_unsafe=True, one_call_per_frag=False, min_cycle_r1=None, max_cycle_r1=None, min_cycle_r2=None, max_cycle_r2=None, use_cache=True, min_bq=None)[source]¶ Obtain observed bases at reference aligned locations
Parameters: - return_refbases (bool) – return both observed bases and reference bases
- allow_N (bool) – Keep N base calls in observations
- min_cycle_r1 (int) – Exclude read 1 base calls with a cycle smaller than this value (excludes bases which are trimmed before mapping)
- max_cycle_r1 (int) – Exclude read 1 base calls with a cycle larger than this value (excludes bases which are trimmed before mapping)
- min_cycle_r2 (int) – Exclude read 2 base calls with a cycle smaller than this value (excludes bases which are trimmed before mapping)
- max_cycle_r2 (int) – Exclude read 2 base calls with a cycle larger than this value (excludes bases which are trimmed before mapping)
Returns: base (string) : obs (int) } and { genome_location (tuple) : base (string) if return_refbases is True }
Return type: { genome_location (tuple)
-
get_base_observation_dict_NOREF
(allow_N=False)[source]¶ identical to get_base_observation_dict but does not obtain reference bases, has to be used when no MD tag is present :param return_refbases: return both observed bases and reference bases :type return_refbases: bool
Returns: base (string) : obs (int) } and { genome_location (tuple) : base (string) if return_refbases is True } Return type: { genome_location (tuple)
-
get_consensus
(dove_safe: bool = False, only_include_refbase: str = None, allow_N=False, with_probs_and_obs=False, **get_consensus_dictionaries_kwargs)[source]¶ Obtain consensus base-calls for the molecule
-
get_consensus_base
(contig, position, classifier=None)[source]¶ Obtain base call at single position of the molecule
Parameters: - contig (str) – contig to extract base call from
- position (int) – position to extract base call from (zero based)
- classifier (obj) – base calling classifier
Returns: base call, or None when no base call could be made
Return type: base_call (str)
-
get_consensus_base_frequencies
(allow_N=False)[source]¶ Obtain the frequency of bases in the molecule consensus sequence
Returns: Counter containing base frequecies, for example: { ‘A’:10,’T’:3, C:4 } Return type: base_frequencies (Counter)
-
get_consensus_gc_ratio
()[source]¶ Obtain the GC ratio of the molecule consensus sequence
Returns: GC ratio Return type: gc_ratio(float)
-
get_consensus_old
(base_obs=None, classifier=None, store_consensus=True, reuse_cached_consensus=True, allow_unsafe=False, allow_N=False)[source]¶ Get dictionary containing consensus calls in respect to reference. By default mayority voting is used to determine the consensus base. If a classifier is supplied the classifier is used to determine the consensus base.
Parameters: - base_obs (defaultdict(Counter)) – { genome_location (tuple) : base (string) : obs (int) }
- classifier – fitted classifier to use for consensus calling. When no classifier is provided the consensus is determined by majority voting
- store_consensus (bool) – Store the generated consensus for re-use
Returns: {location : base}
Return type: consensus (dict)
-
get_consensus_read
(target_file, read_name, consensus=None, phred_scores=None, cigarstring=None, mdstring=None, start=None, supplementary=False)[source]¶ get pysam.AlignedSegment containing aggregated molecule information
Parameters: - target_file (pysam.AlignmentFile) – File to create the read for
- read_name (str) – name of the read to write
Returns: read(pysam.AlignedSegment)
-
get_cut_site
()[source]¶ For restriction based protocol data, obtain genomic location of cut site
Returns: None if site is not available chromosome (str) position (int) strand (bool)
-
get_feature_vector
()[source]¶ Obtain a feature vector representation of the molecule
Returns: feature_vector(np.array)
-
get_html
(reference=None, consensus=None, show_reference_sequence=True, show_consensus_sequence=True, reference_bases=None)[source]¶ Get html representation of the molecule :returns: Html representation of the molecule :rtype: html_rep(str)
-
get_match_mismatch_frequency
(ignore_locations=None)[source]¶ - Get amount of base-calls matching and mismatching the reference sequence,
- mismatches in every read are counted
Parameters: ignore_locations (iterable(tuple([chrom(str),pos(int)]))) – Locations not to take into account for the match and mismatch frequency Returns: matches(int), mismatches(int)
-
get_max_mapping_qual
()[source]¶ Get max mapping quality of the molecule :returns: max_mapping_qual (float)
-
get_mean_base_quality
(chromosome, position, base=None, not_base=None)[source]¶ Get the mean phred score at the supplied coordinate and base-call
Parameters: - chromosome (str) –
- position (int) –
- base (str) – select only reads with this base
- not_base (str) – select only reads without this base
Returns: mean_phred_score (float)
-
get_mean_cycle
(chromosome, position, base=None, not_base=None)[source]¶ Get the mean cycle at the supplied coordinate and base-call
Parameters: - chromosome (str) –
- position (int) –
- base (str) – select only reads with this base
- not_base (str) – select only reads without this base
Returns: mean cycle for R1 and R2
Return type: mean_cycles (tuple)
-
get_mean_mapping_qual
()[source]¶ Get mean mapping quality of the molecule
Returns: mean_mapping_qual (float)
-
get_mean_rt_fragment_size
()[source]¶ Obtain the mean RT reaction fragment size
Returns: mean_rt_size (float)
-
get_methylated_count
(context=3)[source]¶ Get the total amount of methylated bases
Parameters: context (int) – 3 or 4 base context Returns: sum of methylated bases in contexts Return type: r (Counter)
-
get_methylation_dict
()[source]¶ Obtain methylation dictionary
Returns: (read.reference_name, rpos) : times seen methylated - methylated_state (dict):
- {(read.reference_name, rpos) : 1/0/-1 } 1 for methylated 0 for unmethylated -1 for unknown
Return type: methylated_positions (Counter)
-
get_min_mapping_qual
() → float[source]¶ Get min mapping quality of the molecule :returns: min_mapping_qual (float)
-
get_raw_barcode_sequences
()[source]¶ Obtain (Cell) barcode sequences associated to molecule, not hamming corrected
Returns: barcode sequence(s) Return type: barcode sequences (set)
-
get_rt_reaction_fragment_sizes
(max_N_distance=1)[source]¶ Obtain all RT reaction fragment sizes :returns: rt_sizes (list of ints)
-
get_rt_reactions
() → dict[source]¶ Obtain RT reaction dictionary
Returns: {(primer,pos) : [fragment, fragment..] } Return type: rt_dict (dict)
-
get_safely_aligned_length
()[source]¶ Get the amount of safely aligned bases (excludes primers) :returns:
- Amount of safely aligned bases
- or None when this cannot be determined because both mates are not mapped
Return type: aligned_bases (int)
-
get_sample
()[source]¶ Obtain sample
Returns: Sample associated with the molecule. Usually extracted from SM tag. Calls fragment.get_sample() to obtain the sample Return type: sample (str)
-
get_span_sequence
(reference=None)[source]¶ Obtain the sequence between the start and end of the molecule :param reference: reference to use.
If not specified self.reference is usedReturns: sequence (str)
-
get_strand
()[source]¶ Obtain mapping strand of molecule
Returns: - True,False,None
- True when strand is REVERSE False when strand is FORWARD None when strand is not determined
Return type: strand
-
get_strand_repr
(unknown='?')[source]¶ Get string representation of mapping strand
Parameters: unknown (str) – set what character/string to return when the strand is not available Returns: - forward, - reverse, ? unknown
Return type: strand_repr (str)
-
get_tag_counter
()[source]¶ Obtain a dictionary with tag -> value -> frequency
Returns: { tag(str) : { value(int/str): frequency:(int) } Return type: tag_obs (defaultdict(Counter))
-
get_umi
()[source]¶ Obtain umi of molecule
Returns: return main umi associated with this molecule Return type: umi (str)
-
get_umi_error_rate
()[source]¶ Obtain fraction of fragments that are associated to the molecule with a exact matching UMI vs total amount of associated fragments :returns: exact_matching_fraction (float)
-
has_valid_span
()[source]¶ Check if the span of the molecule is determined
Returns: has_valid_span (bool)
-
is_completely_matching
¶ checks if all cigar operations are M, Returns True when all cigar operations are M, False otherwise
Type: Checks if all associated reads are completely mapped
-
is_multimapped
()[source]¶ Check if the molecule is multimapping
Returns: True when multimapping Return type: is_multimapped (bool)
-
is_valid
(set_rejection_reasons=False)[source]¶ Check if the molecule is valid All of the following requirements should be met: - no multimapping - no low mapping mapping_quality (Change molecule.min_max_mapping_quality to set the threshold) - molecule is associated with at least one valid fragment
Parameters: - set_rejection_reasons (bool) – When set to True, all reads get a
- reason (rejection) –
Returns: True when all requirements are met, False otherwise
Return type: is_valid (bool)
-
library
¶ Associated library
Returns: Library associated with the first read of this molecule Return type: library (str)
-
set_meta
(tag, value)[source]¶ Set meta information to all fragments
Parameters: - tag (str) – 2 letter tag
- value – value to set
Set methylation call tags given a methylation dictionary
This method sets multiple tags in every read associated to the molecule. The tags being set are the bismark_call_tag, every aligned base is annotated with a zZxXhH or “.”, and a tag for both the total methylated C’s and unmethylated C’s
Parameters: - call_dict (dict) – Dictionary containing bismark calls (chrom,pos) : {‘context’:letter,’reference_base’: letter , ‘consensus’: letter, optiona: ‘qual’: pred_score (int) }
- bismark_call_tag (str) – tag to write bismark call string
- total_methylated_tag (str) – tag to write total methylated bases
- total_unmethylated_tag (str) – tag to write total unmethylated bases
- reads (iterable) – reads to write the tags to, when not supplied, the tags are written to all associated reads
Returns: can_be_yielded (bool)
-
set_rejection_reason
(reason, set_qcfail=False)[source]¶ Add rejection reason to all fragments associated to this molecule
Parameters: - reason (str) – rejection reason to set
- set_qcfail (bool) – set qcfail bit to True for all associated reads
-
span_len
¶
-
split_into_allele_molecules
()[source]¶ Split this molecule into multiple molecules, associated to multiple alleles :returns: list :rtype: list_of_molecules
-
update_mapability
(set_mq_zero=False)[source]¶ Update mapability of this molecule. mapping qualities are set to 0 if the mapability_reader returns False for site_is_mapable
The mapability_reader can be set when initiating the molecule, or added later.
Parameters: - set_mq_zero (bool) – set mapping quality of associated reads to 0 when the
- reader returns a bad verdict (mappability) –
Tip
Use createMapabilityIndex.py to create an index to feed to the mapability_reader
-
update_umi
()[source]¶ Set UMI sets self.umi (str) sets the most common umi associated to the molecule
-
write_allele_phasing_information_tag
(allele_resolver=None, tag='ap', reads=None)[source]¶ Write allele phasing information to ap tag
For every associated read a tag wil be written containing: chromosome,postion,base,allele_name|chromosome,postion,base,allele_name|… for all variants found by the AlleleResolver
-
write_pysam
(target_file, consensus=False, no_source_reads=False, consensus_name=None, consensus_read_callback=None, consensus_read_callback_kwargs=None)[source]¶ Write all associated reads to the target file
Parameters: - target_file (pysam.AlignmentFile) – Target file
- consensus (bool) – write consensus
- no_source_reads (bool) – while in consensus mode, don’t write original reads
- consensus_read_callback (function) – this function is called with every consensus read as an arguments
- consensus_read_callback_kwargs (dict) – arguments to pass to the callback function
Write BAM tags to all reads associated to this molecule
- This function sets the following tags:
- mI : most common umi
- DA : allele
- af : amount of associated fragments
- rt : rt_reaction_index
- rd : rt_duplicate_index
- TR : Total RT reactions
- ap : phasing information (if allele_resolver is set)
- TF : total fragments
- ms : size of the molecule (largest fragment)
Write molecule information to the supplied reads as BAM tags
-
-
singlecellmultiomics.molecule.molecule.
consensii_default_vector
()[source]¶ a numpy vector with 5 elements initialsed as zeros
-
singlecellmultiomics.molecule.molecule.
detect_alleles
(molecules, contig, position, min_cell_obs=3, base_confidence_threshold=None, classifier=None)[source]¶ Detect the alleles (variable bases) present at the selected location
Parameters: - molecules – generator to extract molecules from
- variant_location (tuple) – (contig, position) zero based location of the location to test
- min_cell_obs (int) – minimum amount of cells containing the allele to be emitted
- confidence_threshold (float) – minimum confidence of concensus base-call to be taken into account
- classifier (obj) – classifier used for consensus call, when no classifier is supplied a mayority vote is used
-
singlecellmultiomics.molecule.molecule.
get_variant_phase
(molecules, contig, position, variant_base, allele_resolver, phasing_ratio_threshold=None)[source]¶
singlecellmultiomics.molecule.nlaIII module¶
-
class
singlecellmultiomics.molecule.nlaIII.
AnnotatedNLAIIIMolecule
(fragment, features, **kwargs)[source]¶ Bases:
singlecellmultiomics.molecule.featureannotatedmolecule.FeatureAnnotatedMolecule
,singlecellmultiomics.molecule.nlaIII.NlaIIIMolecule
Nla III based Molecule which is annotated with features (genes/exons/introns, .. )
Parameters: - fragments (singlecellmultiomics.fragment.Fragment) – Fragments to associate to the molecule
- features (singlecellmultiomics.features.FeatureContainer) – container to use to obtain features from
- **kwargs – extra args
Write BAM tags to all reads associated to this molecule
- This function sets the following tags:
- mI : most common umi
- DA : allele
- af : amount of associated fragments
- rt : rt_reaction_index
- rd : rt_duplicate_index
- TR : Total RT reactions
- ap : phasing information (if allele_resolver is set)
- TF : total fragments
- ms : size of the molecule (largest fragment)
Write molecule information to the supplied reads as BAM tags
-
class
singlecellmultiomics.molecule.nlaIII.
NlaIIIMolecule
(fragment, site_has_to_be_mapped=False, **kwargs)[source]¶ Bases:
singlecellmultiomics.molecule.molecule.Molecule
Nla III based Molecule class
Parameters: - fragments (singlecellmultiomics.fragment.Fragment) – Fragments to associate to the molecule
- **kwargs – extra args
-
get_fragment_span_sequence
(reference=None)[source]¶ Obtain the sequence between the start and end of the molecule :param reference: reference to use.
If not specified self.reference is usedReturns: sequence (str)
-
get_undigested_site_count
(reference=None)[source]¶ Obtain the amount of undigested sites in the span of the molecule
Parameters: reference (pysam.FastaFile) – reference handle Returns: int amount of undigested cut sites in the mapping span of the molecule Return type: undigested_site_count Raises: ValueError
– when the span of the molecule is not properly defined
-
is_valid
(set_rejection_reasons=False, reference=None)[source]¶ Check if the molecule is valid All of the following requirements should be met: - no multimapping - no low mapping mapping_quality (Change molecule.min_max_mapping_quality to set the threshold) - molecule is associated with at least one valid fragment
Parameters: - set_rejection_reasons (bool) – When set to True, all reads get a
- reason (rejection) –
Returns: True when all requirements are met, False otherwise
Return type: is_valid (bool)
Write BAM tags to all reads associated to this molecule
- This function sets the following tags:
- mI : most common umi
- DA : allele
- af : amount of associated fragments
- rt : rt_reaction_index
- rd : rt_duplicate_index
- TR : Total RT reactions
- ap : phasing information (if allele_resolver is set)
- TF : total fragments
- ms : size of the molecule (largest fragment)
Write molecule information to the supplied reads as BAM tags
singlecellmultiomics.molecule.rna module¶
-
class
singlecellmultiomics.molecule.rna.
VASA
(fragment, features, **kwargs)[source]¶ Bases:
singlecellmultiomics.molecule.featureannotatedmolecule.FeatureAnnotatedMolecule
VASA seq molecule
Parameters: - fragments (singlecellmultiomics.fragment.Fragment) – Fragments to associate to the molecule
- features (singlecellmultiomics.features.FeatureContainer) – container to use to obtain features from
- **kwargs – extra args
singlecellmultiomics.molecule.taps module¶
-
class
singlecellmultiomics.molecule.taps.
AnnotatedTAPSCHICMolecule
(fragments=None, features=None, taps_strand='R', taps=None, **kwargs)[source]¶ Bases:
singlecellmultiomics.molecule.chic.AnnotatedCHICMolecule
,singlecellmultiomics.molecule.taps.TAPSMolecule
Molecule class for combined TAPS, CHIC and transcriptome
-
is_valid
(set_rejection_reasons=False)[source]¶ Check if the molecule is valid All of the following requirements should be met: - no multimapping - no low mapping mapping_quality (Change molecule.min_max_mapping_quality to set the threshold) - molecule is associated with at least one valid fragment
Parameters: - set_rejection_reasons (bool) – When set to True, all reads get a
- reason (rejection) –
Returns: True when all requirements are met, False otherwise
Return type: is_valid (bool)
Write BAM tags to all reads associated to this molecule
- This function sets the following tags:
- mI : most common umi
- DA : allele
- af : amount of associated fragments
- rt : rt_reaction_index
- rd : rt_duplicate_index
- TR : Total RT reactions
- ap : phasing information (if allele_resolver is set)
- TF : total fragments
- ms : size of the molecule (largest fragment)
Write molecule information to the supplied reads as BAM tags
-
-
class
singlecellmultiomics.molecule.taps.
AnnotatedTAPSNlaIIIMolecule
(fragments=None, features=None, taps=None, **kwargs)[source]¶ Bases:
singlecellmultiomics.molecule.nlaIII.AnnotatedNLAIIIMolecule
,singlecellmultiomics.molecule.taps.TAPSMolecule
Molecule class for combined TAPS, NLAIII and transcriptome
-
is_valid
(set_rejection_reasons=False)[source]¶ Check if the molecule is valid All of the following requirements should be met: - no multimapping - no low mapping mapping_quality (Change molecule.min_max_mapping_quality to set the threshold) - molecule is associated with at least one valid fragment
Parameters: - set_rejection_reasons (bool) – When set to True, all reads get a
- reason (rejection) –
Returns: True when all requirements are met, False otherwise
Return type: is_valid (bool)
Write BAM tags to all reads associated to this molecule
- This function sets the following tags:
- mI : most common umi
- DA : allele
- af : amount of associated fragments
- rt : rt_reaction_index
- rd : rt_duplicate_index
- TR : Total RT reactions
- ap : phasing information (if allele_resolver is set)
- TF : total fragments
- ms : size of the molecule (largest fragment)
Write molecule information to the supplied reads as BAM tags
-
-
class
singlecellmultiomics.molecule.taps.
TAPS
(reference=None, reference_variants=None, taps_strand='F', **kwargs)[source]¶ Bases:
object
-
molecule_to_context_call_dict
(molecule)[source]¶ Extract bismark call_string dictionary from a molecule
Parameters: molecule (singlecellmultiomics.molecule.Molecule) – molecule to extract calls from Returns: {(chrom,pos):bismark call letter} Return type: context_call_dict(dict)
-
overlap_tag
= None¶ z unmethylated C in CpG context (CG) Z methylated C in CpG context (CG) x unmethylated C in CHG context ( C[ACT]G ) X methylated C in CHG context ( C[ACT]G ) h unmethylated C in CHH context ( C[ACT][ACT] ) H methylated C in CHH context ( C[ACT][ACT] )
-
position_to_context
(chromosome, position, ref_base, observed_base='N', strand=None, reference=None)[source]¶ Extract bismark call letter from a chromosomal location given the observed base
Parameters: - chromosome (str) – chromosome / contig of location to test
- position (int) – genomic location of location to test
- observed_base (str) – base observed in the read
- strand (int) – mapping strand, False: forward, True: Reverse
Returns: 3 basepair context bismark_letter(str) : bismark call
Return type: context(str)
-
-
class
singlecellmultiomics.molecule.taps.
TAPSCHICMolecule
(fragments=None, taps=None, taps_strand='R', **kwargs)[source]¶ Bases:
singlecellmultiomics.molecule.chic.CHICMolecule
,singlecellmultiomics.molecule.taps.TAPSMolecule
Molecule class for combined TAPS and CHIC
-
is_valid
(set_rejection_reasons=False)[source]¶ Check if the molecule is valid All of the following requirements should be met: - no multimapping - no low mapping mapping_quality (Change molecule.min_max_mapping_quality to set the threshold) - molecule is associated with at least one valid fragment
Parameters: - set_rejection_reasons (bool) – When set to True, all reads get a
- reason (rejection) –
Returns: True when all requirements are met, False otherwise
Return type: is_valid (bool)
Write BAM tags to all reads associated to this molecule
- This function sets the following tags:
- mI : most common umi
- DA : allele
- af : amount of associated fragments
- rt : rt_reaction_index
- rd : rt_duplicate_index
- TR : Total RT reactions
- ap : phasing information (if allele_resolver is set)
- TF : total fragments
- ms : size of the molecule (largest fragment)
Write molecule information to the supplied reads as BAM tags
-
-
class
singlecellmultiomics.molecule.taps.
TAPSMolecule
(fragments=None, taps=None, classifier=None, taps_strand='F', allow_unsafe_base_calls=False, methylation_consensus_kwargs=None, **kwargs)[source]¶ Bases:
singlecellmultiomics.molecule.molecule.Molecule
-
is_valid
(set_rejection_reasons=False)[source]¶ Check if the molecule is valid All of the following requirements should be met: - no multimapping - no low mapping mapping_quality (Change molecule.min_max_mapping_quality to set the threshold) - molecule is associated with at least one valid fragment
Parameters: - set_rejection_reasons (bool) – When set to True, all reads get a
- reason (rejection) –
Returns: True when all requirements are met, False otherwise
Return type: is_valid (bool)
-
obtain_methylation_calls
(**get_consensus_dictionaries_kwargs)[source]¶ This methods returns a methylation call dictionary
Returns: (chrom,pos) : {‘consensus’: consensusBase, ‘reference’: referenceBase, ‘call’: call} Return type: mcalls(dict)
Write molecule information to the supplied reads as BAM tags
-
-
class
singlecellmultiomics.molecule.taps.
TAPSNlaIIIMolecule
(fragments=None, taps=None, taps_strand='R', **kwargs)[source]¶ Bases:
singlecellmultiomics.molecule.nlaIII.NlaIIIMolecule
,singlecellmultiomics.molecule.taps.TAPSMolecule
Molecule class for combined TAPS and NLAIII
-
is_valid
(set_rejection_reasons=False)[source]¶ Check if the molecule is valid All of the following requirements should be met: - no multimapping - no low mapping mapping_quality (Change molecule.min_max_mapping_quality to set the threshold) - molecule is associated with at least one valid fragment
Parameters: - set_rejection_reasons (bool) – When set to True, all reads get a
- reason (rejection) –
Returns: True when all requirements are met, False otherwise
Return type: is_valid (bool)
Write BAM tags to all reads associated to this molecule
- This function sets the following tags:
- mI : most common umi
- DA : allele
- af : amount of associated fragments
- rt : rt_reaction_index
- rd : rt_duplicate_index
- TR : Total RT reactions
- ap : phasing information (if allele_resolver is set)
- TF : total fragments
- ms : size of the molecule (largest fragment)
Write molecule information to the supplied reads as BAM tags
-
-
class
singlecellmultiomics.molecule.taps.
TAPSPTaggedMolecule
(fragments=None, features=None, taps=None, **kwargs)[source]¶ Bases:
singlecellmultiomics.molecule.taps.AnnotatedTAPSNlaIIIMolecule
Write molecule information to the supplied reads as BAM tags