singlecellmultiomics.fragment package¶

Submodules¶

singlecellmultiomics.fragment.chic module¶

class singlecellmultiomics.fragment.chic.CHICFragment(reads, R1_primer_length=4, R2_primer_length=6, assignment_radius=0, umi_hamming_distance=1, invert_strand=False, no_umi_cigar_processing=False, **kwargs)[source]¶

Bases: singlecellmultiomics.fragment.fragment.Fragment

get_site_location()[source]¶

identify_site()[source]¶

is_valid()[source]¶

set_site(site_chrom, site_pos, site_strand=None, is_trimmed=False)[source]¶

singlecellmultiomics.fragment.fragment module¶

class singlecellmultiomics.fragment.fragment.FeatureCountsFullLengthFragment(reads, R1_primer_length=4, R2_primer_length=6, assignment_radius=10000, umi_hamming_distance=1, invert_strand=False, **kwargs)[source]¶

Bases: singlecellmultiomics.fragment.fragment.FeatureCountsSingleEndFragment

Class for fragments annotated with featureCounts, with multiple reads covering a gene

Extracts annotated gene from the XT tag. Deduplicates using XT tag and UMI Reads without XT tag are flagged as invalid

class singlecellmultiomics.fragment.fragment.FeatureCountsSingleEndFragment(reads, R1_primer_length=4, R2_primer_length=6, assignment_radius=100000, umi_hamming_distance=1, invert_strand=False, **kwargs)[source]¶

Bases: singlecellmultiomics.fragment.fragment.Fragment

Class for fragments annotated with featureCounts

Extracts annotated gene from the XT tag. Deduplicates using XT tag and UMI Reads without XT tag are flagged as invalid

identify_gene()[source]¶

is_valid()[source]¶

class singlecellmultiomics.fragment.fragment.Fragment(reads, assignment_radius: int = 0, umi_hamming_distance: int = 1, R1_primer_length: int = 0, R2_primer_length: int = 6, tag_definitions: list = None, max_fragment_size: int = None, mapping_dir=(False, True), max_NUC_stretch: int = None, read_group_format: int = 0, library_name: str = None, single_end: bool = False)[source]¶

Bases: object

This class holds 1 or more reads which are derived from the same cluster

Example

Generate a Fragment with a single associated read:

>>> from singlecellmultiomics.molecule import Molecule
>>> from singlecellmultiomics.fragment import Fragment
>>> import pysam
>>> read = pysam.AlignedSegment()
>>> read.reference_start = 30
>>> read.query_name = 'R1'
>>> read.mapping_quality = 30
>>> read.set_tag('SM','CELL_1') # The sample to which the sample belongs is extracted from the SM tag
>>> read.set_tag('RX','CAT') # The UMI is extracted from the RX tag
>>> read.query_sequence = "CATGTATCCGGGCTTAA"
>>> read.query_qualities = [30] * len(read.query_sequence)
>>> read.cigarstring = f'{len(read.query_sequence)}M'
>>> Fragment([read])
Fragment:
    sample:CELL_1
    umi:CAT
    span:None 30-47
    strand:+
    has R1: yes
    has R2: no
    randomer trimmed: no

Warning

Make sure the RX and SM tags of the read are set! If these are encoded in the read name, use singlecellmultiomics.universalBamTagger.customreads for conversion.

R1¶

R2¶

__eq__(other)[source]¶

Check equivalence between two Fragments or Fragment and Molecule.

Parameters:	other (Fragment or Molecule) – object to compare against

Returns: is_eq (bool) : True when the other object is (likely) derived from the same molecule

Example

>>> from singlecellmultiomics.molecule import Molecule
>>> from singlecellmultiomics.fragment import Fragment
>>> import pysam
>>> # Create sam file to write some reads to:
>>> test_sam = pysam.AlignmentFile('test.sam','w',reference_names=['chr1','chr2'],reference_lengths=[1000,1000])
>>> read_A = pysam.AlignedSegment(test_sam.header)
>>> read_A.set_tag('SM','CELL_1') # The sample to which the sample belongs is extracted from the SM tag
>>> read_A.set_tag('RX','CAT') # The UMI is extracted from the RX tag
>>> # By default the molecule assignment is done based on the mapping location of read 1:
>>> read_A.reference_name = 'chr1'
>>> read_A.reference_start = 100
>>> read_A.query_sequence = 'ATCGGG'
>>> read_A.cigarstring = '6M'

>>> read_B = pysam.AlignedSegment(test_sam.header)
>>> read_B.set_tag('SM','CELL_1')
>>> read_B.set_tag('RX','CAT')
>>> read_B.reference_start = 100
>>> read_B.query_sequence = 'ATCGG'
>>> read_B.cigarstring = '5M'

>>> frag_A = Fragment([read_A],umi_hamming_distance=0)
>>> frag_A
    Fragment:
        sample:CELL_1
        umi:CAT
        span:chr1 100-106
        strand:+
        has R1: yes
        has R2: no
        randomer trimmed: no
>>> frag_B = Fragment([read_B],umi_hamming_distance=0)
>>> frag_A == frag_B
True
# Fragment A and fragment B belong to the same molecule,
# the UMI is identical, the starting position of R1 is identical and
# the sample name matches

When we move one of the reads, the Fragments are not equivalent any more

Example

>>> read_B.reference_start = 150
>>> frag_B = Fragment([read_B],umi_hamming_distance=0)
>>> frag_A == frag_B
False

Except if the difference <= the assignment_radius

Example

>>> read_B.reference_start = 150
>>> read_A.reference_start = 100
>>> frag_B = Fragment([read_B],assignment_radius=300)
>>> frag_A = Fragment([read_A],assignment_radius=300)
>>> frag_A == frag_B
True

When the UMI’s are too far apart, the eq function returns False

Example

>>> read_B.reference_start = 100
>>> read_A.reference_start = 100
>>> read_A.set_tag('RX','GGG')
>>> frag_B = Fragment([read_B])
>>> frag_A = Fragment([read_A])
>>> frag_A == frag_B
False

When the sample of the Fragments are not identical, the eq function returns False

Example

>>> read_B.reference_start = 100
>>> read_A.reference_start = 100
>>> read_A.set_tag('RX','AAA')
>>> read_B.set_tag('RX','AAA')
>>> read_B.set_tag('SM', 'CELL_2' )
>>> frag_B = Fragment([read_B])
>>> frag_A = Fragment([read_A])
>>> frag_A == frag_B
False

__getitem__(index)[source]¶

Get a read from the fragment

Parameters:	index (int) – 0 : Read 1 1: Read 2 ..

__len__()[source]¶

Obtain the amount of associated reads to the fragment

Returns:	assocoiated reads (int)

aligned_length¶

estimated_length¶: Obtain the estimated size of the fragment, returns None when estimation is not possible Takes into account removed bases (R2_primer_length attribute) Assumes inwards sequencing orientation, except when self.single_end is set

get_R1()[source]¶

Obtain the AlignedSegment of read 1 of the fragment

Returns:	Read 1 of the fragment, returns None when R1 is not mapped
Return type:	R1 (pysam.AlignedSegment)

get_R2()[source]¶

Obtain the AlignedSegment of read 2 of the fragment

Returns:	Read 2 of the fragment, returns None when R2 is not mapped
Return type:	R1 (pysam.AlignedSegment)

get_consensus(only_include_refbase: str = None, dove_safe: bool = False, **get_consensus_dictionaries_kwargs) → dict[source]¶

a dictionary of (reference_pos) : (qbase, quality, reference_base) tuples

Parameters:	only_include_refbase (str) – Only report bases aligned to this reference base, uppercase only dove_safe (bool) – Only report bases supported within R1 and R2 start and end coordinates
Returns:	{reference_position: (qbase, quality)
Return type:	consensus(dict)

get_fragment_size()[source]¶

get_html(chromosome=None, span_start=None, span_end=None, show_read1=None, show_read2=None)[source]¶

Get HTML representation of the fragment

Parameters:	chromosome (str) – chromosome to view span_start (int) – first base to show span_end (int) – last base to show show_read1 (bool) – show read1 show_read2 (bool) – show read2
Returns:	html representation of the fragment
Return type:	html(string)

get_random_primer_hash()[source]¶

Obtain hash describing the random primer this assumes the random primer is on the end of R2 and has a length of self.R2_primer_length When the rS tag is set, the value of this tag is used as random primer sequence Returns None,None when the random primer cannot be described

Returns:	reference_name (str) or None reference_start (int) : Int or None sequence (str) : Int or None

get_read_group(with_attr_dict=False) → str[source]¶

get_sample() → str[source]¶

Obtain the sample name associated with the fragment The sample name is extracted from the SM tag of any of the associated reads.

Returns:	sample name (str)

get_span() → tuple[source]¶

get_strand()[source]¶

Obtain strand

Returns:	False for Forward, True for reverse
Return type:	strand (bool)

get_strand_repr()[source]¶

get_umi()[source]¶

Get the UMI sequence associated to this fragment

Returns:	umi(str)

has_R1()[source]¶

Check if the fragment has an associated read 1

Returns:	has_r1 (bool)

has_R2()[source]¶

Check if the fragment has an associated read 2

Returns:	has_r2 (bool)

has_valid_span() → bool[source]¶

identify_strand()[source]¶

is_valid()[source]¶

remove_meta(key)[source]¶

set_duplicate(is_duplicate)[source]¶: Define this fragment as duplicate, sets the corresponding bam bit flag :param value: is_duplicate :type value: bool

set_meta(key, value, as_set=False)[source]¶

set_recognized_sequence(seq)[source]¶

set_rejection_reason(reason, set_qcfail=False)[source]¶

set_sample(sample: str = None, library_name: str = None)[source]¶: Force sample name or obtain sample name from associated reads

set_strand(strand: bool)[source]¶

Set mapping strand

Parameters:	strand (bool) – False for Forward, True for reverse

umi_eq(other)[source]¶

Hamming distance measurement to another Fragment or Molecule,

Returns :: is_close (bool) : returns True when the hamming distance between the two objects <= umi_hamming_distance

Example

>>> from singlecellmultiomics.molecule import Molecule
>>> from singlecellmultiomics.fragment import Fragment
>>> import pysam
>>> # Create reads (read_A, and read_B), they both belong to the same
>>> # cell and have 1 hamming distance between their UMI's
>>> read_A = pysam.AlignedSegment()
>>> read_A.set_tag('SM','CELL_1') # The sample to which the sample belongs is extracted from the SM tag
>>> read_A.set_tag('RX','CAT') # The UMI is extracted from the RX tag
>>> read_B = pysam.AlignedSegment()
>>> read_B.set_tag('SM','CELL_1') # The sample to which the sample belongs is extracted from the SM tag
>>> read_B.set_tag('RX','CAG') # The UMI is extracted from the RX tag

>>> # Create fragment objects for read_A and B:
>>> frag_A = Fragment([read_A],umi_hamming_distance=0)
>>> frag_B = Fragment([read_B],umi_hamming_distance=0)
>>> frag_A.umi_eq(frag_B) # This returns False, the distance is 1, which is higher than 0 (umi_hamming_distance)
False

>>> frag_A = Fragment([read_A],umi_hamming_distance=1)
>>> frag_B = Fragment([read_B],umi_hamming_distance=1)
>>> frag_A.umi_eq(frag_B) # This returns True, the distance is 1, which is the (umi_hamming_distance)
True

update_span()[source]¶

Update the span (the location the fragment maps to) stored in Fragment

The span is a single stretch of coordinates on a single contig. The contig is determined by the reference_name assocated to the first mapped read in self.reads

This calculation assumes the reads are sequenced inwards and dove-tails of the molecule cannot be trusted

update_umi()[source]¶: Extract umi from ‘RX’ tag and store the UMI in the Fragment object

write_pysam(pysam_handle)[source]¶

Write all associated reads to the target file

Parameters:	target_file (pysam.AlignmentFile) – Target file

write_tags()[source]¶

write_tensor(chromosome=None, span_start=None, span_end=None, height=30, index_start=0, base_content_table=None, base_mismatches_table=None, base_indel_table=None, base_qual_table=None, base_clip_table=None, mask_reference_bases=None, reference=None, skip_missing_reads=False)[source]¶

Write tensor representation of the fragment to supplied 2d arrays

Parameters:

chromosome (str) – chromosome to view
span_start (int) – first base to show
span_end (int) – last base to show
height (int) – height of the tensor (reads)
index_start (int) – start writing tensor at this row
base_content_table (np.array) – 2d array to write base contents to
base_indel_table (np.array) – 2d array to write indel information to
base_content_table – 2d array to write base contents to
base_content_table – 2d array to write base contents to
mask_reference_bases (set) – mask reference bases in this set with a N set( (chrom,pos), … )
reference (pysam.FastaFile) – Handle to reference file to use instead of MD tag. If None: MD tag is used.
skip_missing_reads (bool) – when enabled only existing (non None) reads are added to the tensor. Use this option when mapping single-end

Returns:

None

class singlecellmultiomics.fragment.fragment.FragmentStartPosition(reads, assignment_radius: int = 0, umi_hamming_distance: int = 1, R1_primer_length: int = 0, R2_primer_length: int = 6, tag_definitions: list = None, max_fragment_size: int = None, mapping_dir=(False, True), max_NUC_stretch: int = None, read_group_format: int = 0, library_name: str = None, single_end: bool = False)[source]¶

Bases: singlecellmultiomics.fragment.fragment.Fragment

Fragment without a specific location on a contig

get_site_location()[source]¶

class singlecellmultiomics.fragment.fragment.FragmentWithoutPosition(reads, assignment_radius: int = 0, umi_hamming_distance: int = 1, R1_primer_length: int = 0, R2_primer_length: int = 6, tag_definitions: list = None, max_fragment_size: int = None, mapping_dir=(False, True), max_NUC_stretch: int = None, read_group_format: int = 0, library_name: str = None, single_end: bool = False)[source]¶

Bases: singlecellmultiomics.fragment.fragment.Fragment

Fragment without a specific location on a contig

get_site_location()[source]¶

class singlecellmultiomics.fragment.fragment.FragmentWithoutUMI(reads, **kwargs)[source]¶

Bases: singlecellmultiomics.fragment.fragment.Fragment

Use this class when no UMI information is available

set_umi(**kwargs)[source]¶

class singlecellmultiomics.fragment.fragment.SingleEndTranscriptFragment(reads, features, assignment_radius=0, stranded=None, capture_locations=False, auto_set_intron_exon_features=True, **kwargs)[source]¶

Bases: singlecellmultiomics.fragment.fragment.Fragment, singlecellmultiomics.features.features.FeatureAnnotatedObject

annotate()[source]¶

get_site_location()[source]¶

is_valid()[source]¶

write_tags()[source]¶

singlecellmultiomics.fragment.nlaIII module¶

class singlecellmultiomics.fragment.nlaIII.NlaIIIFragment(reads, R1_primer_length=4, R2_primer_length=6, assignment_radius=1000, umi_hamming_distance=1, invert_strand=False, check_motif=True, no_overhang=False, cut_location_offset=-4, reference=None, allow_cycle_shift=False, use_allele_tag=False, no_umi_cigar_processing=False, **kwargs)[source]¶

Bases: singlecellmultiomics.fragment.fragment.Fragment

get_safe_span(allow_unsafe=True)[source]¶

get_site_location()[source]¶

get_undigested_site_count(reference_handle)[source]¶

Obtain the amount of undigested sites in the span of the fragment

Parameters:	reference (pysam.FastaFile or similiar) –
Returns:	undigested_site_count (int) – amount of undigested cut sites in the mapping span of the fragment Raises ——- ValueError (when the span of the molecule is not properly defined)

identify_site()[source]¶

is_valid()[source]¶

set_site(site_chrom, site_pos, site_strand=None, valid=True)[source]¶

singlecellmultiomics.fragment package¶

Submodules¶

singlecellmultiomics.fragment.chic module¶

singlecellmultiomics.fragment.fragment module¶

singlecellmultiomics.fragment.nlaIII module¶

Module contents¶