singlecellmultiomics.fragment package

Submodules

singlecellmultiomics.fragment.chic module

class singlecellmultiomics.fragment.chic.CHICFragment(reads, R1_primer_length=4, R2_primer_length=6, assignment_radius=0, umi_hamming_distance=1, invert_strand=False, no_umi_cigar_processing=False, **kwargs)[source]

Bases: singlecellmultiomics.fragment.fragment.Fragment

get_site_location()[source]
identify_site()[source]
is_valid()[source]
set_site(site_chrom, site_pos, site_strand=None, is_trimmed=False)[source]

singlecellmultiomics.fragment.fragment module

class singlecellmultiomics.fragment.fragment.FeatureCountsFullLengthFragment(reads, R1_primer_length=4, R2_primer_length=6, assignment_radius=10000, umi_hamming_distance=1, invert_strand=False, **kwargs)[source]

Bases: singlecellmultiomics.fragment.fragment.FeatureCountsSingleEndFragment

Class for fragments annotated with featureCounts, with multiple reads covering a gene

Extracts annotated gene from the XT tag. Deduplicates using XT tag and UMI Reads without XT tag are flagged as invalid

class singlecellmultiomics.fragment.fragment.FeatureCountsSingleEndFragment(reads, R1_primer_length=4, R2_primer_length=6, assignment_radius=100000, umi_hamming_distance=1, invert_strand=False, **kwargs)[source]

Bases: singlecellmultiomics.fragment.fragment.Fragment

Class for fragments annotated with featureCounts

Extracts annotated gene from the XT tag. Deduplicates using XT tag and UMI Reads without XT tag are flagged as invalid

identify_gene()[source]
is_valid()[source]
class singlecellmultiomics.fragment.fragment.Fragment(reads, assignment_radius: int = 0, umi_hamming_distance: int = 1, R1_primer_length: int = 0, R2_primer_length: int = 6, tag_definitions: list = None, max_fragment_size: int = None, mapping_dir=(False, True), max_NUC_stretch: int = None, read_group_format: int = 0, library_name: str = None, single_end: bool = False)[source]

Bases: object

This class holds 1 or more reads which are derived from the same cluster

Example

Generate a Fragment with a single associated read:

>>> from singlecellmultiomics.molecule import Molecule
>>> from singlecellmultiomics.fragment import Fragment
>>> import pysam
>>> read = pysam.AlignedSegment()
>>> read.reference_start = 30
>>> read.query_name = 'R1'
>>> read.mapping_quality = 30
>>> read.set_tag('SM','CELL_1') # The sample to which the sample belongs is extracted from the SM tag
>>> read.set_tag('RX','CAT') # The UMI is extracted from the RX tag
>>> read.query_sequence = "CATGTATCCGGGCTTAA"
>>> read.query_qualities = [30] * len(read.query_sequence)
>>> read.cigarstring = f'{len(read.query_sequence)}M'
>>> Fragment([read])
Fragment:
    sample:CELL_1
    umi:CAT
    span:None 30-47
    strand:+
    has R1: yes
    has R2: no
    randomer trimmed: no

Warning

Make sure the RX and SM tags of the read are set! If these are encoded in the read name, use singlecellmultiomics.universalBamTagger.customreads for conversion.

R1
R2
__eq__(other)[source]

Check equivalence between two Fragments or Fragment and Molecule.

Parameters:other (Fragment or Molecule) – object to compare against
Returns
is_eq (bool) : True when the other object is (likely) derived from the same molecule

Example

>>> from singlecellmultiomics.molecule import Molecule
>>> from singlecellmultiomics.fragment import Fragment
>>> import pysam
>>> # Create sam file to write some reads to:
>>> test_sam = pysam.AlignmentFile('test.sam','w',reference_names=['chr1','chr2'],reference_lengths=[1000,1000])
>>> read_A = pysam.AlignedSegment(test_sam.header)
>>> read_A.set_tag('SM','CELL_1') # The sample to which the sample belongs is extracted from the SM tag
>>> read_A.set_tag('RX','CAT') # The UMI is extracted from the RX tag
>>> # By default the molecule assignment is done based on the mapping location of read 1:
>>> read_A.reference_name = 'chr1'
>>> read_A.reference_start = 100
>>> read_A.query_sequence = 'ATCGGG'
>>> read_A.cigarstring = '6M'
>>> read_B = pysam.AlignedSegment(test_sam.header)
>>> read_B.set_tag('SM','CELL_1')
>>> read_B.set_tag('RX','CAT')
>>> read_B.reference_start = 100
>>> read_B.query_sequence = 'ATCGG'
>>> read_B.cigarstring = '5M'
>>> frag_A = Fragment([read_A],umi_hamming_distance=0)
>>> frag_A
    Fragment:
        sample:CELL_1
        umi:CAT
        span:chr1 100-106
        strand:+
        has R1: yes
        has R2: no
        randomer trimmed: no
>>> frag_B = Fragment([read_B],umi_hamming_distance=0)
>>> frag_A == frag_B
True
# Fragment A and fragment B belong to the same molecule,
# the UMI is identical, the starting position of R1 is identical and
# the sample name matches

When we move one of the reads, the Fragments are not equivalent any more

Example

>>> read_B.reference_start = 150
>>> frag_B = Fragment([read_B],umi_hamming_distance=0)
>>> frag_A == frag_B
False

Except if the difference <= the assignment_radius

Example

>>> read_B.reference_start = 150
>>> read_A.reference_start = 100
>>> frag_B = Fragment([read_B],assignment_radius=300)
>>> frag_A = Fragment([read_A],assignment_radius=300)
>>> frag_A == frag_B
True
When the UMI’s are too far apart, the eq function returns False

Example

>>> read_B.reference_start = 100
>>> read_A.reference_start = 100
>>> read_A.set_tag('RX','GGG')
>>> frag_B = Fragment([read_B])
>>> frag_A = Fragment([read_A])
>>> frag_A == frag_B
False

When the sample of the Fragments are not identical, the eq function returns False

Example

>>> read_B.reference_start = 100
>>> read_A.reference_start = 100
>>> read_A.set_tag('RX','AAA')
>>> read_B.set_tag('RX','AAA')
>>> read_B.set_tag('SM', 'CELL_2' )
>>> frag_B = Fragment([read_B])
>>> frag_A = Fragment([read_A])
>>> frag_A == frag_B
False
__getitem__(index)[source]

Get a read from the fragment

Parameters:index (int) – 0 : Read 1 1: Read 2 ..
__len__()[source]

Obtain the amount of associated reads to the fragment

Returns:assocoiated reads (int)
aligned_length
estimated_length

Obtain the estimated size of the fragment, returns None when estimation is not possible Takes into account removed bases (R2_primer_length attribute) Assumes inwards sequencing orientation, except when self.single_end is set

get_R1()[source]

Obtain the AlignedSegment of read 1 of the fragment

Returns:
Read 1 of the fragment, returns None
when R1 is not mapped
Return type:R1 (pysam.AlignedSegment)
get_R2()[source]

Obtain the AlignedSegment of read 2 of the fragment

Returns:
Read 2 of the fragment, returns None
when R2 is not mapped
Return type:R1 (pysam.AlignedSegment)
get_consensus(only_include_refbase: str = None, dove_safe: bool = False, **get_consensus_dictionaries_kwargs) → dict[source]

a dictionary of (reference_pos) : (qbase, quality, reference_base) tuples

Parameters:
  • only_include_refbase (str) – Only report bases aligned to this reference base, uppercase only
  • dove_safe (bool) – Only report bases supported within R1 and R2 start and end coordinates
Returns:

{reference_position: (qbase, quality)

Return type:

consensus(dict)

get_fragment_size()[source]
get_html(chromosome=None, span_start=None, span_end=None, show_read1=None, show_read2=None)[source]

Get HTML representation of the fragment

Parameters:
  • chromosome (str) – chromosome to view
  • span_start (int) – first base to show
  • span_end (int) – last base to show
  • show_read1 (bool) – show read1
  • show_read2 (bool) – show read2
Returns:

html representation of the fragment

Return type:

html(string)

get_random_primer_hash()[source]

Obtain hash describing the random primer this assumes the random primer is on the end of R2 and has a length of self.R2_primer_length When the rS tag is set, the value of this tag is used as random primer sequence Returns None,None when the random primer cannot be described

Returns:reference_name (str) or None reference_start (int) : Int or None sequence (str) : Int or None
get_read_group(with_attr_dict=False) → str[source]
get_sample() → str[source]

Obtain the sample name associated with the fragment The sample name is extracted from the SM tag of any of the associated reads.

Returns:sample name (str)
get_span() → tuple[source]
get_strand()[source]

Obtain strand

Returns:False for Forward, True for reverse
Return type:strand (bool)
get_strand_repr()[source]
get_umi()[source]

Get the UMI sequence associated to this fragment

Returns:umi(str)
has_R1()[source]

Check if the fragment has an associated read 1

Returns:has_r1 (bool)
has_R2()[source]

Check if the fragment has an associated read 2

Returns:has_r2 (bool)
has_valid_span() → bool[source]
identify_strand()[source]
is_valid()[source]
remove_meta(key)[source]
set_duplicate(is_duplicate)[source]

Define this fragment as duplicate, sets the corresponding bam bit flag :param value: is_duplicate :type value: bool

set_meta(key, value, as_set=False)[source]
set_recognized_sequence(seq)[source]
set_rejection_reason(reason, set_qcfail=False)[source]
set_sample(sample: str = None, library_name: str = None)[source]

Force sample name or obtain sample name from associated reads

set_strand(strand: bool)[source]

Set mapping strand

Parameters:strand (bool) – False for Forward, True for reverse
umi_eq(other)[source]

Hamming distance measurement to another Fragment or Molecule,

Returns :
is_close (bool) : returns True when the hamming distance between the two objects <= umi_hamming_distance

Example

>>> from singlecellmultiomics.molecule import Molecule
>>> from singlecellmultiomics.fragment import Fragment
>>> import pysam
>>> # Create reads (read_A, and read_B), they both belong to the same
>>> # cell and have 1 hamming distance between their UMI's
>>> read_A = pysam.AlignedSegment()
>>> read_A.set_tag('SM','CELL_1') # The sample to which the sample belongs is extracted from the SM tag
>>> read_A.set_tag('RX','CAT') # The UMI is extracted from the RX tag
>>> read_B = pysam.AlignedSegment()
>>> read_B.set_tag('SM','CELL_1') # The sample to which the sample belongs is extracted from the SM tag
>>> read_B.set_tag('RX','CAG') # The UMI is extracted from the RX tag
>>> # Create fragment objects for read_A and B:
>>> frag_A = Fragment([read_A],umi_hamming_distance=0)
>>> frag_B = Fragment([read_B],umi_hamming_distance=0)
>>> frag_A.umi_eq(frag_B) # This returns False, the distance is 1, which is higher than 0 (umi_hamming_distance)
False
>>> frag_A = Fragment([read_A],umi_hamming_distance=1)
>>> frag_B = Fragment([read_B],umi_hamming_distance=1)
>>> frag_A.umi_eq(frag_B) # This returns True, the distance is 1, which is the (umi_hamming_distance)
True
update_span()[source]

Update the span (the location the fragment maps to) stored in Fragment

The span is a single stretch of coordinates on a single contig. The contig is determined by the reference_name assocated to the first mapped read in self.reads

This calculation assumes the reads are sequenced inwards and dove-tails of the molecule cannot be trusted

update_umi()[source]

Extract umi from ‘RX’ tag and store the UMI in the Fragment object

write_pysam(pysam_handle)[source]

Write all associated reads to the target file

Parameters:target_file (pysam.AlignmentFile) – Target file
write_tags()[source]
write_tensor(chromosome=None, span_start=None, span_end=None, height=30, index_start=0, base_content_table=None, base_mismatches_table=None, base_indel_table=None, base_qual_table=None, base_clip_table=None, mask_reference_bases=None, reference=None, skip_missing_reads=False)[source]

Write tensor representation of the fragment to supplied 2d arrays

Parameters:
  • chromosome (str) – chromosome to view
  • span_start (int) – first base to show
  • span_end (int) – last base to show
  • height (int) – height of the tensor (reads)
  • index_start (int) – start writing tensor at this row
  • base_content_table (np.array) – 2d array to write base contents to
  • base_indel_table (np.array) – 2d array to write indel information to
  • base_content_table – 2d array to write base contents to
  • base_content_table – 2d array to write base contents to
  • mask_reference_bases (set) – mask reference bases in this set with a N set( (chrom,pos), … )
  • reference (pysam.FastaFile) – Handle to reference file to use instead of MD tag. If None: MD tag is used.
  • skip_missing_reads (bool) – when enabled only existing (non None) reads are added to the tensor. Use this option when mapping single-end
Returns:

None

class singlecellmultiomics.fragment.fragment.FragmentStartPosition(reads, assignment_radius: int = 0, umi_hamming_distance: int = 1, R1_primer_length: int = 0, R2_primer_length: int = 6, tag_definitions: list = None, max_fragment_size: int = None, mapping_dir=(False, True), max_NUC_stretch: int = None, read_group_format: int = 0, library_name: str = None, single_end: bool = False)[source]

Bases: singlecellmultiomics.fragment.fragment.Fragment

Fragment without a specific location on a contig

get_site_location()[source]
class singlecellmultiomics.fragment.fragment.FragmentWithoutPosition(reads, assignment_radius: int = 0, umi_hamming_distance: int = 1, R1_primer_length: int = 0, R2_primer_length: int = 6, tag_definitions: list = None, max_fragment_size: int = None, mapping_dir=(False, True), max_NUC_stretch: int = None, read_group_format: int = 0, library_name: str = None, single_end: bool = False)[source]

Bases: singlecellmultiomics.fragment.fragment.Fragment

Fragment without a specific location on a contig

get_site_location()[source]
class singlecellmultiomics.fragment.fragment.FragmentWithoutUMI(reads, **kwargs)[source]

Bases: singlecellmultiomics.fragment.fragment.Fragment

Use this class when no UMI information is available

set_umi(**kwargs)[source]
class singlecellmultiomics.fragment.fragment.SingleEndTranscriptFragment(reads, features, assignment_radius=0, stranded=None, capture_locations=False, auto_set_intron_exon_features=True, **kwargs)[source]

Bases: singlecellmultiomics.fragment.fragment.Fragment, singlecellmultiomics.features.features.FeatureAnnotatedObject

annotate()[source]
get_site_location()[source]
is_valid()[source]
write_tags()[source]

singlecellmultiomics.fragment.nlaIII module

class singlecellmultiomics.fragment.nlaIII.NlaIIIFragment(reads, R1_primer_length=4, R2_primer_length=6, assignment_radius=1000, umi_hamming_distance=1, invert_strand=False, check_motif=True, no_overhang=False, cut_location_offset=-4, reference=None, allow_cycle_shift=False, use_allele_tag=False, no_umi_cigar_processing=False, **kwargs)[source]

Bases: singlecellmultiomics.fragment.fragment.Fragment

get_safe_span(allow_unsafe=True)[source]
get_site_location()[source]
get_undigested_site_count(reference_handle)[source]

Obtain the amount of undigested sites in the span of the fragment

Parameters:reference (pysam.FastaFile or similiar) –
Returns:
  • undigested_site_count (int) – amount of undigested cut sites in the mapping span of the fragment
  • Raises
  • ——-
  • ValueError (when the span of the molecule is not properly defined)
identify_site()[source]
is_valid()[source]
set_site(site_chrom, site_pos, site_strand=None, valid=True)[source]

Module contents