singlecellmultiomics.bamProcessing package

Submodules

singlecellmultiomics.bamProcessing.alignment_view module

singlecellmultiomics.bamProcessing.alignment_view.cstr(s, color='black', weight=300)[source]
singlecellmultiomics.bamProcessing.alignment_view.visualise_molecule(molecule, reference=None, show_reference=False, margin_bases=0, start_span=None, end_span=None, show_quals=True, R1PrimerLength=4, R2PrimerLength=6, highlight={})[source]

singlecellmultiomics.bamProcessing.bamDuprate module

singlecellmultiomics.bamProcessing.bamExtractRandomPrimerStats module

singlecellmultiomics.bamProcessing.bamExtractRandomPrimerStats.get_random_primer_histogram(molecule_source, min_mq, max_size, size_bin_size, head=None)[source]

This method counts the frequencies of the random primers which present in the supplied molecule_source.

Parameters:
  • molecule_source – iterable yielding Molecules
  • min_mq (int) – minimum mean mapping quality of a fragment to be taken into account
  • max_size (int) – maximum fragment size
  • size_bin_size (int) – bin size in basepairs of the histogram
  • head (int) – amount of random sequences to process. When not supplied all random primer sequences in the iterable are counted (expect for the ones with a mapping quality which is too low).

singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation module

singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation.bam_to_histogram(bam_path, add_to, feature_container, site_mode=False, bin_size=25, min_mq=30, max_distance=15000, head=None, quick=True)[source]
singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation.distance_to_feature_start(chromosome, lookup_coordinate, feature_container, lookup_strand=None)[source]
singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation.distance_to_hit(lookup_coordinate, hit_start, hit_end, hit_strand, lookup_strand=None, distance_from_begin=True)[source]

singlecellmultiomics.bamProcessing.bamFilter module

singlecellmultiomics.bamProcessing.bamFunctions module

singlecellmultiomics.bamProcessing.bamFunctions.GATK_indel_realign(origin_bam, target_bam, contig, region_start, region_end, known_variants_vcf_path, gatk_path='GenomeAnalysisTK.jar', interval_path=None, java_cmd='java -jar -Xmx40G -Djava.io.tmpdir=./gatk_tmp', reference=None, interval_write_path=None)[source]

Re-align a specified region in a bam file using GenomeAnalysisTK

origin_bam (str) : path to extract region from to re-align

target_bam(str) : path to write re-aligned reads to

contig (str) : contig of selected region

region_start (int) : start coordinate of region to re align (1 based)

region_end (int) :end coordiante of selected region (1 based)

known_variants_vcf_path (str) : path to vcf containing reference variants

interval_path (str) : Use this intervals to perform realignment, when not specified intervals are generated using RealignerTargetCreator

interval_write_path (str) : when interval_path is not supplied, write the interval file here

java_cmd (str) : Command to open java

gatk_path (str) : path to GenomeAnalysisTK.jar

reference (str) : path to reference Fasta file

class singlecellmultiomics.bamProcessing.bamFunctions.MapabilityReader(mapability_safe_file_path, read_all=False, dont_open=True)[source]

Bases: singlecellmultiomics.utils.prefetch.Prefetcher

instance(arg_update=None)[source]
prefetch(contig, start, end)[source]
site_is_mapable(contig, ds, strand)[source]
singlecellmultiomics.bamProcessing.bamFunctions.add_blacklisted_region(input_header, contig=None, start=None, end=None, source='auto_blacklisted')[source]
singlecellmultiomics.bamProcessing.bamFunctions.add_readgroups_to_header(origin_bam_path, readgroups_in, target_bam_path=None, header_write_mode='auto')[source]

Add the readgroups in the set readgroups to the header of origin_bam_path.

This function first loads the header of the origin to memory. The supplied readgroups are added to this header. The new header is then exported to a SAM file. The SAM file is then concatenated to the original bam file.

Parameters:
  • origin_bam_path (str) – path to bam file to which to add readgroups to header
  • readgroups_in (set/dict) – set or dictionary which contains read groups. The dictionary should have the format { read_group_id (str) { ‘ID’: ID, ‘LB’:library, ‘PL’:platform, ‘SM’:sampleLib, ‘PU’:readGroup }
  • target_bam_path (str) – path to write bam file including the readgrouped header to. When not supplied the output is written to the input bam file
singlecellmultiomics.bamProcessing.bamFunctions.bam_is_processed_by_program(alignments, program='bamtagmultiome')[source]

Check if bam file has been processed by the supplied program

This function checks if there is an entry available in the ‘PG’ block in header of the bam file where PN matches the program supplied.

Parameters:
  • alignments (pysam.AlignmentFile) – handle to bam file
  • program (str) – program to look for
Returns:

the program was used to process the bam file

Return type:

program_present(bool)

singlecellmultiomics.bamProcessing.bamFunctions.get_blacklisted_regions_from_bam(bam_path)[source]
singlecellmultiomics.bamProcessing.bamFunctions.get_contig_size(bam, contig)[source]

Extract the length of a contig from a bam file

Parameters:
  • bam (str or pysam.AlignmentFile) – handle to bam file or path to bam file
  • contig (str) –
Returns:

length (int)

singlecellmultiomics.bamProcessing.bamFunctions.get_contig_sizes(bam)[source]

Extract lengths of all contigs from a bam file

Parameters:bam (str or pysam.AlignmentFile) – handle to bam file or path to bam file
Returns:dict (contig:length (int) )
Return type:contig_lengths
singlecellmultiomics.bamProcessing.bamFunctions.get_contigs(bam: str) → Generator[T_co, T_contra, V_co][source]

Get all contigs listed in the bam file header

Parameters:bam_path (str) – path to bam file or pysam handle
Returns:contigs(str) list
singlecellmultiomics.bamProcessing.bamFunctions.get_contigs_with_reads(bam_path: str, with_length: bool = False) → Generator[T_co, T_contra, V_co][source]

Get all contigs with reads mapped to them

Parameters:
  • bam_path (str) – path to bam file
  • with_length (bool) – also yield the length of the contig
Yields:

contig(str)

singlecellmultiomics.bamProcessing.bamFunctions.get_index_path(bam_path: str)[source]

Obtain path to bam index

Returns:path to the index file, None if not available
Return type:path_to_index(str)
singlecellmultiomics.bamProcessing.bamFunctions.get_r1_counts_per_cell(bam_path, prefix_with_bam=False)[source]

Obtain the amount of unique read1 reads per cell

Parameters:
  • bam_path – str
  • prefix_with_bam (bool) – add bam name as prefix of cell name
Returns:

{sampleA:n_molecules, sampleB:m_molecules, …}

Return type:

cell_obs (Counter)

singlecellmultiomics.bamProcessing.bamFunctions.get_random_locations(bam, n)[source]

Select random locations in the supplied bam file

bam(str or pysam.AlignmentFile)

n(int) : amount of locations to generate

returns: generator of (contig,position) tuples

singlecellmultiomics.bamProcessing.bamFunctions.get_read_group_format(bam)[source]

Obtain read group format

Parameters:bam (str) – path to bam file
Returns:read group format
Return type:type (int)
Raises:ValueError – when read group format cannot be determined
singlecellmultiomics.bamProcessing.bamFunctions.get_read_group_from_read(read, format, with_attr_dict=False)[source]
singlecellmultiomics.bamProcessing.bamFunctions.get_read_group_to_sample_dict(bam)[source]

Obtain a dictionary containing {‘read_group’ : ‘sample’ , …} :param bam_file: :type bam_file: pysam.AlignmentFile

singlecellmultiomics.bamProcessing.bamFunctions.get_reference_from_pysam_alignmentFile(pysam_AlignmentFile, ignore_missing=False)[source]

Extract path to reference from pysam handle

Parameters:
  • pysam_AlignmentFile (pysam.AlignmentFile) –
  • ignore_missing (bool) – Check if the file exists, if not return None
Returns:

path to bam file (if exists or ignore_missing is supplied) or None

Return type:

path

singlecellmultiomics.bamProcessing.bamFunctions.get_reference_path_from_bam(bam, ignore_missing=False)[source]

Extract path to reference from bam file

Parameters:
  • bam (str) – path to bam file
  • ignore_missing (bool) – Check if the file exists, if not return None
Returns:

path to bam file (if exists or ignore_missing is supplied) or None

Return type:

path

singlecellmultiomics.bamProcessing.bamFunctions.get_sample_to_read_group_dict(bam)[source]

Obtain a dictionary containing {‘sample name’ : [‘read groupA’, ‘read group B’], …} :param bam_file: :type bam_file: pysam.AlignmentFile

singlecellmultiomics.bamProcessing.bamFunctions.get_samples_from_bam(bam)[source]

Get a list of samples present in the bam_file

Parameters:bam_file (str) – path to bam file or pysam object
Returns:set containing all sample names
Return type:samples (set)
singlecellmultiomics.bamProcessing.bamFunctions.mate_iter(alignments, **kwargs)[source]
singlecellmultiomics.bamProcessing.bamFunctions.merge_bams(bams: list, output_path: str, threads: int = 4)[source]

Merge bamfiles to output_path

When a single bam file is supplied, the bam file is moved to output_path All input bam files are removed

Parameters:
  • bams – list or tuple containing paths to bam files to merge
  • output_path (str) – target path
Returns:

output_path (str)

singlecellmultiomics.bamProcessing.bamFunctions.random_sample_bam(bam, n, **sample_location_args)[source]

Sample a bam file at random locations

Parameters:
  • bam (pysam.AlignmentFile) – bam to sample from
  • n (int) – Amount of locations to sample
  • *sample_location_args – arguments to pass to sample_location
Returns:

dictionary with amount of reads at the selected locations

Return type:

samples (dict)

singlecellmultiomics.bamProcessing.bamFunctions.replace_bam_header(origin_bam_path, header, target_bam_path=None, header_write_mode='auto')[source]
singlecellmultiomics.bamProcessing.bamFunctions.sam_to_bam(sam_in, bam_out, threads=4)[source]

Convert sam file to sorted bam file

Parameters:
  • sam_in (str) – input sam file path
  • bam_out (str) – output bam file path
singlecellmultiomics.bamProcessing.bamFunctions.sample_location(handle, contig, pos, dedup=True, qc=True)[source]

Obtain dictionary containing the coverage for every sample

Parameters:
  • handle (pysam.AlignmentFile) – File to obtain reads from
  • contig (str) – contig to sample
  • pos (int) – coordinate to sample (zero based)
  • dedup (bool) – ignore duplicated reads
  • qc (bool) – ignore qc failed reads
Returns:

dictionary with amount of reads at the selected locations

Return type:

samples (dict)

singlecellmultiomics.bamProcessing.bamFunctions.sort_and_index(unsorted_path, sorted_path, remove_unsorted=False, local_temp_sort=True, fast_compression=False, prefix='TMP')[source]

Sort and index a bam file :param unsorted_path: path to unsorted bam file :type unsorted_path: str :param sorted_path: write sorted file here :type sorted_path: str :param remove_unsorted: remove the unsorted file :type remove_unsorted: bool :param local_temp_sort: create temporary files in target directory :type local_temp_sort: bool

Raises:SamtoolsError when sorting or indexing fails
singlecellmultiomics.bamProcessing.bamFunctions.sorted_bam_file(write_path, origin_bam=None, header=None, read_groups=None, local_temp_sort=True, input_is_sorted=False, mode='wb', fast_compression=False, temp_prefix='SCMO', **kwargs)[source]

Get writing handle to a sorted bam file

Parameters:
  • write_path (str) – write to a bam file at this path
  • origin_bam (pysam.AlignmentFile) – bam file to copy header for or
  • header (dict) – header for the bam file to write
  • read_groups (set/dict) – set or dictionary which contains read groups. The dictionary should have the format { read_group_id (str) { ‘ID’: ID, ‘LB’:library, ‘PL’:platform, ‘SM’:sampleLib, ‘PU’:readGroup }
  • local_temp_sort (bool) – create temporary files in current directory
  • input_is_sorted (bool) – Assume the input is sorted, no sorting will be applied
  • mode (str) – Output mode, use wbu for uncompressed writing.
  • **kwargs – arguments to pass to the new pysam.AlignmentFile output handle

Example

>>> # This example assumes molecules are generated by `molecule_iterator`
>>> read_groups = set() # Store unique read groups in this set
>>> with sorted_bam_file('test_output.bam', header=input_header,read_groups=read_groups) as out:
>>>     for molecule in molecule_iterator:
>>>         molecule.write_tags()
>>>         molecule.write_pysam(out)
>>>         for fragment in molecule:
>>>             read_groups.add(fragment.get_read_group())
test_output.bam will be written, with read groups defined, sorted and indexed.

Write some pysam reads to a sorted bam file .. rubric:: Example

>>> import pysam
>>> from singlecellmultiomics.bamProcessing import sorted_bam_file
>>> test_sam = pysam.AlignmentFile('test.sam','w',reference_names=['chr1','chr2'],reference_lengths=[1000,1000])
>>> read_A = pysam.AlignedSegment(test_sam.header)
>>> read_A.reference_name = 'chr2'
>>> read_A.reference_start = 100
>>> read_A.query_sequence = 'TTGCA'
>>> read_A.query_name= 'READ_A'
>>> read_A.cigarstring = '5M'
>>> read_A.query_qualities = [30] * len(read_A.query_sequence)
>>> read_A.set_tag('RG','HVKCCBGXB.4.MYLIBRARY_1')
>>> read_B = pysam.AlignedSegment(test_sam.header)
>>> read_B.reference_name = 'chr1'
>>> read_B.reference_start = 100
>>> read_B.query_sequence = 'ATCGGG'
>>> read_B.cigarstring = '6M'
>>> read_B.query_name= 'READ_B'
>>> read_B.query_qualities = [30] * len(read_B.query_sequence)
>>> read_B.set_tag('RG','HVKCCBGXB.4.MYLIBRARY_2')
>>> read_groups = set(( 'HVKCCBGXB.4.MYLIBRARY_2','HVKCCBGXB.4.MYLIBRARY_1'))
>>> with sorted_bam_file('out.bam', header=test_sam.header,read_groups=read_groups) as out:
>>>     out.write(read_A)
>>>     out.write(read_B)
Results in the bam file:
@HD     VN:1.6  SO:coordinate
@SQ     SN:chr1 LN:1000
@SQ     SN:chr2 LN:1000
@RG     ID:HVKCCBGXB.4.MYLIBRARY_2      SM:MYLIBRARY_2  LB:MYLIBRARY    PU:HVKCCBGXB.4.MYLIBRARY_2      PL:ILLUMINA
@RG     ID:HVKCCBGXB.4.MYLIBRARY_1      SM:MYLIBRARY_1  LB:MYLIBRARY    PU:HVKCCBGXB.4.MYLIBRARY_1      PL:ILLUMINA
READ_B  0       chr1    101     0       6M      *       0       0       ATCGGG  ??????  RG:Z:HVKCCBGXB.4.MYLIBRARY_2
READ_A  0       chr2    101     0       5M      *       0       0       TTGCA   ?????   RG:Z:HVKCCBGXB.4.MYLIBRARY_1
singlecellmultiomics.bamProcessing.bamFunctions.verify_and_fix_bam(bam_path)[source]

Check if the bam file is not truncated and indexed. Also regenerates index when its older than the bam file If not, apply index

Parameters:bam_path (str) – path to bam file
Raises:ValueError – when the file is corrupted and a fix could not be applied
singlecellmultiomics.bamProcessing.bamFunctions.write_program_tag(input_header, program_name, command_line, description, version)[source]

Write Program Tag to bam file header :param input_header: header to write PG tag to :type input_header: dict :param program_name: value to write to PN tag :type program_name: str :param command_line: value to write to CL tag :type command_line: str :param version: value to write to VN tag :type version: str :param description: value to write to DS tag :type description: str

singlecellmultiomics.bamProcessing.bamMappingRate module

singlecellmultiomics.bamProcessing.bamPlateVisualisation module

singlecellmultiomics.bamProcessing.bamPlotRTstats module

singlecellmultiomics.bamProcessing.bamPlotRTstats.nlaIII_molecule_acceptance_function(molecule)[source]

singlecellmultiomics.bamProcessing.bamTabulator module

singlecellmultiomics.bamProcessing.bamToCountTable module

singlecellmultiomics.bamProcessing.bamToCountTable.assignReads(read, countTable, args, joinFeatures, featureTags, sampleTags, more_args=[], blacklist_dic=None)[source]
singlecellmultiomics.bamProcessing.bamToCountTable.coordinate_to_bins(point, bin_size, sliding_increment)[source]

Convert a single value to a list of overlapping bins

Parameters:
  • point (int) – coordinate to look up
  • bin_size (int) – bin size
  • sliding_increment (int) – sliding window offset, this is the increment between bins
Returns:

list

Return type:

[(bin_start,bin_end), . ]

singlecellmultiomics.bamProcessing.bamToCountTable.coordinate_to_sliding_bin_locations(dp, bin_size, sliding_increment)[source]

Convert a single value to a list of overlapping bins

Parameters:
  • point (int) – coordinate to look up
  • bin_size (int) – bin size
  • sliding_increment (int) – sliding window offset, this is the increment between bins
Returns:

  • start (int) – the start coordinate of the first overlapping bin
  • end (int) – the end of the last overlapping bin
  • start_id (int) – the index of the first overlapping bin
  • end_id (int) – the index of the last overlapping bin

singlecellmultiomics.bamProcessing.bamToCountTable.create_count_table(args, return_df=False)[source]
singlecellmultiomics.bamProcessing.bamToCountTable.readTag(read, tag, defective='None')[source]
singlecellmultiomics.bamProcessing.bamToCountTable.read_has_alternative_hits_to_non_alts(read)[source]
singlecellmultiomics.bamProcessing.bamToCountTable.read_should_be_counted(read, args, blacklist_dic=None)[source]

Check if a read should be counted given the filter arguments

Parameters:read (pysam.AlignedSegment or None) – read to check if it should be counted
Returns:
Return type:bool
singlecellmultiomics.bamProcessing.bamToCountTable.tagToHumanName(tag, TagDefinitions)[source]

singlecellmultiomics.bamProcessing.bamToMethylationAndCopyNumber module

singlecellmultiomics.bamProcessing.bamToMethylationAndCopyNumber.obtain_approximate_reference_cut_position(site, contig, alt_spans)[source]

singlecellmultiomics.bamProcessing.bamToRNACounts module

singlecellmultiomics.bamProcessing.structureTensor module

singlecellmultiomics.bamProcessing.variantStats module

singlecellmultiomics.bamProcessing.variantStats.obtain_variant_statistics(alignment_file_paths, cell_obs, statistics, cell_call_data, reference, chromosome, ssnv_position, gsnv_position, haplotype_scores, WINDOW_RADIUS, out, min_read_obs, read_groups, umi_hamming_distance, args)[source]

Obtain statistics from known gsnv-phased variant location

Parameters:
  • alignment_file_paths (list) – List of handles to Pysam.Alignment files from which to extract molecules
  • ( collections.defaultdict(lambda (statistics) – collections.defaultdict( collections.Counter) ) )
  • ( collections.defaultdict(lambda – collections.defaultdict( collections.Counter) ) )
  • cell_call_data (collections.defaultdict(dict)) –
  • haplotype_scores (dict) –
  • reference (pysamiterators.CachedFasta) –
  • chromosome (str) –
  • ssnv_position (int) – zero based ssnv position
  • gsnv_position (int) – zero based gsnv position
  • WINDOW_RADIUS (int) –
  • out (pysam.AlignmentFile) –
  • min_read_obs (int) –
  • read_groups (set) –
  • umi_hamming_distance (int) –
  • args

Module contents