singlecellmultiomics.bamProcessing package¶

Submodules¶

singlecellmultiomics.bamProcessing.alignment_view module¶

singlecellmultiomics.bamProcessing.alignment_view.cstr(s, color='black', weight=300)[source]¶

singlecellmultiomics.bamProcessing.alignment_view.visualise_molecule(molecule, reference=None, show_reference=False, margin_bases=0, start_span=None, end_span=None, show_quals=True, R1PrimerLength=4, R2PrimerLength=6, highlight={})[source]¶

singlecellmultiomics.bamProcessing.bamDuprate module¶

singlecellmultiomics.bamProcessing.bamExtractRandomPrimerStats module¶

singlecellmultiomics.bamProcessing.bamExtractRandomPrimerStats.get_random_primer_histogram(molecule_source, min_mq, max_size, size_bin_size, head=None)[source]¶

This method counts the frequencies of the random primers which present in the supplied molecule_source.

Parameters:

molecule_source – iterable yielding Molecules
min_mq (int) – minimum mean mapping quality of a fragment to be taken into account
max_size (int) – maximum fragment size
size_bin_size (int) – bin size in basepairs of the histogram
head (int) – amount of random sequences to process. When not supplied all random primer sequences in the iterable are counted (expect for the ones with a mapping quality which is too low).

singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation module¶

singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation.bam_to_histogram(bam_path, add_to, feature_container, site_mode=False, bin_size=25, min_mq=30, max_distance=15000, head=None, quick=True)[source]¶

singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation.distance_to_feature_start(chromosome, lookup_coordinate, feature_container, lookup_strand=None)[source]¶

singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation.distance_to_hit(lookup_coordinate, hit_start, hit_end, hit_strand, lookup_strand=None, distance_from_begin=True)[source]¶

singlecellmultiomics.bamProcessing.bamFilter module¶

singlecellmultiomics.bamProcessing.bamFunctions module¶

singlecellmultiomics.bamProcessing.bamFunctions.GATK_indel_realign(origin_bam, target_bam, contig, region_start, region_end, known_variants_vcf_path, gatk_path='GenomeAnalysisTK.jar', interval_path=None, java_cmd='java -jar -Xmx40G -Djava.io.tmpdir=./gatk_tmp', reference=None, interval_write_path=None)[source]¶

Re-align a specified region in a bam file using GenomeAnalysisTK

origin_bam (str) : path to extract region from to re-align

target_bam(str) : path to write re-aligned reads to

contig (str) : contig of selected region

region_start (int) : start coordinate of region to re align (1 based)

region_end (int) :end coordiante of selected region (1 based)

known_variants_vcf_path (str) : path to vcf containing reference variants

interval_path (str) : Use this intervals to perform realignment, when not specified intervals are generated using RealignerTargetCreator

interval_write_path (str) : when interval_path is not supplied, write the interval file here

java_cmd (str) : Command to open java

gatk_path (str) : path to GenomeAnalysisTK.jar

reference (str) : path to reference Fasta file

class singlecellmultiomics.bamProcessing.bamFunctions.MapabilityReader(mapability_safe_file_path, read_all=False, dont_open=True)[source]¶

Bases: singlecellmultiomics.utils.prefetch.Prefetcher

instance(arg_update=None)[source]¶

prefetch(contig, start, end)[source]¶

site_is_mapable(contig, ds, strand)[source]¶

singlecellmultiomics.bamProcessing.bamFunctions.add_blacklisted_region(input_header, contig=None, start=None, end=None, source='auto_blacklisted')[source]¶

singlecellmultiomics.bamProcessing.bamFunctions.add_readgroups_to_header(origin_bam_path, readgroups_in, target_bam_path=None, header_write_mode='auto')[source]¶

Add the readgroups in the set readgroups to the header of origin_bam_path.

This function first loads the header of the origin to memory. The supplied readgroups are added to this header. The new header is then exported to a SAM file. The SAM file is then concatenated to the original bam file.

Parameters:

origin_bam_path (str) – path to bam file to which to add readgroups to header
readgroups_in (set/dict) – set or dictionary which contains read groups. The dictionary should have the format { read_group_id (str) { ‘ID’: ID, ‘LB’:library, ‘PL’:platform, ‘SM’:sampleLib, ‘PU’:readGroup }
target_bam_path (str) – path to write bam file including the readgrouped header to. When not supplied the output is written to the input bam file

singlecellmultiomics.bamProcessing.bamFunctions.bam_is_processed_by_program(alignments, program='bamtagmultiome')[source]¶

Check if bam file has been processed by the supplied program

This function checks if there is an entry available in the ‘PG’ block in header of the bam file where PN matches the program supplied.

Parameters:	alignments (pysam.AlignmentFile) – handle to bam file program (str) – program to look for
Returns:	the program was used to process the bam file
Return type:	program_present(bool)

singlecellmultiomics.bamProcessing.bamFunctions.get_blacklisted_regions_from_bam(bam_path)[source]¶

singlecellmultiomics.bamProcessing.bamFunctions.get_contig_size(bam, contig)[source]¶

Extract the length of a contig from a bam file

Parameters:	bam (str or pysam.AlignmentFile) – handle to bam file or path to bam file contig (str) –
Returns:	length (int)

singlecellmultiomics.bamProcessing.bamFunctions.get_contig_sizes(bam)[source]¶

Extract lengths of all contigs from a bam file

Parameters:	bam (str or pysam.AlignmentFile) – handle to bam file or path to bam file
Returns:	dict (contig:length (int) )
Return type:	contig_lengths

singlecellmultiomics.bamProcessing.bamFunctions.get_contigs(bam: str) → Generator[T_co, T_contra, V_co][source]¶

Get all contigs listed in the bam file header

Parameters:	bam_path (str) – path to bam file or pysam handle
Returns:	contigs(str) list

singlecellmultiomics.bamProcessing.bamFunctions.get_contigs_with_reads(bam_path: str, with_length: bool = False) → Generator[T_co, T_contra, V_co][source]¶

Get all contigs with reads mapped to them

Parameters:	bam_path (str) – path to bam file with_length (bool) – also yield the length of the contig
Yields:	contig(str)

singlecellmultiomics.bamProcessing.bamFunctions.get_index_path(bam_path: str)[source]¶

Obtain path to bam index

Returns:	path to the index file, None if not available
Return type:	path_to_index(str)

singlecellmultiomics.bamProcessing.bamFunctions.get_r1_counts_per_cell(bam_path, prefix_with_bam=False)[source]¶

Obtain the amount of unique read1 reads per cell

Parameters:	bam_path – str prefix_with_bam (bool) – add bam name as prefix of cell name
Returns:	{sampleA:n_molecules, sampleB:m_molecules, …}
Return type:	cell_obs (Counter)

singlecellmultiomics.bamProcessing.bamFunctions.get_random_locations(bam, n)[source]¶

Select random locations in the supplied bam file

bam(str or pysam.AlignmentFile)

n(int) : amount of locations to generate

returns: generator of (contig,position) tuples

singlecellmultiomics.bamProcessing.bamFunctions.get_read_group_format(bam)[source]¶

Obtain read group format

Parameters:	bam (str) – path to bam file
Returns:	read group format
Return type:	type (int)
Raises:	`ValueError` – when read group format cannot be determined

singlecellmultiomics.bamProcessing.bamFunctions.get_read_group_from_read(read, format, with_attr_dict=False)[source]¶

singlecellmultiomics.bamProcessing.bamFunctions.get_read_group_to_sample_dict(bam)[source]¶: Obtain a dictionary containing {‘read_group’ : ‘sample’ , …} :param bam_file: :type bam_file: pysam.AlignmentFile

singlecellmultiomics.bamProcessing.bamFunctions.get_reference_from_pysam_alignmentFile(pysam_AlignmentFile, ignore_missing=False)[source]¶

Extract path to reference from pysam handle

Parameters:	pysam_AlignmentFile (pysam.AlignmentFile) – ignore_missing (bool) – Check if the file exists, if not return None
Returns:	path to bam file (if exists or ignore_missing is supplied) or None
Return type:	path

singlecellmultiomics.bamProcessing.bamFunctions.get_reference_path_from_bam(bam, ignore_missing=False)[source]¶

Extract path to reference from bam file

Parameters:	bam (str) – path to bam file ignore_missing (bool) – Check if the file exists, if not return None
Returns:	path to bam file (if exists or ignore_missing is supplied) or None
Return type:	path

singlecellmultiomics.bamProcessing.bamFunctions.get_sample_to_read_group_dict(bam)[source]¶: Obtain a dictionary containing {‘sample name’ : [‘read groupA’, ‘read group B’], …} :param bam_file: :type bam_file: pysam.AlignmentFile

singlecellmultiomics.bamProcessing.bamFunctions.get_samples_from_bam(bam)[source]¶

Get a list of samples present in the bam_file

Parameters:	bam_file (str) – path to bam file or pysam object
Returns:	set containing all sample names
Return type:	samples (set)

singlecellmultiomics.bamProcessing.bamFunctions.mate_iter(alignments, **kwargs)[source]¶

singlecellmultiomics.bamProcessing.bamFunctions.merge_bams(bams: list, output_path: str, threads: int = 4)[source]¶

Merge bamfiles to output_path

When a single bam file is supplied, the bam file is moved to output_path All input bam files are removed

Parameters:	bams – list or tuple containing paths to bam files to merge output_path (str) – target path
Returns:	output_path (str)

singlecellmultiomics.bamProcessing.bamFunctions.random_sample_bam(bam, n, **sample_location_args)[source]¶

Sample a bam file at random locations

Parameters:	bam (pysam.AlignmentFile) – bam to sample from n (int) – Amount of locations to sample *sample_location_args – arguments to pass to sample_location
Returns:	dictionary with amount of reads at the selected locations
Return type:	samples (dict)

singlecellmultiomics.bamProcessing.bamFunctions.replace_bam_header(origin_bam_path, header, target_bam_path=None, header_write_mode='auto')[source]¶

singlecellmultiomics.bamProcessing.bamFunctions.sam_to_bam(sam_in, bam_out, threads=4)[source]¶

Convert sam file to sorted bam file

Parameters:	sam_in (str) – input sam file path bam_out (str) – output bam file path

singlecellmultiomics.bamProcessing.bamFunctions.sample_location(handle, contig, pos, dedup=True, qc=True)[source]¶

Obtain dictionary containing the coverage for every sample

Parameters:	handle (pysam.AlignmentFile) – File to obtain reads from contig (str) – contig to sample pos (int) – coordinate to sample (zero based) dedup (bool) – ignore duplicated reads qc (bool) – ignore qc failed reads
Returns:	dictionary with amount of reads at the selected locations
Return type:	samples (dict)

singlecellmultiomics.bamProcessing.bamFunctions.sort_and_index(unsorted_path, sorted_path, remove_unsorted=False, local_temp_sort=True, fast_compression=False, prefix='TMP')[source]¶

Sort and index a bam file :param unsorted_path: path to unsorted bam file :type unsorted_path: str :param sorted_path: write sorted file here :type sorted_path: str :param remove_unsorted: remove the unsorted file :type remove_unsorted: bool :param local_temp_sort: create temporary files in target directory :type local_temp_sort: bool

Raises:	SamtoolsError when sorting or indexing fails

singlecellmultiomics.bamProcessing.bamFunctions.sorted_bam_file(write_path, origin_bam=None, header=None, read_groups=None, local_temp_sort=True, input_is_sorted=False, mode='wb', fast_compression=False, temp_prefix='SCMO', **kwargs)[source]¶

Get writing handle to a sorted bam file

Parameters:

write_path (str) – write to a bam file at this path
origin_bam (pysam.AlignmentFile) – bam file to copy header for or
header (dict) – header for the bam file to write
read_groups (set/dict) – set or dictionary which contains read groups. The dictionary should have the format { read_group_id (str) { ‘ID’: ID, ‘LB’:library, ‘PL’:platform, ‘SM’:sampleLib, ‘PU’:readGroup }
local_temp_sort (bool) – create temporary files in current directory
input_is_sorted (bool) – Assume the input is sorted, no sorting will be applied
mode (str) – Output mode, use wbu for uncompressed writing.
**kwargs – arguments to pass to the new pysam.AlignmentFile output handle

Example

>>> # This example assumes molecules are generated by `molecule_iterator`
>>> read_groups = set() # Store unique read groups in this set
>>> with sorted_bam_file('test_output.bam', header=input_header,read_groups=read_groups) as out:
>>>     for molecule in molecule_iterator:
>>>         molecule.write_tags()
>>>         molecule.write_pysam(out)
>>>         for fragment in molecule:
>>>             read_groups.add(fragment.get_read_group())
test_output.bam will be written, with read groups defined, sorted and indexed.

Write some pysam reads to a sorted bam file .. rubric:: Example

>>> import pysam
>>> from singlecellmultiomics.bamProcessing import sorted_bam_file
>>> test_sam = pysam.AlignmentFile('test.sam','w',reference_names=['chr1','chr2'],reference_lengths=[1000,1000])
>>> read_A = pysam.AlignedSegment(test_sam.header)
>>> read_A.reference_name = 'chr2'
>>> read_A.reference_start = 100
>>> read_A.query_sequence = 'TTGCA'
>>> read_A.query_name= 'READ_A'
>>> read_A.cigarstring = '5M'
>>> read_A.query_qualities = [30] * len(read_A.query_sequence)
>>> read_A.set_tag('RG','HVKCCBGXB.4.MYLIBRARY_1')
>>> read_B = pysam.AlignedSegment(test_sam.header)
>>> read_B.reference_name = 'chr1'
>>> read_B.reference_start = 100
>>> read_B.query_sequence = 'ATCGGG'
>>> read_B.cigarstring = '6M'
>>> read_B.query_name= 'READ_B'
>>> read_B.query_qualities = [30] * len(read_B.query_sequence)
>>> read_B.set_tag('RG','HVKCCBGXB.4.MYLIBRARY_2')
>>> read_groups = set(( 'HVKCCBGXB.4.MYLIBRARY_2','HVKCCBGXB.4.MYLIBRARY_1'))
>>> with sorted_bam_file('out.bam', header=test_sam.header,read_groups=read_groups) as out:
>>>     out.write(read_A)
>>>     out.write(read_B)
Results in the bam file:
@HD     VN:1.6  SO:coordinate
@SQ     SN:chr1 LN:1000
@SQ     SN:chr2 LN:1000
@RG     ID:HVKCCBGXB.4.MYLIBRARY_2      SM:MYLIBRARY_2  LB:MYLIBRARY    PU:HVKCCBGXB.4.MYLIBRARY_2      PL:ILLUMINA
@RG     ID:HVKCCBGXB.4.MYLIBRARY_1      SM:MYLIBRARY_1  LB:MYLIBRARY    PU:HVKCCBGXB.4.MYLIBRARY_1      PL:ILLUMINA
READ_B  0       chr1    101     0       6M      *       0       0       ATCGGG  ??????  RG:Z:HVKCCBGXB.4.MYLIBRARY_2
READ_A  0       chr2    101     0       5M      *       0       0       TTGCA   ?????   RG:Z:HVKCCBGXB.4.MYLIBRARY_1

singlecellmultiomics.bamProcessing.bamFunctions.verify_and_fix_bam(bam_path)[source]¶

Check if the bam file is not truncated and indexed. Also regenerates index when its older than the bam file If not, apply index

Parameters:	bam_path (str) – path to bam file
Raises:	`ValueError` – when the file is corrupted and a fix could not be applied

singlecellmultiomics.bamProcessing.bamFunctions.write_program_tag(input_header, program_name, command_line, description, version)[source]¶: Write Program Tag to bam file header :param input_header: header to write PG tag to :type input_header: dict :param program_name: value to write to PN tag :type program_name: str :param command_line: value to write to CL tag :type command_line: str :param version: value to write to VN tag :type version: str :param description: value to write to DS tag :type description: str

singlecellmultiomics.bamProcessing.bamMappingRate module¶

singlecellmultiomics.bamProcessing.bamPlateVisualisation module¶

singlecellmultiomics.bamProcessing.bamPlotRTstats module¶

singlecellmultiomics.bamProcessing.bamPlotRTstats.nlaIII_molecule_acceptance_function(molecule)[source]¶

singlecellmultiomics.bamProcessing.bamTabulator module¶

singlecellmultiomics.bamProcessing.bamToCountTable module¶

singlecellmultiomics.bamProcessing.bamToCountTable.assignReads(read, countTable, args, joinFeatures, featureTags, sampleTags, more_args=[], blacklist_dic=None)[source]¶

singlecellmultiomics.bamProcessing.bamToCountTable.coordinate_to_bins(point, bin_size, sliding_increment)[source]¶

Convert a single value to a list of overlapping bins

Parameters:	point (int) – coordinate to look up bin_size (int) – bin size sliding_increment (int) – sliding window offset, this is the increment between bins
Returns:	list
Return type:	[(bin_start,bin_end), . ]

singlecellmultiomics.bamProcessing.bamToCountTable.coordinate_to_sliding_bin_locations(dp, bin_size, sliding_increment)[source]¶

Convert a single value to a list of overlapping bins

Parameters:

point (int) – coordinate to look up
bin_size (int) – bin size
sliding_increment (int) – sliding window offset, this is the increment between bins

Returns:

start (int) – the start coordinate of the first overlapping bin
end (int) – the end of the last overlapping bin
start_id (int) – the index of the first overlapping bin
end_id (int) – the index of the last overlapping bin

singlecellmultiomics.bamProcessing.bamToCountTable.create_count_table(args, return_df=False)[source]¶

singlecellmultiomics.bamProcessing.bamToCountTable.readTag(read, tag, defective='None')[source]¶

singlecellmultiomics.bamProcessing.bamToCountTable.read_has_alternative_hits_to_non_alts(read)[source]¶

singlecellmultiomics.bamProcessing.bamToCountTable.read_should_be_counted(read, args, blacklist_dic=None)[source]¶

Check if a read should be counted given the filter arguments

Parameters:	read (pysam.AlignedSegment or None) – read to check if it should be counted
Returns:
Return type:	bool

singlecellmultiomics.bamProcessing.bamToCountTable.tagToHumanName(tag, TagDefinitions)[source]¶

singlecellmultiomics.bamProcessing.bamToMethylationAndCopyNumber module¶

singlecellmultiomics.bamProcessing.bamToMethylationAndCopyNumber.obtain_approximate_reference_cut_position(site, contig, alt_spans)[source]¶

singlecellmultiomics.bamProcessing.bamToRNACounts module¶

singlecellmultiomics.bamProcessing.structureTensor module¶

singlecellmultiomics.bamProcessing.variantStats module¶

singlecellmultiomics.bamProcessing.variantStats.obtain_variant_statistics(alignment_file_paths, cell_obs, statistics, cell_call_data, reference, chromosome, ssnv_position, gsnv_position, haplotype_scores, WINDOW_RADIUS, out, min_read_obs, read_groups, umi_hamming_distance, args)[source]¶

Obtain statistics from known gsnv-phased variant location

Parameters:

alignment_file_paths (list) – List of handles to Pysam.Alignment files from which to extract molecules
( collections.defaultdict(lambda (statistics) – collections.defaultdict( collections.Counter) ) )
( collections.defaultdict(lambda – collections.defaultdict( collections.Counter) ) )
cell_call_data (collections.defaultdict(dict)) –
haplotype_scores (dict) –
reference (pysamiterators.CachedFasta) –
chromosome (str) –
ssnv_position (int) – zero based ssnv position
gsnv_position (int) – zero based gsnv position
WINDOW_RADIUS (int) –
out (pysam.AlignmentFile) –
min_read_obs (int) –
read_groups (set) –
umi_hamming_distance (int) –
args –

singlecellmultiomics.bamProcessing package¶

Submodules¶

singlecellmultiomics.bamProcessing.alignment_view module¶

singlecellmultiomics.bamProcessing.bamDuprate module¶

singlecellmultiomics.bamProcessing.bamExtractRandomPrimerStats module¶

singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation module¶

singlecellmultiomics.bamProcessing.bamFilter module¶

singlecellmultiomics.bamProcessing.bamFunctions module¶

singlecellmultiomics.bamProcessing.bamMappingRate module¶

singlecellmultiomics.bamProcessing.bamPlateVisualisation module¶

singlecellmultiomics.bamProcessing.bamPlotRTstats module¶

singlecellmultiomics.bamProcessing.bamTabulator module¶

singlecellmultiomics.bamProcessing.bamToCountTable module¶

singlecellmultiomics.bamProcessing.bamToMethylationAndCopyNumber module¶

singlecellmultiomics.bamProcessing.bamToRNACounts module¶

singlecellmultiomics.bamProcessing.structureTensor module¶

singlecellmultiomics.bamProcessing.variantStats module¶

Module contents¶