singlecellmultiomics.bamProcessing package¶

Submodules¶

singlecellmultiomics.bamProcessing.alignment_view module¶

singlecellmultiomics.bamProcessing.alignment_view.cstr(s, color='black', weight=300)[source]¶

singlecellmultiomics.bamProcessing.alignment_view.visualise_molecule(molecule, reference=None, show_reference=False, margin_bases=0, start_span=None, end_span=None, show_quals=True, R1PrimerLength=4, R2PrimerLength=6, highlight={})[source]¶

singlecellmultiomics.bamProcessing.bamDuprate module¶

singlecellmultiomics.bamProcessing.bamExtractRandomPrimerStats module¶

singlecellmultiomics.bamProcessing.bamExtractRandomPrimerStats.get_random_primer_histogram(molecule_source, min_mq, max_size, size_bin_size, head=None)[source]¶

This method counts the frequencies of the random primers which present in the supplied molecule_source.

Parameters:

molecule_source – iterable yielding Molecules
min_mq (int) – minimum mean mapping quality of a fragment to be taken into account
max_size (int) – maximum fragment size
size_bin_size (int) – bin size in basepairs of the histogram
head (int) – amount of random sequences to process. When not supplied all random primer sequences in the iterable are counted (expect for the ones with a mapping quality which is too low).

singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation module¶

singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation.bam_to_histogram(bam_path, add_to, feature_container, site_mode=False, bin_size=25, min_mq=30, max_distance=15000, head=None, quick=True)[source]¶

singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation.distance_to_feature_start(chromosome, lookup_coordinate, feature_container, lookup_strand=None)[source]¶

singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation.distance_to_hit(lookup_coordinate, hit_start, hit_end, hit_strand, lookup_strand=None, distance_from_begin=True)[source]¶

singlecellmultiomics.bamProcessing.bamFilter module¶

singlecellmultiomics.bamProcessing.bamFunctions module¶

singlecellmultiomics.bamProcessing.bamFunctions.GATK_indel_realign(origin_bam, target_bam, contig, region_start, region_end, known_variants_vcf_path, gatk_path='GenomeAnalysisTK.jar', interval_path=None, java_cmd='java -jar -Xmx40G -Djava.io.tmpdir=./gatk_tmp', reference=None, interval_write_path=None)[source]¶

Re-align a specified region in a bam file using GenomeAnalysisTK

origin_bam (str) : path to extract region from to re-align

target_bam(str) : path to write re-aligned reads to

contig (str) : contig of selected region

region_start (int) : start coordinate of region to re align (1 based)

region_end (int) :end coordiante of selected region (1 based)

known_variants_vcf_path (str) : path to vcf containing reference variants

interval_path (str) : Use this intervals to perform realignment, when not specified intervals are generated using RealignerTargetCreator

interval_write_path (str) : when interval_path is not supplied, write the interval file here

java_cmd (str) : Command to open java

gatk_path (str) : path to GenomeAnalysisTK.jar

reference (str) : path to reference Fasta file

class singlecellmultiomics.bamProcessing.bamFunctions.MapabilityReader(mapability_safe_file_path)[source]¶

Bases: object

site_is_mapable(contig, ds, strand)[source]¶

Obtain if a restriction site is mapable or not :param contig: contig of site to look up :type contig: str :param ds: zero based coordinate of site to look up :type ds: int :param strand: strand of site to look up (False: FWD, True: REV) :type strand: bool

Returns:	True when the site is uniquely mapable, False otherwise
Return type:	site_is_mapable (bool)

singlecellmultiomics.bamProcessing.bamFunctions.add_readgroups_to_header(origin_bam_path, readgroups_in, target_bam_path=None, header_write_mode='auto')[source]¶

Add the readgroups in the set readgroups to the header of origin_bam_path.

This function first loads the header of the origin to memory. The supplied readgroups are added to this header. The new header is then exported to a SAM file. The SAM file is then concatenated to the original bam file.

Parameters:

origin_bam_path (str) – path to bam file to which to add readgroups to header
readgroups_in (set/dict) – set or dictionary which contains read groups. The dictionary should have the format { read_group_id (str) { ‘ID’: ID, ‘LB’:library, ‘PL’:platform, ‘SM’:sampleLib, ‘PU’:readGroup }
target_bam_path (str) – path to write bam file including the readgrouped header to. When not supplied the output is written to the input bam file

singlecellmultiomics.bamProcessing.bamFunctions.get_reference_from_pysam_alignmentFile(pysam_AlignmentFile, ignore_missing=False)[source]¶

Extract path to reference from pysam handle

Parameters:	pysam_AlignmentFile (pysam.AlignmentFile) – ignore_missing (bool) – Check if the file exists, if not return None
Returns:	path to bam file (if exists or ignore_missing is supplied) or None
Return type:	path

singlecellmultiomics.bamProcessing.bamFunctions.get_reference_path_from_bam(bam, ignore_missing=False)[source]¶

Extract path to reference from bam file

Parameters:	bam (str) – path to bam file ignore_missing (bool) – Check if the file exists, if not return None
Returns:	path to bam file (if exists or ignore_missing is supplied) or None
Return type:	path

singlecellmultiomics.bamProcessing.bamFunctions.sort_and_index(unsorted_path, sorted_path, remove_unsorted=False, local_temp_sort=True)[source]¶

Sort and index a bam file :param unsorted_path: path to unsorted bam file :type unsorted_path: str :param sorted_path: write sorted file here :type sorted_path: str :param remove_unsorted: remove the unsorted file :type remove_unsorted: bool :param local_temp_sort: create temporary files in target directory :type local_temp_sort: bool

Raises:	SamtoolsError when sorting or indexing fails

singlecellmultiomics.bamProcessing.bamFunctions.sorted_bam_file(write_path, origin_bam=None, header=None, read_groups=None, local_temp_sort=True)[source]¶

Get writing handle to a sorted bam file

Parameters:

write_path (str) – write to a bam file at this path
origin_bam (pysam.AlignmentFile) – bam file to copy header for or
header (dict) – header for the bam file to write
read_groups (set/dict) – set or dictionary which contains read groups. The dictionary should have the format { read_group_id (str) { ‘ID’: ID, ‘LB’:library, ‘PL’:platform, ‘SM’:sampleLib, ‘PU’:readGroup }
local_temp_sort (bool) – create temporary files in current directory

Example

>>> # This example assumes molecules are generated by `molecule_iterator`
>>> read_groups = set() # Store unique read groups in this set
>>> with sorted_bam_file('test_output.bam', header=input_header,read_groups=read_groups) as out:
>>>     for molecule in molecule_iterator:
>>>         molecule.write_tags()
>>>         molecule.write_pysam(out)
>>>         for fragment in molecule:
>>>             read_groups.add(fragment.get_read_group())
test_output.bam will be written, with read groups defined, sorted and indexed.

Write some pysam reads to a sorted bam file .. rubric:: Example

>>> import pysam
>>> from singlecellmultiomics.bamProcessing import sorted_bam_file
>>> test_sam = pysam.AlignmentFile('test.sam','w',reference_names=['chr1','chr2'],reference_lengths=[1000,1000])
>>> read_A = pysam.AlignedSegment(test_sam.header)
>>> read_A.reference_name = 'chr2'
>>> read_A.reference_start = 100
>>> read_A.query_sequence = 'TTGCA'
>>> read_A.query_name= 'READ_A'
>>> read_A.cigarstring = '5M'
>>> read_A.query_qualities = [30] * len(read_A.query_sequence)
>>> read_A.set_tag('RG','HVKCCBGXB.4.MYLIBRARY_1')
>>> read_B = pysam.AlignedSegment(test_sam.header)
>>> read_B.reference_name = 'chr1'
>>> read_B.reference_start = 100
>>> read_B.query_sequence = 'ATCGGG'
>>> read_B.cigarstring = '6M'
>>> read_B.query_name= 'READ_B'
>>> read_B.query_qualities = [30] * len(read_B.query_sequence)
>>> read_B.set_tag('RG','HVKCCBGXB.4.MYLIBRARY_2')
>>> read_groups = set(( 'HVKCCBGXB.4.MYLIBRARY_2','HVKCCBGXB.4.MYLIBRARY_1'))
>>> with sorted_bam_file('out.bam', header=test_sam.header,read_groups=read_groups) as out:
>>>     out.write(read_A)
>>>     out.write(read_B)
Results in the bam file:
@HD     VN:1.6  SO:coordinate
@SQ     SN:chr1 LN:1000
@SQ     SN:chr2 LN:1000
@RG     ID:HVKCCBGXB.4.MYLIBRARY_2      SM:MYLIBRARY_2  LB:MYLIBRARY    PU:HVKCCBGXB.4.MYLIBRARY_2      PL:ILLUMINA
@RG     ID:HVKCCBGXB.4.MYLIBRARY_1      SM:MYLIBRARY_1  LB:MYLIBRARY    PU:HVKCCBGXB.4.MYLIBRARY_1      PL:ILLUMINA
READ_B  0       chr1    101     0       6M      *       0       0       ATCGGG  ??????  RG:Z:HVKCCBGXB.4.MYLIBRARY_2
READ_A  0       chr2    101     0       5M      *       0       0       TTGCA   ?????   RG:Z:HVKCCBGXB.4.MYLIBRARY_1

singlecellmultiomics.bamProcessing.bamFunctions.verify_and_fix_bam(bam_path)[source]¶

Check if the bam file is not truncated and indexed. If not, apply index

Parameters:	bam_path (str) – path to bam file
Raises:	`ValueError` – when the file is corrupted and a fix could not be applied

singlecellmultiomics.bamProcessing.bamFunctions.write_program_tag(input_header, program_name, command_line, description, version)[source]¶: Write Program Tag to bam file header :param input_header: header to write PG tag to :type input_header: dict :param program_name: value to write to PN tag :type program_name: str :param command_line: value to write to CL tag :type command_line: str :param version: value to write to VN tag :type version: str :param description: value to write to DS tag :type description: str

singlecellmultiomics.bamProcessing.bamMappingRate module¶

singlecellmultiomics.bamProcessing.bamPlateVisualisation module¶

singlecellmultiomics.bamProcessing.bamPlotRTstats module¶

singlecellmultiomics.bamProcessing.bamPlotRTstats.nlaIII_molecule_acceptance_function(molecule)[source]¶

singlecellmultiomics.bamProcessing.bamTabulator module¶

singlecellmultiomics.bamProcessing.bamToCountTable module¶

singlecellmultiomics.bamProcessing.bamToCountTable.assignReads(read, countTable, args, joinFeatures, featureTags, sampleTags, more_args=[])[source]¶

singlecellmultiomics.bamProcessing.bamToCountTable.coordinate_to_bins(point, bin_size, sliding_increment)[source]¶

Convert a single value to a list of overlapping bins

Parameters:	point (int) – coordinate to look up bin_size (int) – bin size sliding_increment (int) – sliding window offset, this is the increment between bins
Returns:	list
Return type:	[(bin_start,bin_end), . ]

singlecellmultiomics.bamProcessing.bamToCountTable.coordinate_to_sliding_bin_locations(dp, bin_size, sliding_increment)[source]¶

Convert a single value to a list of overlapping bins

Parameters:

point (int) – coordinate to look up
bin_size (int) – bin size
sliding_increment (int) – sliding window offset, this is the increment between bins

Returns:

start (int) – the start coordinate of the first overlapping bin
end (int) – the end of the last overlapping bin
start_id (int) – the index of the first overlapping bin
end_id (int) – the index of the last overlapping bin

singlecellmultiomics.bamProcessing.bamToCountTable.create_count_table(args, return_df=False)[source]¶

singlecellmultiomics.bamProcessing.bamToCountTable.readTag(read, tag, defective='None')[source]¶

singlecellmultiomics.bamProcessing.bamToCountTable.read_has_alternative_hits_to_non_alts(read)[source]¶

singlecellmultiomics.bamProcessing.bamToCountTable.read_should_be_counted(read, args)[source]¶

Check if a read should be counted given the filter arguments

Parameters:	read (pysam.AlignedSegment or None) – read to check if it should be counted
Returns:
Return type:	bool

singlecellmultiomics.bamProcessing.bamToCountTable.tagToHumanName(tag, TagDefinitions)[source]¶

singlecellmultiomics.bamProcessing.bamToMethylationAndCopyNumber module¶

singlecellmultiomics.bamProcessing.bamToMethylationAndCopyNumber.obtain_approximate_reference_cut_position(site, contig, alt_spans)[source]¶

singlecellmultiomics.bamProcessing.bamToRNACounts module¶

singlecellmultiomics.bamProcessing.structureTensor module¶

singlecellmultiomics.bamProcessing.variantStats module¶

singlecellmultiomics.bamProcessing.variantStats.obtain_variant_statistics(alignment_file_paths, cell_obs, statistics, cell_call_data, reference, chromosome, ssnv_position, gsnv_position, haplotype_scores, WINDOW_RADIUS, out, min_read_obs, read_groups, umi_hamming_distance, args)[source]¶

Obtain statistics from known gsnv-phased variant location

Parameters:

alignment_file_paths (list) – List of handles to Pysam.Alignment files from which to extract molecules
( collections.defaultdict(lambda (statistics) – collections.defaultdict( collections.Counter) ) )
( collections.defaultdict(lambda – collections.defaultdict( collections.Counter) ) )
cell_call_data (collections.defaultdict(dict)) –
haplotype_scores (dict) –
reference (pysamiterators.CachedFasta) –
chromosome (str) –
ssnv_position (int) – zero based ssnv position
gsnv_position (int) – zero based gsnv position
WINDOW_RADIUS (int) –
out (pysam.AlignmentFile) –
min_read_obs (int) –
read_groups (set) –
umi_hamming_distance (int) –
args –

singlecellmultiomics.bamProcessing package¶

Submodules¶

singlecellmultiomics.bamProcessing.alignment_view module¶

singlecellmultiomics.bamProcessing.bamDuprate module¶

singlecellmultiomics.bamProcessing.bamExtractRandomPrimerStats module¶

singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation module¶

singlecellmultiomics.bamProcessing.bamFilter module¶

singlecellmultiomics.bamProcessing.bamFunctions module¶

singlecellmultiomics.bamProcessing.bamMappingRate module¶

singlecellmultiomics.bamProcessing.bamPlateVisualisation module¶

singlecellmultiomics.bamProcessing.bamPlotRTstats module¶

singlecellmultiomics.bamProcessing.bamTabulator module¶

singlecellmultiomics.bamProcessing.bamToCountTable module¶

singlecellmultiomics.bamProcessing.bamToMethylationAndCopyNumber module¶

singlecellmultiomics.bamProcessing.bamToRNACounts module¶

singlecellmultiomics.bamProcessing.structureTensor module¶

singlecellmultiomics.bamProcessing.variantStats module¶

Module contents¶