singlecellmultiomics.bamProcessing package

Submodules

singlecellmultiomics.bamProcessing.alignment_view module

singlecellmultiomics.bamProcessing.alignment_view.cstr(s, color='black', weight=300)[source]
singlecellmultiomics.bamProcessing.alignment_view.visualise_molecule(molecule, reference=None, show_reference=False, margin_bases=0, start_span=None, end_span=None, show_quals=True, R1PrimerLength=4, R2PrimerLength=6, highlight={})[source]

singlecellmultiomics.bamProcessing.bamDuprate module

singlecellmultiomics.bamProcessing.bamExtractRandomPrimerStats module

singlecellmultiomics.bamProcessing.bamExtractRandomPrimerStats.get_random_primer_histogram(molecule_source, min_mq, max_size, size_bin_size, head=None)[source]

This method counts the frequencies of the random primers which present in the supplied molecule_source.

Parameters:
  • molecule_source – iterable yielding Molecules
  • min_mq (int) – minimum mean mapping quality of a fragment to be taken into account
  • max_size (int) – maximum fragment size
  • size_bin_size (int) – bin size in basepairs of the histogram
  • head (int) – amount of random sequences to process. When not supplied all random primer sequences in the iterable are counted (expect for the ones with a mapping quality which is too low).

singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation module

singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation.bam_to_histogram(bam_path, add_to, feature_container, site_mode=False, bin_size=25, min_mq=30, max_distance=15000, head=None, quick=True)[source]
singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation.distance_to_feature_start(chromosome, lookup_coordinate, feature_container, lookup_strand=None)[source]
singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation.distance_to_hit(lookup_coordinate, hit_start, hit_end, hit_strand, lookup_strand=None, distance_from_begin=True)[source]

singlecellmultiomics.bamProcessing.bamFilter module

singlecellmultiomics.bamProcessing.bamFunctions module

singlecellmultiomics.bamProcessing.bamFunctions.GATK_indel_realign(origin_bam, target_bam, contig, region_start, region_end, known_variants_vcf_path, gatk_path='GenomeAnalysisTK.jar', interval_path=None, java_cmd='java -jar -Xmx40G -Djava.io.tmpdir=./gatk_tmp', reference=None, interval_write_path=None)[source]

Re-align a specified region in a bam file using GenomeAnalysisTK

origin_bam (str) : path to extract region from to re-align

target_bam(str) : path to write re-aligned reads to

contig (str) : contig of selected region

region_start (int) : start coordinate of region to re align (1 based)

region_end (int) :end coordiante of selected region (1 based)

known_variants_vcf_path (str) : path to vcf containing reference variants

interval_path (str) : Use this intervals to perform realignment, when not specified intervals are generated using RealignerTargetCreator

interval_write_path (str) : when interval_path is not supplied, write the interval file here

java_cmd (str) : Command to open java

gatk_path (str) : path to GenomeAnalysisTK.jar

reference (str) : path to reference Fasta file

class singlecellmultiomics.bamProcessing.bamFunctions.MapabilityReader(mapability_safe_file_path)[source]

Bases: object

site_is_mapable(contig, ds, strand)[source]

Obtain if a restriction site is mapable or not :param contig: contig of site to look up :type contig: str :param ds: zero based coordinate of site to look up :type ds: int :param strand: strand of site to look up (False: FWD, True: REV) :type strand: bool

Returns:True when the site is uniquely mapable, False otherwise
Return type:site_is_mapable (bool)
singlecellmultiomics.bamProcessing.bamFunctions.add_readgroups_to_header(origin_bam_path, readgroups_in, target_bam_path=None, header_write_mode='auto')[source]

Add the readgroups in the set readgroups to the header of origin_bam_path.

This function first loads the header of the origin to memory. The supplied readgroups are added to this header. The new header is then exported to a SAM file. The SAM file is then concatenated to the original bam file.

Parameters:
  • origin_bam_path (str) – path to bam file to which to add readgroups to header
  • readgroups_in (set/dict) – set or dictionary which contains read groups. The dictionary should have the format { read_group_id (str) { ‘ID’: ID, ‘LB’:library, ‘PL’:platform, ‘SM’:sampleLib, ‘PU’:readGroup }
  • target_bam_path (str) – path to write bam file including the readgrouped header to. When not supplied the output is written to the input bam file
singlecellmultiomics.bamProcessing.bamFunctions.get_reference_from_pysam_alignmentFile(pysam_AlignmentFile, ignore_missing=False)[source]

Extract path to reference from pysam handle

Parameters:
  • pysam_AlignmentFile (pysam.AlignmentFile) –
  • ignore_missing (bool) – Check if the file exists, if not return None
Returns:

path to bam file (if exists or ignore_missing is supplied) or None

Return type:

path

singlecellmultiomics.bamProcessing.bamFunctions.get_reference_path_from_bam(bam, ignore_missing=False)[source]

Extract path to reference from bam file

Parameters:
  • bam (str) – path to bam file
  • ignore_missing (bool) – Check if the file exists, if not return None
Returns:

path to bam file (if exists or ignore_missing is supplied) or None

Return type:

path

singlecellmultiomics.bamProcessing.bamFunctions.sort_and_index(unsorted_path, sorted_path, remove_unsorted=False, local_temp_sort=True)[source]

Sort and index a bam file :param unsorted_path: path to unsorted bam file :type unsorted_path: str :param sorted_path: write sorted file here :type sorted_path: str :param remove_unsorted: remove the unsorted file :type remove_unsorted: bool :param local_temp_sort: create temporary files in target directory :type local_temp_sort: bool

Raises:SamtoolsError when sorting or indexing fails
singlecellmultiomics.bamProcessing.bamFunctions.sorted_bam_file(write_path, origin_bam=None, header=None, read_groups=None, local_temp_sort=True)[source]

Get writing handle to a sorted bam file

Parameters:
  • write_path (str) – write to a bam file at this path
  • origin_bam (pysam.AlignmentFile) – bam file to copy header for or
  • header (dict) – header for the bam file to write
  • read_groups (set/dict) – set or dictionary which contains read groups. The dictionary should have the format { read_group_id (str) { ‘ID’: ID, ‘LB’:library, ‘PL’:platform, ‘SM’:sampleLib, ‘PU’:readGroup }
  • local_temp_sort (bool) – create temporary files in current directory

Example

>>> # This example assumes molecules are generated by `molecule_iterator`
>>> read_groups = set() # Store unique read groups in this set
>>> with sorted_bam_file('test_output.bam', header=input_header,read_groups=read_groups) as out:
>>>     for molecule in molecule_iterator:
>>>         molecule.write_tags()
>>>         molecule.write_pysam(out)
>>>         for fragment in molecule:
>>>             read_groups.add(fragment.get_read_group())
test_output.bam will be written, with read groups defined, sorted and indexed.

Write some pysam reads to a sorted bam file .. rubric:: Example

>>> import pysam
>>> from singlecellmultiomics.bamProcessing import sorted_bam_file
>>> test_sam = pysam.AlignmentFile('test.sam','w',reference_names=['chr1','chr2'],reference_lengths=[1000,1000])
>>> read_A = pysam.AlignedSegment(test_sam.header)
>>> read_A.reference_name = 'chr2'
>>> read_A.reference_start = 100
>>> read_A.query_sequence = 'TTGCA'
>>> read_A.query_name= 'READ_A'
>>> read_A.cigarstring = '5M'
>>> read_A.query_qualities = [30] * len(read_A.query_sequence)
>>> read_A.set_tag('RG','HVKCCBGXB.4.MYLIBRARY_1')
>>> read_B = pysam.AlignedSegment(test_sam.header)
>>> read_B.reference_name = 'chr1'
>>> read_B.reference_start = 100
>>> read_B.query_sequence = 'ATCGGG'
>>> read_B.cigarstring = '6M'
>>> read_B.query_name= 'READ_B'
>>> read_B.query_qualities = [30] * len(read_B.query_sequence)
>>> read_B.set_tag('RG','HVKCCBGXB.4.MYLIBRARY_2')
>>> read_groups = set(( 'HVKCCBGXB.4.MYLIBRARY_2','HVKCCBGXB.4.MYLIBRARY_1'))
>>> with sorted_bam_file('out.bam', header=test_sam.header,read_groups=read_groups) as out:
>>>     out.write(read_A)
>>>     out.write(read_B)
Results in the bam file:
@HD     VN:1.6  SO:coordinate
@SQ     SN:chr1 LN:1000
@SQ     SN:chr2 LN:1000
@RG     ID:HVKCCBGXB.4.MYLIBRARY_2      SM:MYLIBRARY_2  LB:MYLIBRARY    PU:HVKCCBGXB.4.MYLIBRARY_2      PL:ILLUMINA
@RG     ID:HVKCCBGXB.4.MYLIBRARY_1      SM:MYLIBRARY_1  LB:MYLIBRARY    PU:HVKCCBGXB.4.MYLIBRARY_1      PL:ILLUMINA
READ_B  0       chr1    101     0       6M      *       0       0       ATCGGG  ??????  RG:Z:HVKCCBGXB.4.MYLIBRARY_2
READ_A  0       chr2    101     0       5M      *       0       0       TTGCA   ?????   RG:Z:HVKCCBGXB.4.MYLIBRARY_1
singlecellmultiomics.bamProcessing.bamFunctions.verify_and_fix_bam(bam_path)[source]

Check if the bam file is not truncated and indexed. If not, apply index

Parameters:bam_path (str) – path to bam file
Raises:ValueError – when the file is corrupted and a fix could not be applied
singlecellmultiomics.bamProcessing.bamFunctions.write_program_tag(input_header, program_name, command_line, description, version)[source]

Write Program Tag to bam file header :param input_header: header to write PG tag to :type input_header: dict :param program_name: value to write to PN tag :type program_name: str :param command_line: value to write to CL tag :type command_line: str :param version: value to write to VN tag :type version: str :param description: value to write to DS tag :type description: str

singlecellmultiomics.bamProcessing.bamMappingRate module

singlecellmultiomics.bamProcessing.bamPlateVisualisation module

singlecellmultiomics.bamProcessing.bamPlotRTstats module

singlecellmultiomics.bamProcessing.bamPlotRTstats.nlaIII_molecule_acceptance_function(molecule)[source]

singlecellmultiomics.bamProcessing.bamTabulator module

singlecellmultiomics.bamProcessing.bamToCountTable module

singlecellmultiomics.bamProcessing.bamToCountTable.assignReads(read, countTable, args, joinFeatures, featureTags, sampleTags, more_args=[])[source]
singlecellmultiomics.bamProcessing.bamToCountTable.coordinate_to_bins(point, bin_size, sliding_increment)[source]

Convert a single value to a list of overlapping bins

Parameters:
  • point (int) – coordinate to look up
  • bin_size (int) – bin size
  • sliding_increment (int) – sliding window offset, this is the increment between bins
Returns:

list

Return type:

[(bin_start,bin_end), . ]

singlecellmultiomics.bamProcessing.bamToCountTable.coordinate_to_sliding_bin_locations(dp, bin_size, sliding_increment)[source]

Convert a single value to a list of overlapping bins

Parameters:
  • point (int) – coordinate to look up
  • bin_size (int) – bin size
  • sliding_increment (int) – sliding window offset, this is the increment between bins
Returns:

  • start (int) – the start coordinate of the first overlapping bin
  • end (int) – the end of the last overlapping bin
  • start_id (int) – the index of the first overlapping bin
  • end_id (int) – the index of the last overlapping bin

singlecellmultiomics.bamProcessing.bamToCountTable.create_count_table(args, return_df=False)[source]
singlecellmultiomics.bamProcessing.bamToCountTable.readTag(read, tag, defective='None')[source]
singlecellmultiomics.bamProcessing.bamToCountTable.read_has_alternative_hits_to_non_alts(read)[source]
singlecellmultiomics.bamProcessing.bamToCountTable.read_should_be_counted(read, args)[source]

Check if a read should be counted given the filter arguments

Parameters:read (pysam.AlignedSegment or None) – read to check if it should be counted
Returns:
Return type:bool
singlecellmultiomics.bamProcessing.bamToCountTable.tagToHumanName(tag, TagDefinitions)[source]

singlecellmultiomics.bamProcessing.bamToMethylationAndCopyNumber module

singlecellmultiomics.bamProcessing.bamToMethylationAndCopyNumber.obtain_approximate_reference_cut_position(site, contig, alt_spans)[source]

singlecellmultiomics.bamProcessing.bamToRNACounts module

singlecellmultiomics.bamProcessing.structureTensor module

singlecellmultiomics.bamProcessing.variantStats module

singlecellmultiomics.bamProcessing.variantStats.obtain_variant_statistics(alignment_file_paths, cell_obs, statistics, cell_call_data, reference, chromosome, ssnv_position, gsnv_position, haplotype_scores, WINDOW_RADIUS, out, min_read_obs, read_groups, umi_hamming_distance, args)[source]

Obtain statistics from known gsnv-phased variant location

Parameters:
  • alignment_file_paths (list) – List of handles to Pysam.Alignment files from which to extract molecules
  • ( collections.defaultdict(lambda (statistics) – collections.defaultdict( collections.Counter) ) )
  • ( collections.defaultdict(lambda – collections.defaultdict( collections.Counter) ) )
  • cell_call_data (collections.defaultdict(dict)) –
  • haplotype_scores (dict) –
  • reference (pysamiterators.CachedFasta) –
  • chromosome (str) –
  • ssnv_position (int) – zero based ssnv position
  • gsnv_position (int) – zero based gsnv position
  • WINDOW_RADIUS (int) –
  • out (pysam.AlignmentFile) –
  • min_read_obs (int) –
  • read_groups (set) –
  • umi_hamming_distance (int) –
  • args

Module contents