singlecellmultiomics.bamProcessing package¶
Submodules¶
singlecellmultiomics.bamProcessing.alignment_view module¶
singlecellmultiomics.bamProcessing.bamDuprate module¶
singlecellmultiomics.bamProcessing.bamExtractRandomPrimerStats module¶
-
singlecellmultiomics.bamProcessing.bamExtractRandomPrimerStats.get_random_primer_histogram(molecule_source, min_mq, max_size, size_bin_size, head=None)[source]¶ This method counts the frequencies of the random primers which present in the supplied molecule_source.
Parameters: - molecule_source – iterable yielding Molecules
- min_mq (int) – minimum mean mapping quality of a fragment to be taken into account
- max_size (int) – maximum fragment size
- size_bin_size (int) – bin size in basepairs of the histogram
- head (int) – amount of random sequences to process. When not supplied all random primer sequences in the iterable are counted (expect for the ones with a mapping quality which is too low).
singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation module¶
-
singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation.bam_to_histogram(bam_path, add_to, feature_container, site_mode=False, bin_size=25, min_mq=30, max_distance=15000, head=None, quick=True)[source]¶
singlecellmultiomics.bamProcessing.bamFilter module¶
singlecellmultiomics.bamProcessing.bamFunctions module¶
-
singlecellmultiomics.bamProcessing.bamFunctions.GATK_indel_realign(origin_bam, target_bam, contig, region_start, region_end, known_variants_vcf_path, gatk_path='GenomeAnalysisTK.jar', interval_path=None, java_cmd='java -jar -Xmx40G -Djava.io.tmpdir=./gatk_tmp', reference=None, interval_write_path=None)[source]¶ Re-align a specified region in a bam file using GenomeAnalysisTK
origin_bam (str) : path to extract region from to re-align
target_bam(str) : path to write re-aligned reads to
contig (str) : contig of selected region
region_start (int) : start coordinate of region to re align (1 based)
region_end (int) :end coordiante of selected region (1 based)
known_variants_vcf_path (str) : path to vcf containing reference variants
interval_path (str) : Use this intervals to perform realignment, when not specified intervals are generated using RealignerTargetCreator
interval_write_path (str) : when interval_path is not supplied, write the interval file here
java_cmd (str) : Command to open java
gatk_path (str) : path to GenomeAnalysisTK.jar
reference (str) : path to reference Fasta file
-
class
singlecellmultiomics.bamProcessing.bamFunctions.MapabilityReader(mapability_safe_file_path)[source]¶ Bases:
object-
site_is_mapable(contig, ds, strand)[source]¶ Obtain if a restriction site is mapable or not :param contig: contig of site to look up :type contig: str :param ds: zero based coordinate of site to look up :type ds: int :param strand: strand of site to look up (False: FWD, True: REV) :type strand: bool
Returns: True when the site is uniquely mapable, False otherwise Return type: site_is_mapable (bool)
-
-
singlecellmultiomics.bamProcessing.bamFunctions.add_readgroups_to_header(origin_bam_path, readgroups_in, target_bam_path=None, header_write_mode='auto')[source]¶ Add the readgroups in the set readgroups to the header of origin_bam_path.
This function first loads the header of the origin to memory. The supplied readgroups are added to this header. The new header is then exported to a SAM file. The SAM file is then concatenated to the original bam file.
Parameters: - origin_bam_path (str) – path to bam file to which to add readgroups to header
- readgroups_in (set/dict) – set or dictionary which contains read groups. The dictionary should have the format { read_group_id (str) { ‘ID’: ID, ‘LB’:library, ‘PL’:platform, ‘SM’:sampleLib, ‘PU’:readGroup }
- target_bam_path (str) – path to write bam file including the readgrouped header to. When not supplied the output is written to the input bam file
-
singlecellmultiomics.bamProcessing.bamFunctions.get_reference_from_pysam_alignmentFile(pysam_AlignmentFile, ignore_missing=False)[source]¶ Extract path to reference from pysam handle
Parameters: - pysam_AlignmentFile (pysam.AlignmentFile) –
- ignore_missing (bool) – Check if the file exists, if not return None
Returns: path to bam file (if exists or ignore_missing is supplied) or None
Return type: path
-
singlecellmultiomics.bamProcessing.bamFunctions.get_reference_path_from_bam(bam, ignore_missing=False)[source]¶ Extract path to reference from bam file
Parameters: - bam (str) – path to bam file
- ignore_missing (bool) – Check if the file exists, if not return None
Returns: path to bam file (if exists or ignore_missing is supplied) or None
Return type: path
-
singlecellmultiomics.bamProcessing.bamFunctions.sort_and_index(unsorted_path, sorted_path, remove_unsorted=False, local_temp_sort=True)[source]¶ Sort and index a bam file :param unsorted_path: path to unsorted bam file :type unsorted_path: str :param sorted_path: write sorted file here :type sorted_path: str :param remove_unsorted: remove the unsorted file :type remove_unsorted: bool :param local_temp_sort: create temporary files in target directory :type local_temp_sort: bool
Raises: SamtoolsError when sorting or indexing fails
-
singlecellmultiomics.bamProcessing.bamFunctions.sorted_bam_file(write_path, origin_bam=None, header=None, read_groups=None, local_temp_sort=True)[source]¶ Get writing handle to a sorted bam file
Parameters: - write_path (str) – write to a bam file at this path
- origin_bam (pysam.AlignmentFile) – bam file to copy header for or
- header (dict) – header for the bam file to write
- read_groups (set/dict) – set or dictionary which contains read groups. The dictionary should have the format { read_group_id (str) { ‘ID’: ID, ‘LB’:library, ‘PL’:platform, ‘SM’:sampleLib, ‘PU’:readGroup }
- local_temp_sort (bool) – create temporary files in current directory
Example
>>> # This example assumes molecules are generated by `molecule_iterator` >>> read_groups = set() # Store unique read groups in this set >>> with sorted_bam_file('test_output.bam', header=input_header,read_groups=read_groups) as out: >>> for molecule in molecule_iterator: >>> molecule.write_tags() >>> molecule.write_pysam(out) >>> for fragment in molecule: >>> read_groups.add(fragment.get_read_group()) test_output.bam will be written, with read groups defined, sorted and indexed.
Write some pysam reads to a sorted bam file .. rubric:: Example
>>> import pysam >>> from singlecellmultiomics.bamProcessing import sorted_bam_file >>> test_sam = pysam.AlignmentFile('test.sam','w',reference_names=['chr1','chr2'],reference_lengths=[1000,1000]) >>> read_A = pysam.AlignedSegment(test_sam.header) >>> read_A.reference_name = 'chr2' >>> read_A.reference_start = 100 >>> read_A.query_sequence = 'TTGCA' >>> read_A.query_name= 'READ_A' >>> read_A.cigarstring = '5M' >>> read_A.query_qualities = [30] * len(read_A.query_sequence) >>> read_A.set_tag('RG','HVKCCBGXB.4.MYLIBRARY_1') >>> read_B = pysam.AlignedSegment(test_sam.header) >>> read_B.reference_name = 'chr1' >>> read_B.reference_start = 100 >>> read_B.query_sequence = 'ATCGGG' >>> read_B.cigarstring = '6M' >>> read_B.query_name= 'READ_B' >>> read_B.query_qualities = [30] * len(read_B.query_sequence) >>> read_B.set_tag('RG','HVKCCBGXB.4.MYLIBRARY_2') >>> read_groups = set(( 'HVKCCBGXB.4.MYLIBRARY_2','HVKCCBGXB.4.MYLIBRARY_1')) >>> with sorted_bam_file('out.bam', header=test_sam.header,read_groups=read_groups) as out: >>> out.write(read_A) >>> out.write(read_B) Results in the bam file: @HD VN:1.6 SO:coordinate @SQ SN:chr1 LN:1000 @SQ SN:chr2 LN:1000 @RG ID:HVKCCBGXB.4.MYLIBRARY_2 SM:MYLIBRARY_2 LB:MYLIBRARY PU:HVKCCBGXB.4.MYLIBRARY_2 PL:ILLUMINA @RG ID:HVKCCBGXB.4.MYLIBRARY_1 SM:MYLIBRARY_1 LB:MYLIBRARY PU:HVKCCBGXB.4.MYLIBRARY_1 PL:ILLUMINA READ_B 0 chr1 101 0 6M * 0 0 ATCGGG ?????? RG:Z:HVKCCBGXB.4.MYLIBRARY_2 READ_A 0 chr2 101 0 5M * 0 0 TTGCA ????? RG:Z:HVKCCBGXB.4.MYLIBRARY_1
-
singlecellmultiomics.bamProcessing.bamFunctions.verify_and_fix_bam(bam_path)[source]¶ Check if the bam file is not truncated and indexed. If not, apply index
Parameters: bam_path (str) – path to bam file Raises: ValueError– when the file is corrupted and a fix could not be applied
-
singlecellmultiomics.bamProcessing.bamFunctions.write_program_tag(input_header, program_name, command_line, description, version)[source]¶ Write Program Tag to bam file header :param input_header: header to write PG tag to :type input_header: dict :param program_name: value to write to PN tag :type program_name: str :param command_line: value to write to CL tag :type command_line: str :param version: value to write to VN tag :type version: str :param description: value to write to DS tag :type description: str
singlecellmultiomics.bamProcessing.bamMappingRate module¶
singlecellmultiomics.bamProcessing.bamPlateVisualisation module¶
singlecellmultiomics.bamProcessing.bamPlotRTstats module¶
singlecellmultiomics.bamProcessing.bamTabulator module¶
singlecellmultiomics.bamProcessing.bamToCountTable module¶
-
singlecellmultiomics.bamProcessing.bamToCountTable.assignReads(read, countTable, args, joinFeatures, featureTags, sampleTags, more_args=[])[source]¶
-
singlecellmultiomics.bamProcessing.bamToCountTable.coordinate_to_bins(point, bin_size, sliding_increment)[source]¶ Convert a single value to a list of overlapping bins
Parameters: - point (int) – coordinate to look up
- bin_size (int) – bin size
- sliding_increment (int) – sliding window offset, this is the increment between bins
Returns: list
Return type: [(bin_start,bin_end), . ]
-
singlecellmultiomics.bamProcessing.bamToCountTable.coordinate_to_sliding_bin_locations(dp, bin_size, sliding_increment)[source]¶ Convert a single value to a list of overlapping bins
Parameters: - point (int) – coordinate to look up
- bin_size (int) – bin size
- sliding_increment (int) – sliding window offset, this is the increment between bins
Returns: - start (int) – the start coordinate of the first overlapping bin
- end (int) – the end of the last overlapping bin
- start_id (int) – the index of the first overlapping bin
- end_id (int) – the index of the last overlapping bin
-
singlecellmultiomics.bamProcessing.bamToCountTable.create_count_table(args, return_df=False)[source]¶
-
singlecellmultiomics.bamProcessing.bamToCountTable.read_has_alternative_hits_to_non_alts(read)[source]¶
singlecellmultiomics.bamProcessing.bamToMethylationAndCopyNumber module¶
singlecellmultiomics.bamProcessing.bamToRNACounts module¶
singlecellmultiomics.bamProcessing.structureTensor module¶
singlecellmultiomics.bamProcessing.variantStats module¶
-
singlecellmultiomics.bamProcessing.variantStats.obtain_variant_statistics(alignment_file_paths, cell_obs, statistics, cell_call_data, reference, chromosome, ssnv_position, gsnv_position, haplotype_scores, WINDOW_RADIUS, out, min_read_obs, read_groups, umi_hamming_distance, args)[source]¶ Obtain statistics from known gsnv-phased variant location
Parameters: - alignment_file_paths (list) – List of handles to Pysam.Alignment files from which to extract molecules
- ( collections.defaultdict(lambda (statistics) – collections.defaultdict( collections.Counter) ) )
- ( collections.defaultdict(lambda – collections.defaultdict( collections.Counter) ) )
- cell_call_data (collections.defaultdict(dict)) –
- haplotype_scores (dict) –
- reference (pysamiterators.CachedFasta) –
- chromosome (str) –
- ssnv_position (int) – zero based ssnv position
- gsnv_position (int) – zero based gsnv position
- WINDOW_RADIUS (int) –
- out (pysam.AlignmentFile) –
- min_read_obs (int) –
- read_groups (set) –
- umi_hamming_distance (int) –
- args –