singlecellmultiomics.bamProcessing package¶
Submodules¶
singlecellmultiomics.bamProcessing.alignment_view module¶
singlecellmultiomics.bamProcessing.bamDuprate module¶
singlecellmultiomics.bamProcessing.bamExtractRandomPrimerStats module¶
-
singlecellmultiomics.bamProcessing.bamExtractRandomPrimerStats.
get_random_primer_histogram
(molecule_source, min_mq, max_size, size_bin_size, head=None)[source]¶ This method counts the frequencies of the random primers which present in the supplied molecule_source.
Parameters: - molecule_source – iterable yielding Molecules
- min_mq (int) – minimum mean mapping quality of a fragment to be taken into account
- max_size (int) – maximum fragment size
- size_bin_size (int) – bin size in basepairs of the histogram
- head (int) – amount of random sequences to process. When not supplied all random primer sequences in the iterable are counted (expect for the ones with a mapping quality which is too low).
singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation module¶
-
singlecellmultiomics.bamProcessing.bamFeatureDensityVisualisation.
bam_to_histogram
(bam_path, add_to, feature_container, site_mode=False, bin_size=25, min_mq=30, max_distance=15000, head=None, quick=True)[source]¶
singlecellmultiomics.bamProcessing.bamFilter module¶
singlecellmultiomics.bamProcessing.bamFunctions module¶
-
singlecellmultiomics.bamProcessing.bamFunctions.
GATK_indel_realign
(origin_bam, target_bam, contig, region_start, region_end, known_variants_vcf_path, gatk_path='GenomeAnalysisTK.jar', interval_path=None, java_cmd='java -jar -Xmx40G -Djava.io.tmpdir=./gatk_tmp', reference=None, interval_write_path=None)[source]¶ Re-align a specified region in a bam file using GenomeAnalysisTK
origin_bam (str) : path to extract region from to re-align
target_bam(str) : path to write re-aligned reads to
contig (str) : contig of selected region
region_start (int) : start coordinate of region to re align (1 based)
region_end (int) :end coordiante of selected region (1 based)
known_variants_vcf_path (str) : path to vcf containing reference variants
interval_path (str) : Use this intervals to perform realignment, when not specified intervals are generated using RealignerTargetCreator
interval_write_path (str) : when interval_path is not supplied, write the interval file here
java_cmd (str) : Command to open java
gatk_path (str) : path to GenomeAnalysisTK.jar
reference (str) : path to reference Fasta file
-
class
singlecellmultiomics.bamProcessing.bamFunctions.
MapabilityReader
(mapability_safe_file_path, read_all=False, dont_open=True)[source]¶ Bases:
singlecellmultiomics.utils.prefetch.Prefetcher
-
singlecellmultiomics.bamProcessing.bamFunctions.
add_blacklisted_region
(input_header, contig=None, start=None, end=None, source='auto_blacklisted')[source]¶
-
singlecellmultiomics.bamProcessing.bamFunctions.
add_readgroups_to_header
(origin_bam_path, readgroups_in, target_bam_path=None, header_write_mode='auto')[source]¶ Add the readgroups in the set readgroups to the header of origin_bam_path.
This function first loads the header of the origin to memory. The supplied readgroups are added to this header. The new header is then exported to a SAM file. The SAM file is then concatenated to the original bam file.
Parameters: - origin_bam_path (str) – path to bam file to which to add readgroups to header
- readgroups_in (set/dict) – set or dictionary which contains read groups. The dictionary should have the format { read_group_id (str) { ‘ID’: ID, ‘LB’:library, ‘PL’:platform, ‘SM’:sampleLib, ‘PU’:readGroup }
- target_bam_path (str) – path to write bam file including the readgrouped header to. When not supplied the output is written to the input bam file
-
singlecellmultiomics.bamProcessing.bamFunctions.
bam_is_processed_by_program
(alignments, program='bamtagmultiome')[source]¶ Check if bam file has been processed by the supplied program
This function checks if there is an entry available in the ‘PG’ block in header of the bam file where PN matches the program supplied.
Parameters: - alignments (pysam.AlignmentFile) – handle to bam file
- program (str) – program to look for
Returns: the program was used to process the bam file
Return type: program_present(bool)
-
singlecellmultiomics.bamProcessing.bamFunctions.
get_contig_size
(bam, contig)[source]¶ Extract the length of a contig from a bam file
Parameters: - bam (str or pysam.AlignmentFile) – handle to bam file or path to bam file
- contig (str) –
Returns: length (int)
-
singlecellmultiomics.bamProcessing.bamFunctions.
get_contig_sizes
(bam)[source]¶ Extract lengths of all contigs from a bam file
Parameters: bam (str or pysam.AlignmentFile) – handle to bam file or path to bam file Returns: dict (contig:length (int) ) Return type: contig_lengths
-
singlecellmultiomics.bamProcessing.bamFunctions.
get_contigs
(bam: str) → Generator[T_co, T_contra, V_co][source]¶ Get all contigs listed in the bam file header
Parameters: bam_path (str) – path to bam file or pysam handle Returns: contigs(str) list
-
singlecellmultiomics.bamProcessing.bamFunctions.
get_contigs_with_reads
(bam_path: str, with_length: bool = False) → Generator[T_co, T_contra, V_co][source]¶ Get all contigs with reads mapped to them
Parameters: - bam_path (str) – path to bam file
- with_length (bool) – also yield the length of the contig
Yields: contig(str)
-
singlecellmultiomics.bamProcessing.bamFunctions.
get_index_path
(bam_path: str)[source]¶ Obtain path to bam index
Returns: path to the index file, None if not available Return type: path_to_index(str)
-
singlecellmultiomics.bamProcessing.bamFunctions.
get_r1_counts_per_cell
(bam_path, prefix_with_bam=False)[source]¶ Obtain the amount of unique read1 reads per cell
Parameters: - bam_path – str
- prefix_with_bam (bool) – add bam name as prefix of cell name
Returns: {sampleA:n_molecules, sampleB:m_molecules, …}
Return type: cell_obs (Counter)
-
singlecellmultiomics.bamProcessing.bamFunctions.
get_random_locations
(bam, n)[source]¶ Select random locations in the supplied bam file
bam(str or pysam.AlignmentFile)
n(int) : amount of locations to generate
returns: generator of (contig,position) tuples
-
singlecellmultiomics.bamProcessing.bamFunctions.
get_read_group_format
(bam)[source]¶ Obtain read group format
Parameters: bam (str) – path to bam file Returns: read group format Return type: type (int) Raises: ValueError
– when read group format cannot be determined
-
singlecellmultiomics.bamProcessing.bamFunctions.
get_read_group_from_read
(read, format, with_attr_dict=False)[source]¶
-
singlecellmultiomics.bamProcessing.bamFunctions.
get_read_group_to_sample_dict
(bam)[source]¶ Obtain a dictionary containing {‘read_group’ : ‘sample’ , …} :param bam_file: :type bam_file: pysam.AlignmentFile
-
singlecellmultiomics.bamProcessing.bamFunctions.
get_reference_from_pysam_alignmentFile
(pysam_AlignmentFile, ignore_missing=False)[source]¶ Extract path to reference from pysam handle
Parameters: - pysam_AlignmentFile (pysam.AlignmentFile) –
- ignore_missing (bool) – Check if the file exists, if not return None
Returns: path to bam file (if exists or ignore_missing is supplied) or None
Return type: path
-
singlecellmultiomics.bamProcessing.bamFunctions.
get_reference_path_from_bam
(bam, ignore_missing=False)[source]¶ Extract path to reference from bam file
Parameters: - bam (str) – path to bam file
- ignore_missing (bool) – Check if the file exists, if not return None
Returns: path to bam file (if exists or ignore_missing is supplied) or None
Return type: path
-
singlecellmultiomics.bamProcessing.bamFunctions.
get_sample_to_read_group_dict
(bam)[source]¶ Obtain a dictionary containing {‘sample name’ : [‘read groupA’, ‘read group B’], …} :param bam_file: :type bam_file: pysam.AlignmentFile
-
singlecellmultiomics.bamProcessing.bamFunctions.
get_samples_from_bam
(bam)[source]¶ Get a list of samples present in the bam_file
Parameters: bam_file (str) – path to bam file or pysam object Returns: set containing all sample names Return type: samples (set)
-
singlecellmultiomics.bamProcessing.bamFunctions.
merge_bams
(bams: list, output_path: str, threads: int = 4)[source]¶ Merge bamfiles to output_path
When a single bam file is supplied, the bam file is moved to output_path All input bam files are removed
Parameters: - bams – list or tuple containing paths to bam files to merge
- output_path (str) – target path
Returns: output_path (str)
-
singlecellmultiomics.bamProcessing.bamFunctions.
random_sample_bam
(bam, n, **sample_location_args)[source]¶ Sample a bam file at random locations
Parameters: - bam (pysam.AlignmentFile) – bam to sample from
- n (int) – Amount of locations to sample
- *sample_location_args – arguments to pass to sample_location
Returns: dictionary with amount of reads at the selected locations
Return type: samples (dict)
-
singlecellmultiomics.bamProcessing.bamFunctions.
replace_bam_header
(origin_bam_path, header, target_bam_path=None, header_write_mode='auto')[source]¶
-
singlecellmultiomics.bamProcessing.bamFunctions.
sam_to_bam
(sam_in, bam_out, threads=4)[source]¶ Convert sam file to sorted bam file
Parameters: - sam_in (str) – input sam file path
- bam_out (str) – output bam file path
-
singlecellmultiomics.bamProcessing.bamFunctions.
sample_location
(handle, contig, pos, dedup=True, qc=True)[source]¶ Obtain dictionary containing the coverage for every sample
Parameters: - handle (pysam.AlignmentFile) – File to obtain reads from
- contig (str) – contig to sample
- pos (int) – coordinate to sample (zero based)
- dedup (bool) – ignore duplicated reads
- qc (bool) – ignore qc failed reads
Returns: dictionary with amount of reads at the selected locations
Return type: samples (dict)
-
singlecellmultiomics.bamProcessing.bamFunctions.
sort_and_index
(unsorted_path, sorted_path, remove_unsorted=False, local_temp_sort=True, fast_compression=False, prefix='TMP')[source]¶ Sort and index a bam file :param unsorted_path: path to unsorted bam file :type unsorted_path: str :param sorted_path: write sorted file here :type sorted_path: str :param remove_unsorted: remove the unsorted file :type remove_unsorted: bool :param local_temp_sort: create temporary files in target directory :type local_temp_sort: bool
Raises: SamtoolsError when sorting or indexing fails
-
singlecellmultiomics.bamProcessing.bamFunctions.
sorted_bam_file
(write_path, origin_bam=None, header=None, read_groups=None, local_temp_sort=True, input_is_sorted=False, mode='wb', fast_compression=False, temp_prefix='SCMO', **kwargs)[source]¶ Get writing handle to a sorted bam file
Parameters: - write_path (str) – write to a bam file at this path
- origin_bam (pysam.AlignmentFile) – bam file to copy header for or
- header (dict) – header for the bam file to write
- read_groups (set/dict) – set or dictionary which contains read groups. The dictionary should have the format { read_group_id (str) { ‘ID’: ID, ‘LB’:library, ‘PL’:platform, ‘SM’:sampleLib, ‘PU’:readGroup }
- local_temp_sort (bool) – create temporary files in current directory
- input_is_sorted (bool) – Assume the input is sorted, no sorting will be applied
- mode (str) – Output mode, use wbu for uncompressed writing.
- **kwargs – arguments to pass to the new pysam.AlignmentFile output handle
Example
>>> # This example assumes molecules are generated by `molecule_iterator` >>> read_groups = set() # Store unique read groups in this set >>> with sorted_bam_file('test_output.bam', header=input_header,read_groups=read_groups) as out: >>> for molecule in molecule_iterator: >>> molecule.write_tags() >>> molecule.write_pysam(out) >>> for fragment in molecule: >>> read_groups.add(fragment.get_read_group()) test_output.bam will be written, with read groups defined, sorted and indexed.
Write some pysam reads to a sorted bam file .. rubric:: Example
>>> import pysam >>> from singlecellmultiomics.bamProcessing import sorted_bam_file >>> test_sam = pysam.AlignmentFile('test.sam','w',reference_names=['chr1','chr2'],reference_lengths=[1000,1000]) >>> read_A = pysam.AlignedSegment(test_sam.header) >>> read_A.reference_name = 'chr2' >>> read_A.reference_start = 100 >>> read_A.query_sequence = 'TTGCA' >>> read_A.query_name= 'READ_A' >>> read_A.cigarstring = '5M' >>> read_A.query_qualities = [30] * len(read_A.query_sequence) >>> read_A.set_tag('RG','HVKCCBGXB.4.MYLIBRARY_1') >>> read_B = pysam.AlignedSegment(test_sam.header) >>> read_B.reference_name = 'chr1' >>> read_B.reference_start = 100 >>> read_B.query_sequence = 'ATCGGG' >>> read_B.cigarstring = '6M' >>> read_B.query_name= 'READ_B' >>> read_B.query_qualities = [30] * len(read_B.query_sequence) >>> read_B.set_tag('RG','HVKCCBGXB.4.MYLIBRARY_2') >>> read_groups = set(( 'HVKCCBGXB.4.MYLIBRARY_2','HVKCCBGXB.4.MYLIBRARY_1')) >>> with sorted_bam_file('out.bam', header=test_sam.header,read_groups=read_groups) as out: >>> out.write(read_A) >>> out.write(read_B) Results in the bam file: @HD VN:1.6 SO:coordinate @SQ SN:chr1 LN:1000 @SQ SN:chr2 LN:1000 @RG ID:HVKCCBGXB.4.MYLIBRARY_2 SM:MYLIBRARY_2 LB:MYLIBRARY PU:HVKCCBGXB.4.MYLIBRARY_2 PL:ILLUMINA @RG ID:HVKCCBGXB.4.MYLIBRARY_1 SM:MYLIBRARY_1 LB:MYLIBRARY PU:HVKCCBGXB.4.MYLIBRARY_1 PL:ILLUMINA READ_B 0 chr1 101 0 6M * 0 0 ATCGGG ?????? RG:Z:HVKCCBGXB.4.MYLIBRARY_2 READ_A 0 chr2 101 0 5M * 0 0 TTGCA ????? RG:Z:HVKCCBGXB.4.MYLIBRARY_1
-
singlecellmultiomics.bamProcessing.bamFunctions.
verify_and_fix_bam
(bam_path)[source]¶ Check if the bam file is not truncated and indexed. Also regenerates index when its older than the bam file If not, apply index
Parameters: bam_path (str) – path to bam file Raises: ValueError
– when the file is corrupted and a fix could not be applied
-
singlecellmultiomics.bamProcessing.bamFunctions.
write_program_tag
(input_header, program_name, command_line, description, version)[source]¶ Write Program Tag to bam file header :param input_header: header to write PG tag to :type input_header: dict :param program_name: value to write to PN tag :type program_name: str :param command_line: value to write to CL tag :type command_line: str :param version: value to write to VN tag :type version: str :param description: value to write to DS tag :type description: str
singlecellmultiomics.bamProcessing.bamMappingRate module¶
singlecellmultiomics.bamProcessing.bamPlateVisualisation module¶
singlecellmultiomics.bamProcessing.bamPlotRTstats module¶
singlecellmultiomics.bamProcessing.bamTabulator module¶
singlecellmultiomics.bamProcessing.bamToCountTable module¶
-
singlecellmultiomics.bamProcessing.bamToCountTable.
assignReads
(read, countTable, args, joinFeatures, featureTags, sampleTags, more_args=[], blacklist_dic=None)[source]¶
-
singlecellmultiomics.bamProcessing.bamToCountTable.
coordinate_to_bins
(point, bin_size, sliding_increment)[source]¶ Convert a single value to a list of overlapping bins
Parameters: - point (int) – coordinate to look up
- bin_size (int) – bin size
- sliding_increment (int) – sliding window offset, this is the increment between bins
Returns: list
Return type: [(bin_start,bin_end), . ]
-
singlecellmultiomics.bamProcessing.bamToCountTable.
coordinate_to_sliding_bin_locations
(dp, bin_size, sliding_increment)[source]¶ Convert a single value to a list of overlapping bins
Parameters: - point (int) – coordinate to look up
- bin_size (int) – bin size
- sliding_increment (int) – sliding window offset, this is the increment between bins
Returns: - start (int) – the start coordinate of the first overlapping bin
- end (int) – the end of the last overlapping bin
- start_id (int) – the index of the first overlapping bin
- end_id (int) – the index of the last overlapping bin
-
singlecellmultiomics.bamProcessing.bamToCountTable.
create_count_table
(args, return_df=False)[source]¶
-
singlecellmultiomics.bamProcessing.bamToCountTable.
read_has_alternative_hits_to_non_alts
(read)[source]¶
singlecellmultiomics.bamProcessing.bamToMethylationAndCopyNumber module¶
singlecellmultiomics.bamProcessing.bamToRNACounts module¶
singlecellmultiomics.bamProcessing.structureTensor module¶
singlecellmultiomics.bamProcessing.variantStats module¶
-
singlecellmultiomics.bamProcessing.variantStats.
obtain_variant_statistics
(alignment_file_paths, cell_obs, statistics, cell_call_data, reference, chromosome, ssnv_position, gsnv_position, haplotype_scores, WINDOW_RADIUS, out, min_read_obs, read_groups, umi_hamming_distance, args)[source]¶ Obtain statistics from known gsnv-phased variant location
Parameters: - alignment_file_paths (list) – List of handles to Pysam.Alignment files from which to extract molecules
- ( collections.defaultdict(lambda (statistics) – collections.defaultdict( collections.Counter) ) )
- ( collections.defaultdict(lambda – collections.defaultdict( collections.Counter) ) )
- cell_call_data (collections.defaultdict(dict)) –
- haplotype_scores (dict) –
- reference (pysamiterators.CachedFasta) –
- chromosome (str) –
- ssnv_position (int) – zero based ssnv position
- gsnv_position (int) – zero based gsnv position
- WINDOW_RADIUS (int) –
- out (pysam.AlignmentFile) –
- min_read_obs (int) –
- read_groups (set) –
- umi_hamming_distance (int) –
- args –