singlecellmultiomics.utils package¶

Submodules¶

singlecellmultiomics.utils.binning module¶

singlecellmultiomics.utils.binning.bp_chunked(job_generator, bp_per_job)[source]¶

Chunk an iterator containing coordinate sorted tasks in chunks of a total size of roughly bp_per_job

Parameters:	job_generator – iterable of commands, format (contig, start, end, task) bp_per_job* (int) – Amount of bp per chunk of jobs/tasks
Yields:	chunk(list) – [(contig, start, end, task),(contig, start, end, task),..]

@todo: contig is not used, this function expects that only bins on a single contig are supplied

singlecellmultiomics.utils.binning.coordinate_to_bins(point, bin_size, sliding_increment)[source]¶

Convert a single value to a list of overlapping bins

Parameters:	point (int) – coordinate to look up bin_size (int) – bin size sliding_increment (int) – sliding window offset, this is the increment between bins
Returns:	list
Return type:	[(bin_start,bin_end), . ]

singlecellmultiomics.utils.binning.coordinate_to_sliding_bin_locations(dp, bin_size, sliding_increment)[source]¶

Convert a single value to a list of overlapping bins

Parameters —– —– point : int

coordinate to look up

bin_size : int: bin size
sliding_increment : int: sliding window offset, this is the increment between bins

Returns:	start (int) – the start coordinate of the first overlapping bin end (int) – the end of the last overlapping bin start_id (int) – the index of the first overlapping bin end_id (int) – the index of the last overlapping bin

singlecellmultiomics.utils.blockzip module¶

class singlecellmultiomics.utils.blockzip.BlockZip(path, mode='r', read_all=False)[source]¶

Bases: object

__getitem__(contig_position_strand)[source]¶

Obtain data at the supplied contig position and strand :param contig_position_strand: tuple of (

contig(str) postion(int) strand(bool))

Returns:	data stored for the genomic location, returns None when no data is available
Return type:	result (str)

__iter__()[source]¶: Get iterator going over all lines in the file

read_contig_to_cache(contig, region_start=None, region_end=None)[source]¶

read_file_line(line)[source]¶

verify()[source]¶

write(contig, position, strand, data)[source]¶: Write information for location contig/postion/strand !! Write the contig data per contig, random mixing of contigs will result in a corrupted file :param contig: :type contig: str :param postion: :type postion: int :param strand: :type strand: bool :param data: :type data: str

singlecellmultiomics.utils.html module¶

singlecellmultiomics.utils.html.style_str(s, color='black', weight=300)[source]¶

Style the supplied string with HTML tags

Parameters:	s (str) – string to format color (str) – color to show the string in weight (int) – how thick the string will be displayed
Returns:	html representation of the string
Return type:	html(string)

singlecellmultiomics.utils.iteration module¶

singlecellmultiomics.utils.iteration.find_ranges(iterable)[source]¶: Yield range of consecutive numbers.

singlecellmultiomics.utils.sequtils module¶

class singlecellmultiomics.utils.sequtils.Reference[source]¶

Bases: singlecellmultiomics.utils.prefetch.Prefetcher

This is a picklable wrapper to pass reference handles

instance(arg_update)[source]¶

prefetch(contig, start, end)[source]¶

singlecellmultiomics.utils.sequtils.base_probabilities_to_likelihood(probs: dict)[source]¶

singlecellmultiomics.utils.sequtils.complement(seq)[source]¶

Obtain complement of seq

Returns:	complement (str)

singlecellmultiomics.utils.sequtils.create_MD_tag(reference_seq, query_seq)[source]¶

Create MD tag :param reference_seq: reference sequence of alignment :type reference_seq: str :param query_seq: query bases of alignment :type query_seq: str

Returns:	md description of the alignment
Return type:	md_tag(str)

singlecellmultiomics.utils.sequtils.create_fasta_dict_file(refpath: str, skip_if_exists=True)[source]¶

Create index dict file for the reference fasta at refpath

Parameters:	refpath – path to fasta file skip_if_exists – do not generate the index if it exists
Returns:	path to the dict index file
Return type:	dpath (str)

singlecellmultiomics.utils.sequtils.get_chromosome_number(chrom: str) → int[source]¶

Get chromosome number (index) of the supplied chromosome:: ‘1’ -> 1, chr1 -> 1, returns -1 when not available, chrM -> -1

singlecellmultiomics.utils.sequtils.get_consensus_dictionaries(R1, R2, only_include_refbase=None, dove_safe=False, min_phred_score=None, skip_first_n_cycles_R1=None, skip_last_n_cycles_R1=None, skip_first_n_cycles_R2=None, skip_last_n_cycles_R2=None, dove_R2_distance=0, dove_R1_distance=0)[source]¶

singlecellmultiomics.utils.sequtils.get_context(contig: str, position: int, reference: pysam.libcfaidx.FastaFile, ibase: str = None, k_rad: int = 1)[source]¶

Parameters:	contig – contig of the location to extract context position – zero based position reference – pysam.FastaFile handle or similar object which supports .fetch() ibase – single base to inject into the middle of the context k_rad – radius to extract
Returns:	extracted context with length k_rad*2 + 1
Return type:	context(str)

singlecellmultiomics.utils.sequtils.get_contig_lengths_from_resource(resource) → dict[source]¶: Extract contig lengts from the supplied resouce (Fasta file or Bam/Cram/Sam ) :returns: lengths(dict)

singlecellmultiomics.utils.sequtils.get_contig_list_from_fasta(fasta_path: str, with_length: bool = False) → list[source]¶

Obtain list of contigs froma fasta file,: all alternative contigs are pooled into the string MISC_ALT_CONTIGS_SCMO

Parameters:	fasta_path (str or pysam.FastaFile) – Path or handle to fasta file with_length (bool) – return list of lengths
Returns:	List of contigs + [‘MISC_ALT_CONTIGS_SCMO’] if any alt contig is present in the fasta file
Return type:	contig_list (list )

singlecellmultiomics.utils.sequtils.get_file_type(s: str)[source]¶: Guess the file type of the input string, returns None when the file type can not be determined

singlecellmultiomics.utils.sequtils.hamming_distance(a, b)[source]¶

singlecellmultiomics.utils.sequtils.invert_strand_f(s)[source]¶

singlecellmultiomics.utils.sequtils.is_autosome(chrom: str) → bool[source]¶

Returns True when the chromsome is an autosomal chromsome, not an alternative allele, mitochrondrial or sex chromosome

Parameters:	chrom (str) – chromosome name
Returns:	True when the chromsome is an autosome
Return type:	is_main(bool)

singlecellmultiomics.utils.sequtils.is_main_chromosome(chrom: str, exclude_mt=False) → bool[source]¶

Returns True when the chromsome is a main chromsome, not an alternative locus, scaffold, decoy or spike-in

Parameters:	chrom (str) – chromosome name
Returns:	True when the chromsome is a main chromsome
Return type:	is_main(bool)

singlecellmultiomics.utils.sequtils.likelihood_to_prob(likelihoods)[source]¶

singlecellmultiomics.utils.sequtils.phred_to_prob(phred)[source]¶

Convert a phred score (ASCII) or integer to a numeric probability :param phred: score to convert :type phred: str/int

Returns:	probability(float)

singlecellmultiomics.utils.sequtils.phredscores_to_base_call(probs: dict)[source]¶

Perform base calling on a observation dictionary. Returns N when there are multiple options with the same likelihood

Parameters:	probs – dictionary with confidence scores probs = { ‘A’:[0.95,0.99,0.9], ‘T’:[0.1], } –
Returns:	Called base phred(float) : probability of the call to be correct
Return type:	base(str)

singlecellmultiomics.utils.sequtils.pick_best_base_call(*calls) → tuple[source]¶

Pick the best base-call from a list of base calls

Example

>>> pick_best_base_call( ('A',32), ('C',22) ) )
('A', 32)

>>> pick_best_base_call( ('A',32), ('C',32) ) )
None

Parameters:	calls (generator) – generator/list containing tuples
Returns:	tuple (best_base, best_q) or (‘N’,0) when there is a tie

singlecellmultiomics.utils.sequtils.prob_to_phred(prob: float)[source]¶

Convert probability of base call being correct into phred score Values are clipped to stay within 0 to 60 phred range

Parameters:	prob (float) – probability of base call being correct
Returns:	phred_score (byte)

singlecellmultiomics.utils.sequtils.read_to_consensus_dict(read, start: int = None, end: int = None, only_include_refbase: str = None, skip_first_n_cycles: int = None, skip_last_n_cycles: int = None, min_phred_score: int = None)[source]¶: Obtain consensus calls for read, between start and end

singlecellmultiomics.utils.sequtils.reverse_complement(seq)[source]¶

Obtain reverse complement of seq

Returns:	reverse complement (str)

singlecellmultiomics.utils.sequtils.split_nth(seq, separator, n)[source]¶

Split sequence at the n-th occurence of separator

Parameters:	seq (str) – sequence to split separator (str) – separator to split on n (int) – split at the n-th occurence

singlecellmultiomics.utils.submission module¶

singlecellmultiomics.utils.submission.create_job_file_paths(target_directory, job_alias=None, prefix=None, job_file_name=None)[source]¶

singlecellmultiomics.utils.submission.generate_job_script(scheduler, jobfile, stderr, stdout, job_name, memory_gb, working_directory, time_h, threads_n, email, mail_when_finished=False, copy_env=True, slurm_scratch_space_size=None)[source]¶

singlecellmultiomics.utils.submission.generate_submission_command(jobfile, hold, scheduler='sge')[source]¶

singlecellmultiomics.utils.submission.submit_job(command, target_directory, working_directory, threads_n=1, memory_gb=8, time_h=8, scheduler='sge', copy_env=True, email=None, job_alias=None, mail_when_finished=False, hold=None, submit=True, prefix=None, job_file_name=None, job_name=None, silent=False, slurm_scratch_space_size=None)[source]¶

Submit a job

Parameters:	threads (int) – amount of requested threads memory_gb (int) – amount of requested memory scheduler (str) – sge/slurm/local hold (list) – list of job depedencies submit (bool) – perform the actual submission, when set to False only the submission script is written
Returns:	id of sumbitted job
Return type:	job_id(str)

singlecellmultiomics.utils.submission.write_cmd_to_submission_file(cmd, job_data, jobfile, scheduler='sge')[source]¶