singlecellmultiomics.utils package

Submodules

singlecellmultiomics.utils.binning module

singlecellmultiomics.utils.binning.bp_chunked(job_generator, bp_per_job)[source]

Chunk an iterator containing coordinate sorted tasks in chunks of a total size of roughly bp_per_job

Parameters:
  • job_generator – iterable of commands, format (contig, start, end, *task)
  • bp_per_job (int) – Amount of bp per chunk of jobs/tasks
Yields:

chunk(list) – [(contig, start, end, *task),(contig, start, end, *task),..]

@todo: contig is not used, this function expects that only bins on a single contig are supplied

singlecellmultiomics.utils.binning.coordinate_to_bins(point, bin_size, sliding_increment)[source]

Convert a single value to a list of overlapping bins

Parameters:
  • point (int) – coordinate to look up
  • bin_size (int) – bin size
  • sliding_increment (int) – sliding window offset, this is the increment between bins
Returns:

list

Return type:

[(bin_start,bin_end), . ]

singlecellmultiomics.utils.binning.coordinate_to_sliding_bin_locations(dp, bin_size, sliding_increment)[source]

Convert a single value to a list of overlapping bins

Parameters —– —– point : int

coordinate to look up
bin_size : int
bin size
sliding_increment : int
sliding window offset, this is the increment between bins
Returns:
  • start (int) – the start coordinate of the first overlapping bin
  • end (int) – the end of the last overlapping bin
  • start_id (int) – the index of the first overlapping bin
  • end_id (int) – the index of the last overlapping bin

singlecellmultiomics.utils.blockzip module

class singlecellmultiomics.utils.blockzip.BlockZip(path, mode='r', read_all=False)[source]

Bases: object

__getitem__(contig_position_strand)[source]

Obtain data at the supplied contig position and strand :param contig_position_strand: tuple of (

contig(str) postion(int) strand(bool))
Returns:data stored for the genomic location, returns None when no data is available
Return type:result (str)
__iter__()[source]

Get iterator going over all lines in the file

read_contig_to_cache(contig, region_start=None, region_end=None)[source]
read_file_line(line)[source]
verify()[source]
write(contig, position, strand, data)[source]

Write information for location contig/postion/strand !! Write the contig data per contig, random mixing of contigs will result in a corrupted file :param contig: :type contig: str :param postion: :type postion: int :param strand: :type strand: bool :param data: :type data: str

singlecellmultiomics.utils.html module

singlecellmultiomics.utils.html.style_str(s, color='black', weight=300)[source]

Style the supplied string with HTML tags

Parameters:
  • s (str) – string to format
  • color (str) – color to show the string in
  • weight (int) – how thick the string will be displayed
Returns:

html representation of the string

Return type:

html(string)

singlecellmultiomics.utils.iteration module

singlecellmultiomics.utils.iteration.find_ranges(iterable)[source]

Yield range of consecutive numbers.

singlecellmultiomics.utils.sequtils module

class singlecellmultiomics.utils.sequtils.Reference[source]

Bases: singlecellmultiomics.utils.prefetch.Prefetcher

This is a picklable wrapper to pass reference handles

instance(arg_update)[source]
prefetch(contig, start, end)[source]
singlecellmultiomics.utils.sequtils.base_probabilities_to_likelihood(probs: dict)[source]
singlecellmultiomics.utils.sequtils.complement(seq)[source]

Obtain complement of seq

Returns:complement (str)
singlecellmultiomics.utils.sequtils.create_MD_tag(reference_seq, query_seq)[source]

Create MD tag :param reference_seq: reference sequence of alignment :type reference_seq: str :param query_seq: query bases of alignment :type query_seq: str

Returns:md description of the alignment
Return type:md_tag(str)
singlecellmultiomics.utils.sequtils.create_fasta_dict_file(refpath: str, skip_if_exists=True)[source]

Create index dict file for the reference fasta at refpath

Parameters:
  • refpath – path to fasta file
  • skip_if_exists – do not generate the index if it exists
Returns:

path to the dict index file

Return type:

dpath (str)

singlecellmultiomics.utils.sequtils.get_chromosome_number(chrom: str) → int[source]
Get chromosome number (index) of the supplied chromosome:
‘1’ -> 1, chr1 -> 1, returns -1 when not available, chrM -> -1
singlecellmultiomics.utils.sequtils.get_consensus_dictionaries(R1, R2, only_include_refbase=None, dove_safe=False, min_phred_score=None, skip_first_n_cycles_R1=None, skip_last_n_cycles_R1=None, skip_first_n_cycles_R2=None, skip_last_n_cycles_R2=None, dove_R2_distance=0, dove_R1_distance=0)[source]
singlecellmultiomics.utils.sequtils.get_context(contig: str, position: int, reference: pysam.libcfaidx.FastaFile, ibase: str = None, k_rad: int = 1)[source]
Parameters:
  • contig – contig of the location to extract context
  • position – zero based position
  • reference – pysam.FastaFile handle or similar object which supports .fetch()
  • ibase – single base to inject into the middle of the context
  • k_rad – radius to extract
Returns:

extracted context with length k_rad*2 + 1

Return type:

context(str)

singlecellmultiomics.utils.sequtils.get_contig_lengths_from_resource(resource) → dict[source]

Extract contig lengts from the supplied resouce (Fasta file or Bam/Cram/Sam ) :returns: lengths(dict)

singlecellmultiomics.utils.sequtils.get_contig_list_from_fasta(fasta_path: str, with_length: bool = False) → list[source]
Obtain list of contigs froma fasta file,
all alternative contigs are pooled into the string MISC_ALT_CONTIGS_SCMO
Parameters:
  • fasta_path (str or pysam.FastaFile) – Path or handle to fasta file
  • with_length (bool) – return list of lengths
Returns:

List of contigs + [‘MISC_ALT_CONTIGS_SCMO’] if any alt contig is present in the fasta file

Return type:

contig_list (list )

singlecellmultiomics.utils.sequtils.get_file_type(s: str)[source]

Guess the file type of the input string, returns None when the file type can not be determined

singlecellmultiomics.utils.sequtils.hamming_distance(a, b)[source]
singlecellmultiomics.utils.sequtils.invert_strand_f(s)[source]
singlecellmultiomics.utils.sequtils.is_autosome(chrom: str) → bool[source]

Returns True when the chromsome is an autosomal chromsome, not an alternative allele, mitochrondrial or sex chromosome

Parameters:chrom (str) – chromosome name
Returns:True when the chromsome is an autosome
Return type:is_main(bool)
singlecellmultiomics.utils.sequtils.is_main_chromosome(chrom: str, exclude_mt=False) → bool[source]

Returns True when the chromsome is a main chromsome, not an alternative locus, scaffold, decoy or spike-in

Parameters:chrom (str) – chromosome name
Returns:True when the chromsome is a main chromsome
Return type:is_main(bool)
singlecellmultiomics.utils.sequtils.likelihood_to_prob(likelihoods)[source]
singlecellmultiomics.utils.sequtils.phred_to_prob(phred)[source]

Convert a phred score (ASCII) or integer to a numeric probability :param phred: score to convert :type phred: str/int

Returns:probability(float)
singlecellmultiomics.utils.sequtils.phredscores_to_base_call(probs: dict)[source]

Perform base calling on a observation dictionary. Returns N when there are multiple options with the same likelihood

Parameters:
  • probs – dictionary with confidence scores probs = { ‘A’:[0.95,0.99,0.9], ‘T’:[0.1],
  • }
Returns:

Called base phred(float) : probability of the call to be correct

Return type:

base(str)

singlecellmultiomics.utils.sequtils.pick_best_base_call(*calls) → tuple[source]

Pick the best base-call from a list of base calls

Example

>>> pick_best_base_call( ('A',32), ('C',22) ) )
('A', 32)
>>> pick_best_base_call( ('A',32), ('C',32) ) )
None
Parameters:calls (generator) – generator/list containing tuples
Returns:tuple (best_base, best_q) or (‘N’,0) when there is a tie
singlecellmultiomics.utils.sequtils.prob_to_phred(prob: float)[source]

Convert probability of base call being correct into phred score Values are clipped to stay within 0 to 60 phred range

Parameters:prob (float) – probability of base call being correct
Returns:phred_score (byte)
singlecellmultiomics.utils.sequtils.read_to_consensus_dict(read, start: int = None, end: int = None, only_include_refbase: str = None, skip_first_n_cycles: int = None, skip_last_n_cycles: int = None, min_phred_score: int = None)[source]

Obtain consensus calls for read, between start and end

singlecellmultiomics.utils.sequtils.reverse_complement(seq)[source]

Obtain reverse complement of seq

Returns:reverse complement (str)
singlecellmultiomics.utils.sequtils.split_nth(seq, separator, n)[source]

Split sequence at the n-th occurence of separator

Parameters:
  • seq (str) – sequence to split
  • separator (str) – separator to split on
  • n (int) – split at the n-th occurence

singlecellmultiomics.utils.submission module

singlecellmultiomics.utils.submission.create_job_file_paths(target_directory, job_alias=None, prefix=None, job_file_name=None)[source]
singlecellmultiomics.utils.submission.generate_job_script(scheduler, jobfile, stderr, stdout, job_name, memory_gb, working_directory, time_h, threads_n, email, mail_when_finished=False, copy_env=True, slurm_scratch_space_size=None)[source]
singlecellmultiomics.utils.submission.generate_submission_command(jobfile, hold, scheduler='sge')[source]
singlecellmultiomics.utils.submission.submit_job(command, target_directory, working_directory, threads_n=1, memory_gb=8, time_h=8, scheduler='sge', copy_env=True, email=None, job_alias=None, mail_when_finished=False, hold=None, submit=True, prefix=None, job_file_name=None, job_name=None, silent=False, slurm_scratch_space_size=None)[source]

Submit a job

Parameters:
  • threads (int) – amount of requested threads
  • memory_gb (int) – amount of requested memory
  • scheduler (str) – sge/slurm/local
  • hold (list) – list of job depedencies
  • submit (bool) – perform the actual submission, when set to False only the submission script is written
Returns:

id of sumbitted job

Return type:

job_id(str)

singlecellmultiomics.utils.submission.write_cmd_to_submission_file(cmd, job_data, jobfile, scheduler='sge')[source]

Module contents