Skip to content

Comments

initial upload#2

Merged
ambrosejcarr merged 1 commit intomasterfrom
upload
Nov 15, 2017
Merged

initial upload#2
ambrosejcarr merged 1 commit intomasterfrom
upload

Conversation

@ambrosejcarr
Copy link
Member

Initial upload of the scsequtils class, converted and uploaded to humancellatlas as sctools

I put together a basic readme of the main package classes and their use, as well as the command line argument(s) (currently only one) that I'm using for optimus.

The main functions are in the sctools package, and tests for each module are in sctools/test

There are quite a few testing files whose size I have tried to keep small; there is no need to review any code in sctools/test/data/

setup.py Outdated
CLASSIFIERS = [
"Development Status :: 4 - Beta",
"Natural Language :: English",
"License :: OSI Approved :: GNU General Public License v2 (GPLv2)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The license checked into the repo is MIT. Other Mint repos are BSD-3 so would prefer to use that for consistency, unless there's a good reason not to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason. I'll switch it.


def tag(self, output_bam_name, tag_generators):
"""
given a bam file and tag generators in the same order, adds tags to the .bam file,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by "in the same order" here? same order as something in the bam file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. All files need equivalent sort order. I'll make the doc string clearer.

outbam = pysam.AlignmentFile(output_bam_name, 'wb', header=inbam.header)
try:
# zip up all the iterators
for *tag_sets, sam_record in zip(*tag_generators, inbam):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with pysam. I'm assuming this does not pull all reads in the bam into memory at once, although naively that's that it looks like. Can you confirm?

Copy link
Member Author

@ambrosejcarr ambrosejcarr Nov 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed; it's an iterator, and pysam is the python wrapper for htslib.

This line is zipping up three iterators that are lazily read. You get 2 fastq records (8 lines) and one sam line per iteration.


:return Iterator: iterator over tuples of records, one from each passed Reader object.
"""
# iterators = [iter(r) for r in readers]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this commented out line needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, will remove.



def zip_readers(*readers, indices=None):
"""zip together multiple fastq objects, yielding records simultaneously.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a more general function, right? It doesn't have to be a fastq object, just a Reader.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll edit the docstring!

setup.py Outdated
install_requires=[
'pysam',
'numpy',
'google-cloud',
Copy link
Contributor

@dshiga dshiga Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This brings in a huge web of dependencies. We were doing the same in secondary-analysis but changed it to install a more targeted set of google libraries. (We ran into an issue with one of the many packages that google-cloud installs, which it turned out we didn't actually need.) This also makes it a lot faster to install.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, very cool. Do you think it would be adequate to pull the same set of libraries you're using in the secondary-analysis repo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, depends which actual google modules you're using. Are you using it just for reading from buckets? In that case you probably just need google-cloud-storage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll verify, but I think that's all.

return the length of the Reader object. this should typically not be used with sys.stdin,
as it will consume the input
"""
return sum(1 for _ in self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be helpful to clarify that this gives the sum of line counts for all files in Reader and requires fully reading all files.

for file_ in self._files:

# set correct mode
if self._mode == 'r' and any(file_.endswith(s) for s in ['.gz', '.bz2']):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we treat this as a special case? We already set mode = 'r' when self._mode == 'r' by default in the else. In this special case we set it to 'rt' instead. And according to this, r and rt are the same thing:
https://stackoverflow.com/questions/23051062/open-files-in-rt-and-wt-modes

Copy link
Member Author

@ambrosejcarr ambrosejcarr Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the gzip and bzip2 libraries do make a distinction between r and rt; r means bytes, by default for these packages, which is annoying/confusing. This line fixes it so r means str and rb means bytes across the reader, which is, I think, normally what the user expects.

https://docs.python.org/3/library/gzip.html#gzip.open

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... if they have proper extensions. I will look into a better way of detecting compressing type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, in that case I think what you have is perfectly reasonable here. Maybe just add a comment explaining that rt really does differ from r in this case.

"""return the collective size of all files being read in bytes"""
return sum(os.stat(f).st_size for f in self._files)

def select_indices(self, indices):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indices is a set of line numbers that we want to filter down to, right? Or is it more general than that? If it's line numbers it might be clearer to call this select_lines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select_record_indices maybe? It's not pulling out lines/records themselves, just indices to them.

to create records specific to exons, transcripts, and genes
"""

__slots__ = ['_fields', '_attribute']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be _attributes plural? Looks like it's a dict of k/v pairs below.

"""
fields = record.strip(';\n').split('\t')
self._fields = fields[:8]
self._attribute = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a comment explaining the parsing that's happening here, and/or a short example of what fields[8] looks like.

:param key: item to access
:return str: value of item
"""
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could replace 94-97 with return self._attribute.get(key)

def size(self):
size = self.end - self.start
if size < 0:
raise ValueError('invalid record: negative size %d (start > end)' % size)
Copy link
Contributor

@dshiga dshiga Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not calculate this in __init__ and throw an error there? That would protect against creation of invalid records. Or is there value in being able to parse invalid ones?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it's more about limiting unnecessary processing of records; I wanted to maximize speed here and so do no parsing of the records until the data is specifically requested.

Often only a few fields of the record are of interest, or a particular type of record. For example, often times on gene records are wanted, and therefore there is no value in processing the much more frequent exon or transcript records.

I grudgingly added the variable field parsing to init (field[8:], which you correctly wanted an explanation for) because it contains gene id information that is normally of interest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added explanation in docstring.

super().__init__(files, mode, header_comment_char) # has different default args from super

def __iter__(self):
for line in super().__iter__():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would for line in super(): accomplish the same thing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I take that back, I like being explicit here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left unchanged.

:param [str|bytes] record: list of four strings or bytes objects defining a
single fastq record
"""
self._data = list(record)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a special reason why we are making a new list rather than using record as is, or are you just being careful? If there's a special reason, would be worth a comment.

Copy link
Member Author

@ambrosejcarr ambrosejcarr Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My docstring in line 21 is inaccurate; it should say "iterable" instead of list. This will often be read as a tuple. However, I require the ability to change the record, and so need it to be mutable. Hence the list conversion. Good catch.


@name.setter
def name(self, value):
if not isinstance(value, (bytes, str)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently you can create a record with an invalid name type but setting name to an invalid type after the fact throws an error. Would it be worth having __init__ call the setters so that this error is also thrown at record creation time? Or is there a reason why you would want to allow invalid records to be constructed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a value for invalid records. I don't think calling the setters at __init__ time will introduce a significant overhead, so I'll switch to this. Thanks for the suggestion. 👍


@name2.setter
def name2(self, value):
if not isinstance(value, (bytes, str)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to have these constraints. In addition to restricting types, are there other constraints that would be worth adding? Do you frequently encounter fields with mangled format that you would want to reject for example?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea; technically they are supposed to start with '@'. I can add this in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I want to be careful to balance validation with speed; all quality strings should contain only ASCII letters and the name field has a specific structure defined by the sequencer that generates the fastq file. However, these are expensive to validate and might be better served by external validation tools (or expanded later, if we find we want to apply tools that require the structures of those fields to be intact/correct"

self._data[3] = value

def __bytes__(self):
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you expect _data to usually be in bytes, and/or do you prefer to optimize speed for the bytes case, rather than for string? That appears to be the choice you're making here, just want to confirm.

Copy link
Contributor

@dshiga dshiga Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth moving implementation of __bytes__ and __str__ into the BytesRecord and StrRecord classes, so you don't have to try catch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect it to almost always be bytes (and it will always be bytes for our pipelines), as
it's faster to convert from bytes to 2-bit encodings; which you will see later in the package.

I used to have these inside the respective classes as you're suggesting. The reason I moved away from it was essentially just to reduce the amount of code in the package. My logic was that anyone using strings no longer cares about speed, and therefore it's OK to add small inefficiencies for string records.

I am very willing to move it back if you feel that would be better!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. I will also make a note to be clear bytes is the preferred implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's such a small amount of code and you have to have it somewhere - seems a little clearer to put it in BytesRecord and StrRecord.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted to make bytes and str both 1st class. StrRecord now subclasses Record, which operates on bytes. This should also indicate to the user (subtly) that bytes is the preferred implementation.


def average_quality(self):
"""return the average quality of this record"""
return sum(c for c in self.quality[:-1]) / (len(self.quality[:-1]) - 1) - 33
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why minus 33? A comment might be helpful here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely correct. It's because of an old solexa/illumina conversion.

:return int: mean
:return float: standard deviation
"""
pass # implement
Copy link
Contributor

@dshiga dshiga Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to return NotImplementedError here like you do elsewhere. Or in this case, maybe just take out this function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove it. Thanks.

chromosome = str(chromosome) # try to convert
if chromosome not in valid_chromosomes:
raise ValueError('chromsome %s not valid. Must be one of %r' %
(chromosome, valid_chromosomes))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to explicitly limit this to GRCh37/38? It will break for hg19 and b37.

Copy link
Member Author

@ambrosejcarr ambrosejcarr Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No; I should expand it to accept chr19 type chromosomes. Thanks for the comment 👍

for i, record in enumerate(fin):

if not record.is_unmapped: # record is mapped
if chromosome in record.reference_name and specific < n_specific:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a chance of a if "1" in "15" or something like that here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only used this for chromosomes 19 and 21; I think you're right. I'll adjust it to use ==.

I was, ironically, trying to support the chr19 vs 19 problem you identify above but there are ways to get that without this bug :)

if chromosome in record.reference_name and specific < n_specific:
chromosome_indices.append(i)
specific += 1
elif nonspecific < include_other:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like other_indices is meant to point to reads that aren't aligned to chromosome, but here it can also include reads aligned to chromosome if there are more than n_specific such reads. If that's intentional, it might be helpful to update the documentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not intentional, thanks for pointing this out.


def __init__(self, barcode_set, barcode_length):
"""
:param Counter barcode_set: dictionary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name barcode_set could make people think it should be a Set rather than a Mapping.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True; I'll make this.. barcodes

'median': distances[int(len(distances) * .5)],
'average': sum(distances) / len(distances),
'75th percentile': distances[int(len(distances) * .75)],
'maximum': distances[-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use numpy.percentile for these? I think we've already got numpy, and that'll return correct values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to avoid using numpy in the package but eventually capitulated for the entropy calculations. I'll switch it in.

except KeyError:
if byte != 78:
raise
return random.randint(0, 4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can return a 4. I think you want randrange.

Also, there are lots of ways to indicate ambiguous bases other than "N".

Copy link
Member Author

@ambrosejcarr ambrosejcarr Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In python 3 randint's upper bound is exclusive.

You're right about the other codes. I guess there is no reason not to include the full IUPAC list; I have only ever seen N ambiguous nucleotides in fastq sequence, which was the main purpose here.

👍 for generalizability, however.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm committed to python 2.7 forever, so I'm not an expert on this, but I think randint is inclusive in python3: https://docs.python.org/3/library/random.html#random.randint

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I'm sorry, i was assuming I was using numpy here, which is exclusive. I even went and tested the numpy version >.<

You're completely correct. I'll change this.

class ThreeBitEncodingMap:

# C: 1, A: 2, G: 3, T: 4, N: 6; # note, not using 0
map_ = {65: 2, 97: 2, 67: 1, 99: 1, 71: 3, 103: 3, 84: 4, 116: 4, 78: 6, 110: 6}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it maybe be clearer to say ord("A") rather than 65?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; that's wonderful. I've been struggling with how to write this clearly.

"""
# iterators = [iter(r) for r in readers]
if indices:
iterators = zip(*[r.select_indices(indices) for r in readers])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to use the generator syntax here? I'm not actually sure if it makes a difference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean iterators = zip(*(r.select_indices(indices) for r in readers))?

I'll make sure it doesn't break, but yea; that would probably be more memory efficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I mean it probably has no real effect, but you're so careful about using iterators everywhere else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think this could really matter at scale. This is a good catch. 👍

"""

@staticmethod
def record_grouper(iterable):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this function better after talking in person. A comment about its purpose would probably be helpful, here or maybe in the class docstring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added docstring to function.

"""
:param FastqRecord record: record to extract from
:param Tag tag: defines tag to extract
:return tuple:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a more detailed comment about what's returned



# this could easily be expanded beyond fastq generated tags, we just don't have any use cases yet.
Tag = namedtuple('Tag', ['start', 'end', 'sequence_tag', 'quality_tag'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I mostly understand now, but it was not clear to me at first what a fastq tag is and what TagGenerator was doing. More descriptive comments about what Tag and TagGenerator are for would be helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there is one too many levels of abstraction in this set of functions; this is what I was most concerned about being unclear. I'll put some time into improving the documentation and also maybe the code.

class ThreeBitEncodingMap:

# C: 1, A: 2, G: 3, T: 4, N: 6; # note, not using 0
map_ = {65: 2, 97: 2, 67: 1, 99: 1, 71: 3, 103: 3, 84: 4, 116: 4, 78: 6, 110: 6}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, unless it would bog down performance, what about creating the map_ this way?
{ord('A'): 2, ord('a'): 2, ... }
That way you are documenting the mapping in code rather than a comment, and less chance for a typo to introduce an error.
Or you could have raw_map = {'A': 2, 'a': 2, ... } and then process that to create map_ = {65: 2, ... }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Markus made the same (really good) suggestion. I'm going to change this; much much clearer the way you're both suggesting.

except KeyError:
if byte != 78:
raise
return random.randint(0, 4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns 0 to 4 inclusive, but I think you meant 0-3. random.randrange(0, 4) or random.randint(0, 3)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct; I converted from numpy but forgot that numpy.randint and random.randint are different.

for i in reversed(range(self._barcode_length)):
binary_base_representations, counts = np.unique(keys & 3, return_counts=True)
if weighted: # todo weighted not working, values multiplication does not work
base_counts_by_position[i, binary_base_representations] = counts # * values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should throw a NotImplementedError, or else take out this option altogether for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should throw a NotImplementedError or I can extract it to another open branch and add an issue. I think this will be useful, but we can drop it from this upload as I haven't figured out how to do this efficiently, yet.


def __init__(self, barcode_set, barcode_length):
"""
:param Counter barcode_set: dictionary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be helpful to indicate that this is a dict of barcodes to counts. Do you need it to be a collections.Counter specifically, or any dict of barcodes to counts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think any mapping to counts is probably Ok; i'll verify and make explicit.

def from_whitelist(cls, file_, barcode_length):
"""Creates a barcode set from a whitelist file

:param str file_: location of the whitelist file. Should be formatted one barcode per line.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the barcodes are expected to be strings like 'ACGT', not two bit or three bit encoded, etc. Would be helpful to specify that or give an example barcode string, since elsewhere in this package we expect two bit or three bit encoded strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment in file_ docstring.

def from_iterable_strings(cls, iterable, barcode_length):
"""construct an ObservedBarcodeSet from an iterable of string barcodes"""
tbe = TwoBit(barcode_length)
return cls(Counter(tbe.encode(b) for b in iterable))
Copy link
Contributor

@dshiga dshiga Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need to include barcode_length as a param to cls? Same for line 128.

if not isinstance(barcode_set, Mapping):
raise TypeError('barcode set must be a dict-like object mapping barcodes to counts')
self._data = barcode_set
self._barcode_length = barcode_length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw an error if barcode_length is None?

return base4_entropy(self.base_frequency(weighted=weighted))


class PriorBarcodeSet(BarcodeBase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on why two classes are needed here, rather than three factory methods that create instances of BarcodeBase from a file, from an iterable of strings, and an iterable of bit encoded strings. Once you've constructed them, they share all of their code in BarcodeBase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I might have some questions about implementation here, but I'll make this change.

'using picard FastqToSam')
parser.add_argument('-o', '--output-bamfile', required=True,
help='filename for tagged bam')
args = vars(parser.parse_args())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why convert to a dict here? Isn't it easier to use args.u2 etc below instead of args['u2']?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing; it's hard to create namespace objects but easy to create keyword dictionaries; I'll make it clear that args can be passed that way in the docstring.

}

@classmethod
def get_tag(cls, sequencing_read):
Copy link
Contributor

@dshiga dshiga Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be get_tags plural? Returns a tuple of tags. Or get_tag_set.

"""
tag_generators = []
for k, v in files_with_tags.items():
tag_generators.append(fastq.TagGenerator(cls.get_tag(k), files=v))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says the key is the filename and the value is the tag, opposite of what is happening here.

self._file = alignment_file
self._open_mode = open_mode

def indices_by_chromosome(self, n_specific, chromosome, include_other=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call it include_unaligned instead of include_other. Would it make sense to interpret -1 to mean include all unaligned reads? Is that a useful thing to be able to do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, this is not just unaligned reads but any read that does not match the chromosome we asked for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll find a way to clarify; sounds like it produced some confusion.

Copy link
Member Author

@ambrosejcarr ambrosejcarr Nov 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted docstring:

    """
    Return the list of first n_specific indices of reads aligned to selected chromosome.

    If desired, will also return non-specific indices in a second list (can serve as negative
    control reads).

    :param int n_specific: number of aligned reads to return indices for
    :param str chromosome: only reads from this chromosome are considered valid
    :param int include_other: optional, (default=0), the number of reads to include that are
      NOT aligned to chromosome (could be aligned or unaligned read)
    :return [int]: list of indices to reads aligning to chromosome
    :return [int]: list of indices to reads NOT aligning to chromosome, only returned if
      include_other is not 0.
    """

:param [fastq.TagGenerator] tag_generators: generators that yield Tag objects
(see fastq.Tag)
"""
inbam = pysam.AlignmentFile(self.bam_file, 'rb', check_sq=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using with pysam.AlignmentFile the way you do on line 57?


with pysam.AlignmentFile(self._file, self._open_mode) as fin:
specific, nonspecific = 0, 0 # counters
chromosome = str(chromosome)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already converted to string on 52.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit different; here I'm being lenient in case the user passed an int as chromosome. On 52 I was building up the set of allowable chromosomes.

chromosome_indices = []
other_indices = []

for i, record in enumerate(fin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like some of the contents here, especially the if else block would be worth moving out into its own function that could be unit tested on its own.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified the code. Since each list hashes its length, I don't need the counters; I can just test the length of chromosome_indices and other_indices directly.

However, removing this into its own function would only serve to separate the type/value checking from the index extraction; I think this belongs where it is, but hope that the code simplification makes it clearer and reduces your desire to extract the code.

Copy link
Contributor

@dshiga dshiga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how you've written this, very clear and concise and well documented! I made a bunch of comments on particular lines of code. Once you've made new commits, let me know and I'll take another look.

@ambrosejcarr
Copy link
Member Author

Thanks very much @dshiga. I'll begin addressing your comments shortly!

Copy link
Contributor

@dshiga dshiga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

"""

# acceptable chromosomes
valid_chromosomes = [str(i) for i in range(1, 23)] + ['M', 'X', 'Y']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe "MT" is also used to represent mitochrondia too.


if len(chromosome_indices) < n_specific or len(other_indices) < include_other:
warnings.warn('Only %d unaligned and %d reads aligned to chromosome %s were found in'
'%s' % (len(chromosome_indices), len(other_indices),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you've swapped chromosome_indices and other_indices.

try:
return self.map_[byte]
except KeyError:
if byte != 78:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to keep everything consistent, you might want byte in (ord("N"), ord("n"))

def __getitem__(self, item):
return self._data[item]

def summarize_hamming_distances(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a little test of this method?

Copy link
Contributor

@mckinsel mckinsel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ambrosejcarr
Copy link
Member Author

ambrosejcarr commented Nov 14, 2017

@dshiga @mckinsel

Marcus' most recent comment revealed some non-functional, untested code. I ran a coverage report and added tests to everything I thought important to test. Most of the things that are not currently tested are errors that are supposed to be thrown when the classes receive incorrect input.

If anyone's interested, please take a look at the attached report and let me know if you'd like tests added for anything that's currently not covered. If not, I'll merge this before the demo tomorrow at around 1~10:45a.

cov_html.zip

@ambrosejcarr ambrosejcarr merged commit a011a61 into master Nov 15, 2017
@ambrosejcarr ambrosejcarr deleted the upload branch November 15, 2017 15:07
benjamincarlin added a commit that referenced this pull request Feb 19, 2019
benjamincarlin added a commit that referenced this pull request Mar 11, 2019
* first draft for generalizing attaching barcodes

* updated generalized class

* Fixed PR #1

* styling comment changes

* styling comment changes

* styling comment changes

* barcode end postion bug

* Fixed PR #2

* Fixed PR #2, made arg validation more human readable

* add comments/documentation to the class

* updated comments/documentation style for return values

* Fixed PR #57 comments

* Updated doc strings, error handling

* provided descriptions for input files and purpose of class

* updated class description
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants