Skip to content

Conversation

@jcdyer
Copy link
Contributor

@jcdyer jcdyer commented Sep 19, 2017

OC-3086

Add a model to track completion of individual blocks.

Reviewers

Testing instructions

  1. Test creating and modifying BlockCompletion objects in the django shell:
    $ ./manage.py lms --settings=devstack shell
    ...
    >>> from lms.djangoapps.completion.models import BlockCompletion
    >>> from django.contrib.auth.models import User
    >>> from opaque_keys.edx.keys import CourseKey, UsageKey
    >>> BlockCompletion.objects.submit_completion(user=User.objects.get(...), course_key=CourseKey.from_string('...'), block_key=UsageKey.from_string('...'), completed=0.5)
    >>>
    
  2. Ensure that values have to be between 0.0 and 1.0, and that duplicate completions aren't created for a given user/course/block.
  3. Test migrations up and down.
    $ ./manage.py lms --settings=devstack migrate completion
    $ ./manage.py lms --settings=devstack migrate completion zero
    $ ./manage.py lms --settings=devstack migrate completion

@openedx-webhooks
Copy link

Thanks for the pull request, @jcdyer! It looks like you're a member of a company that does contract work for edX. If you're doing this work as part of a paid contract with edX, you should talk to edX about who will review this pull request. If this work is not part of a paid contract with edX, then you should ensure that there is an OSPR issue to track this work in JIRA, so that we don't lose track of your pull request.

Create an OSPR issue for this pull request.

Copy link
Contributor

@bradenmacdonald bradenmacdonald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcdyer Looks good and the migration works - had a few comments for you though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_or_create is actually a standard method on the default Manager already, and this is shadowing that base class method. Can you just use that instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has two customizations over the standard create_or_update, one of which is important, and the other which is just a convenience:

Important: It only updates the entry when the existing entry has changed its completion value. This prevents the modified timestamp from getting updated when it shouldn't, which will prove important for determining when to perform a recalculation of aggregators.

Convenience: The caller is not required to pass a block type, because it can be inferred from the block key.

I agree it's not great to shadow create_or_update. I could go one of two ways with this:

  1. Eliminate the convenience feature, so this has the same behavior as the super() version, but still performs the check that values have changed before updating.
  2. Rename this (maybe submit_completion) and keep the convenience function. I'm tempted to go with 2, to make it clear that we're working with a custom API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks for renaming - I think it's better not to shadow methods and change the behavior :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It currently allows True and False through without error - does that matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. This is only going to be applied to FloatFields. None is a possible value in a float field, but if you pass store true or false in a FloatField, they will just get stored as 1.0 and 0.0, (because bool is a subclass of float). In fact, there's really no reason to test 'hi'. A (slightly) more interesting question would be how the validator handles float('inf') and float('nan').

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we actually move this into openedx.core.djangoapps per the README notes at https://github.com/edx/edx-platform/tree/master/openedx ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in the new openedx.features directory? I think that might be a good fit for it (though it's existence is probably newer than the last time the README was updated. @robrap might have an opinion on that. I believe he's been involved in some work on improving the way the edx-platform code base is organized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason when running migrations and just running the devstack LMS in #16112 I'm seeing:

/edx/app/edxapp/edx-platform/lms/djangoapps/completion/models.py:94: RemovedInDjango19Warning: Model class lms.djangoapps.completion.models.BlockCompletion doesn't declare an explicit app_label and either isn't in an application in INSTALLED_APPS or else was imported before its application was loaded. This will no longer be supported in Django 1.9.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be unrelated, but John Eskew passed on the following details around setting up an AppConfig while working on his Django upgrade work. It came up in the context of importing signals. But, the following quote can be found in these docs:

New applications should avoid default_app_config. Instead they should require the dotted path to the appropriate AppConfig subclass to be configured explicitly in INSTALLED_APPS.

John wrote:

Here's the commit from my branch:
edx@04d662c

and here's the Django docs:
https://docs.djangoproject.com/en/1.11/ref/applications/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robrap, @bradenmacdonald: I added an AppConfig, and squashed my commits. Assuming tests pass, this should be ready to merge, but I no longer have merge permissions on edx-platform.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also include a composite index (index_together) on user, course_key? I imagine that retrieving all the data for a particular user in a particular course will be a common use case. But we can also wait to optimize/index later based on actual usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need. MySQL can use the prefix of an index for indexing. One of my favorite resources on how indexes work is this five question quiz. After you finish the quiz, it gives very clear explanations for why the answers are right or wrong, which makes it a great learning resource.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think we will need to get the indexes right before this goes to production, because this table is going to get very large, and once it does, unless I'm mistaken about how MySQL does it, adding an index will be a prohibitively slow operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove block_type from this unique_together, since it's derived from block_key, and if we ever tried to save two entries that differed only in their block_type value, we'd want that to raise an error since that's always an invalid case. With this constraint as written, such a situation would be accepted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, perhaps only user, block_key should be the unique constraint, since the course_key is usually derived from the block_key as well. (Not sure exactly how that works with old/draft mongo though.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right that I should keep the unique index precise. It will need to be (user, course_key, block_key) to handle old mongo. The other indexes will not need to be unique (but could be, in some cases).

Copy link
Contributor Author

@jcdyer jcdyer Sep 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block type still needs to be included in the composite indexes. Even though it's derived from block_key, it will be needed to create filtered separately, in order to create efficient aggregations that filter by block case (to meet edX's watched-videos need).`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will work as an index to retrieve any of the following:

  • all completions in a course
  • all completions of a particular type in a course
  • all completions of a particular type for a given user in a course

Copy link
Contributor Author

@jcdyer jcdyer Sep 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will work as an index on:

  • all completions for a given user
  • all completions in a course for a given user
  • the specific completion for a particular block for a particular user.

Things that aren't indexed:

  • All completions for a particular block across all users
  • All completions modified since a given time at any granularity. For aggregating smartly, do we need to add a (user, course, modified) key? What about (course, modified)? I think we might need the former, but not the latter, because I think we will always need to compare the timestamp against the specific user's aggregations to know whether to update them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a naive question, but are block key's globally unique, and would one need to know to use the course key to make use of this index for the block key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe block keys are globally unique for split mongo, but they drop the run value from the course key in old-mongo, so if the same block exists in two runs of the same course (highly likely), then they need to be disambiguated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update to include an index for recently modified values and to rename the validator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we agreed that we would use "percent" rather than "ratio" in the spec? Is there a strong case for the need for inconsistency?

One additional argument for "percent", which I know you don't love, is that it is precisely the fact that this ratio represents a percent that provides the rules for what are valid and invalid ratios.

@jcdyer
Copy link
Contributor Author

jcdyer commented Sep 22, 2017

jenkins run all

@jcdyer
Copy link
Contributor Author

jcdyer commented Sep 22, 2017

@robrap Do you (or the Taming the Monolith team) have an opinion on where this should live?

  • lms/djangoapps/completion
  • openedx/core/djangoapps/completion
  • openedx/features/completion
  • other

@robrap
Copy link
Contributor

robrap commented Sep 22, 2017

@jcdyer: At this point, it probably should stay where it is in lms/djangoapps/completion.

@jcdyer
Copy link
Contributor Author

jcdyer commented Sep 22, 2017

@robrap: It has never seen the light of the master branch, so "at this point" is probably the best time to move it, if it should be moved. (Unless you mean "at this point" in the monolith taming process)

Copy link
Contributor

@robrap robrap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some import comments. Sorry I comment on something different each time in. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we try to import more explicitly, like AutoCreatedField, but maybe not.

Copy link
Contributor Author

@jcdyer jcdyer Sep 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the migrations file. I think it's better to leave that in its autogenerated state, so that you can regenerate it as needed without having to reformat every time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's excluded from linting for that reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Sorry. My brain misfired.

@robrap
Copy link
Contributor

robrap commented Sep 22, 2017

Yes. I mean at this point in the taming process.

@jcdyer
Copy link
Contributor Author

jcdyer commented Sep 22, 2017

@bradenmacdonald This is ready for another round of review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very minor, but in the API design, the name of this model, and some methods, we use the term completion, though here and as the final argument to submit_completion you're using completed. Is it worth making those consistent now while we have the chance? Or is there a distinction I may have missed that explains the difference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks for renaming - I think it's better not to shadow methods and change the behavior :)

Copy link
Contributor

@bradenmacdonald bradenmacdonald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 @jcdyer Thanks for those changes! I have one nitty comment which is optional. This looks great to me overall. The indexes that we choose to use may change a bit as this evolves, but I think this is a great start.

  • I tested this: verified that the migrations work
  • I read through the code
  • I checked for accessibility issues: n/a
  • Includes documentation: docstrings.

@jcdyer
Copy link
Contributor Author

jcdyer commented Sep 25, 2017

Thanks @bradenmacdonald!

@robrap This is ready for another round of review.

Copy link
Contributor

@robrap robrap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments and questions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change needed. I added a comment on slack around naming conventions in case you are interested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was block_type intentionally or unintentionally left out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentionally. The information is already visible to a human reader in block_key.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine your comments about the indices living in the code. Thoughts?

Copy link
Contributor Author

@jcdyer jcdyer Sep 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's useful information that I think might not be as widely known as it should be, but it's fundamental to how indexes work, and is by no means unique to this particular model (the indexes on the grades models were designed with this in mind, for certain. I would think a wiki page, or other training documentation would be a better place for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a note somewhere about the purpose of block_type in the model and its relation to upcoming "aggregation models"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Possible repeat comment, but I don't see what Github did with mine.]
Checking if lack of block_type is intentional or unintentional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I responded to the other version of this comment. Intentional.

@jcdyer jcdyer force-pushed the cliff/block-completion-model branch from dcdcc0f to 0c96659 Compare September 28, 2017 10:42
@jcdyer
Copy link
Contributor Author

jcdyer commented Sep 28, 2017

@bradenmacdonald @robrap

I know this PR has already been approved, but I'm thinking about data storage, since this is going to be a big table. Since we have to have the course key for old mongo, and we're including the block type for indexing purposes, does it make sense not to store the full block key separately, and only store the block id? Then, instead of duplicating the data, we're just decomposing the block key into its constituent parts. And we can add a property to return a reconstructed block key whenever needed.

I've been up since 4am, so I could easily be missing something, but I don't think this would cause any problems in terms of retrieval or indexing, as long as we update the indexes appropriately. The signature of submit_completion() would stay the same.

@bradenmacdonald
Copy link
Contributor

@jcdyer

does it make sense not to store the full block key separately, and only store the block id?

Tempting, but that seems to go against the principle of treating opaque keys as actually opaque. It closes the door to future types of keys that have different or additional fields.

If the concern is data size, I would also be tempted to suggest normalizing the data by creating a separate BlockReference table with columns (course_key, block_key, block_type). Then this model would have two foreign keys (one to User, one to BlockReference) and a float, so the data size would be much smaller. However, that would remove the ability to create indexes that involve any of those columns together with the user ID (among other issues), so I suspect that we need to stick with the denormalized version for performance.

@jcdyer jcdyer force-pushed the cliff/block-completion-model branch 3 times, most recently from 2e28925 to 5d9ddfe Compare October 3, 2017 15:53
@jcdyer
Copy link
Contributor Author

jcdyer commented Oct 3, 2017

@robrap This is squashed and ready to merge. I no longer have merge permission, so I'll leave that to you.

@jcdyer jcdyer mentioned this pull request Oct 3, 2017
4 tasks
@jcdyer
Copy link
Contributor Author

jcdyer commented Oct 10, 2017

@robrap @bradenmacdonald Can one of you merge this? It's got two approvals, and I believe it is ready to go.

@robrap
Copy link
Contributor

robrap commented Oct 10, 2017

I'm going to let @bradenmacdonald manage all of these PRs if that's ok. I also might be replaced as reviewer at some point, but I'll let you know if that is the case.

@bradenmacdonald
Copy link
Contributor

@jcdyer @robrap I just had one last thought before we merge: the "Everything About Database Migrations" doc mentions to ask: "Should the primary key be a bigint (more than 4b rows)?"

I don't know what usage numbers are realistic for the coming few years, but if we imagine 10 million learners with 3 courses each and 100 blocks per course, that's 3 billion - do you think we should consider a bigint primary key now? Let me know your thoughts and then I'll merge.

@jcdyer
Copy link
Contributor Author

jcdyer commented Oct 11, 2017

@bradenmacdonald Yep. I think that's a good idea. I'll add a commit.

@jcdyer
Copy link
Contributor Author

jcdyer commented Oct 11, 2017

It looks like edX has never actually created a BigInt primary key field, and it wasn't quite as straightforward as models.BigIntegerField(primary_key=True), on account of the autoincrementing. I based my solution on this: https://stackoverflow.com/a/17035822/131084 and the implementation that is coming down the pipe in django 1.10 (https://docs.djangoproject.com/en/1.10/_modules/django/db/models/fields/#BigAutoField).

@bradenmacdonald
Copy link
Contributor

@jcdyer There is at least one existing use of a bigint: https://github.com/edx/edx-platform/blob/9c4869c1d59899a1e4a0ed2bf9cf92c77acc4447/lms/djangoapps/coursewarehistoryextended/models.py#L36
I suppose you could reuse the custom field type it uses, though your approach that's compatible with the upcoming BigAutoField (the key difference is that it's signed) is probably better.

I don't see the new code yet so I assume you're still working on it - let me know when it's ready for a spin.

@feanil Quick check: any concerns or advice from devops re merging code with a new DB table that uses a bigint primary key?

@jcdyer
Copy link
Contributor Author

jcdyer commented Oct 12, 2017

Oh good find. I did a grep for BigInteger (but apparently not BigInt) and only found one use (which wasn't a key), and then went looking for general purpose model-utility submodules. I think you're right though. Makes sense to stick with the future-compatible version.

@jcdyer
Copy link
Contributor Author

jcdyer commented Oct 12, 2017

Hmm. I thought I committed it. Maybe I was on the wrong branch. I'll investigate in the morning.

@robrap
Copy link
Contributor

robrap commented Oct 12, 2017

@jcdyer
Copy link
Contributor Author

jcdyer commented Oct 12, 2017

@bradenmacdonald Ready for review.

@robrap
Copy link
Contributor

robrap commented Oct 12, 2017

FYI: @doctoryes: Thought you might be interested in the BigAutoField discussion and code as you work on the Django upgrades.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jcdyer ! Questions:

  • Will this work with sqlite3, which is typically used in unittests?
  • Is the Django version unsigned or signed? Unsigned makes more sense here, of course.
  • Post-Django-upgrade, the initial migration will still point to this file's def. Do you advise just adding an import of BigAutoField to the top of this file post-upgrade?
  • What's the back-of-envelope calculation for when edx.org will get > 4.2 billion of these completions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@doctoryes Good point. BIGINT is allowed for sqlite, but the current combination is causing the python test suite to fail with this sqlite error:

OperationalError: AUTOINCREMENT is only allowed on an INTEGER PRIMARY KEY

The django BigAutoField is signed. It uses bigint AUTO_INCREMENT on MySQL and integer AUTOINCREMENT on sqlite which works around the above issue, since sqlite uses 64-bit primary keys in any case and allows up to 8 byte ints in any column anyways regardless of its declared type.

@jcdyer I'll give a final review once the tests are fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like @doctoryes has asked most of the questions I would care about.

Copy link
Contributor Author

@jcdyer jcdyer Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @doctoryes!

The back-of-napkin calculation that @bradenmacdonald came up with was 10M users taking 3 courses eaches with 100 completable blocks per course = 3B entries. It will probably take longer than that to get there for a couple reasons:

  1. We won't (and can't) backfill this information.
  2. We aren't currently planning to fill empty records for all blocks when a user enrolls in a course so never-visited blocks are non-records (but this could change).

So we aren't likely to actually hit the cap for significantly longer if at all, but it's close enough to the right ballpark that I think a better-safe-than-sorry policy is warranted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on the app configuration! Is this pass needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope.

@jcdyer jcdyer force-pushed the cliff/block-completion-model branch from 26a3c9b to 8149a26 Compare October 13, 2017 14:10
@jcdyer
Copy link
Contributor Author

jcdyer commented Oct 13, 2017

@bradenmacdonald @robrap This is ready for a (hopefully) final look before I squash commits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this import in the migration the same as the one in models.py? Otherwise this will presumably generate a spurious and invalid migration when we upgrade django since the model will no longer match the migration history.

try:
    from django.models import BigAutoField  # New in django 1.10
except ImportError:
    from openedx.core.djangolib.fields import BigAutoField

@bradenmacdonald
Copy link
Contributor

Had one comment about the migration field import, but otherwise still 👍 from me.

  • I tested this: followed the test instructions with the latest version (had to change completed argument to completion for submit_completion), verified that MySQL is using bigint datatype for the completion_blockcompletion id column.
  • I read through the code
  • I checked for accessibility issues: n/a
  • Includes documentation: docstrings.

@bradenmacdonald
Copy link
Contributor

@jcdyer Please let me know your thoughts on that import in the migration and squash this, then I'll merge :)

@jcdyer jcdyer mentioned this pull request Oct 17, 2017
4 tasks
@jcdyer jcdyer force-pushed the cliff/block-completion-model branch from cd9b226 to 6f56da5 Compare October 17, 2017 18:32
@jcdyer
Copy link
Contributor Author

jcdyer commented Oct 17, 2017

@bradenmacdonald Migration updated and commits squashed. The only change is to the imports in migrations/0001_initial.py

* Includes custom manager.
* Includes percent validation.
* Includes useful indices.
* Subclasses TimeStampedModel

OC-3086
@jcdyer jcdyer force-pushed the cliff/block-completion-model branch from 6f56da5 to 2207ef7 Compare October 17, 2017 18:36
@jcdyer jcdyer force-pushed the cliff/block-completion-model branch from 2207ef7 to 4ab64f7 Compare October 17, 2017 18:46
@bradenmacdonald
Copy link
Contributor

jenkins run bokchoy

@bradenmacdonald bradenmacdonald merged commit d64e0b9 into openedx:master Oct 17, 2017
@bradenmacdonald bradenmacdonald deleted the cliff/block-completion-model branch October 17, 2017 22:06
@edx-pipeline-bot
Copy link
Contributor

EdX Release Notice: This PR has been deployed to the staging environment in preparation for a release to production on Thursday, October 19, 2017.

@edx-pipeline-bot
Copy link
Contributor

EdX Release Notice: This PR has been deployed to the production environment.

@edx-pipeline-bot
Copy link
Contributor

EdX Release Notice: This PR has been rolled back from the production environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants