Skip to content

Conversation

@mtyaka
Copy link
Member

@mtyaka mtyaka commented Feb 21, 2017

Normally the value of deprecated Answer.course_id will be copied into the new Answer.course_key field by a migration, but if your answer table is very large, you might want to use this management command to perform the copying in batches instead.

The new management command added by this patch lets you copy the column in batches, with a configurable batch size and sleep time between each batch.

Discussion: https://github.com/edx/edx-platform/pull/14327#issuecomment-280043022

Testing instructions:

  1. Install this branch of problem builder. Add some Answer type blocks to a unit in studio and fill them in the LMS to populate some entries in the problem_builder_answer table.
  2. Current release of problem builder does not allow NULL values in the course_key field, so you'll have to temporarily modify the table to be able to nullify the values via mysql console:
ALTER TABLE problem_builder_answer MODIFY course_key VARCHAR(255);
UPDATE problem_builder_answer SET course_id=course_key;
UPDATE problem_builder_answer SET course_key=NULL;
  1. Make sure course_key field of all rows in the table is empty (SELECT * FROM problem_builder_answer).
  2. Run the management command as the edxapp user:
python manage.py lms copy_deprecated_course_id --settings=devstack 
  1. Make sure that the values of course_id have been copied into course_key (SELECT * FROM problem_builder_answer).
  2. Repeat steps 2 - 5 while playing with different --batch-size and --sleep values.
  3. Put your answer table back into the correct state:
ALTER TABLE problem_builder_answer MODIFY course_key VARCHAR(255) NOT NULL;

@mtyaka mtyaka force-pushed the mtyaka/copy-course-ids-management branch 2 times, most recently from 0ca3142 to 768c171 Compare February 21, 2017 14:10
batch_size = options['batch_size']
sleep_time = options['sleep']
queryset = Answer.objects.filter(course_key__isnull=True)
batch_count = (queryset.count() / batch_size) + 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This should do a floating point division with a ceiling operation instead - otherwise cases where the queryset length is divisible by the batch_size will result in one additional batch count than necessary. E.g.:

batch_count = int(math.ceil(queryset.count() / float(batch_size)))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdero Thanks, done!

@bdero
Copy link

bdero commented Feb 23, 2017

This looks good @mtyaka - I just had one nit about the batch_count calculation, which you can choose to ignore if you want since I verified it won't cause an error either way. 👍

  • I tested this by creating a bunch of problem builder problems with this version checked out, moved all of the course_key fields to course_id + nulled out the course_key field, and repeated for several different combinations of --batch-size and --sleep.
  • I read through the code
  • I checked for accessibility issues N/A.
  • Includes documentation Not necessary. This is a one-time tool to help edX migrate problem builder. The doc string is informative enough.

Normally the value of deprecated Answer.course_id will be copied into
the new Answer.course_key field by a migration, but if your answer table
is very large, you might want to use this management command to perform
the copying in batches instead.
@mtyaka
Copy link
Member Author

mtyaka commented Feb 23, 2017

Rebased from 768c171

@mtyaka mtyaka force-pushed the mtyaka/copy-course-ids-management branch from 768c171 to 6b358f6 Compare February 23, 2017 06:57
batch_size = options['batch_size']
sleep_time = options['sleep']
queryset = Answer.objects.filter(course_key__isnull=True)
batch_count = int(math.ceil(queryset.count() / float(batch_size)))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually do the equivalent of SELECT count(*)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, queryset.count() performs a SELECT count(*).

Copy link

@maxrothman maxrothman Feb 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's going to be a performance issue for large tables, isn't it? And large tables is how we got down this road in the first place.

@mtyaka @bdero FYI until this is fixed, edX will not be able to use this management command.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxrothman This is only executed once at the beginning of the command to be able to report progress. I don't know how large the table in edX production is, but I was assuming the count query won't take more than a minute or two and that it wasn't going to be a problem since it's only executed once at the beginning of command. However I don't have much experience in this area so I may be underestimating the time the count query takes or I am missing something else.

We can remove the count query and keep fetching batches until the query returns no results - in that case we will not be able to report "Processed batch i of N", but can only print "Processed batch i".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtyaka Can we just avoid counting how many batches there are, and run the batches one by one in a while loop until the batch happens to be empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bradenmacdonald Sure, I opened a PR to do that at #145, but I am away until March 5th so if this needs to land before then somebody else will have to take care of merging & releasing this.

@maxrothman
Copy link

Thanks! This looks good.

@mtyaka mtyaka merged commit 239a6f1 into master Feb 24, 2017
@mtyaka mtyaka deleted the mtyaka/copy-course-ids-management branch February 24, 2017 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants