-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Iahmad/migrate transcripts s3 #17605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Your PR has finished running tests. |
2 similar comments
|
Your PR has finished running tests. |
|
Your PR has finished running tests. |
mushtaqak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did initial review. Things are looking good.
Miner comments on logging and single quotes across py file.
| $ ./manage.py cms migrate_transcripts --all-courses --settings=devstack_docker | ||
| $ ./manage.py cms migrate_transcripts 'edX/DemoX/Demo_Course' --settings=devstack_docker | ||
| """ | ||
| args = '<course_id course_id ...>' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not have a --commit argument that will actually execute the command. If --commit is not provided the command will draft run and would print out the expected transcripts migrated?
|
|
||
|
|
||
| VIDEO_DICT_STAR = dict( | ||
| client_video_id="TWINKLE TWINKLE", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer single quotes here
| data={'data': video_sample_xml} | ||
| ) | ||
|
|
||
| save_to_store(SRT_FILEDATA, "subs_grmtran1.srt", 'text/srt', self.video_descriptor.location) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single quotes
| **kwargs | ||
| ) | ||
| except InvalidKeyError as exc: | ||
| raise CommandError(u'Invalid Course Key: ' + unicode(exc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invalid course key:
| kwargs[key] = options[key] | ||
|
|
||
| try: | ||
| enqueue_async_migrate_transcripts_tasks( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not check if the course exists with this key ? e.g
if not modulestore().get_course(course_key):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the check in def async_migrate_transcript
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this check in def async_migrate_transcript
| revision=ModuleStoreEnum.RevisionOption.published_only, include_orphans=False): | ||
| all_videos.append(video) | ||
|
|
||
| for video in store.get_items(course_key, qualifiers={'category': 'video'}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be unneccessary if we do not provide revision argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The all revision option does not seem to work ...
cms/djangoapps/contentstore/tasks.py
Outdated
| return all_videos | ||
|
|
||
|
|
||
| def is_transcript_content_srt(transcript_content): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstring is missing
cms/djangoapps/contentstore/tasks.py
Outdated
| try: | ||
| srt_subs_obj = SubRipFile.from_string(transcript_content.data.decode('utf-8-sig')) | ||
| if len(srt_subs_obj) > 0: | ||
| LOGGER.info("SRT file format detected") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we log this?
cms/djangoapps/contentstore/tasks.py
Outdated
| transcript_name = args[2] | ||
| force_update = args[3] | ||
| result = None | ||
| LOGGER.info("Start migrating %s transcript", language_code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets use better text for this log as well
cms/djangoapps/contentstore/tasks.py
Outdated
| try: | ||
| transcript_content = Transcript.asset(video.location, transcript_name, language_code) | ||
| if video.edx_video_id: | ||
| LOGGER.info("Found edx_video_id= %s via first fetch asset method", video.edx_video_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add log prefix so as to make debugging easier
|
Your PR has finished running tests. |
201bf55 to
049ef73
Compare
|
Your PR has finished running tests. |
1 similar comment
|
Your PR has finished running tests. |
muhammad-ammar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@irfanuddinahmad I am done with first pass of the review. Let me know once feedback is addressed.
cms/djangoapps/contentstore/tasks.py
Outdated
| video = args[0] | ||
| language_code = args[1] | ||
| transcript_name = args[2] | ||
| force_update = args[3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can do something like
video, language_code, transcript_name, force_update = args
cms/djangoapps/contentstore/tasks.py
Outdated
| )) | ||
| else: | ||
| LOGGER.info("video.sub is empty") | ||
| if any(other_lang_transcripts): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need if other_lang_transcripts:. for loop will not run if other_lang_transcripts is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think code under if english_transcript: and for lang, name in other_lang_transcripts.items(): can be merged like below. As per my understanding we don't need separate handling for sub and transcripts in this case.
all_language_transcripts = dict({'en': video.sub}, **other_lang_transcripts)
for lang, name in all_language_transcriptss.items():
...
cms/djangoapps/contentstore/tasks.py
Outdated
| try: | ||
| transcript_content, transcript_name, Transcript_mime_type = get_transcript_from_contentstore( | ||
| video, language_code, Transcript.SJSON) | ||
| if video.edx_video_id: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can replace the whole if...else... with something like below
if not clean_video_id(video.edx_video_id):
video.edx_video_id = create_external_video('external-video')
video.save_with_metadata(user=User.objects.get(username='staff'))
result = push_to_s3(
video.edx_video_id,
language_code,
transcript_content,
Transcript.SJSON,
force_update
)
cms/djangoapps/contentstore/tasks.py
Outdated
|
|
||
|
|
||
| @task(base=PersistOnFailureTask) | ||
| def async_migrate_transcript(*args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about def async_migrate_transcript(course_key, **kwargs):?
cms/djangoapps/contentstore/tasks.py
Outdated
| kwargs = { | ||
| 'force_update': force_update, | ||
| 'commit': commit | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation is bit off
| self.assertEqual(len(languages), 0) | ||
|
|
||
| # now call migrate_transcripts command and check the transcript availability | ||
| call_command('migrate_transcripts', unicode(self.course.id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should verify the output of command in this case to check if the correct message is returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command output is checked by the following test
def test_migrate_transcripts_logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we should check to have been migrated transcripts here.
| save_to_store(SRT_FILEDATA, 'subs_grmtran1.srt', 'text/srt', self.video_descriptor.location) | ||
| save_to_store(CRO_SRT_FILEDATA, 'subs_croatian1.srt', 'text/srt', self.video_descriptor.location) | ||
|
|
||
| def test_migrated_transcripts_count(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add docstring to explain what we are testing/verifying in this test.
| languages = api.get_available_transcript_languages(self.video_descriptor.edx_video_id) | ||
| self.assertEqual(len(languages), 2) | ||
|
|
||
| def test_migrated_transcripts_without_commit(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command output is checked by the following test
def test_migrate_transcripts_logging
| """ | ||
| Test migrating transcripts | ||
| """ | ||
| translations = self.video_descriptor.available_translations(self.video_descriptor.get_transcripts_info()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to pass include_val_transcripts to get_transcripts_info and available_translations?
@Qubad786 what do you say?
| call_command('migrate_transcripts', unicode(self.course.id), '--force-update', '--commit') | ||
|
|
||
| self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'hr')) | ||
| self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'ge')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do the above in a loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the command arguments change for subsequent calls ... a loop will decrease readability
|
@Qubad786 You also need to look into this PR. |
|
Your PR has finished running tests. |
3 similar comments
|
Your PR has finished running tests. |
|
Your PR has finished running tests. |
|
Your PR has finished running tests. |
a59ff2a to
e6d5216
Compare
|
Your PR has finished running tests. |
cms/djangoapps/contentstore/tasks.py
Outdated
| )) | ||
| LOGGER.info("[Transcript migration] process for video %s ended", video.location) | ||
| callback = task_status_callback.s() | ||
| status = chord(sub_tasks)(callback) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nasthagiri I have used a celery task for each video inside a celery task for each course. Do you think this could create too many celery tasks, which could potentially swamp the task queue? The benefit of a 'task per video' could be a more granular failure tracking and retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@irfanuddinahmad That should be fine given our other recent usage of celery tasks:
- we create a celery task for each learner that we send an email to via ACE
- we create a celery task for each problem submission by a learner
However, you should let devops know so they are aware in case they see spikes in tasks at certain times. Also, the team will want to monitor this the first time you roll it out.
Things to consider:
-
Is there a time_limit placed on the task? If your task is calling the network, you'll want to timeout appropriately (and exponentially backoff) in case of a persistent network failure.
-
Are you using our celery-utils, which provides functionality for logging and persistent retries? You may find this useful if you want to persist failures (after celery's retries give up) and automatically retry them via a Jenkins job.
You can see some of this in action in the grades tasks code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nasthagiri Thanks. Very useful feedback (as usual! :) )
I am already using the PersistOnFailureTask ... let me try the
celeryutils.chordable_django_backend and LoggedPersistOnFailureTask also
mushtaqak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@irfanuddinahmad Done with second pass of the review. This is looking good.
| u'Without this flag, the command will return the transcripts discovered for migration ', | ||
| ) | ||
|
|
||
| def handle(self, *args, **options): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method docstring?
| raise CommandError('At least one course or --all-courses must be specified.') | ||
|
|
||
| kwargs = {} | ||
| for key in ('all_courses', 'force_update', 'commit'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be we can write this as:
kwargs = {key: options[key] for key in ['all_courses', 'force_update', 'commit'] if options.get(key)}
|
|
||
| 1 | ||
| 00:00:02,720 --> 00:00:05,430 | ||
| Ja, ich spreche Deutsch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets also add one more line so as to check unicode characters.
2
00:00:6,500 --> 00:00:08,600
可以用“我不太懂艺术 但我知道我喜欢什么”做比喻.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| client_video_id='TWINKLE TWINKLE', | ||
| duration=42.0, | ||
| edx_video_id='test_edx_video_id', | ||
| status="upload", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: single quotes
|
|
||
| def test_invalid_course(self): | ||
| with self.assertRaises(CommandError): | ||
| call_command('migrate_transcripts', "invalid-course") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 'invalid-course'
| self.assertEqual(len(languages), 0) | ||
|
|
||
| # now call migrate_transcripts command and check the transcript availability | ||
| call_command('migrate_transcripts', unicode(self.course.id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we should check to have been migrated transcripts here.
| languages = api.get_available_transcript_languages(self.video_descriptor.edx_video_id) | ||
| self.assertEqual(len(languages), 2) | ||
|
|
||
| def test_migrated_transcripts_without_commit(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing docstring
| # now call migrate_transcripts command again and check the transcript availability | ||
| call_command('migrate_transcripts', unicode(self.course.id), '--commit') | ||
|
|
||
| self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'hr')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also check the count before and after the command run.
| self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'hr')) | ||
| self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'ge')) | ||
|
|
||
| def test_migrate_transcripts_logging(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haven't checked for any exception logs. We need to check them as well in this test or a separate test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| call_command('migrate_transcripts', "invalid-course") | ||
|
|
||
|
|
||
| class MigrateTranscripts(ModuleStoreTestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MigrateTranscriptsTest
|
Your PR has finished running tests. |
5 similar comments
|
Your PR has finished running tests. |
|
Your PR has finished running tests. |
|
Your PR has finished running tests. |
|
Your PR has finished running tests. |
|
Your PR has finished running tests. |
8a5cf1f to
239b4ad
Compare
|
Your PR has finished running tests. |
|
Your PR has finished running tests. |
|
@nasthagiri Kindly advise regarding the trigger for our video transcript management command. Some options are discussed in the comments of the following issue: |
|
Your PR has finished running tests. |
1 similar comment
|
Your PR has finished running tests. |
|
@irfanuddinahmad https://openedx.atlassian.net/browse/ENT-641 takes me to a SuccessFactors ticket. Did you intend another ticket number? |
|
"@irfanuddinahmad https://openedx.atlassian.net/browse/ENT-641 takes me to a SuccessFactors ticket. Did you intend another ticket number?" |
|
@irfanuddinahmad ah. Thanks. I agree with Fred that there are valid reasons to not have Jenkins jobs be fully parameterized. To Backfill Grades, we created a management command for the following reasons:
We then created a Jenkins Job that would be manually triggered to run the management command. Following devOps practices, we configured the management command via Django Admin. Now, for video migrations, you can do the same thing we did for Grades: Or, you can go with Option #1 specified in the ticket: |
|
@nasthagiri Thanks. We will have a Jenkins job configured via Django Admin. |
|
Your PR has finished running tests. |
|
@irfanuddinahmad Ok. Please inform your devops contact person so s/he is aware of your path forward. |
|
Your PR has finished running tests. |
2 similar comments
|
Your PR has finished running tests. |
|
Your PR has finished running tests. |
| default=DEFAULT_ALL_COURSES, | ||
| help=u'Migrates transcripts to S3 for all courses.' | ||
| ) | ||
| parser.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these settings?
| """ | ||
| force_update = BooleanField(default=False) | ||
| commit = BooleanField(default=False) | ||
| course_ids = TextField( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need all_courses option here?
|
|
||
| force_update = BooleanField( | ||
| default=False, | ||
| help_text="Flag to update transcripts in django storage even if already present." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flag to force migrate transcripts for the requested courses, overwrite if already present.
| '--course-id', '--course_id', | ||
| dest='course_ids', | ||
| action='append', | ||
| help=u'Migrates transcripts for list of courses.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Migrates transcripts for the list of courses
| action='store_true', | ||
| default=DEFAULT_COMMIT, | ||
| help=u'Commits the discovered video transcripts to django storage. ' | ||
| u'Without this flag, the command will return the transcripts discovered for migration ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Remove extra space in transcripts discovered for migration probably add a dot 😃
| self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'hr')) | ||
| self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'ge')) | ||
|
|
||
| def test_migrate_transcripts_logging(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was missed.
cms/djangoapps/contentstore/tasks.py
Outdated
| def async_migrate_transcript_subtask(self, *args, **kwargs): | ||
| #pylint: disable=unused-argument | ||
| """ | ||
| Migrates a transcript of a given video in a course as a new celery task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation is a bit off here
cms/djangoapps/contentstore/tasks.py
Outdated
| dict({'file_format': file_format}), | ||
| ContentFile(transcript_content) | ||
| ) | ||
| LOGGER.info("[Transcript migration] Push_to_storage %s for %s with create_or_update method", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: LOGGER.info("[Transcript migration] save_transcript_to_storage: transcript video %s is %s over-written", '' if result else 'not', edx_video_id)
cms/djangoapps/contentstore/tasks.py
Outdated
| file_format, | ||
| ContentFile(transcript_content) | ||
| ) | ||
| LOGGER.info("[Transcript migration] Push_to_storage %s for %s with create method", result, edx_video_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might need to change to save_transcript_to_storage
cms/djangoapps/contentstore/tasks.py
Outdated
| LOGGER.info("[Transcript migration] Push_to_storage %s for %s with create method", result, edx_video_id) | ||
| return result | ||
| except ValCannotCreateError as err: | ||
| LOGGER.exception("[Transcript migration] Push_to_storage_failed: %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might need to change to save_transcript_to_storage
|
Your PR has finished running tests. |
cms/djangoapps/contentstore/tasks.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
above two variables are unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are from the branched off master code ...
cms/djangoapps/contentstore/tasks.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do clean_video_id only once like this.
edx_video_id = clean_video_id(video.edx_video_id)
if not edx_video_id:
video.edx_video_id = create_external_video('external-video')
video.save_with_metadata(user=User.objects.get(username='staff'))
if edx_video_id:
result = save_transcript_to_storage(
edx_video_id,
language_code,
transcript_content,
Transcript.SJSON,
force_update
)And we need to remove clean_video_id from save_transcript_to_storage as we not making any decision on edx_video_id in it.
mushtaqak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just one outstanding question remaining though about all_courses option in model.
After that's addressed. Let's rebase and merge. Good work 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not have all_courses option here?
|
Your PR has finished running tests. |
muhammad-ammar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
7bad68d to
fabdb1d
Compare
|
Your PR has finished running tests. |
fabdb1d to
3e426ac
Compare
|
Your PR has finished running tests. |
|
EdX Release Notice: This PR has been deployed to the staging environment in preparation for a release to production on Friday, April 13, 2018. |
|
EdX Release Notice: This PR has been deployed to the production environment. |
|
EdX Release Notice: This PR has been rolled back from the production environment. |
Implementation of the design at:
https://openedx.atlassian.net/wiki/spaces/EDUCATOR/pages/162172272/Migration+from+contentstore+to+s3