Skip to content

Conversation

@irfanuddinahmad
Copy link
Contributor

@edx-status-bot
Copy link

Your PR has finished running tests.

2 similar comments
@edx-status-bot
Copy link

Your PR has finished running tests.

@edx-status-bot
Copy link

Your PR has finished running tests.

Copy link
Contributor

@mushtaqak mushtaqak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did initial review. Things are looking good.

Miner comments on logging and single quotes across py file.

$ ./manage.py cms migrate_transcripts --all-courses --settings=devstack_docker
$ ./manage.py cms migrate_transcripts 'edX/DemoX/Demo_Course' --settings=devstack_docker
"""
args = '<course_id course_id ...>'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not have a --commit argument that will actually execute the command. If --commit is not provided the command will draft run and would print out the expected transcripts migrated?



VIDEO_DICT_STAR = dict(
client_video_id="TWINKLE TWINKLE",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer single quotes here

data={'data': video_sample_xml}
)

save_to_store(SRT_FILEDATA, "subs_grmtran1.srt", 'text/srt', self.video_descriptor.location)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single quotes

**kwargs
)
except InvalidKeyError as exc:
raise CommandError(u'Invalid Course Key: ' + unicode(exc))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid course key:

kwargs[key] = options[key]

try:
enqueue_async_migrate_transcripts_tasks(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not check if the course exists with this key ? e.g
if not modulestore().get_course(course_key):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the check in def async_migrate_transcript

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this check in def async_migrate_transcript

revision=ModuleStoreEnum.RevisionOption.published_only, include_orphans=False):
all_videos.append(video)

for video in store.get_items(course_key, qualifiers={'category': 'video'},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be unneccessary if we do not provide revision argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The all revision option does not seem to work ...

return all_videos


def is_transcript_content_srt(transcript_content):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring is missing

try:
srt_subs_obj = SubRipFile.from_string(transcript_content.data.decode('utf-8-sig'))
if len(srt_subs_obj) > 0:
LOGGER.info("SRT file format detected")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we log this?

transcript_name = args[2]
force_update = args[3]
result = None
LOGGER.info("Start migrating %s transcript", language_code)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets use better text for this log as well

try:
transcript_content = Transcript.asset(video.location, transcript_name, language_code)
if video.edx_video_id:
LOGGER.info("Found edx_video_id= %s via first fetch asset method", video.edx_video_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add log prefix so as to make debugging easier

@edx-status-bot
Copy link

Your PR has finished running tests.

@irfanuddinahmad irfanuddinahmad force-pushed the iahmad/migrate_transcripts_S3 branch from 201bf55 to 049ef73 Compare March 9, 2018 14:12
@edx-status-bot
Copy link

Your PR has finished running tests.

1 similar comment
@edx-status-bot
Copy link

Your PR has finished running tests.

Copy link
Contributor

@muhammad-ammar muhammad-ammar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@irfanuddinahmad I am done with first pass of the review. Let me know once feedback is addressed.

video = args[0]
language_code = args[1]
transcript_name = args[2]
force_update = args[3]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can do something like

video, language_code, transcript_name, force_update = args

))
else:
LOGGER.info("video.sub is empty")
if any(other_lang_transcripts):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need if other_lang_transcripts:. for loop will not run if other_lang_transcripts is empty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think code under if english_transcript: and for lang, name in other_lang_transcripts.items(): can be merged like below. As per my understanding we don't need separate handling for sub and transcripts in this case.

all_language_transcripts = dict({'en': video.sub}, **other_lang_transcripts)
for lang, name in all_language_transcriptss.items():
...

try:
transcript_content, transcript_name, Transcript_mime_type = get_transcript_from_contentstore(
video, language_code, Transcript.SJSON)
if video.edx_video_id:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can replace the whole if...else... with something like below

if not clean_video_id(video.edx_video_id):
    video.edx_video_id = create_external_video('external-video')
    video.save_with_metadata(user=User.objects.get(username='staff'))

result = push_to_s3(
    video.edx_video_id,
    language_code,
    transcript_content,
    Transcript.SJSON,
    force_update
)



@task(base=PersistOnFailureTask)
def async_migrate_transcript(*args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about def async_migrate_transcript(course_key, **kwargs):?

kwargs = {
'force_update': force_update,
'commit': commit
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation is bit off

self.assertEqual(len(languages), 0)

# now call migrate_transcripts command and check the transcript availability
call_command('migrate_transcripts', unicode(self.course.id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should verify the output of command in this case to check if the correct message is returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command output is checked by the following test
def test_migrate_transcripts_logging

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should check to have been migrated transcripts here.

save_to_store(SRT_FILEDATA, 'subs_grmtran1.srt', 'text/srt', self.video_descriptor.location)
save_to_store(CRO_SRT_FILEDATA, 'subs_croatian1.srt', 'text/srt', self.video_descriptor.location)

def test_migrated_transcripts_count(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add docstring to explain what we are testing/verifying in this test.

languages = api.get_available_transcript_languages(self.video_descriptor.edx_video_id)
self.assertEqual(len(languages), 2)

def test_migrated_transcripts_without_commit(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command output is checked by the following test
def test_migrate_transcripts_logging

"""
Test migrating transcripts
"""
translations = self.video_descriptor.available_translations(self.video_descriptor.get_transcripts_info())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to pass include_val_transcripts to get_transcripts_info and available_translations?

@Qubad786 what do you say?

call_command('migrate_transcripts', unicode(self.course.id), '--force-update', '--commit')

self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'hr'))
self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'ge'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do the above in a loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the command arguments change for subsequent calls ... a loop will decrease readability

@muhammad-ammar
Copy link
Contributor

@Qubad786 You also need to look into this PR.

@edx-status-bot
Copy link

Your PR has finished running tests.

3 similar comments
@edx-status-bot
Copy link

Your PR has finished running tests.

@edx-status-bot
Copy link

Your PR has finished running tests.

@edx-status-bot
Copy link

Your PR has finished running tests.

@irfanuddinahmad irfanuddinahmad force-pushed the iahmad/migrate_transcripts_S3 branch from a59ff2a to e6d5216 Compare March 19, 2018 09:32
@edx-status-bot
Copy link

Your PR has finished running tests.

))
LOGGER.info("[Transcript migration] process for video %s ended", video.location)
callback = task_status_callback.s()
status = chord(sub_tasks)(callback)
Copy link
Contributor Author

@irfanuddinahmad irfanuddinahmad Mar 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nasthagiri I have used a celery task for each video inside a celery task for each course. Do you think this could create too many celery tasks, which could potentially swamp the task queue? The benefit of a 'task per video' could be a more granular failure tracking and retry.

Copy link
Contributor

@nasthagiri nasthagiri Mar 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@irfanuddinahmad That should be fine given our other recent usage of celery tasks:

  • we create a celery task for each learner that we send an email to via ACE
  • we create a celery task for each problem submission by a learner

However, you should let devops know so they are aware in case they see spikes in tasks at certain times. Also, the team will want to monitor this the first time you roll it out.

Things to consider:

  • Is there a time_limit placed on the task? If your task is calling the network, you'll want to timeout appropriately (and exponentially backoff) in case of a persistent network failure.

  • Are you using our celery-utils, which provides functionality for logging and persistent retries? You may find this useful if you want to persist failures (after celery's retries give up) and automatically retry them via a Jenkins job.

You can see some of this in action in the grades tasks code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nasthagiri Thanks. Very useful feedback (as usual! :) )
I am already using the PersistOnFailureTask ... let me try the
celeryutils.chordable_django_backend and LoggedPersistOnFailureTask also

Copy link
Contributor

@mushtaqak mushtaqak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@irfanuddinahmad Done with second pass of the review. This is looking good.

u'Without this flag, the command will return the transcripts discovered for migration ',
)

def handle(self, *args, **options):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method docstring?

raise CommandError('At least one course or --all-courses must be specified.')

kwargs = {}
for key in ('all_courses', 'force_update', 'commit'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we can write this as:

kwargs = {key: options[key] for key in ['all_courses', 'force_update', 'commit'] if options.get(key)}


1
00:00:02,720 --> 00:00:05,430
Ja, ich spreche Deutsch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets also add one more line so as to check unicode characters.

2
00:00:6,500 --> 00:00:08,600
可以用“我不太懂艺术 但我知道我喜欢什么”做比喻.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

client_video_id='TWINKLE TWINKLE',
duration=42.0,
edx_video_id='test_edx_video_id',
status="upload",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: single quotes


def test_invalid_course(self):
with self.assertRaises(CommandError):
call_command('migrate_transcripts', "invalid-course")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 'invalid-course'

self.assertEqual(len(languages), 0)

# now call migrate_transcripts command and check the transcript availability
call_command('migrate_transcripts', unicode(self.course.id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should check to have been migrated transcripts here.

languages = api.get_available_transcript_languages(self.video_descriptor.edx_video_id)
self.assertEqual(len(languages), 2)

def test_migrated_transcripts_without_commit(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docstring

# now call migrate_transcripts command again and check the transcript availability
call_command('migrate_transcripts', unicode(self.course.id), '--commit')

self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'hr'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also check the count before and after the command run.

self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'hr'))
self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'ge'))

def test_migrate_transcripts_logging(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't checked for any exception logs. We need to check them as well in this test or a separate test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was missed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

call_command('migrate_transcripts', "invalid-course")


class MigrateTranscripts(ModuleStoreTestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MigrateTranscriptsTest

@edx-status-bot
Copy link

Your PR has finished running tests.

5 similar comments
@edx-status-bot
Copy link

Your PR has finished running tests.

@edx-status-bot
Copy link

Your PR has finished running tests.

@edx-status-bot
Copy link

Your PR has finished running tests.

@edx-status-bot
Copy link

Your PR has finished running tests.

@edx-status-bot
Copy link

Your PR has finished running tests.

@irfanuddinahmad irfanuddinahmad force-pushed the iahmad/migrate_transcripts_S3 branch from 8a5cf1f to 239b4ad Compare March 27, 2018 14:32
@edx-status-bot
Copy link

Your PR has finished running tests.

@edx-status-bot
Copy link

Your PR has finished running tests.

@irfanuddinahmad
Copy link
Contributor Author

@nasthagiri Kindly advise regarding the trigger for our video transcript management command. Some options are discussed in the comments of the following issue:
https://openedx.atlassian.net/browse/ENT-641
Devops seem to have a certain preference for Django admin UI. However, Jenkins and DSL could be another option.

@edx-status-bot
Copy link

Your PR has finished running tests.

1 similar comment
@edx-status-bot
Copy link

Your PR has finished running tests.

@nasthagiri
Copy link
Contributor

@irfanuddinahmad https://openedx.atlassian.net/browse/ENT-641 takes me to a SuccessFactors ticket. Did you intend another ticket number?

@irfanuddinahmad
Copy link
Contributor Author

"@irfanuddinahmad https://openedx.atlassian.net/browse/ENT-641 takes me to a SuccessFactors ticket. Did you intend another ticket number?"
@nasthagiri Kindly have a look at Fred's comments. I mentioned this ticket since they also had a management command that was to be run on Production and Stage.

@nasthagiri
Copy link
Contributor

@irfanuddinahmad ah. Thanks.

I agree with Fred that there are valid reasons to not have Jenkins jobs be fully parameterized. To Backfill Grades, we created a management command for the following reasons:

  1. We wanted the Open edX community to also have the ability to backfill their own grades.
  2. We needed this support (to recompute all grades in a course) even in the long-run (post-migration).

We then created a Jenkins Job that would be manually triggered to run the management command.

Following devOps practices, we configured the management command via Django Admin.

Now, for video migrations, you can do the same thing we did for Grades:
Have a Jenkins job, but configured via Django Admin.

Or, you can go with Option #1 specified in the ticket:
Create a new Django Admin action-button, also configured via Django Admin.

@irfanuddinahmad
Copy link
Contributor Author

@nasthagiri Thanks. We will have a Jenkins job configured via Django Admin.

@edx-status-bot
Copy link

Your PR has finished running tests.

@nasthagiri
Copy link
Contributor

@irfanuddinahmad Ok. Please inform your devops contact person so s/he is aware of your path forward.

@edx-status-bot
Copy link

Your PR has finished running tests.

2 similar comments
@edx-status-bot
Copy link

Your PR has finished running tests.

@edx-status-bot
Copy link

Your PR has finished running tests.

default=DEFAULT_ALL_COURSES,
help=u'Migrates transcripts to S3 for all courses.'
)
parser.add_argument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these settings?

"""
force_update = BooleanField(default=False)
commit = BooleanField(default=False)
course_ids = TextField(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need all_courses option here?


force_update = BooleanField(
default=False,
help_text="Flag to update transcripts in django storage even if already present."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flag to force migrate transcripts for the requested courses, overwrite if already present.

'--course-id', '--course_id',
dest='course_ids',
action='append',
help=u'Migrates transcripts for list of courses.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Migrates transcripts for the list of courses

action='store_true',
default=DEFAULT_COMMIT,
help=u'Commits the discovered video transcripts to django storage. '
u'Without this flag, the command will return the transcripts discovered for migration '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Remove extra space in transcripts discovered for migration probably add a dot 😃

self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'hr'))
self.assertTrue(api.is_transcript_available(self.video_descriptor.edx_video_id, 'ge'))

def test_migrate_transcripts_logging(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was missed.

def async_migrate_transcript_subtask(self, *args, **kwargs):
#pylint: disable=unused-argument
"""
Migrates a transcript of a given video in a course as a new celery task.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation is a bit off here

dict({'file_format': file_format}),
ContentFile(transcript_content)
)
LOGGER.info("[Transcript migration] Push_to_storage %s for %s with create_or_update method",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: LOGGER.info("[Transcript migration] save_transcript_to_storage: transcript video %s is %s over-written", '' if result else 'not', edx_video_id)

file_format,
ContentFile(transcript_content)
)
LOGGER.info("[Transcript migration] Push_to_storage %s for %s with create method", result, edx_video_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might need to change to save_transcript_to_storage

LOGGER.info("[Transcript migration] Push_to_storage %s for %s with create method", result, edx_video_id)
return result
except ValCannotCreateError as err:
LOGGER.exception("[Transcript migration] Push_to_storage_failed: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might need to change to save_transcript_to_storage

@edx-status-bot
Copy link

Your PR has finished running tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above two variables are unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are from the branched off master code ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do clean_video_id only once like this.

edx_video_id = clean_video_id(video.edx_video_id)

if not edx_video_id:
    video.edx_video_id = create_external_video('external-video')
    video.save_with_metadata(user=User.objects.get(username='staff'))

if edx_video_id:
    result = save_transcript_to_storage(
        edx_video_id,
        language_code,
        transcript_content,
        Transcript.SJSON,
        force_update
   )

And we need to remove clean_video_id from save_transcript_to_storage as we not making any decision on edx_video_id in it.

Copy link
Contributor

@mushtaqak mushtaqak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just one outstanding question remaining though about all_courses option in model.

After that's addressed. Let's rebase and merge. Good work 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not have all_courses option here?

@edx-status-bot
Copy link

Your PR has finished running tests.

Copy link
Contributor

@muhammad-ammar muhammad-ammar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@irfanuddinahmad irfanuddinahmad force-pushed the iahmad/migrate_transcripts_S3 branch from 7bad68d to fabdb1d Compare April 12, 2018 11:05
@edx-status-bot
Copy link

Your PR has finished running tests.

@irfanuddinahmad irfanuddinahmad force-pushed the iahmad/migrate_transcripts_S3 branch from fabdb1d to 3e426ac Compare April 12, 2018 11:59
@edx-status-bot
Copy link

Your PR has finished running tests.

@irfanuddinahmad irfanuddinahmad merged commit e424e04 into master Apr 12, 2018
@irfanuddinahmad irfanuddinahmad deleted the iahmad/migrate_transcripts_S3 branch April 12, 2018 12:46
@edx-pipeline-bot
Copy link
Contributor

EdX Release Notice: This PR has been deployed to the staging environment in preparation for a release to production on Friday, April 13, 2018.

@edx-pipeline-bot
Copy link
Contributor

EdX Release Notice: This PR has been deployed to the production environment.

@edx-pipeline-bot
Copy link
Contributor

EdX Release Notice: This PR has been rolled back from the production environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants