Skip to content

Conversation

@janimo
Copy link
Contributor

@janimo janimo commented Apr 19, 2021

This will look for .srt files downloaded by youtube-dl along with the videos and will upload them after each video is uploaded.
I have only tested this with a devstack, checking that the transcript_uploads/ endpoint is called successfully.

I've also added tests for argument parsing to increase CI coverage and have the checks pass.

@openedx-webhooks
Copy link

openedx-webhooks commented Apr 19, 2021

Thanks for the pull request, @janimo! I've created BLENDED-828 to keep track of it in Jira. More details are on the BD-27 project page.

When this pull request is ready, tag your edX technical lead.

@openedx-webhooks openedx-webhooks added blended PR is managed through 2U's blended developmnt program waiting on author PR author needs to resolve review requests, answer questions, fix tests, etc. labels Apr 19, 2021
@janimo janimo force-pushed the jani/video-upload-transcripts branch 5 times, most recently from 181bf45 to b9a68ea Compare May 17, 2021 12:56
@janimo janimo changed the title [BD-27] WIP Upload downloaded transcripts [BD-27] Upload transcripts downloaded by the video_download script May 17, 2021
@openedx-webhooks openedx-webhooks added needs triage and removed waiting on author PR author needs to resolve review requests, answer questions, fix tests, etc. labels May 17, 2021
@alangsto
Copy link
Contributor

@janimo this looks good - but a few questions.

  1. Before merging it would be great if this could be tested on stage/edge
  2. Could we change the file names of the text fixtures? I think they too closely resemble actual video names from a course.

@janimo
Copy link
Contributor Author

janimo commented May 21, 2021

  1. @alangsto it would be great if I could get test on either stage or edge but I'm afraid I do not have an account there. Would I need to go through the process described here to get API credentials?

https://course-catalog-api-guide.readthedocs.io/en/latest/authentication/#getting-a-client-id-and-client-secret

  1. Do you mean the names of the .srt files added under tests/fixtures_data/video_files/01___Intro_to_Knowledge_Based_AI/ ? If so I have used names that mirror the video file name but with a different exceptions, since this is what I saw how youtube-dl saves transcripts for videos and it makes associating them relatively straightforward.

@alangsto
Copy link
Contributor

@janimo

Yes, you would need to get the API credentials to test there. My understanding was that this was being tested this week, potentially by @kaizoku, is that correct?

And I see, then the .srt filenames are fine.

@janimo
Copy link
Contributor Author

janimo commented May 21, 2021

@alangsto ok, @kaizoku has an account, so probably we can test using API credentials associated with that.

@kaizoku
Copy link
Contributor

kaizoku commented May 22, 2021

@alangsto , I was attempting to test on my devstack first, but underestimated the complexity of setting up the video pipeline.

Though, I get a 404 on edge, and staging for the API admin: https://edge.edx.org/api-admin/
Is there somewhere else to generate the API credentials for edge and staging?

@alangsto
Copy link
Contributor

@kaizoku, no worries about testing on edge or stage then, we can take that on as part of our testing of this tool.

@janimo one last request, could you please update the documentation for the video + transcript tooling here in this file (https://github.com/edx/cc2olx/blob/master/src/cc2olx/tools/README.rst). Thanks!

@janimo
Copy link
Contributor Author

janimo commented May 24, 2021

@alangsto I've updated the README to mention transcripts. One thing I am not sure about is what other (if any) changes are needed in the cc2olx tool to fully support transcripts. The documentation I used at https://edx.readthedocs.io/projects/edx-open-learning-xml/en/latest/components/video-components.html doesn't explain much about transcripts in video blocks, and I did not find online examples of video blocks with transcript support enabled. While uploading a transcript will associate the video with it via de edx id tag, I am still not sure whether the video xblock should explicitly list the transcripts itself.

@alangsto
Copy link
Contributor

@janimo have you tried adding a fake transcript to a video component in studio, and looking at the resulting olx from the course export? That would be my first suggestion to understand how xblocks and transcripts work together. I myself am not familiar, but had thought this would be included in the work for this ticket.

@kaizoku
Copy link
Contributor

kaizoku commented May 25, 2021

@janimo, yes, video blocks do need to specify the transcripts associated to the video.
I exported the example video block created when adding a new video block to a course and this is the resulting OLX:

<video url_name="a8f5779265854276b67407e6c24d544f" sub="" transcripts="{&quot;en&quot;: &quot;ac0e4345-102e-4990-9b80-97acf49bf678-en.srt&quot;}" display_name="Video" edx_video_id="ac0e4345-102e-4990-9b80-97acf49bf678" html5_sources="[]" youtube_id_1_0="3_yD_cEKoCk">
  <video_asset client_video_id="external video" duration="0.0" image="">
    <transcripts>
      <transcript language_code="en" file_format="srt" provider="Custom"/>
    </transcripts>
  </video_asset>
  <transcript language="en" src="ac0e4345-102e-4990-9b80-97acf49bf678-en.srt"/>
</video>

We'll need to add OLX generation to link the transcripts to the associated video here.

Jani Monoses added 5 commits May 27, 2021 09:22
This add the new column that contains dash separated, alphabetically
ordered language codes to the output CSV. This is used by cc2olx to
add transcript tags in the generated video block.
@janimo janimo force-pushed the jani/video-upload-transcripts branch 2 times, most recently from 074b2a0 to 87f65ee Compare May 27, 2021 08:48
@janimo
Copy link
Contributor Author

janimo commented May 27, 2021

I've added a related pull which addresses the cc2olx part, once the video upload script generates a CSV with language codes.
#66

@kaizoku
Copy link
Contributor

kaizoku commented May 31, 2021

@janimo, I setup the video pipeline in my devstack and it looks like the transcript upload endpoint doesn't accept the API credentials, as it uses the basic django auth_required, rather than the API auth classes used by the video upload handler.

It looks like we'll need to find another way to upload transcripts programmatically. Did you find any other endpoints while working on this?

@janimo
Copy link
Contributor Author

janimo commented Jun 1, 2021

@kaizoku I did not find another endpoint, but I did not look either after I found this one TBH.

@alangsto
Copy link
Contributor

alangsto commented Jun 1, 2021

@kaizoku @janimo do you need further help on the development of this? I was hoping to test the tool this week, but can set aside time to assist you both if needed. Please let me know

-----------------

The video upload tool uploads video files to edX's video encoding pipeline.
The video upload tool uploads video files and associated transcripts to edX's video encoding pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaizoku can we also include documentation for the video download tool? It can be added to this file, but maybe under another section

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I added a section below describing the options to the video download tool, its expected input and output, and how it can be used with the video upload tool. Does this extra section look good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks great, thanks for adding!

Arguments:
* filename: the transcript filename
* edx_video_id: the video ID of the video this transcript is for
* language_code: the language of the transcript
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add access token to docstring args

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that. It's added now.

mocker.patch("cc2olx.tools.video_upload.open")
mocker.patch("cc2olx.tools.video_upload.requests.post", side_effect=upload_transcript_side_effect)

response = upload_transcript("filename", "edxid", "en")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing access_token arg

The ``course-id`` argument is the ID of the course as it appears in Studio. For example, ``course-v1:edX+111222+111222``.

The ``directory`` argument is a directory containing video files that will be uploaded to edX's video encoding pipeline.
The ``directory`` argument is a directory (such as the one created by the video download tool) containing video files and transcripts, that will be uploaded to edX's video encoding pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should this directory be structured? Will all of the video files and transcripts be within the same directory?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some clarification to it's file searching behavior in this section. Does that help on this point?

@alangsto
Copy link
Contributor

@kaizoku this looks good, I can merge once you fix the formatting

@kaizoku
Copy link
Contributor

kaizoku commented Jun 12, 2021

Ah sorry for the formatting @alangsto. I just noticed make test doesn't run the format checks.

I've fixed the formatting issue here.

@alangsto alangsto merged commit 7567270 into openedx:master Jun 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blended PR is managed through 2U's blended developmnt program merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants