Skip to content

fix: Upgrade to youtube-transcript-api v1.0+#10456

Merged
Cristhianzl merged 6 commits into
mainfrom
cz/fix-youtube-transcript-version
Nov 5, 2025
Merged

fix: Upgrade to youtube-transcript-api v1.0+#10456
Cristhianzl merged 6 commits into
mainfrom
cz/fix-youtube-transcript-version

Conversation

@Cristhianzl
Copy link
Copy Markdown
Member

@Cristhianzl Cristhianzl commented Oct 30, 2025

This pull request updates the YouTube transcript analysis component to support the new youtube-transcript-api v1.0+ and refactors both the implementation and its tests for compatibility and robustness. The most important changes are grouped below:

Component and Dependency Updates

  • Updated the youtube-transcript-api dependency in pyproject.toml to require version >=1.0.0,<2.0.0, ensuring compatibility with the new API.
  • Refactored the YouTubeTranscriptsComponent in Youtube Analysis.json to use the new API, including extracting video IDs from URLs, handling transcript fetching and translation, and grouping transcript segments into time-based chunks. The implementation now robustly handles errors and supports both object and dict transcript formats.
  • Updated the starter project metadata to reflect new dependency versions and removed unused dependencies (langchain_community). [1] [2]

Testing Improvements

  • Added a FetchedTranscriptSnippetMock class to simulate transcript snippets using the new API format in unit tests.
  • Updated test fixtures in test_youtube_transcript_component.py to use the new mock class and added a mock for transcript list objects, improving test coverage for the updated component logic.
REC-20251030173612.mp4

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Oct 30, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Updated youtube-transcript-api dependency to version >=1.0.0,<2.0.0 and refactored the YouTubeTranscriptsComponent to use the new API with video ID extraction from URLs, transcript chunking logic, improved error handling, and comprehensive test coverage.

Changes

Cohort / File(s) Summary
Dependency Updates
pyproject.toml
Updated youtube-transcript-api from exact version 0.6.3 to range >=1.0.0,<2.0.0, widening minimum version compatibility.
YouTube Component Implementation
src/backend/base/langflow/initial_setup/starter_projects/Youtube Analysis.json, src/lfx/src/lfx/components/youtube/youtube_transcripts.py
Replaced legacy loader with YouTubeTranscriptApi-based implementation. Added _extract_video_id() for URL parsing, _load_transcripts() for API retrieval with translation support, and _chunk_transcript() for time-based chunking. Updated all output methods (get_dataframe_output, get_message_output, get_data_output) to use new transcript format and include explicit error handling for TranscriptsDisabled, NoTranscriptFound, and retrieval failures.
Test Suite
src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py
Added FetchedTranscriptSnippetMock class and comprehensive test cases covering video ID extraction, output generation, error scenarios (TranscriptsDisabled, NoTranscriptFound, general exceptions), empty transcripts, transcript chunking with configurable chunk_size_seconds, and translation support.

Sequence Diagram

sequenceDiagram
    participant User
    participant YouTubeComponent
    participant YouTubeTranscriptApi
    
    User->>YouTubeComponent: Provide YouTube URL
    YouTubeComponent->>YouTubeComponent: _extract_video_id(url)
    alt Valid Video ID
        YouTubeComponent->>YouTubeTranscriptApi: list(video_id)
        YouTubeTranscriptApi-->>YouTubeComponent: Available transcripts
        alt Transcript Available
            YouTubeComponent->>YouTubeTranscriptApi: fetch(video_id, language)
            YouTubeTranscriptApi-->>YouTubeComponent: Transcript segments
            alt Translation Requested
                YouTubeComponent->>YouTubeTranscriptApi: fetch with translate
                YouTubeTranscriptApi-->>YouTubeComponent: Translated segments
            end
            YouTubeComponent->>YouTubeComponent: _chunk_transcript(data)
            YouTubeComponent->>YouTubeComponent: Generate outputs (DataFrame/Message/Data)
            YouTubeComponent-->>User: Success response
        else No Transcript Found
            YouTubeComponent-->>User: Error: NoTranscriptFound
        else Transcripts Disabled
            YouTubeComponent-->>User: Error: TranscriptsDisabled
        end
    else Invalid Video ID
        YouTubeComponent-->>User: Error: Invalid URL
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Video ID extraction logic: Review regex patterns for handling multiple YouTube URL formats and invalid URL validation
  • Transcript chunking algorithm: Verify time-based grouping logic and chunk_size_seconds handling are correct
  • Error handling differentiation: Ensure TranscriptsDisabled, NoTranscriptFound, and generic exceptions are properly caught and reported across all output methods
  • Data format compatibility: Confirm both old and new transcript segment formats (dict-like vs object-like) are handled correctly throughout
  • Translation flow: Validate translate parameter usage and fallback behavior in the new API
  • Test coverage: Check that test mocks accurately reflect new YouTubeTranscriptApi behavior and that all error paths are covered

Suggested labels

bug, size:L, lgtm

Suggested reviewers

  • jordanrfrazier

Pre-merge checks and finishing touches

❌ Failed checks (1 error, 2 warnings)
Check name Status Explanation Resolution
Test Coverage For New Implementations ❌ Error The PR introduces significant changes to support youtube-transcript-api v1.0+ but has incomplete test coverage for the regression case identified in the review comment. The reviewer explicitly highlighted that the translation fallback logic fails for videos without English captions and that api.fetch() should be replaced with transcript.fetch(). Most critically, the review specifically requested "a regression case where no English captions exist" to be added to the test suite. The PR summary shows comprehensive test cases were added for various scenarios, but there is no indication that this specific regression test case covering non-English-only transcripts with translation has been included, which is essential for preventing future regressions of this exact scenario. The test coverage must be expanded to include a regression test case that explicitly covers translation of transcripts when no English captions are available. This test should verify that the code successfully retrieves a non-English transcript and translates it to the target language. Additionally, the test setup should be updated to mock transcript.fetch() and .translate().fetch() on Transcript objects rather than YouTubeTranscriptApi.fetch() to properly align with the v1.x API behavior and ensure the translation path actually returns translated snippets as requested in the review comment.
Test Quality And Coverage ⚠️ Warning The tests for the YouTube Transcript API v1.0+ upgrade have significant quality and coverage gaps. Most critically, the tests are mocking at the wrong level and not validating the actual API contract. According to the review comments and the youtube-transcript-api v1.0 documentation, the correct pattern for v1.0+ is to call transcript.fetch() on Transcript objects (and translated_transcript.fetch() after translation), not YouTubeTranscriptApi.fetch(). However, the tests appear to be mocking YouTubeTranscriptApi.fetch() directly rather than mocking the Transcript.fetch() method that the implementation should be calling. This means the tests could pass even if the implementation doesn't actually work with the real v1.0 API. Furthermore, the tests lack coverage for the critical translation fallback logic mentioned in the review comments—specifically testing scenarios where no English captions exist but other languages do, and where the translation path must work correctly. The tests also do not validate that error handling distinguishes between different failure modes (TranscriptsDisabled vs. NoTranscriptFound vs. other exceptions) or test the regression case where a video has no English captions but can be translated from another language. Update the tests to mock Transcript.fetch() and .translate().fetch() instead of YouTubeTranscriptApi.fetch() to properly validate the implementation matches the v1.0 API contract. Add specific test cases for the translation fallback logic: scenarios where only non-English captions exist and must be translated, cases where both generated and manual transcripts are available, and regression tests ensuring videos with only auto-generated non-English transcripts can still be translated. Ensure error handling tests comprehensively cover the distinct exception paths (TranscriptsDisabled, NoTranscriptFound, etc.) to validate that the current implementation properly implements the suggested patch from the review comments.
Excessive Mock Usage Warning ⚠️ Warning The PR summary indicates that new mock classes and fixtures have been introduced (FetchedTranscriptSnippetMock, mock_transcript_list) with patches for YouTubeTranscriptApi and related methods to test various code paths including error scenarios, translation, and chunking. While mocking external API dependencies is generally appropriate for unit tests, the critical review comment reveals a significant implementation issue: the current translation fallback logic is incorrect and the code calls YouTubeTranscriptApi.fetch() instead of transcript.fetch(). The reviewer explicitly states "Once this is in place, update the unit tests to stub Transcript.fetch() (and .translate().fetch() in the translation case) instead of YouTubeTranscriptApi.fetch", suggesting the current test mock structure may be misaligned with how the API should be used. This indicates that excessive or improperly layered mocking may be masking architectural problems rather than testing the correct implementation logic. The PR should not be merged in its current state regarding mock usage. The implementation must first be corrected to match the reviewer's suggested patch (proper translation fallback chain and using transcript.fetch() instead of YouTubeTranscriptApi.fetch()). Once the implementation is fixed, the test mocks must be restructured accordingly to stub the correct API layer (Transcript object methods rather than YouTubeTranscriptApi methods). This will eliminate mock layering issues and ensure tests are validating actual component behavior rather than working around incorrect implementation patterns.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "fix: Upgrade to youtube-transcript-api v1.0+" directly corresponds to the primary change in the changeset: updating the youtube-transcript-api dependency from 0.6.3 to >=1.0.0,<2.0.0 in pyproject.toml. The title is concise, clear, and specific enough that a teammate scanning git history would immediately understand the main change. While the changeset includes substantial refactoring of the YouTubeTranscriptsComponent and tests, these are secondary consequences of supporting the new API version, not the primary driver of the PR. The title accurately captures the root cause and primary intent without being misleading or overly vague.
Docstring Coverage ✅ Passed Docstring coverage is 96.15% which is sufficient. The required threshold is 80.00%.
Test File Naming And Structure ✅ Passed The test file test_youtube_transcript_component.py follows proper pytest backend testing conventions with the correct test_*.py naming pattern in the appropriate directory structure (src/backend/tests/unit/components/bundles/youtube/). According to the PR summary, the tests are comprehensively organized with proper fixtures (mock_transcript_list fixture) and mock classes (FetchedTranscriptSnippetMock) for setup, covering both positive and negative scenarios: video ID extraction from various URL formats and invalid URLs, DataFrame/Message/Data outputs, error handling (TranscriptsDisabled, NoTranscriptFound, general exceptions), empty transcript scenarios, chunking behavior with configurable parameters, and translation support. The test cases address edge cases and error conditions as required, with logical organization of related functionality being tested.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the bug Something isn't working label Oct 30, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Oct 30, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 31.30%. Comparing base (0cfb4fd) to head (e54ed79).
⚠️ Report is 1 commits behind head on main.

❌ Your project status has failed because the head coverage (39.37%) is below the target coverage (60.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #10456      +/-   ##
==========================================
- Coverage   31.32%   31.30%   -0.02%     
==========================================
  Files        1324     1324              
  Lines       59920    59920              
  Branches     8966     8966              
==========================================
- Hits        18769    18760       -9     
- Misses      40254    40263       +9     
  Partials      897      897              
Flag Coverage Δ
backend 50.87% <ø> (-0.06%) ⬇️
frontend 13.31% <ø> (ø)
lfx 39.37% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Oct 30, 2025
@github-actions github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Oct 30, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/backend/base/langflow/initial_setup/starter_projects/Youtube Analysis.json (1)

1870-2065: Mirror the transcript fix in the serialized component

This JSON embeds the same YouTubeTranscriptsComponent source, so it still has the hard-coded English lookup and api.fetch(...) call. Please apply the identical fallback + transcript.fetch() change here (and keep the string literal in sync with the Python module) so flows created from this starter project get the corrected behavior.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e24c387 and b05c75b.

📒 Files selected for processing (4)
  • pyproject.toml (1 hunks)
  • src/backend/base/langflow/initial_setup/starter_projects/Youtube Analysis.json (3 hunks)
  • src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py (6 hunks)
  • src/lfx/src/lfx/components/youtube/youtube_transcripts.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (8)
src/backend/tests/unit/components/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)

src/backend/tests/unit/components/**/*.py: Mirror the component directory structure for unit tests in src/backend/tests/unit/components/
Use ComponentTestBaseWithClient or ComponentTestBaseWithoutClient as base classes for component unit tests
Provide file_names_mapping for backward compatibility in component tests
Create comprehensive unit tests for all new components

Files:

  • src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py
{src/backend/**/*.py,tests/**/*.py,Makefile}

📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)

{src/backend/**/*.py,tests/**/*.py,Makefile}: Run make format_backend to format Python code before linting or committing changes
Run make lint to perform linting checks on backend Python code

Files:

  • src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py
src/backend/tests/unit/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)

Test component integration within flows using create_flow, build_flow, and get_build_events utilities

Files:

  • src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py
src/backend/tests/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/testing.mdc)

src/backend/tests/**/*.py: Unit tests for backend code must be located in the 'src/backend/tests/' directory, with component tests organized by component subdirectory under 'src/backend/tests/unit/components/'.
Test files should use the same filename as the component under test, with an appropriate test prefix or suffix (e.g., 'my_component.py' → 'test_my_component.py').
Use the 'client' fixture (an async httpx.AsyncClient) for API tests in backend Python tests, as defined in 'src/backend/tests/conftest.py'.
When writing component tests, inherit from the appropriate base class in 'src/backend/tests/base.py' (ComponentTestBase, ComponentTestBaseWithClient, or ComponentTestBaseWithoutClient) and provide the required fixtures: 'component_class', 'default_kwargs', and 'file_names_mapping'.
Each test in backend Python test files should have a clear docstring explaining its purpose, and complex setups or mocks should be well-commented.
Test both sync and async code paths in backend Python tests, using '@pytest.mark.asyncio' for async tests.
Mock external dependencies appropriately in backend Python tests to isolate unit tests from external services.
Test error handling and edge cases in backend Python tests, including using 'pytest.raises' and asserting error messages.
Validate input/output behavior and test component initialization and configuration in backend Python tests.
Use the 'no_blockbuster' pytest marker to skip the blockbuster plugin in tests when necessary.
Be aware of ContextVar propagation in async tests; test both direct event loop execution and 'asyncio.to_thread' scenarios to ensure proper context isolation.
Test error handling by mocking internal functions using monkeypatch in backend Python tests.
Test resource cleanup in backend Python tests by using fixtures that ensure proper initialization and cleanup of resources.
Test timeout and performance constraints in backend Python tests using 'asyncio.wait_for' and timing assertions.
Test Langflow's Messag...

Files:

  • src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py
src/backend/**/*component*.py

📄 CodeRabbit inference engine (.cursor/rules/icons.mdc)

In your Python component class, set the icon attribute to a string matching the frontend icon mapping exactly (case-sensitive).

Files:

  • src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py
src/backend/**/components/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/icons.mdc)

In your Python component class, set the icon attribute to a string matching the frontend icon mapping exactly (case-sensitive).

Files:

  • src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py
**/{test_*.py,*.test.ts,*.test.tsx}

📄 CodeRabbit inference engine (coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt)

**/{test_*.py,*.test.ts,*.test.tsx}: Check if tests have too many mock objects that obscure what’s actually being tested
Warn when mocks are used instead of testing real behavior and interactions
Suggest using real objects or simpler test doubles when mocks become excessive
Ensure mocks are used only for external dependencies, not core business logic
Recommend integration tests when unit tests become overly mocked
Check that test files follow the project’s naming conventions (backend: test_*.py; frontend: *.test.ts/tsx)
Verify that tests actually exercise the new or changed functionality, not placeholder assertions
Test files should have descriptive test function names explaining what is being tested
Organize tests logically with proper setup and teardown
Include edge cases and error conditions for comprehensive coverage
Verify tests cover both positive (success) and negative (failure) scenarios
Ensure tests are not mere smoke tests; they should validate behavior thoroughly
Ensure tests follow the project’s testing frameworks (pytest for backend, Playwright for frontend)

Files:

  • src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py
**/test_*.py

📄 CodeRabbit inference engine (coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt)

**/test_*.py: Backend tests must be named test_*.py and use proper pytest structure (fixtures, assertions)
For async backend code, use proper pytest async patterns (e.g., pytest-asyncio)
For API endpoints, include tests for both success and error responses

Files:

  • src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py
🧠 Learnings (5)
📚 Learning: 2025-07-21T14:16:14.125Z
Learnt from: CR
PR: langflow-ai/langflow#0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-07-21T14:16:14.125Z
Learning: Applies to src/backend/tests/**/*.py : Mock external dependencies appropriately in backend Python tests to isolate unit tests from external services.

Applied to files:

  • src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py
📚 Learning: 2025-07-21T14:16:14.125Z
Learnt from: CR
PR: langflow-ai/langflow#0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-07-21T14:16:14.125Z
Learning: Applies to src/backend/tests/**/*.py : Use 'MockLanguageModel' for testing language model components without external API calls in backend Python tests.

Applied to files:

  • src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py
📚 Learning: 2025-08-05T22:51:27.961Z
Learnt from: edwinjosechittilappilly
PR: langflow-ai/langflow#0
File: :0-0
Timestamp: 2025-08-05T22:51:27.961Z
Learning: The TestComposioComponentAuth test in src/backend/tests/unit/components/bundles/composio/test_base_composio.py demonstrates proper integration testing patterns for external API components, including real API calls with mocking for OAuth completion, comprehensive resource cleanup, and proper environment variable handling with pytest.skip() fallbacks.

Applied to files:

  • src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py
📚 Learning: 2025-07-18T18:25:54.486Z
Learnt from: CR
PR: langflow-ai/langflow#0
File: .cursor/rules/backend_development.mdc:0-0
Timestamp: 2025-07-18T18:25:54.486Z
Learning: Applies to src/backend/tests/unit/components/**/*.py : Use ComponentTestBaseWithClient or ComponentTestBaseWithoutClient as base classes for component unit tests

Applied to files:

  • src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py
📚 Learning: 2025-07-21T14:16:14.125Z
Learnt from: CR
PR: langflow-ai/langflow#0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-07-21T14:16:14.125Z
Learning: Applies to src/backend/tests/**/*.py : When writing component tests, inherit from the appropriate base class in 'src/backend/tests/base.py' (ComponentTestBase, ComponentTestBaseWithClient, or ComponentTestBaseWithoutClient) and provide the required fixtures: 'component_class', 'default_kwargs', and 'file_names_mapping'.

Applied to files:

  • src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py
🧬 Code graph analysis (2)
src/lfx/src/lfx/components/youtube/youtube_transcripts.py (2)
src/lfx/src/lfx/schema/message.py (1)
  • Message (34-299)
src/lfx/src/lfx/schema/data.py (1)
  • Data (26-288)
src/backend/tests/unit/components/bundles/youtube/test_youtube_transcript_component.py (1)
src/lfx/src/lfx/components/youtube/youtube_transcripts.py (4)
  • _extract_video_id (51-62)
  • get_dataframe_output (147-165)
  • get_message_output (167-179)
  • get_data_output (181-206)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: codecov/project/lfx
  • GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 5
  • GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 3
  • GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 2
  • GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 4
  • GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 1
  • GitHub Check: Run Backend Tests / Integration Tests - Python 3.10
  • GitHub Check: Lint Backend / Run Mypy (3.10)
  • GitHub Check: Lint Backend / Run Mypy (3.11)
  • GitHub Check: Lint Backend / Run Mypy (3.12)
  • GitHub Check: Run Frontend Tests / Determine Test Suites and Shard Distribution
  • GitHub Check: Test Starter Templates
  • GitHub Check: Update Component Index
  • GitHub Check: test-starter-projects
  • GitHub Check: Optimize new Python code in this PR

Comment on lines +78 to +118
if self.translation:
# Get any available transcript and translate it
transcript = transcript_list.find_transcript(["en"])
transcript = transcript.translate(self.translation)
else:
# Try to get transcript in available languages
try:
transcript = transcript_list.find_transcript(["en"])
except NoTranscriptFound:
# Try auto-generated English
transcript = transcript_list.find_generated_transcript(["en"])

# Fetch the transcript data
transcript_data = api.fetch(transcript.video_id, [transcript.language_code])

except (TranscriptsDisabled, NoTranscriptFound) as e:
error_type = type(e).__name__
msg = (
f"Could not retrieve transcripts for video '{video_id}'. "
"Possible reasons:\n"
"1. This video does not have captions/transcripts enabled\n"
"2. The video is private, restricted, or deleted\n"
f"\nTechnical error ({error_type}): {e}"
)
raise RuntimeError(msg) from e
except Exception as e:
error_type = type(e).__name__
msg = (
f"Could not retrieve transcripts for video '{video_id}'. "
"Possible reasons:\n"
"1. This video does not have captions/transcripts enabled\n"
"2. The video is private, restricted, or deleted\n"
"3. YouTube is blocking automated requests\n"
f"\nTechnical error ({error_type}): {e}"
)
raise RuntimeError(msg) from e

if as_chunks:
# Group into chunks based on chunk_size_seconds
return self._chunk_transcript(transcript_data)
# Return as continuous text
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix translation fallback and fetching logic

When translation is set we always look for English captions, so videos that only expose (say) Portuguese captions still blow up with NoTranscriptFound even though they can be translated. On top of that we call YouTubeTranscriptApi.fetch(...) after translate(...), but v1.x expects you to call transcript.fetch() on the Transcript (or translated transcript) object—otherwise the translated snippets are never retrieved.

Please select an English transcript if it exists, fall back to generated-English, then fall back to any available transcript before translating, and use transcript.fetch() so the translation path actually returns translated snippets. That change also keeps the non-translation branch working for videos whose only captions are auto-generated in another language. Suggested patch:

-        if self.translation:
-            # Get any available transcript and translate it
-            transcript = transcript_list.find_transcript(["en"])
-            transcript = transcript.translate(self.translation)
-        else:
-            # Try to get transcript in available languages
-            try:
-                transcript = transcript_list.find_transcript(["en"])
-            except NoTranscriptFound:
-                # Try auto-generated English
-                transcript = transcript_list.find_generated_transcript(["en"])
-
-        # Fetch the transcript data
-        transcript_data = api.fetch(transcript.video_id, [transcript.language_code])
+        if self.translation:
+            try:
+                transcript = transcript_list.find_transcript(["en"])
+            except NoTranscriptFound as en_error:
+                try:
+                    transcript = transcript_list.find_generated_transcript(["en"])
+                except NoTranscriptFound:
+                    try:
+                        transcript = transcript_list[0]
+                    except IndexError as exc:  # pragma: no cover
+                        raise en_error from exc
+            transcript = transcript.translate(self.translation)
+        else:
+            try:
+                transcript = transcript_list.find_transcript(["en"])
+            except NoTranscriptFound as en_error:
+                try:
+                    transcript = transcript_list.find_generated_transcript(["en"])
+                except NoTranscriptFound:
+                    try:
+                        transcript = transcript_list[0]
+                    except IndexError as exc:  # pragma: no cover
+                        raise en_error from exc
+
+        transcript_data = transcript.fetch()

Once this is in place, update the unit tests to stub Transcript.fetch() (and .translate().fetch() in the translation case) instead of YouTubeTranscriptApi.fetch, and add a regression case where no English captions exist so we keep this flow covered.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if self.translation:
# Get any available transcript and translate it
transcript = transcript_list.find_transcript(["en"])
transcript = transcript.translate(self.translation)
else:
# Try to get transcript in available languages
try:
transcript = transcript_list.find_transcript(["en"])
except NoTranscriptFound:
# Try auto-generated English
transcript = transcript_list.find_generated_transcript(["en"])
# Fetch the transcript data
transcript_data = api.fetch(transcript.video_id, [transcript.language_code])
except (TranscriptsDisabled, NoTranscriptFound) as e:
error_type = type(e).__name__
msg = (
f"Could not retrieve transcripts for video '{video_id}'. "
"Possible reasons:\n"
"1. This video does not have captions/transcripts enabled\n"
"2. The video is private, restricted, or deleted\n"
f"\nTechnical error ({error_type}): {e}"
)
raise RuntimeError(msg) from e
except Exception as e:
error_type = type(e).__name__
msg = (
f"Could not retrieve transcripts for video '{video_id}'. "
"Possible reasons:\n"
"1. This video does not have captions/transcripts enabled\n"
"2. The video is private, restricted, or deleted\n"
"3. YouTube is blocking automated requests\n"
f"\nTechnical error ({error_type}): {e}"
)
raise RuntimeError(msg) from e
if as_chunks:
# Group into chunks based on chunk_size_seconds
return self._chunk_transcript(transcript_data)
# Return as continuous text
if self.translation:
try:
transcript = transcript_list.find_transcript(["en"])
except NoTranscriptFound as en_error:
try:
transcript = transcript_list.find_generated_transcript(["en"])
except NoTranscriptFound:
try:
transcript = transcript_list[0]
except IndexError as exc: # pragma: no cover
raise en_error from exc
transcript = transcript.translate(self.translation)
else:
try:
transcript = transcript_list.find_transcript(["en"])
except NoTranscriptFound as en_error:
try:
transcript = transcript_list.find_generated_transcript(["en"])
except NoTranscriptFound:
try:
transcript = transcript_list[0]
except IndexError as exc: # pragma: no cover
raise en_error from exc
transcript_data = transcript.fetch()
except (TranscriptsDisabled, NoTranscriptFound) as e:
error_type = type(e).__name__
msg = (
f"Could not retrieve transcripts for video '{video_id}'. "
"Possible reasons:\n"
"1. This video does not have captions/transcripts enabled\n"
"2. The video is private, restricted, or deleted\n"
f"\nTechnical error ({error_type}): {e}"
)
raise RuntimeError(msg) from e
except Exception as e:
error_type = type(e).__name__
msg = (
f"Could not retrieve transcripts for video '{video_id}'. "
"Possible reasons:\n"
"1. This video does not have captions/transcripts enabled\n"
"2. The video is private, restricted, or deleted\n"
"3. YouTube is blocking automated requests\n"
f"\nTechnical error ({error_type}): {e}"
)
raise RuntimeError(msg) from e
if as_chunks:
# Group into chunks based on chunk_size_seconds
return self._chunk_transcript(transcript_data)
# Return as continuous text
🤖 Prompt for AI Agents
In src/lfx/src/lfx/components/youtube/youtube_transcripts.py around lines
78-118, the logic currently always searches for an English transcript when
translation is requested and calls YouTubeTranscriptApi.fetch(...) after
translate(), which prevents using transcripts available only in other languages
and fails to actually fetch translated snippets; change the selection order to:
try find English transcript, then find generated-English, then fall back to any
available transcript (transcript_list.find_transcript(None) or equivalent)
before translating if self.translation is set; after selecting (and optionally
translating) a Transcript object, call transcript.fetch() on that object (or
translated_transcript.fetch()) instead of api.fetch(...); finally update unit
tests to stub Transcript.fetch() and the translated transcript’s .fetch() (not
YouTubeTranscriptApi.fetch), and add a regression test where no English captions
exist to cover the fallback-to-any-language-then-translate path.

Copy link
Copy Markdown
Collaborator

@erichare erichare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Cristhianzl Cristhianzl enabled auto-merge November 4, 2025 23:39
@github-actions github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Nov 4, 2025
@github-actions github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Nov 4, 2025
@github-actions github-actions Bot added bug Something isn't working and removed bug Something isn't working labels Nov 4, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Nov 4, 2025

Frontend Unit Test Coverage Report

Coverage Summary

Lines Statements Branches Functions
Coverage: 14%
14.37% (3866/26885) 7.34% (1506/20507) 8.64% (509/5889)

Unit Test Results

Tests Skipped Failures Errors Time
1549 0 💤 0 ❌ 0 🔥 18.046s ⏱️

@Cristhianzl Cristhianzl added this pull request to the merge queue Nov 5, 2025
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Nov 5, 2025
@Cristhianzl Cristhianzl added this pull request to the merge queue Nov 5, 2025
Merged via the queue into main with commit 64b593b Nov 5, 2025
81 of 82 checks passed
@Cristhianzl Cristhianzl deleted the cz/fix-youtube-transcript-version branch November 5, 2025 01:35
korenLazar pushed a commit to kiran-kate/langflow that referenced this pull request Nov 12, 2025
* fix youtube transcript api

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes (attempt 2/3)

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes (attempt 2/3)

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
korenLazar pushed a commit to kiran-kate/langflow that referenced this pull request Nov 13, 2025
* fix youtube transcript api

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes (attempt 2/3)

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes (attempt 2/3)

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
korenLazar pushed a commit to kiran-kate/langflow that referenced this pull request Nov 13, 2025
* fix youtube transcript api

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes (attempt 2/3)

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes (attempt 2/3)

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
korenLazar pushed a commit to kiran-kate/langflow that referenced this pull request Nov 13, 2025
* fix youtube transcript api

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes (attempt 2/3)

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes (attempt 2/3)

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
korenLazar pushed a commit to kiran-kate/langflow that referenced this pull request Nov 13, 2025
* fix youtube transcript api

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes (attempt 2/3)

* [autofix.ci] apply automated fixes

* [autofix.ci] apply automated fixes (attempt 2/3)

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants