[Funimation] Rewrite extractor (See desc) (#444)

* Support direct `/player/` URL
* Treat the different versions of an episode as different formats of a single video. So `experience_id` can no longer be used as the video `id` and the `episode_id` is used instead. This means that all existing archives will break
* Extractor options `language` and `version` to pre-select them
* Compat option `seperate-video-versions` to fall back to old behavior (including using the old video IDs)

Closes #428
This commit is contained in:
pukkandan 2021-07-07 02:51:29 +05:30 committed by GitHub
parent 46890374f7
commit 3acf6d3856
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
5 changed files with 217 additions and 116 deletions

View file

@ -128,6 +128,7 @@ ### Differences in default behavior
* `--add-metadata` attaches the `infojson` to `mkv` files in addition to writing the metadata when used with `--write-infojson`. Use `--compat-options no-attach-info-json` to revert this * `--add-metadata` attaches the `infojson` to `mkv` files in addition to writing the metadata when used with `--write-infojson`. Use `--compat-options no-attach-info-json` to revert this
* `playlist_index` behaves differently when used with options like `--playlist-reverse` and `--playlist-items`. See [#302](https://github.com/yt-dlp/yt-dlp/issues/302) for details. You can use `--compat-options playlist-index` if you want to keep the earlier behavior * `playlist_index` behaves differently when used with options like `--playlist-reverse` and `--playlist-items`. See [#302](https://github.com/yt-dlp/yt-dlp/issues/302) for details. You can use `--compat-options playlist-index` if you want to keep the earlier behavior
* The output of `-F` is listed in a new format. Use `--compat-options list-formats` to revert this * The output of `-F` is listed in a new format. Use `--compat-options list-formats` to revert this
* All *experiences* of a funimation episode are considered as a single video. This behavior breaks existing archives. Use `--compat-options seperate-video-versions` to extract information from only the default player
* Youtube live chat (if available) is considered as a subtitle. Use `--sub-langs all,-live_chat` to download all subtitles except live chat. You can also use `--compat-options no-live-chat` to prevent live chat from downloading * Youtube live chat (if available) is considered as a subtitle. Use `--sub-langs all,-live_chat` to download all subtitles except live chat. You can also use `--compat-options no-live-chat` to prevent live chat from downloading
* Youtube channel URLs are automatically redirected to `/video`. Append a `/featured` to the URL to download only the videos in the home page. If the channel does not have a videos tab, we try to download the equivalent `UU` playlist instead. Also, `/live` URLs raise an error if there are no live videos instead of silently downloading the entire channel. You may use `--compat-options no-youtube-channel-redirect` to revert all these redirections * Youtube channel URLs are automatically redirected to `/video`. Append a `/featured` to the URL to download only the videos in the home page. If the channel does not have a videos tab, we try to download the equivalent `UU` playlist instead. Also, `/live` URLs raise an error if there are no live videos instead of silently downloading the entire channel. You may use `--compat-options no-youtube-channel-redirect` to revert all these redirections
* Unavailable videos are also listed for youtube playlists. Use `--compat-options no-youtube-unavailable-videos` to remove this * Unavailable videos are also listed for youtube playlists. Use `--compat-options no-youtube-unavailable-videos` to remove this
@ -1327,7 +1328,7 @@ # Set "comment" field in video metadata using description instead of webpage_url
# EXTRACTOR ARGUMENTS # EXTRACTOR ARGUMENTS
Some extractors accept additional arguments which can be passed using `--extractor-args KEY:ARGS`. `ARGS` is a `;` (semicolon) seperated string of `ARG=VAL1,VAL2`. Eg: `--extractor-args youtube:skip=dash,hls` Some extractors accept additional arguments which can be passed using `--extractor-args KEY:ARGS`. `ARGS` is a `;` (semicolon) seperated string of `ARG=VAL1,VAL2`. Eg: `--extractor-args "youtube:skip=dash,hls;player_client=android" --extractor-args "funimation:version=uncut"`
The following extractors use this feature: The following extractors use this feature:
* **youtube** * **youtube**
@ -1335,8 +1336,13 @@ # EXTRACTOR ARGUMENTS
* `player_client`: `web` (default) or `android` (force use the android client fallbacks for video extraction) * `player_client`: `web` (default) or `android` (force use the android client fallbacks for video extraction)
* `player_skip`: `configs` - skip requests if applicable for client configs and use defaults * `player_skip`: `configs` - skip requests if applicable for client configs and use defaults
* **funimation**
* `language`: Languages to extract. Eg: `funimation:language=english,japanese`
* `version`: The video version to extract - `uncut` or `simulcast`
NOTE: These options may be changed/removed in the future without concern for backward compatibility NOTE: These options may be changed/removed in the future without concern for backward compatibility
# PLUGINS # PLUGINS
Plugins are loaded from `<root-dir>/ytdlp_plugins/<type>/__init__.py`. Currently only `extractor` plugins are supported. Support for `downloader` and `postprocessor` plugins may be added in the future. See [ytdlp_plugins](ytdlp_plugins) for example. Plugins are loaded from `<root-dir>/ytdlp_plugins/<type>/__init__.py`. Currently only `extractor` plugins are supported. Support for `downloader` and `postprocessor` plugins may be added in the future. See [ytdlp_plugins](ytdlp_plugins) for example.

View file

@ -392,11 +392,9 @@ class YoutubeDL(object):
if True, otherwise use ffmpeg/avconv if False, otherwise if True, otherwise use ffmpeg/avconv if False, otherwise
use downloader suggested by extractor if None. use downloader suggested by extractor if None.
compat_opts: Compatibility options. See "Differences in default behavior". compat_opts: Compatibility options. See "Differences in default behavior".
Note that only format-sort, format-spec, no-live-chat, The following options do not work when used through the API:
no-attach-info-json, playlist-index, list-formats, filename, abort-on-error, multistreams, no-live-chat,
no-direct-merge, embed-thumbnail-atomicparsley, no-playlist-metafiles. Refer __init__.py for their implementation
no-youtube-unavailable-videos, no-youtube-channel-redirect,
works when used via the API
The following parameters are not used by YoutubeDL itself, they are used by The following parameters are not used by YoutubeDL itself, they are used by
the downloader (see yt_dlp/downloader/common.py): the downloader (see yt_dlp/downloader/common.py):

View file

@ -273,7 +273,7 @@ def parse_compat_opts():
'filename', 'format-sort', 'abort-on-error', 'format-spec', 'no-playlist-metafiles', 'filename', 'format-sort', 'abort-on-error', 'format-spec', 'no-playlist-metafiles',
'multistreams', 'no-live-chat', 'playlist-index', 'list-formats', 'no-direct-merge', 'multistreams', 'no-live-chat', 'playlist-index', 'list-formats', 'no-direct-merge',
'no-youtube-channel-redirect', 'no-youtube-unavailable-videos', 'no-attach-info-json', 'no-youtube-channel-redirect', 'no-youtube-unavailable-videos', 'no-attach-info-json',
'embed-thumbnail-atomicparsley', 'embed-thumbnail-atomicparsley', 'seperate-video-versions',
] ]
compat_opts = parse_compat_opts() compat_opts = parse_compat_opts()

View file

@ -457,7 +457,8 @@
from .fujitv import FujiTVFODPlus7IE from .fujitv import FujiTVFODPlus7IE
from .funimation import ( from .funimation import (
FunimationIE, FunimationIE,
FunimationShowIE FunimationPageIE,
FunimationShowIE,
) )
from .funk import FunkIE from .funk import FunkIE
from .fusion import FusionIE from .fusion import FusionIE

View file

@ -12,52 +12,114 @@
dict_get, dict_get,
int_or_none, int_or_none,
js_to_json, js_to_json,
str_or_none,
try_get,
urlencode_postdata, urlencode_postdata,
urljoin,
ExtractorError, ExtractorError,
) )
class FunimationIE(InfoExtractor): class FunimationPageIE(InfoExtractor):
_VALID_URL = r'https?://(?:www\.)?funimation(?:\.com|now\.uk)/(?:[^/]+/)?shows/[^/]+/(?P<id>[^/?#&]+)' IE_NAME = 'funimation:page'
_VALID_URL = r'(?P<origin>https?://(?:www\.)?funimation(?:\.com|now\.uk))/(?P<lang>[^/]+/)?(?P<path>shows/(?P<id>[^/]+/[^/?#&]+).*$)'
_NETRC_MACHINE = 'funimation'
_TOKEN = None
_TESTS = [{ _TESTS = [{
'url': 'https://www.funimation.com/shows/hacksign/role-play/',
'info_dict': {
'id': '91144',
'display_id': 'role-play',
'ext': 'mp4',
'title': '.hack//SIGN - Role Play',
'description': 'md5:b602bdc15eef4c9bbb201bb6e6a4a2dd',
'thumbnail': r're:https?://.*\.jpg',
},
'params': {
# m3u8 download
'skip_download': True,
},
}, {
'url': 'https://www.funimation.com/shows/attack-on-titan-junior-high/broadcast-dub-preview/', 'url': 'https://www.funimation.com/shows/attack-on-titan-junior-high/broadcast-dub-preview/',
'info_dict': { 'info_dict': {
'id': '210051', 'id': '210050',
'display_id': 'broadcast-dub-preview',
'ext': 'mp4', 'ext': 'mp4',
'title': 'Attack on Titan: Junior High - Broadcast Dub Preview', 'title': 'Broadcast Dub Preview',
'thumbnail': r're:https?://.*\.(?:jpg|png)', # Other metadata is tested in FunimationIE
}, },
'params': { 'params': {
# m3u8 download 'skip_download': 'm3u8',
'skip_download': True,
}, },
'add_ie': ['Funimation'],
}, { }, {
'url': 'https://www.funimationnow.uk/shows/puzzle-dragons-x/drop-impact/simulcast/', # Not available in US
'url': 'https://www.funimation.com/shows/hacksign/role-play/',
'only_matching': True, 'only_matching': True,
}, { }, {
# with lang code # with lang code
'url': 'https://www.funimation.com/en/shows/hacksign/role-play/', 'url': 'https://www.funimation.com/en/shows/hacksign/role-play/',
'only_matching': True, 'only_matching': True,
}, {
'url': 'https://www.funimationnow.uk/shows/puzzle-dragons-x/drop-impact/simulcast/',
'only_matching': True,
}]
def _real_extract(self, url):
mobj = re.match(self._VALID_URL, url)
display_id = mobj.group('id').replace('/', '_')
if not mobj.group('lang'):
url = '%s/en/%s' % (mobj.group('origin'), mobj.group('path'))
webpage = self._download_webpage(url, display_id)
title_data = self._parse_json(self._search_regex(
r'TITLE_DATA\s*=\s*({[^}]+})',
webpage, 'title data', default=''),
display_id, js_to_json, fatal=False) or {}
video_id = (
title_data.get('id')
or self._search_regex(
(r"KANE_customdimensions.videoID\s*=\s*'(\d+)';", r'<iframe[^>]+src="/player/(\d+)'),
webpage, 'video_id', default=None)
or self._search_regex(
r'/player/(\d+)',
self._html_search_meta(['al:web:url', 'og:video:url', 'og:video:secure_url'], webpage, fatal=True),
'video id'))
return self.url_result(f'https://www.funimation.com/player/{video_id}', FunimationIE.ie_key(), video_id)
class FunimationIE(InfoExtractor):
_VALID_URL = r'https?://(?:www\.)?funimation\.com/player/(?P<id>\d+)'
_NETRC_MACHINE = 'funimation'
_TOKEN = None
_TESTS = [{
'url': 'https://www.funimation.com/player/210051',
'info_dict': {
'id': '210050',
'display_id': 'broadcast-dub-preview',
'ext': 'mp4',
'title': 'Broadcast Dub Preview',
'thumbnail': r're:https?://.*\.(?:jpg|png)',
'episode': 'Broadcast Dub Preview',
'episode_id': '210050',
'season': 'Extras',
'season_id': '166038',
'season_number': 99,
'series': 'Attack on Titan: Junior High',
'description': '',
'duration': 154,
},
'params': {
'skip_download': 'm3u8',
},
}, {
'note': 'player_id should be extracted with the relevent compat-opt',
'url': 'https://www.funimation.com/player/210051',
'info_dict': {
'id': '210051',
'display_id': 'broadcast-dub-preview',
'ext': 'mp4',
'title': 'Broadcast Dub Preview',
'thumbnail': r're:https?://.*\.(?:jpg|png)',
'episode': 'Broadcast Dub Preview',
'episode_id': '210050',
'season': 'Extras',
'season_id': '166038',
'season_number': 99,
'series': 'Attack on Titan: Junior High',
'description': '',
'duration': 154,
},
'params': {
'skip_download': 'm3u8',
'compat_opts': ['seperate-video-versions'],
},
}] }]
def _login(self): def _login(self):
@ -81,102 +143,136 @@ def _login(self):
def _real_initialize(self): def _real_initialize(self):
self._login() self._login()
@staticmethod
def _get_experiences(episode):
for lang, lang_data in episode.get('languages', {}).items():
for video_data in lang_data.values():
for version, f in video_data.items():
yield lang, version.title(), f
def _get_episode(self, webpage, experience_id=None, episode_id=None, fatal=True):
''' Extract the episode, season and show objects given either episode/experience id '''
show = self._parse_json(
self._search_regex(
r'show\s*=\s*({.+?})\s*;', webpage, 'show data', fatal=fatal),
experience_id, transform_source=js_to_json, fatal=fatal) or []
for season in show.get('seasons', []):
for episode in season.get('episodes', []):
if episode_id is not None:
if str(episode.get('episodePk')) == episode_id:
return episode, season, show
continue
for _, _, f in self._get_experiences(episode):
if f.get('experienceId') == experience_id:
return episode, season, show
if fatal:
raise ExtractorError('Unable to find episode information')
else:
self.report_warning('Unable to find episode information')
return {}, {}, {}
def _real_extract(self, url): def _real_extract(self, url):
display_id = self._match_id(url) initial_experience_id = self._match_id(url)
webpage = self._download_webpage(url, display_id) webpage = self._download_webpage(
url, initial_experience_id, note=f'Downloading player webpage for {initial_experience_id}')
episode, season, show = self._get_episode(webpage, experience_id=int(initial_experience_id))
episode_id = str(episode['episodePk'])
display_id = episode.get('slug') or episode_id
def _search_kane(name): formats, subtitles, thumbnails, duration = [], {}, [], 0
return self._search_regex( requested_languages, requested_versions = self._configuration_arg('language'), self._configuration_arg('version')
r"KANE_customdimensions\.%s\s*=\s*'([^']+)';" % name, only_initial_experience = 'seperate-video-versions' in self.get_param('compat_opts', [])
webpage, name, default=None)
title_data = self._parse_json(self._search_regex( for lang, version, fmt in self._get_experiences(episode):
r'TITLE_DATA\s*=\s*({[^}]+})', experience_id = str(fmt['experienceId'])
webpage, 'title data', default=''), if (only_initial_experience and experience_id != initial_experience_id
display_id, js_to_json, fatal=False) or {} or requested_languages and lang not in requested_languages
or requested_versions and version not in requested_versions):
continue
thumbnails.append({'url': fmt.get('poster')})
duration = max(duration, fmt.get('duration', 0))
format_name = '%s %s (%s)' % (version, lang, experience_id)
self.extract_subtitles(
subtitles, experience_id, display_id=display_id, format_name=format_name,
episode=episode if experience_id == initial_experience_id else episode_id)
video_id = title_data.get('id') or self._search_regex([
r"KANE_customdimensions.videoID\s*=\s*'(\d+)';",
r'<iframe[^>]+src="/player/(\d+)',
], webpage, 'video_id', default=None)
if not video_id:
player_url = self._html_search_meta([
'al:web:url',
'og:video:url',
'og:video:secure_url',
], webpage, fatal=True)
video_id = self._search_regex(r'/player/(\d+)', player_url, 'video id')
title = episode = title_data.get('title') or _search_kane('videoTitle') or self._og_search_title(webpage)
series = _search_kane('showName')
if series:
title = '%s - %s' % (series, title)
description = self._html_search_meta(['description', 'og:description'], webpage, fatal=True)
subtitles = self.extract_subtitles(url, video_id, display_id)
try:
headers = {} headers = {}
if self._TOKEN: if self._TOKEN:
headers['Authorization'] = 'Token %s' % self._TOKEN headers['Authorization'] = 'Token %s' % self._TOKEN
sources = self._download_json( page = self._download_json(
'https://www.funimation.com/api/showexperience/%s/' % video_id, 'https://www.funimation.com/api/showexperience/%s/' % experience_id,
video_id, headers=headers, query={ display_id, headers=headers, expected_status=403, query={
'pinst_id': ''.join([random.choice(string.digits + string.ascii_letters) for _ in range(8)]), 'pinst_id': ''.join([random.choice(string.digits + string.ascii_letters) for _ in range(8)]),
})['items'] }, note=f'Downloading {format_name} JSON')
except ExtractorError as e: sources = page.get('items') or []
if isinstance(e.cause, compat_HTTPError) and e.cause.code == 403: if not sources:
error = self._parse_json(e.cause.read(), video_id)['errors'][0] error = try_get(page, lambda x: x['errors'][0], dict)
raise ExtractorError('%s said: %s' % ( if error:
self.IE_NAME, error.get('detail') or error.get('title')), expected=True) self.report_warning('%s said: Error %s - %s' % (
raise self.IE_NAME, error.get('code'), error.get('detail') or error.get('title')))
else:
self.report_warning('No sources found for format')
formats = [] current_formats = []
for source in sources: for source in sources:
source_url = source.get('src') source_url = source.get('src')
if not source_url:
continue
source_type = source.get('videoType') or determine_ext(source_url) source_type = source.get('videoType') or determine_ext(source_url)
if source_type == 'm3u8': if source_type == 'm3u8':
formats.extend(self._extract_m3u8_formats( current_formats.extend(self._extract_m3u8_formats(
source_url, video_id, 'mp4', source_url, display_id, 'mp4', m3u8_id='%s-%s' % (experience_id, 'hls'), fatal=False,
m3u8_id='hls', fatal=False)) note=f'Downloading {format_name} m3u8 information'))
else: else:
formats.append({ current_formats.append({
'format_id': source_type, 'format_id': '%s-%s' % (experience_id, source_type),
'url': source_url, 'url': source_url,
}) })
for f in current_formats:
# TODO: Convert language to code
f.update({'language': lang, 'format_note': version})
formats.extend(current_formats)
self._remove_duplicate_formats(formats)
self._sort_formats(formats) self._sort_formats(formats)
return { return {
'id': video_id, 'id': initial_experience_id if only_initial_experience else episode_id,
'display_id': display_id, 'display_id': display_id,
'title': title, 'duration': duration,
'description': description, 'title': episode['episodeTitle'],
'thumbnail': self._og_search_thumbnail(webpage), 'description': episode.get('episodeSummary'),
'series': series, 'episode': episode.get('episodeTitle'),
'season_number': int_or_none(title_data.get('seasonNum') or _search_kane('season')), 'episode_number': int_or_none(episode.get('episodeId')),
'episode_number': int_or_none(title_data.get('episodeNum')), 'episode_id': episode_id,
'episode': episode, 'season': season.get('seasonTitle'),
'subtitles': subtitles, 'season_number': int_or_none(season.get('seasonId')),
'season_id': title_data.get('seriesId'), 'season_id': str_or_none(season.get('seasonPk')),
'series': show.get('showTitle'),
'formats': formats, 'formats': formats,
'thumbnails': thumbnails,
'subtitles': subtitles,
} }
def _get_subtitles(self, url, video_id, display_id): def _get_subtitles(self, subtitles, experience_id, episode, display_id, format_name):
player_url = urljoin(url, '/player/' + video_id) if isinstance(episode, str):
player_page = self._download_webpage(player_url, display_id) webpage = self._download_webpage(
text_tracks_json_string = self._search_regex( f'https://www.funimation.com/player/{experience_id}', display_id,
r'"textTracks": (\[{.+?}\])', fatal=False, note=f'Downloading player webpage for {format_name}')
player_page, 'subtitles data', default='') episode, _, _ = self._get_episode(webpage, episode_id=episode, fatal=False)
text_tracks = self._parse_json(
text_tracks_json_string, display_id, js_to_json, fatal=False) or [] for _, version, f in self._get_experiences(episode):
subtitles = {} for source in f.get('sources'):
for text_track in text_tracks: for text_track in source.get('textTracks'):
url_element = {'url': text_track.get('src')} if not text_track.get('src'):
language = text_track.get('language') continue
if text_track.get('type') == 'CC': sub_type = text_track.get('type').upper()
language += '_CC' sub_type = sub_type if sub_type != 'FULL' else None
subtitles.setdefault(language, []).append(url_element) current_sub = {
'url': text_track['src'],
'name': ' '.join(filter(None, (version, text_track.get('label'), sub_type)))
}
lang = '_'.join(filter(None, (
text_track.get('language', 'und'), version if version != 'Simulcast' else None, sub_type)))
if current_sub not in subtitles.get(lang, []):
subtitles.setdefault(lang, []).append(current_sub)
return subtitles return subtitles
@ -224,7 +320,7 @@ def _real_extract(self, url):
'title': show_info['name'], 'title': show_info['name'],
'entries': [ 'entries': [
self.url_result( self.url_result(
'%s/%s' % (base_url, vod_item.get('episodeSlug')), FunimationIE.ie_key(), '%s/%s' % (base_url, vod_item.get('episodeSlug')), FunimationPageIE.ie_key(),
vod_item.get('episodeId'), vod_item.get('episodeName')) vod_item.get('episodeId'), vod_item.get('episodeName'))
for vod_item in sorted(vod_items, key=lambda x: x.get('episodeOrder'))], for vod_item in sorted(vod_items, key=lambda x: x.get('episodeOrder'))],
} }