youtube-dl/youtube_dl
Glenn Slayden c9a9ccf8a3
URL batch listing file improvements
These improvements apply to reading the list of URLs from the file supplied via the `--batch-file` (`-a`) command line option.

1. Skip blank and empty lines in the file. Currently, lines with leading whitespace are only skipped when that whitespace is followed by a comment character (`#`, `;`, or `]`). This means that empty lines and lines consisting only of whitespace are returned as (trimmed) empty strings in the list of URLs to process.

2. [bug fix] Detect and remove the Unicode BOM when the file descriptor is already decoding Unicode.

With Python 3, the `batch_fd` enumerator returns the lines of the file as Unicode. For UTF-8, this means that the raw BOM bytes from the file `\xef \xbb \xbf` show up converted into a single `\ufeff` character prefixed to the first enumerated text line. 

This fix solves several buggy interactions between the presence of BOM, the skipping of comments and/or blank lines, and ensuring the list of URLs is consistently trimmed. For example, if the first line of the file is blank, the BOM is incorrectly returned as a URL standing alone. If the first line contains a URL, it will be prefixed with this unwanted single character--but note that its being there will have inhibited the proper trimming of any leading whitespace. Currently, the `UnicodeBOMIE` helper attempts to recover from some of these error cases, but this fix prevents the error from happening in the first place (at least on Python3). In any case, the `UnicodeBOMIE` approach is flawed, because it is clearly illogical for a BOM to appear in the (non-batch) URL(s) specified directly on the command line (and for that matter, on URLs *after the first line* of a batch list, also)

3. Having fixed `read_batch_urls` so that it more consistently enumerates only properly trimmed URLs, it can also do a quick on-the-fly elimination of exact duplicates (of course doing so without disturbing the order in which they are listed).
2020-10-04 22:54:59 -07:00
..
downloader [downloader/http] Properly handle missing message in SSLError (closes #26646) 2020-09-22 07:01:59 +07:00
extractor [expressen] Add support for di.se (closes #26670) 2020-09-24 07:37:10 +07:00
postprocessor [postprocessor/embedthumbnail] Fix issues (closes #25717) 2020-09-14 03:28:31 +07:00
YoutubeDL.py [YoutubeDL] Force redirect URL to unicode on python 2 2020-02-29 19:08:44 +07:00
__init__.py Output batch filename when it could not be read (#21915) 2019-08-01 03:54:39 +07:00
__main__.py
aes.py Switch codebase to use compat_b64decode 2018-01-23 22:23:12 +07:00
cache.py Use expand_path where appropriate (closes #12556) 2017-03-26 02:31:16 +07:00
compat.py [compat] Introduce compat_cookiejar_Cookie 2020-05-05 05:54:10 +07:00
jsinterp.py [jsinterp] Fix typo and cleanup regexes (closes #13134) 2017-05-18 22:57:38 +07:00
options.py [options] Clarify doc on --exec command (closes #19087) (#24883) 2020-04-24 02:31:38 +07:00
socks.py [socks] Report errors elegantly when credentails are required but missing 2017-04-22 21:48:41 +08:00
swfinterp.py Update coding style after pycodestyle 2.1.0 2016-11-17 19:45:42 +08:00
update.py [update] Fix updating via symlinks (closes #23991) 2020-02-08 19:46:58 +07:00
utils.py URL batch listing file improvements 2020-10-04 22:54:59 -07:00
version.py release 2020.09.20 2020-09-20 12:30:45 +07:00