From 003debed0ccb062f8de215b85886634283eac04e Mon Sep 17 00:00:00 2001 From: bastimeyer Date: Tue, 7 Nov 2023 18:51:35 +0100 Subject: [PATCH] docs: add plugin.api.validate API guide --- docs/api/validate.rst | 3 + docs/api_guide.rst | 1 + docs/api_guide/validate.rst | 335 ++++++++++++++++++++++++++++++++++++ 3 files changed, 339 insertions(+) create mode 100644 docs/api_guide/validate.rst diff --git a/docs/api/validate.rst b/docs/api/validate.rst index 2bdb436b..1e45d82a 100644 --- a/docs/api/validate.rst +++ b/docs/api/validate.rst @@ -16,6 +16,9 @@ Validation schemas . Ideally, we'd just run autodoc on the main module and configure the order of items. :( +Please see the :ref:`validation schema guides ` +for an introduction to this API and a list of examples. + .. autoclass:: streamlink.plugin.api.validate.Schema :members: :undoc-members: diff --git a/docs/api_guide.rst b/docs/api_guide.rst index 5f3b3e26..db7a1f34 100644 --- a/docs/api_guide.rst +++ b/docs/api_guide.rst @@ -5,3 +5,4 @@ API Guide :maxdepth: 2 api_guide/quickstart + api_guide/validate diff --git a/docs/api_guide/validate.rst b/docs/api_guide/validate.rst new file mode 100644 index 00000000..b405ce7d --- /dev/null +++ b/docs/api_guide/validate.rst @@ -0,0 +1,335 @@ +Validation schemas +================== + +.. currentmodule:: streamlink.plugin.api.validate + +Introduction +------------ + +The :ref:`streamlink.plugin.api.validate ` module provides an API for defining declarative +validation schemas which are used to verify and extract data from various inputs, for example HTTP responses. + +Validation schemas are a powerful tool for :ref:`plugin ` implementors to find and extract data like +stream URLs, stream metadata and more from websites and web APIs. + +Instead of verifying and extracting data programatically and having to perform error handling manually, +declarative validation schemas allow defining comprehensive validation and extraction rules which are easy to understand +and which raise errors with meaningful messages upon extraction failure. + + +Examples +-------- + +Simple schemas +^^^^^^^^^^^^^^ + +Let's begin with a few simple validation schemas which are not particularly useful yet. + +.. code-block:: pycon + + >>> from streamlink.plugin.api import validate + + >>> schema_one = validate.Schema("123") + >>> schema_two = validate.Schema(123) + >>> schema_three = validate.Schema(int, 123.0) + + >>> schema_one.validate("123") + '123' + >>> schema_two.validate(123) + 123 + >>> schema_three.validate(123) + 123 + +First, three :class:`Schema` instances are created, ``schema_one``, ``schema_two`` and ``schema_three``. + +The :class:`Schema` class is the main schema validation interface and the outer wrapper for all schema definitions. +It is a subclass of :class:`validate.all ` which additionally implements the :meth:`Schema.validate()` method. +This interface is expected by various Streamlink methods and functions when passing the ``schema`` argument/keyword, +for example to the :class:`HTTPSession ` methods or :mod:`streamlink.utils.parse` functions. + +The :class:`validate.all ` class takes a sequence of schema object arguments and validates each one in order. +All schema objects in this schema container must be valid. + +Schema objects can be anything, and depending on their type, different validations will be applied. In our example, both +``schema_one`` and ``schema_two`` contain only one schema object, namely ``"123"`` and ``123`` respectively, whereas +``schema_three`` contains two schema objects, ``int`` and ``123.0``. This means that the first two schemas validate +only one condition, while the third one validates two, first ``int``, then ``123.0``. + +As you've probably already noticed, validation schemas also have a return value for their extraction purpose, but this isn't +much interesting in this example. + +The ``"123"``, ``123`` and ``123.0`` schemas are simple :func:`equality validations `. This is the case for +all basic objects, and all they do is validate and return the input value again. ``int`` however is a ``type`` object, +and thus a :func:`type validation <_validate_type>`, which checks whether the input is an instance of the schema object +and then also returns the input value again. Since ``123`` is an ``int``, the schema is valid for that input. +``schema_three`` however hasn't finished validating yet at this point, as it defines two validation schemas in total. +This means that the return value of the ``int`` validation gets passed to the ``123.0`` schema validation, and as expected +when checking ``123 == 123.0``, despite both the input and schema being different types, namely ``int`` and ``float``, +the validation succeeds and returns its input value again, causing the return value of the whole +``schema_three`` to be ``123``. + +Now let's have a look at validation errors. + +.. code-block:: pycon + + >>> schema_one.validate(123) + streamlink.exceptions.PluginError: Unable to validate result: ValidationError(equality): + 123 does not equal '123' + + >>> schema_three.validate(123.0) + streamlink.exceptions.PluginError: Unable to validate result: ValidationError(type): + Type of 123.0 should be int, but is float + +The first :meth:`Schema.validate()` call passes ``123`` to ``schema_one``. ``schema_one`` however expects ``"123"``, so +a :class:`ValidationError <_exception.ValidationError>` is raised because the input value is not equal to the schema. +:meth:`Schema.validate()` catches the error and wraps it in a :class:`PluginError ` +with a specific validation message. + +The second validation also fails, but here, it's because of the input type. The first sub-schema explicitly checks for +the type ``int``, and despite the following schema being ``123.0``, which is a ``float`` object that would obviously validate +a ``123.0`` ``float`` input when comparing equality, a :class:`ValidationError <_exception.ValidationError>` is raised. + +Extracting JSON data +^^^^^^^^^^^^^^^^^^^^ + +The next example shows how to read an optional integer value from JSON data. + +.. code-block:: pycon + + >>> from streamlink.plugin.api import validate + + >>> json_schema = validate.Schema( + ... str, + ... validate.parse_json(), + ... { + ... "status": validate.any(None, int), + ... }, + ... validate.get("status"), + ... ) + + >>> json_schema.validate("""{"status":null}""") + None + >>> json_schema.validate("""{"status":123}""") + 123 + + >>> json_schema.validate("""Not JSON""") + streamlink.exceptions.PluginError: Unable to validate result: ValidationError: + Unable to parse JSON: Expecting value: line 1 column 1 (char 0) ('Not JSON') + + >>> json_schema.validate("""{"status":"unknown"}""") + streamlink.exceptions.PluginError: Unable to validate result: ValidationError(dict): + Unable to validate value of key 'status' + Context(AnySchema): + ValidationError(equality): + 'unknown' does not equal None + ValidationError(type): + Type of 'unknown' should be int, but is str + +Once again, we start with a new :class:`Schema` object which gets assigned to ``json_schema``. This schema collection validates +four schemas in total. Each of them must be valid, with each output being the input of the next one. + +Since our goal is to parse JSON data and extract data from it, this means that we should only accept string inputs, so we set +``str`` as the first schema in this :class:`validate.all ` schema collection. + +Next is the :func:`validate.parse_json() ` validation, a call of a utility function which returns +a :class:`validate.transform ` schema object that does exactly what its name suggests: it takes an input and returns +something else. In this case, obviously, strings are the input and a parsed JSON object is the output, assuming that the input +is indeed valid JSON data. + +Now we validate the parsed JSON object. We expect the JSON data to be a JSON ``object``, so we let the next validation schema +be a :func:`dict validation <_validate_dict>`. :class:`dict` validation schemas define a set of key-value pairs which +must exist in the input, unless keys are set as optional using :class:`validate.optional `. +For the sake of simplicity, this isn't the case in this example just yet. Each value of the key-value pairs is +a validation schema on its own where the input is validated against. + +Here, the ``"status"`` key has a :class:`validate.any ` validation schema, which is also a schema collection, similar to +:class:`validate.all `, but :class:`validate.any ` requires at least one sub-schema to be valid, not all. +Each sub-schema receives the same input, and the output of the overall schema collection is the output of the first sub-schema +that's valid. For our example, this means that the value of the ``status`` key in the JSON data must either be +``None`` (``null``) or an ``int``. + +If any of the schemas in a nested schema definition like that fails, then a validation error stack will be generated +by :class:`ValidationError <_exception.ValidationError>`, as shown above. + +The last of the four schemas in the outer :class:`validate.all ` schema collection is a :class:`validate.get ` schema. +This schema works on any kind of input which implements :func:`__getitem__()`, for example :class:`dict` objects. +And as expected, it attempts to get and return the ``"status"`` key of the output of the previous :class:`dict` validation. +The :mod:`validation ` module also supports getting multiple values at once using +the :class:`validate.union ` or :class:`validate.union_get ` schemas, but this isn't relevant here. + +Finding stream URLs in HTML +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Let's imagine a simple website where a stream URL is embedded as JSON data in a ``data-player`` attribute of an unknown +HTML element where the web player of that website reads from. + +Extracting this data could be done by using regular expressions, but then we'd have to take HTML syntax into account, +as well as JSON syntax which should usually be HTML-encoded in that HTML element attribute, which would make writing +a regular expression even harder, apart from the fact that the JSON data structure could easily change at any time. + +Therefore it would make much more sense parsing the HTML data, querying the resulting node tree using an XPath query +for getting the attribute value, then parsing the JSON data and finally finding and validating the stream URL. + +We also don't want to raise validation errors unnecessarily when the user inputs a URL where no video player was found, +so we can instead return an empty list of streams in our plugin implementation and let Streamlink's CLI exit gracefully. +Validation errors are only supposed to be raised when an actual error happened due to unexpected data, +not when streams are offline or inaccessible. + +Thanks to validation schemas, we can do all this declaratively without causing a mess when doing this programmatically. + +.. code-block:: pycon + + >>> from streamlink.plugin.api import validate + + >>> schema = validate.Schema( + ... validate.parse_html(), + ... validate.xml_xpath_string(".//*[@data-player][1]/@data-player"), + ... validate.none_or_all( + ... validate.parse_json(), + ... { + ... validate.optional("url"): validate.url( + ... path=validate.endswith(".m3u8"), + ... ), + ... }, + ... validate.get("url"), + ... ), + ... ) + + >>> schema.validate(""" + ... + ...
+ ... """) + None + + >>> schema.validate(""" + ... + ...
+ ... ... + ...
+ ... """) + None + + >>> schema.validate(""" + ... + ...
+ ... ... + ...
+ ... """) + 'https://host/hls-playlist.m3u8' + + >>> schema.validate(""" + ... + ...
+ ... ... + ...
+ ... """) + streamlink.exceptions.PluginError: Unable to validate result: ValidationError(NoneOrAllSchema): + ValidationError(dict): + Unable to validate value of key 'url' + Context(url): + Unable to validate URL attribute 'path' + Context(endswith): + '/dash-manifest.mpd' does not end with '.m3u8' + +We start with a new :class:`Schema` and begin by parsing HTML using the :func:`validate.parse_html() ` +utility function. Similar to :func:`validate.parse_json() `, it returns a :class:`validate.transform ` +schema. :func:`validate.parse_html() ` however returns a parsed HTML node tree via Streamlink's +:ref:`lxml dependency `. + +This is followed by an XPath query schema using the :func:`validate.xml_xpath_string() ` utility function. +:func:`validate.xml_xpath_string() ` is a wrapper for :func:`validate.xml_xpath() ` which always +returns a string or ``None``, depending on the query result. This is useful for querying text contents or single attribute +values, like in this case. XPath queries on their own always return a result set, i.e. possibly multiple values, so when +trying to find single values, it is important to limit the number of potential return values to only one in the XPath query. + +The query here attempts to find any node with a ``data-player`` attribute. It then limits the result set to the first found +element and then reads the value of its ``data-player`` attribute. :func:`validate.xml_xpath_string() ` +turns this into a single string return value, or ``None`` if no or an empty value was returned by the query. + +Since we now have two different paths for our overall validation schema, either no player data or still unvalidated player data, +our next schema is a :class:`validate.none_or_all ` schema. This works similar to :class:`validate.all `, +except that ``None`` inputs are skipped and get returned immediately without validating any sub-schemas. This lets us handle +cases where no player was found on the website, without raising a :class:`ValidationError <_exception.ValidationError>`. + +In the :class:`validate.none_or_all ` schema, we now attempt to parse JSON data, which was already shown +previously, except for the fact that we don't need to validate the ``str`` input here, as the XPath query must have already +returned a string value. + +On to the :func:`dict validation <_validate_dict>`. We're only interested in the ``url`` key. Any other keys of the input +will get ignored. Since we're aware that ``url`` can be missing if the stream is offline, we mark it as optional using the +:class:`validate.optional ` schema. This makes the :func:`dict validation <_validate_dict>` not raise an error +if it's missing, but if it's set, then its value must validate. Talking about the value, we want its value to be a URL. + +This is where the :func:`validate.url ` utility function comes in handy. It parses the input and lets us validate +any parts of the parsed URL with further validation schemas. The return value is always the full URL string. In our example, +we want to ensure that the URL's path ends with the ``".m3u8"`` string, which is an indicator for the stream being +an HLS stream, so we can pass the URL to Streamlink's :class:`HLS implementation `. + +Lastly, we simply get the ``url`` key using :class:`validate.get `. The return value must either be ``None`` if no ``url`` +key was included in the JSON data, or a ``str`` with a URL where its path ends with ``".m3u8"``. + +This means that the overall schema can only return ``None`` or said kind of URL string. If the ``url`` key is not a URL, +or if its path does not end with ``".m3u8"``, then a :class:`ValidationError <_exception.ValidationError>` is raised, +which is what we want. The ``None`` return value should then be checked accordingly by the plugin implementation. + +Validating HTTP responses +^^^^^^^^^^^^^^^^^^^^^^^^^ + +In order to validate HTTP responses directly, Streamlink's :class:`HTTPSession ` allows +setting the ``schema`` keyword in :meth:`HTTPSession.request() `, +as well as in each HTTP-verb method like ``get()``, ``post()``, etc. + +Here's a simple plugin implementation with the same schema from the `Finding stream URLs in HTML`_ example above. + +.. code-block:: python + :caption: example-plugin.py + :name: example-plugin + + import re + + from streamlink.plugin import Plugin, pluginmatcher + from streamlink.plugin.api import validate + from streamlink.stream.hls import HLSStream + + + @pluginmatcher(re.compile(r"https://example\.tld/")) + class ExamplePlugin(Plugin): + def _get_streams(): + hls_url = self.session.http.get(self.url, schema=validate.Schema( + validate.parse_html(), + validate.xml_xpath_string(".//*[@data-player][1]/@data-player"), + validate.none_or_all( + validate.parse_json(), + { + validate.optional("url"): validate.url( + path=validate.endswith(".m3u8"), + ), + }, + validate.get("url"), + ), + )) + + if not hls_url: + return None + + return HLSStream.parse_variant_playlist(self.session, hls_url) + + + __plugin__ = ExamplePlugin