scrape: extract strings from new non-text tags (#35021)

With the upgrade to beautifulsoup4 to 4.9.0 (#34007), certain tags (`<style>`, `<script>` and `<template>`) are no longer treated as having text content (see https://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings and reported bug https://bugs.launchpad.net/beautifulsoup/+bug/1868861) meaning the content of these types of tags became inaccessible to HA. Where the previous code could access `.text` on the tag, bs4 4.9 now yields an empty string; these types of tags require accesing `.string` instead. This PR checks the tag name (which will aalways be lowercase given how the parser works; https://www.crummy.com/software/BeautifulSoup/bs4/doc/#other-parser-problems) and applies this different access strategy to get the content of the HTML tag. All other tags are handled in the original manner.
2024-09-06 10:29:55 +02:00 · 2020-05-04 18:45:40 +10:00 · 2020-05-04 18:45:40 +10:00 · 7a73c6adf7
commit 7a73c6adf7
parent 49979d0a75
1 changed files with 5 additions and 1 deletions
--- a/homeassistant/components/scrape/sensor.py
+++ b/homeassistant/components/scrape/sensor.py
@ -132,7 +132,11 @@ class ScrapeSensor(Entity):
            if self._attr is not None:
                value = raw_data.select(self._select)[self._index][self._attr]
            else:
-                value = raw_data.select(self._select)[self._index].text
+                tag = raw_data.select(self._select)[self._index]
+                if tag.name in ("style", "script", "template"):
+                    value = tag.string
+                else:
+                    value = tag.text
            _LOGGER.debug(value)
        except IndexError:
            _LOGGER.error("Unable to extract data from HTML")