1
mirror of https://github.com/home-assistant/core synced 2024-09-06 10:29:55 +02:00

scrape: extract strings from new non-text tags (#35021)

With the upgrade to beautifulsoup4 to 4.9.0 (#34007), certain tags
(`<style>`, `<script>` and `<template>`) are no longer treated as having
text content (see
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings
and reported bug https://bugs.launchpad.net/beautifulsoup/+bug/1868861)
meaning the content of these types of tags became inaccessible to HA.

Where the previous code could access `.text` on the tag, bs4 4.9 now
yields an empty string; these types of tags require accesing `.string`
instead.  This PR checks the tag name (which will aalways be lowercase
given how the parser works;
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#other-parser-problems)
and applies this different access strategy to get the content of the
HTML tag.  All other tags are handled in the original manner.
This commit is contained in:
David Beitey 2020-05-04 18:45:40 +10:00 committed by GitHub
parent 49979d0a75
commit 7a73c6adf7
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -132,7 +132,11 @@ class ScrapeSensor(Entity):
if self._attr is not None:
value = raw_data.select(self._select)[self._index][self._attr]
else:
value = raw_data.select(self._select)[self._index].text
tag = raw_data.select(self._select)[self._index]
if tag.name in ("style", "script", "template"):
value = tag.string
else:
value = tag.text
_LOGGER.debug(value)
except IndexError:
_LOGGER.error("Unable to extract data from HTML")