...
 
Commits (11)
Behold, mortal, the origins of Beautiful Soup...
================================================
Leonard Richardson is the primary programmer.
Aaron DeVore is awesome.
Mark Pilgrim provided the encoding detection code that forms the base
of UnicodeDammit.
Thomas Kluyver and Ezio Melotti finished the work of getting Beautiful
Soup 4 working under Python 3.
Simon Willison wrote soupselect, which was used to make Beautiful Soup
support CSS selectors.
Sam Ruby helped with a lot of edge cases.
Jonathan Ellis was awarded the prestigious Beau Potage D'Or for his
work in solving the nestable tags conundrum.
An incomplete list of people have contributed patches to Beautiful
Soup:
Istvan Albert, Andrew Lin, Anthony Baxter, Andrew Boyko, Tony Chang,
Zephyr Fang, Fuzzy, Roman Gaufman, Yoni Gilad, Richie Hindle, Peteris
Krumins, Kent Johnson, Ben Last, Robert Leftwich, Staffan Malmgren,
Ksenia Marasanova, JP Moins, Adam Monsen, John Nagle, "Jon", Ed
Oskiewicz, Greg Phillips, Giles Radford, Arthur Rudolph, Marko
Samastur, Jouni Seppnen, Alexander Schmolck, Andy Theyers, Glyn
Webster, Paul Wright, Danny Yoo
An incomplete list of people who made suggestions or found bugs or
found ways to break Beautiful Soup:
Hanno Bck, Matteo Bertini, Chris Curvey, Simon Cusack, Bruce Eckel,
Matt Ernst, Michael Foord, Tom Harris, Bill de hOra, Donald Howes,
Matt Patterson, Scott Roberts, Steve Strassmann, Mike Williams,
warchild at redho dot com, Sami Kuisma, Carlos Rocha, Bob Hutchison,
Joren Mc, Michal Migurski, John Kleven, Tim Heaney, Tripp Lilley, Ed
Summers, Dennis Sutch, Chris Smith, Aaron Sweep^W Swartz, Stuart
Turner, Greg Edwards, Kevin J Kalupson, Nikos Kouremenos, Artur de
Sousa Rocha, Yichun Wei, Per Vognsen
Beautiful Soup is made available under the MIT license:
Copyright (c) 2004-2018 Leonard Richardson
Copyright (c) 2004-2019 Leonard Richardson
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
......@@ -25,3 +25,6 @@ Beautiful Soup is made available under the MIT license:
Beautiful Soup incorporates code from the html5lib library, which is
also made available under the MIT license. Copyright (c) 2006-2013
James Graham and other contributors
Beautiful Soup depends on the soupsieve library, which is also made
available under the MIT license. Copyright (c) 2018 Isaac Muse
= 4.7.1 (20190106)
* Fixed a significant performance problem introduced in 4.7.0. [bug=1810617]
* Fixed an incorrectly raised exception when inserting a tag before or
after an identical tag. [bug=1810692]
* Beautiful Soup will no longer try to keep track of namespaces that
are not defined with a prefix; this can confuse soupselect. [bug=1810680]
* Tried even harder to avoid the deprecation warning originally fixed in
4.6.1. [bug=1778909]
= 4.7.0 (20181231)
* Beautiful Soup's CSS Selector implementation has been replaced by a
dependency on Isaac Muse's SoupSieve project (the soupsieve package
on PyPI). The good news is that SoupSieve has a much more robust and
complete implementation of CSS selectors, resolving a large number
of longstanding issues. The bad news is that from this point onward,
SoupSieve must be installed if you want to use the select() method.
You don't have to change anything lf you installed Beautiful Soup
through pip (SoupSieve will be automatically installed when you
upgrade Beautiful Soup) or if you don't use CSS selectors from
within Beautiful Soup.
SoupSieve documentation: https://facelessuser.github.io/soupsieve/
* Added the PageElement.extend() method, which works like list.append().
[bug=1514970]
* PageElement.insert_before() and insert_after() now take a variable
number of arguments. [bug=1514970]
* Fix a number of problems with the tree builder that caused
trees that were superficially okay, but which fell apart when bits
were extracted. Patch by Isaac Muse. [bug=1782928,1809910]
* Fixed a problem with the tree builder in which elements that
contained no content (such as empty comments and all-whitespace
elements) were not being treated as part of the tree. Patch by Isaac
Muse. [bug=1798699]
* Fixed a problem with multi-valued attributes where the value
contained whitespace. Thanks to Jens Svalgaard for the
fix. [bug=1787453]
* Clarified ambiguous license statements in the source code. Beautiful
Soup is released under the MIT license, and has been since 4.4.0.
* This file has been renamed from NEWS.txt to CHANGELOG.
= 4.6.3 (20180812)
* Exactly the same as 4.6.2. Re-released to make the README file
......
Metadata-Version: 2.1
Name: beautifulsoup4
Version: 4.6.3
Version: 4.7.1
Summary: Screen-scraping library
Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/
Author: Leonard Richardson
......@@ -58,7 +58,7 @@ Description: Beautiful Soup is a library that makes it easy to scrape informatio
* [Discussion group](http://groups.google.com/group/beautifulsoup/)
* [Development](https://code.launchpad.net/beautifulsoup/)
* [Bug tracker](https://bugs.launchpad.net/beautifulsoup/)
* [Complete changelog](https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/NEWS.txt)
* [Complete changelog](https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/CHANGELOG)
# Building the documentation
......
......@@ -49,7 +49,7 @@ To go beyond the basics, [comprehensive documentation is available](http://www.c
* [Discussion group](http://groups.google.com/group/beautifulsoup/)
* [Development](https://code.launchpad.net/beautifulsoup/)
* [Bug tracker](https://bugs.launchpad.net/beautifulsoup/)
* [Complete changelog](https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/NEWS.txt)
* [Complete changelog](https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/CHANGELOG)
# Building the documentation
......
Metadata-Version: 2.1
Name: beautifulsoup4
Version: 4.6.3
Version: 4.7.1
Summary: Screen-scraping library
Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/
Author: Leonard Richardson
......@@ -58,7 +58,7 @@ Description: Beautiful Soup is a library that makes it easy to scrape informatio
* [Discussion group](http://groups.google.com/group/beautifulsoup/)
* [Development](https://code.launchpad.net/beautifulsoup/)
* [Bug tracker](https://bugs.launchpad.net/beautifulsoup/)
* [Complete changelog](https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/NEWS.txt)
* [Complete changelog](https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/CHANGELOG)
# Building the documentation
......
AUTHORS.txt
COPYING.txt
LICENSE
MANIFEST.in
......
soupsieve>=1.2
[html5lib]
html5lib
......
......@@ -17,12 +17,10 @@ http://www.crummy.com/software/BeautifulSoup/bs4/doc/
"""
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.
__author__ = "Leonard Richardson (leonardr@segfault.org)"
__version__ = "4.6.3"
__copyright__ = "Copyright (c) 2004-2018 Leonard Richardson"
__version__ = "4.7.1"
__copyright__ = "Copyright (c) 2004-2019 Leonard Richardson"
# Use of this source code is governed by the MIT license.
__license__ = "MIT"
__all__ = ['BeautifulSoup']
......@@ -237,10 +235,11 @@ class BeautifulSoup(Tag):
self.builder = builder
self.is_xml = builder.is_xml
self.known_xml = self.is_xml
self.builder.soup = self
self._namespaces = dict()
self.parse_only = parse_only
self.builder.initialize_soup(self)
if hasattr(markup, 'read'): # It's a file-type object.
markup = markup.read()
elif len(markup) <= 256 and (
......@@ -382,7 +381,7 @@ class BeautifulSoup(Tag):
def pushTag(self, tag):
#print "Push", tag.name
if self.currentTag:
if self.currentTag is not None:
self.currentTag.contents.append(tag)
self.tagStack.append(tag)
self.currentTag = self.tagStack[-1]
......@@ -421,60 +420,71 @@ class BeautifulSoup(Tag):
def object_was_parsed(self, o, parent=None, most_recent_element=None):
"""Add an object to the parse tree."""
parent = parent or self.currentTag
previous_element = most_recent_element or self._most_recent_element
if parent is None:
parent = self.currentTag
if most_recent_element is not None:
previous_element = most_recent_element
else:
previous_element = self._most_recent_element
next_element = previous_sibling = next_sibling = None
if isinstance(o, Tag):
next_element = o.next_element
next_sibling = o.next_sibling
previous_sibling = o.previous_sibling
if not previous_element:
if previous_element is None:
previous_element = o.previous_element
fix = parent.next_element is not None
o.setup(parent, previous_element, next_element, previous_sibling, next_sibling)
self._most_recent_element = o
parent.contents.append(o)
if parent.next_sibling:
# This node is being inserted into an element that has
# already been parsed. Deal with any dangling references.
index = len(parent.contents)-1
while index >= 0:
if parent.contents[index] is o:
break
index -= 1
else:
raise ValueError(
"Error building tree: supposedly %r was inserted "
"into %r after the fact, but I don't see it!" % (
o, parent
)
)
if index == 0:
previous_element = parent
previous_sibling = None
else:
previous_element = previous_sibling = parent.contents[index-1]
if index == len(parent.contents)-1:
next_element = parent.next_sibling
next_sibling = None
else:
next_element = next_sibling = parent.contents[index+1]
o.previous_element = previous_element
if previous_element:
previous_element.next_element = o
o.next_element = next_element
if next_element:
next_element.previous_element = o
o.next_sibling = next_sibling
if next_sibling:
next_sibling.previous_sibling = o
o.previous_sibling = previous_sibling
if previous_sibling:
previous_sibling.next_sibling = o
# Check if we are inserting into an already parsed node.
if fix:
self._linkage_fixer(parent)
def _linkage_fixer(self, el):
"""Make sure linkage of this fragment is sound."""
first = el.contents[0]
child = el.contents[-1]
descendant = child
if child is first and el.parent is not None:
# Parent should be linked to first child
el.next_element = child
# We are no longer linked to whatever this element is
prev_el = child.previous_element
if prev_el is not None and prev_el is not el:
prev_el.next_element = None
# First child should be linked to the parent, and no previous siblings.
child.previous_element = el
child.previous_sibling = None
# We have no sibling as we've been appended as the last.
child.next_sibling = None
# This index is a tag, dig deeper for a "last descendant"
if isinstance(child, Tag) and child.contents:
descendant = child._last_descendant(False)
# As the final step, link last descendant. It should be linked
# to the parent's next sibling (if found), else walk up the chain
# and find a parent with a sibling. It should have no next sibling.
descendant.next_element = None
descendant.next_sibling = None
target = el
while True:
if target is None:
break
elif target.next_sibling is not None:
descendant.next_element = target.next_sibling
target.next_sibling.previous_element = child
break
target = target.parent
def _popToTag(self, name, nsprefix=None, inclusivePop=True):
"""Pops the tag stack up to and including the most recent
......@@ -520,7 +530,7 @@ class BeautifulSoup(Tag):
self.currentTag, self._most_recent_element)
if tag is None:
return tag
if self._most_recent_element:
if self._most_recent_element is not None:
self._most_recent_element.next_element = tag
self._most_recent_element = tag
self.pushTag(tag)
......
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.
# Use of this source code is governed by the MIT license.
__license__ = "MIT"
from collections import defaultdict
import itertools
......@@ -8,7 +8,7 @@ from bs4.element import (
CharsetMetaAttributeValue,
ContentMetaAttributeValue,
HTMLAwareEntitySubstitution,
whitespace_re
nonwhitespace_re
)
__all__ = [
......@@ -102,6 +102,12 @@ class TreeBuilder(object):
def __init__(self):
self.soup = None
def initialize_soup(self, soup):
"""The BeautifulSoup object has been initialized and is now
being associated with the TreeBuilder.
"""
self.soup = soup
def reset(self):
pass
......@@ -167,7 +173,7 @@ class TreeBuilder(object):
# values. Split it into a list.
value = attrs[attr]
if isinstance(value, basestring):
values = whitespace_re.split(value)
values = nonwhitespace_re.findall(value)
else:
# html5lib sometimes calls setAttributes twice
# for the same tag when rearranging the parse
......
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.
# Use of this source code is governed by the MIT license.
__license__ = "MIT"
__all__ = [
'HTML5TreeBuilder',
......@@ -15,7 +15,7 @@ from bs4.builder import (
)
from bs4.element import (
NamespacedAttribute,
whitespace_re,
nonwhitespace_re,
)
import html5lib
from html5lib.constants import (
......@@ -206,7 +206,7 @@ class AttrList(object):
# A node that is being cloned may have already undergone
# this procedure.
if not isinstance(value, list):
value = whitespace_re.split(value)
value = nonwhitespace_re.findall(value)
self.element[name] = value
def items(self):
return list(self.attrs.items())
......@@ -249,7 +249,7 @@ class Element(treebuilder_base.Node):
if not isinstance(child, basestring) and child.parent is not None:
node.element.extract()
if (string_child and self.element.contents
if (string_child is not None and self.element.contents
and self.element.contents[-1].__class__ == NavigableString):
# We are appending a string onto another string.
# TODO This has O(n^2) performance, for input like
......@@ -360,16 +360,16 @@ class Element(treebuilder_base.Node):
# Set the first child's previous_element and previous_sibling
# to elements within the new parent
first_child = to_append[0]
if new_parents_last_descendant:
if new_parents_last_descendant is not None:
first_child.previous_element = new_parents_last_descendant
else:
first_child.previous_element = new_parent_element
first_child.previous_sibling = new_parents_last_child
if new_parents_last_descendant:
if new_parents_last_descendant is not None:
new_parents_last_descendant.next_element = first_child
else:
new_parent_element.next_element = first_child
if new_parents_last_child:
if new_parents_last_child is not None:
new_parents_last_child.next_sibling = first_child
# Find the very last element being moved. It is now the
......@@ -379,7 +379,7 @@ class Element(treebuilder_base.Node):
last_childs_last_descendant = to_append[-1]._last_descendant(False, True)
last_childs_last_descendant.next_element = new_parents_last_descendant_next_element
if new_parents_last_descendant_next_element:
if new_parents_last_descendant_next_element is not None:
# TODO: This code has no test coverage and I'm not sure
# how to get html5lib to go through this path, but it's
# just the other side of the previous line.
......
# encoding: utf-8
"""Use the HTMLParser library to parse HTML files that aren't too bad."""
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.
# Use of this source code is governed by the MIT license.
__license__ = "MIT"
__all__ = [
'HTMLParserTreeBuilder',
......
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.
# Use of this source code is governed by the MIT license.
__license__ = "MIT"
__all__ = [
'LXMLTreeBuilderForXML',
'LXMLTreeBuilder',
......@@ -32,6 +33,10 @@ from bs4.dammit import EncodingDetector
LXML = 'lxml'
def _invert(d):
"Invert a dictionary."
return dict((v,k) for k, v in d.items())
class LXMLTreeBuilderForXML(TreeBuilder):
DEFAULT_PARSER_CLASS = etree.XMLParser
......@@ -48,7 +53,29 @@ class LXMLTreeBuilderForXML(TreeBuilder):
# This namespace mapping is specified in the XML Namespace
# standard.
DEFAULT_NSMAPS = {'http://www.w3.org/XML/1998/namespace' : "xml"}
DEFAULT_NSMAPS = dict(xml='http://www.w3.org/XML/1998/namespace')
DEFAULT_NSMAPS_INVERTED = _invert(DEFAULT_NSMAPS)
def initialize_soup(self, soup):
"""Let the BeautifulSoup object know about the standard namespace
mapping.
"""
super(LXMLTreeBuilderForXML, self).initialize_soup(soup)
self._register_namespaces(self.DEFAULT_NSMAPS)
def _register_namespaces(self, mapping):
"""Let the BeautifulSoup object know about namespaces encountered
while parsing the document.
This might be useful later on when creating CSS selectors.
"""
for key, value in mapping.items():
if key and key not in self.soup._namespaces:
# Let the BeautifulSoup object know about a new namespace.
# If there are multiple namespaces defined with the same
# prefix, the first one in the document takes precedence.
self.soup._namespaces[key] = value
def default_parser(self, encoding):
# This can either return a parser object or a class, which
......@@ -75,8 +102,8 @@ class LXMLTreeBuilderForXML(TreeBuilder):
if empty_element_tags is not None:
self.empty_element_tags = set(empty_element_tags)
self.soup = None
self.nsmaps = [self.DEFAULT_NSMAPS]
self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED]
def _getNsTag(self, tag):
# Split the namespace URL out of a fully-qualified lxml tag
# name. Copied from lxml's src/lxml/sax.py.
......@@ -144,7 +171,7 @@ class LXMLTreeBuilderForXML(TreeBuilder):
raise ParserRejectedMarkup(str(e))
def close(self):
self.nsmaps = [self.DEFAULT_NSMAPS]
self.nsmaps = [self.DEFAULT_NSMAPS_INVERTED]
def start(self, name, attrs, nsmap={}):
# Make sure attrs is a mutable dict--lxml may send an immutable dictproxy.
......@@ -158,8 +185,14 @@ class LXMLTreeBuilderForXML(TreeBuilder):
self.nsmaps.append(None)
elif len(nsmap) > 0:
# A new namespace mapping has come into play.
inverted_nsmap = dict((value, key) for key, value in nsmap.items())
self.nsmaps.append(inverted_nsmap)
# First, Let the BeautifulSoup object know about it.
self._register_namespaces(nsmap)
# Then, add it to our running list of inverted namespace
# mappings.
self.nsmaps.append(_invert(nsmap))
# Also treat the namespace mapping as a set of attributes on the
# tag, so we can recreate it later.
attrs = attrs.copy()
......
......@@ -6,8 +6,7 @@ necessary. It is heavily based on code from Mark Pilgrim's Universal
Feed Parser. It works best on XML and HTML, but it does not rewrite the
XML or HTML to reflect a new encoding; that's the tree builder's job.
"""
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.
# Use of this source code is governed by the MIT license.
__license__ = "MIT"
import codecs
......
"""Diagnostic functions, mainly for use when doing tech support."""
# Use of this source code is governed by a BSD-style license that can be
# found in the LICENSE file.
# Use of this source code is governed by the MIT license.
__license__ = "MIT"
import cProfile
......
This diff is collapsed.
This diff is collapsed.
......@@ -128,3 +128,43 @@ class HTML5LibBuilderSmokeTest(SoupTest, HTML5TreeBuilderSmokeTest):
markup = b"""<table><td></tbody>A"""
soup = self.soup(markup)
self.assertEqual(u"<body>A<table><tbody><tr><td></td></tr></tbody></table></body>", soup.body.decode())
def test_extraction(self):
"""
Test that extraction does not destroy the tree.
https://bugs.launchpad.net/beautifulsoup/+bug/1782928
"""
markup = """
<html><head></head>
<style>
</style><script></script><body><p>hello</p></body></html>
"""
soup = self.soup(markup)
[s.extract() for s in soup('script')]
[s.extract() for s in soup('style')]
self.assertEqual(len(soup.find_all("p")), 1)
def test_empty_comment(self):
"""
Test that empty comment does not break structure.
https://bugs.launchpad.net/beautifulsoup/+bug/1806598
"""
markup = """
<html>
<body>
<form>
<!----><input type="text">
</form>
</body>
</html>
"""
soup = self.soup(markup)
inputs = []
for form in soup.find_all('form'):
inputs.extend(form.find_all('input'))
self.assertEqual(len(inputs), 1)
......@@ -80,3 +80,21 @@ class LXMLXMLTreeBuilderSmokeTest(SoupTest, XMLTreeBuilderSmokeTest):
@property
def default_builder(self):
return LXMLTreeBuilderForXML()
def test_namespace_indexing(self):
# We should not track un-prefixed namespaces as we can only hold one
# and it will be recognized as the default namespace by soupsieve,
# which may be confusing in some situations. When no namespace is provided
# for a selector, the default namespace (if defined) is assumed.
soup = self.soup(
'<?xml version="1.1"?>\n'
'<root>'
'<tag xmlns="http://unprefixed-namespace.com">content</tag>'
'<prefix:tag xmlns:prefix="http://prefixed-namespace.com">content</tag>'
'</root>'
)
self.assertEqual(
soup._namespaces,
{'xml': 'http://www.w3.org/XML/1998/namespace', 'prefix': 'http://prefixed-namespace.com'}
)
# -*- coding: utf-8 -*-
"""Tests for Beautiful Soup's tree traversal methods.
......@@ -932,6 +931,13 @@ class TestTreeModification(SoupTest):
soup.a.append(soup.b)
self.assertEqual(data, soup.decode())
def test_extend(self):
data = "<a><b><c><d><e><f><g></g></f></e></d></c></b></a>"
soup = self.soup(data)
l = [soup.g, soup.f, soup.e, soup.d, soup.c, soup.b]
soup.a.extend(l)
self.assertEqual("<a><g></g><f></f><e></e><d></d><c></c><b></b></a>", soup.decode())
def test_move_tag_to_beginning_of_parent(self):
data = "<a><b></b><c></c><d></d></a>"
soup = self.soup(data)
......@@ -958,6 +964,29 @@ class TestTreeModification(SoupTest):
self.assertEqual(
soup.decode(), self.document_for("QUUX<b>bar</b><a>foo</a>BAZ"))
# Can't insert an element before itself.
b = soup.b
self.assertRaises(ValueError, b.insert_before, b)
# Can't insert before if an element has no parent.
b.extract()
self.assertRaises(ValueError, b.insert_before, "nope")
# Can insert an identical element
soup = self.soup("<a>")
soup.a.insert_before(soup.new_tag("a"))
def test_insert_multiple_before(self):
soup = self.soup("<a>foo</a><b>bar</b>")
soup.b.insert_before("BAZ", " ", "QUUX")
soup.a.insert_before("QUUX", " ", "BAZ")
self.assertEqual(
soup.decode(), self.document_for("QUUX BAZ<a>foo</a>BAZ QUUX<b>bar</b>"))
soup.a.insert_before(soup.b, "FOO")
self.assertEqual(
soup.decode(), self.document_for("QUUX BAZ<b>bar</b>FOO<a>foo</a>BAZ QUUX"))
def test_insert_after(self):
soup = self.soup("<a>foo</a><b>bar</b>")
soup.b.insert_after("BAZ")
......@@ -968,6 +997,28 @@ class TestTreeModification(SoupTest):
self.assertEqual(
soup.decode(), self.document_for("QUUX<b>bar</b><a>foo</a>BAZ"))
# Can't insert an element after itself.
b = soup.b
self.assertRaises(ValueError, b.insert_after, b)
# Can't insert after if an element has no parent.
b.extract()
self.assertRaises(ValueError, b.insert_after, "nope")
# Can insert an identical element
soup = self.soup("<a>")
soup.a.insert_before(soup.new_tag("a"))
def test_insert_multiple_after(self):
soup = self.soup("<a>foo</a><b>bar</b>")
soup.b.insert_after("BAZ", " ", "QUUX")
soup.a.insert_after("QUUX", " ", "BAZ")
self.assertEqual(
soup.decode(), self.document_for("<a>foo</a>QUUX BAZ<b>bar</b>BAZ QUUX"))
soup.b.insert_after(soup.a, "FOO ")
self.assertEqual(
soup.decode(), self.document_for("QUUX BAZ<b>bar</b><a>foo</a>FOO BAZ QUUX"))
def test_insert_after_raises_exception_if_after_has_no_meaning(self):
soup = self.soup("")
tag = soup.new_tag("a")
......@@ -1783,7 +1834,7 @@ class TestSoupSelector(TreeTest):
self.assertEqual(len(self.soup.select('del')), 0)
def test_invalid_tag(self):
self.assertRaises(ValueError, self.soup.select, 'tag%t')
self.assertRaises(SyntaxError, self.soup.select, 'tag%t')
def test_select_dashed_tag_ids(self):
self.assertSelects('custom-dashed-tag', ['dash1', 'dash2'])
......@@ -1974,8 +2025,7 @@ class TestSoupSelector(TreeTest):
NotImplementedError, self.soup.select, "a:no-such-pseudoclass")
self.assertRaises(
NotImplementedError, self.soup.select, "a:nth-of-type(a)")
SyntaxError, self.soup.select, "a:nth-of-type(a)")
def test_nth_of_type(self):
# Try to select first paragraph
......@@ -1992,9 +2042,9 @@ class TestSoupSelector(TreeTest):
els = self.soup.select('div#inner p:nth-of-type(4)')
self.assertEqual(len(els), 0)
# Pass in an invalid value.
self.assertRaises(
ValueError, self.soup.select, 'div p:nth-of-type(0)')
# Zero will select no tags.
els = self.soup.select('div p:nth-of-type(0)')
self.assertEqual(len(els), 0)
def test_nth_of_type_direct_descendant(self):
els = self.soup.select('div#inner > p:nth-of-type(1)')
......@@ -2031,7 +2081,7 @@ class TestSoupSelector(TreeTest):
self.assertEqual([], self.soup.select('#inner ~ h2'))
def test_dangling_combinator(self):
self.assertRaises(ValueError, self.soup.select, 'h1 >')
self.assertRaises(SyntaxError, self.soup.select, 'h1 >')
def test_sibling_combinator_wont_select_same_tag_twice(self):
self.assertSelects('p[lang] ~ p', ['lang-en-gb', 'lang-en-us', 'lang-fr'])
......@@ -2062,8 +2112,8 @@ class TestSoupSelector(TreeTest):
self.assertSelects('div x,y, z', ['xid', 'yid', 'zida', 'zidb', 'zidab', 'zidac'])
def test_invalid_multiple_select(self):
self.assertRaises(ValueError, self.soup.select, ',x, y')
self.assertRaises(ValueError, self.soup.select, 'x,,y')
self.assertRaises(SyntaxError, self.soup.select, ',x, y')
self.assertRaises(SyntaxError, self.soup.select, 'x,,y')
def test_multiple_select_attrs(self):
self.assertSelects('p[lang=en], p[lang=en-gb]', ['lang-en', 'lang-en-gb'])
......@@ -2087,4 +2137,3 @@ class TestSoupSelector(TreeTest):
# order.
for element in soup.find_all(class_=['c1', 'c2']):
assert element in selected
beautifulsoup4 (4.7.1-1) unstable; urgency=medium
* New upstream release.
* Now depends on {python,python3,pypy}-soupsieve.
- Tag test dependencies <!nocheck>, to keep the stack bootstrappable.
* Bump copyright years.
* Test with html5lib, where possible.
-- Stefano Rivera <stefanor@debian.org> Sat, 02 Feb 2019 11:14:00 +0100
beautifulsoup4 (4.6.3-2) unstable; urgency=medium
[ Ondřej Nový ]
......
......@@ -8,13 +8,18 @@ Build-Depends:
dh-python,
pypy (>= 1.7),
pypy-setuptools,
pypy-soupsieve <!nocheck>,
python-all,
python-lxml,
python-html5lib <!nocheck>,
python-lxml <!nocheck>,
python-setuptools,
python3-sphinx,
python-soupsieve <!nocheck>,
python3-all (>= 3.1.2),
python3-lxml,
python3-setuptools
python3-html5lib <!nocheck>,
python3-lxml <!nocheck>,
python3-setuptools,
python3-soupsieve <!nocheck>,
python3-sphinx
Standards-Version: 4.3.0
Homepage: https://www.crummy.com/software/BeautifulSoup
Vcs-Git: https://salsa.debian.org/python-team/modules/beautifulsoup4.git
......
......@@ -4,7 +4,7 @@ Upstream-Contact: Leonard Richardson <leonardr@segfault.org>
Source: https://launchpad.net/beautifulsoup
Files: *
Copyright: 2004-2018, Leonard Richardsonn
Copyright: 2004-2019, Leonard Richardsonn
License: Expat
Comment:
Beautiful Soup incorporates code from the html5lib library, which is also made
......@@ -18,7 +18,7 @@ License: public-domain
Files: debian/*
Copyright:
2005-2009, Decklin Foster <decklin@red-bean.com>
2011-2018, Stefano Rivera <stefanor@debian.org>
2011-2019, Stefano Rivera <stefanor@debian.org>
License: Expat
License: Expat
......
......@@ -11,10 +11,10 @@ This patch can be dropped when Debian stops supporting Python 3.6.
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/bs4/diagnose.py b/bs4/diagnose.py
index 7a28c09..088e709 100644
index f9835c3..f15e4ad 100644
--- a/bs4/diagnose.py
+++ b/bs4/diagnose.py
@@ -155,7 +155,7 @@ def rword(length=5):
@@ -154,7 +154,7 @@ def rword(length=5):
def rsentence(length=4):
"Generate a random sentence-like string."
......
Tests: unittests
Depends: python-all, python-bs4, python-lxml, python-nose (>= 1.3)
# Currently no pypy-nose (or pypy-lxml)
#Tests: unittests-pypy
#Depends: pypy, pypy-bs4, pypy-lxml, pypy-nose
Depends:
python-all,
python-bs4,
python-html5lib,
python-lxml,
python-nose (>= 1.3)
Tests: unittests3
Depends: python3-all, python3-bs4, python3-lxml, python3-nose (>= 1.3)
Depends:
python3-all,
python3-bs4,
python3-html5lib,
python3-lxml,
python3-nose (>= 1.3)
Tests: unittests-pypy
Restrictions: skip-not-installable
Depends: pypy, pypy-bs4, pypy-html5lb, pypy-lxml, pypy-nose
......@@ -1662,9 +1662,22 @@ tag it contains.
CSS selectors
-------------
Beautiful Soup supports the most commonly-used CSS selectors. Just
pass a string into the ``.select()`` method of a ``Tag`` object or the
``BeautifulSoup`` object itself.
As of version 4.7.0, Beautiful Soup supports most CSS4 selectors via
the `SoupSieve <https://facelessuser.github.io/soupsieve/>`_
project. If you installed Beautiful Soup through ``pip``, SoupSieve
was installed at the same time, so you don't have to do anything extra.
``BeautifulSoup`` has a ``.select()`` method which uses SoupSieve to
run a CSS selector against a parsed document and return all the
matching elements. ``Tag`` has a similar method which runs a CSS
selector against the contents of a single tag.
(Earlier versions of Beautiful Soup also have the ``.select()``
method, but only the most commonly-used CSS selectors are supported.)
The SoupSieve `documentation
<https://facelessuser.github.io/soupsieve/>`_ lists all the currently
supported CSS selectors, but here are some of the basics:
You can find tags::
......@@ -1761,31 +1774,42 @@ Find tags by attribute value::
soup.select('a[href*=".com/el"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
Match language codes::
multilingual_markup = """
<p lang="en">Hello</p>
<p lang="en-us">Howdy, y'all</p>
<p lang="en-gb">Pip-pip, old fruit</p>
<p lang="fr">Bonjour mes amis</p>
"""
multilingual_soup = BeautifulSoup(multilingual_markup)
multilingual_soup.select('p[lang|=en]')
# [<p lang="en">Hello</p>,
# <p lang="en-us">Howdy, y'all</p>,
# <p lang="en-gb">Pip-pip, old fruit</p>]
Find only the first tag that matches a selector::
There's also a method called ``select_one()``, which finds only the
first tag that matches a selector::
soup.select_one(".sister")
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
This is all a convenience for users who know the CSS selector syntax. You
can do all this stuff with the Beautiful Soup API. And if CSS
selectors are all you need, you might as well use lxml directly: it's
a lot faster, and it supports more CSS selectors. But this lets you
`combine` simple CSS selectors with the Beautiful Soup API.
If you've parsed XML that defines namespaces, you can use them in CSS
selectors.::
from bs4 import BeautifulSoup
xml = """<tag xmlns:ns1="http://namespace1/" xmlns:ns2="http://namespace2/">
<ns1:child>I'm in namespace 1</ns1:child>
<ns2:child>I'm in namespace 2</ns2:child>
</tag> """
soup = BeautifulSoup(xml, "xml")
soup.select("child")
# [<ns1:child>I'm in namespace 1</ns1:child>, <ns2:child>I'm in namespace 2</ns2:child>]
soup.select("ns1|child", namespaces=namespaces)
# [<ns1:child>I'm in namespace 1</ns1:child>]
When handling a CSS selector that uses namespaces, Beautiful Soup
uses the namespace abbreviations it found when parsing the
document. You can override this by passing in your own dictionary of
abbreviations::
namespaces = dict(first="http://namespace1/", second="http://namespace2/")
soup.select("second|child", namespaces=namespaces)
# [<ns1:child>I'm in namespace 2</ns1:child>]
All this CSS selector stuff is a convenience for people who already
know the CSS selector syntax. You can do all of this with the
Beautiful Soup API. And if CSS selectors are all you need, you should
parse the document with lxml: it's a lot faster. But this lets you
`combine` CSS selectors with the Beautiful Soup API.
Modifying the tree
==================
......@@ -1846,6 +1870,21 @@ like calling ``.append()`` on a Python list::
soup.a.contents
# [u'Foo', u'Bar']
``extend()``
------------
Starting in Beautiful Soup 4.7.0, ``Tag`` also supports a method
called ``.extend()``, which works just like calling ``.extend()`` on a
Python list::
soup = BeautifulSoup("<a>Soup</a>")
soup.a.extend(["'s", " ", "on"])
soup
# <html><head></head><body><a>Soup's on</a></body></html>
soup.a.contents
# [u'Soup', u''s', u' ', u'on']
``NavigableString()`` and ``.new_tag()``
-------------------------------------------------
......@@ -1914,7 +1953,7 @@ say. It works just like ``.insert()`` on a Python list::
``insert_before()`` and ``insert_after()``
------------------------------------------
The ``insert_before()`` method inserts a tag or string immediately
The ``insert_before()`` method inserts tags or strings immediately
before something else in the parse tree::
soup = BeautifulSoup("<b>stop</b>")
......@@ -1924,14 +1963,16 @@ before something else in the parse tree::
soup.b
# <b><i>Don't</i>stop</b>
The ``insert_after()`` method moves a tag or string so that it
immediately follows something else in the parse tree::
The ``insert_after()`` method inserts tags or strings immediately
following something else in the parse tree::
soup.b.i.insert_after(soup.new_string(" ever "))
div = soup.new_tag('div')
div.string = 'ever'
soup.b.i.insert_after(" you ", div)
soup.b
# <b><i>Don't</i> ever stop</b>
# <b><i>Don't</i> you <div>ever</div> stop</b>
soup.b.contents
# [<i>Don't</i>, u' ever ', u'stop']
# [<i>Don't</i>, u' you', <div>ever</div>, u'stop']
``clear()``
-----------
......@@ -2061,7 +2102,8 @@ Pretty-printing
---------------
The ``prettify()`` method will turn a Beautiful Soup parse tree into a
nicely formatted Unicode string, with each HTML/XML tag on its own line::
nicely formatted Unicode string, with a separate line for each
tag and each string:
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
......
......@@ -8,12 +8,13 @@ with open("README.md", "r") as fh:
setup(
name="beautifulsoup4",
version = "4.6.3",
version = "4.7.1",
author="Leonard Richardson",
author_email='leonardr@segfault.org',
url="http://www.crummy.com/software/BeautifulSoup/bs4/",
download_url = "http://www.crummy.com/software/BeautifulSoup/bs4/download/",
description="Screen-scraping library",
install_requires=["soupsieve>=1.2"],
long_description=long_description,
long_description_content_type="text/markdown",
license="MIT",
......