Welcome to w3lib’s documentation!

Overview

This is a Python library of web-related functions, such as:

remove comments, or tags from HTML snippets
extract base url from HTML snippets
translate entities on HTML strings
convert raw HTTP headers to dicts and vice-versa
construct HTTP auth header
converting HTML pages to unicode
sanitize urls (like browsers do)
extract arguments from urls

The w3lib library is licensed under the BSD license.

Requirements

Python 3.8+

Install

pip install w3lib

Tests

pytest is the preferred way to run tests. Just run: pytest from the root directory to execute tests using the default Python interpreter.

tox could be used to run tests for all supported Python versions. Install it (using ‘pip install tox’) and then run tox from the root directory - tests will be executed for all available Python interpreters.

Changelog

2.1.2 (2023-08-03)

Fix test failures on Python 3.11.4+ (#212, #213).
Fix an incorrect type hint (#211).
Add project URLs to setup.py (#215).

2.1.1 (2022-12-09)

safe_url_string(), safe_download_url() and canonicalize_url() now strip whitespace and control characters urls according to the URL living standard.

2.1.0 (2022-11-28)

Dropped Python 3.6 support, and made Python 3.11 support official. (#195, #200)
safe_url_string() now generates safer URLs.

To make URLs safer for the URL living standard:
- ;= are percent-encoded in the URL username.
- ;:= are percent-encoded in the URL password.
- ' is percent-encoded in the URL query if the URL scheme is special.
To make URLs safer for RFC 2396 and RFC 3986, |[] are percent-encoded in URL paths, queries, and fragments.

(#80, #203)
html_to_unicode() now checks for the byte order mark before inspecting the Content-Type header when determining the content encoding, in line with the URL living standard. (#189, #191)
canonicalize_url() now strips spaces from the input URL, to be more in line with the URL living standard. (#132, #136)
get_base_url() now ignores HTML comments. (#70, #77)
Fixed safe_url_string() re-encoding percent signs on the URL username and password even when they were being used as part of an escape sequence. (#187, #196)
Fixed basic_auth_header() using the wrong flavor of base64 encoding, which could prevent authentication in rare cases. (#181, #192)
Fixed replace_entities() raising OverflowError in some cases due to a bug in CPython. (#199, #202)
Improved typing and fixed typing issues. (#190, #206)
Made CI and test improvements. (#197, #198)
Adopted a Code of Conduct. (#194)

2.0.1 (2022-08-11)

Minor documentation fix (release date is set in the changelog).

2.0.0 (2022-08-11)

Backwards incompatible changes:

Python 2 is no longer supported; Python 3.6+ is required now (#168, #175).
w3lib.url.safe_url_string() and w3lib.url.canonicalize_url() no longer convert “%23” to “#” when it appears in the URL path. This is a bug fix. It’s listed as a backward-incomatible change because in some cases the output of w3lib.url.canonicalize_url() is going to change, and so, if this output is used to generate URL fingerprints, new fingerprints might be incompatible with those created with the previous w3lib versions (#141).

Deprecation removals (#169):

The w3lib.form module is removed.
The w3lib.html.remove_entities function is removed.
The w3lib.url.urljoin_rfc function is removed.

The following functions are deprecated, and will be removed in future releases (#170):

w3lib.util.str_to_unicode
w3lib.util.unicode_to_str
w3lib.util.to_native_str

Other improvements and bug fixes:

Type annotations are added (#172, #184).
Added support for Python 3.9 and 3.10 (#168, #176).
Fixed w3lib.html.get_meta_refresh() for <meta> tags where http-equiv is written after content (#179).
Fixed w3lib.url.safe_url_string() for IDNA domains with ports (#174).
w3lib.url.url_query_cleaner() no longer adds an unneeded # when keep_fragments=True is passed, and the URL doesn’t have a fragment (#159).
Removed a workaround for an ancient pathname2url bug (#142)
CI is migrated to GitHub Actions (#166, #177); other CI improvements (#160, #182).
The code is formatted using black (#173).

1.22.0 (2020-05-13)

Python 3.4 is no longer supported (issue #156)
w3lib.url.safe_url_string() now supports an optional quote_path parameter to disable the percent-encoding of the URL path (issue #119)
w3lib.url.add_or_replace_parameter() and w3lib.url.add_or_replace_parameters() no longer remove duplicate parameters from the original query string that are not being added or replaced (issue #126)
w3lib.html.remove_tags() now raises a ValueError exception instead of AssertionError when using both the which_ones and the keep parameters (issue #154)
Test improvements (issues #143, #146, #148, #149)
Documentation improvements (issues #140, #144, #145, #151, #152, #153)
Code cleanup (issue #139)

1.21.0 (2019-08-09)

Add the encoding and path_encoding parameters to w3lib.url.safe_download_url() (issue #118)
w3lib.url.safe_url_string() now also removes tabs and new lines (issue #133)
w3lib.html.remove_comments() now also removes truncated comments (issue #129)
w3lib.html.remove_tags_with_content() no longer removes tags which start with the same text as one of the specified tags (issue #114)
Recommend pytest instead of nose to run tests (issue #124)

1.20.0 (2019-01-11)

Fix url_query_cleaner to do not append “?” to urls without a query string (issue #109)
Add support for Python 3.7 and drop Python 3.3 (issue #113)
Add w3lib.url.add_or_replace_parameters helper (issue #117)
Documentation fixes (issue #115)

1.19.0 (2018-01-25)

Add a workaround for CPython segfault (https://bugs.python.org/issue32583) which affect w3lib.encoding functions. This is technically backwards incompatible because it changes the way non-decodable bytes are replaced (in some cases instead of two \ufffd chars you can get one). As a side effect, the fix speeds up decoding in Python 3.4+.
Add ‘encoding’ parameter for w3lib.http.basic_auth_header.
Fix pypy testing setup, add pypy3 to CI.

1.18.0 (2017-08-03)

Include additional assets used for distribution packages in the source tarball
Consider [ and ] as safe characters in path and query components of URLs, i.e. they are not escaped anymore
Disable codecov project coverage check

1.17.0 (2017-02-08)

Add Python 3.5 and 3.6 support
Add w3lib.url.parse_data_uri helper for parsing “data:” URIs
Add w3lib.html.strip_html5_whitespace function to strip leading and trailing whitespace as per W3C recommendations, e.g. for cleaning “href” attribute values
Fix w3lib.http.headers_raw_to_dict for multiple headers with same name
Do not distribute tests/test_*.pyc artifacts

1.16.0 (2016-11-10)

canonicalize_url() and safe_url_string(): strip “:” when no port is specified (as per RFC 3986; see also https://github.com/scrapy/scrapy/issues/2377)
url_query_cleaner(): support new keep_fragments argument (defaulting to False)

1.15.0 (2016-07-29)

Add canonicalize_url() to w3lib.url

1.14.3 (2016-07-14)

Bugfix release:

Handle IDNA encoding failures in safe_url_string() (issue #62)

1.14.2 (2016-04-11)

Bugfix release:

fix function import for (deprecated) urljoin_rfc (issue #51)
only expose wanted functions from w3lib.url, via __all__ (see issue #54, https://github.com/scrapy/scrapy/issues/1917)

1.14.1 (2016-04-07)

Bugfix release:

For bytes URLs, when supplied encoding (or default UTF8) is wrong, safe_url_string falls back to percent-encoding offending bytes.

1.14.0 (2016-04-06)

Changes to safe_url_string:

proper handling of non-ASCII characters in Python2 and Python3
support IDNs
new path_encoding to override default UTF-8 when serializing non-ASCII characters before percent-encoding

html_body_declared_encoding also detects encoding when not sole attribute in <meta>.

Package is now properly marked as zip_safe.

1.13.0 (2015-11-05)

remove_tags removes uppercase tags as well;
ignore meta-redirects inside script or noscript tags by default, but add an option to not ignore them;
replace_entities now handles entities without trailing semicolon;
fixed uncaught UnicodeDecodeError when decoding entities.

1.12.0 (2015-06-29)

meta_refresh regex now handles leading newlines and whitespaces in the url;
include tests folder in source distribution.

1.11.0 (2015-01-13)

url_query_cleaner now supports str or list parameters;
add support for resolving base URLs in <base> tags with attributes before href.

1.10.0 (2014-08-20)

reverted all 1.9.0 changes.

1.9.0 (2014-08-16)

all url-related functions accept bytes and unicode and now return bytes.

1.8.1 (2014-08-14)

w3lib.http.basic_auth_header now returns bytes

1.8.0 (2014-07-31)

add support for big5-hkscs encoding.

1.7.1 (2014-07-26)

PY3 fixed headers_raw_to_dict and headers_dict_to_raw;
documentation improvements;
provide wheels.

1.6 (2014-06-03)

w3lib.form.encode_multipart is deprecated;
docstrings and docs are improved;
w3lib.url.add_or_replace_parameter is re-implemented on top of stdlib functions;
remove_entities is renamed to replace_entities.

1.5 (2013-11-09)

Python 2.6 support is dropped.

1.4 (2013-10-18)

Python 3 support;
get_meta_refresh encoding handling is fixed;
check for ‘?’ in add_or_replace_parameter;
ISO-8859-1 is used for HTTP Basic Auth;
fixed unicode handling in replace_escape_chars;

1.3 (2012-05-13)

support non-standard gb_2312_80 encoding;
drop Python 2.5 support.

1.2 (2012-05-02)

Detect encoding for content attr before http-equiv in meta tag.

1.1 (2012-04-18)

w3lib.html.remove_comments handles multiline comments;
Added w3lib.encoding module, containing functions for working with character encoding, like encoding autodetection from HTML pages.
w3lib.url.urljoin_rfc is deprecated.

1.0 (2011-04-17)

First release of w3lib.

History

The code of w3lib was originally part of the Scrapy framework but was later stripped out of Scrapy, with the aim of make it more reusable and to provide a useful library of web functions without depending on Scrapy.

Welcome to w3lib’s documentation!

Overview

Modules

Requirements

Install

Tests

Changelog

2.1.2 (2023-08-03)

2.1.1 (2022-12-09)

2.1.0 (2022-11-28)

2.0.1 (2022-08-11)

2.0.0 (2022-08-11)

1.22.0 (2020-05-13)

1.21.0 (2019-08-09)

1.20.0 (2019-01-11)

1.19.0 (2018-01-25)

1.18.0 (2017-08-03)

1.17.0 (2017-02-08)

1.16.0 (2016-11-10)

1.15.0 (2016-07-29)

1.14.3 (2016-07-14)

1.14.2 (2016-04-11)

1.14.1 (2016-04-07)

1.14.0 (2016-04-06)

1.13.0 (2015-11-05)

1.12.0 (2015-06-29)

1.11.0 (2015-01-13)

1.10.0 (2014-08-20)

1.9.0 (2014-08-16)

1.8.1 (2014-08-14)

1.8.0 (2014-07-31)

1.7.1 (2014-07-26)

1.6 (2014-06-03)

1.5 (2013-11-09)

1.4 (2013-10-18)

1.3 (2012-05-13)

1.2 (2012-05-02)

1.1 (2012-04-18)

1.0 (2011-04-17)

History

Indices and tables