Welcome to w3lib’s documentation!

Overview

This is a Python library of web-related functions, such as:

  • remove comments, or tags from HTML snippets

  • extract base url from HTML snippets

  • translate entities on HTML strings

  • convert raw HTTP headers to dicts and vice-versa

  • construct HTTP auth header

  • converting HTML pages to unicode

  • sanitize urls (like browsers do)

  • extract arguments from urls

The w3lib library is licensed under the BSD license.

Modules

Requirements

Python 3.8+

Install

pip install w3lib

Tests

pytest is the preferred way to run tests. Just run: pytest from the root directory to execute tests using the default Python interpreter.

tox could be used to run tests for all supported Python versions. Install it (using ‘pip install tox’) and then run tox from the root directory - tests will be executed for all available Python interpreters.

Changelog

2.1.2 (2023-08-03)

  • Fix test failures on Python 3.11.4+ (#212, #213).

  • Fix an incorrect type hint (#211).

  • Add project URLs to setup.py (#215).

2.1.1 (2022-12-09)

2.1.0 (2022-11-28)

  • Dropped Python 3.6 support, and made Python 3.11 support official. (#195, #200)

  • safe_url_string() now generates safer URLs.

    To make URLs safer for the URL living standard:

    • ;= are percent-encoded in the URL username.

    • ;:= are percent-encoded in the URL password.

    • ' is percent-encoded in the URL query if the URL scheme is special.

    To make URLs safer for RFC 2396 and RFC 3986, |[] are percent-encoded in URL paths, queries, and fragments.

    (#80, #203)

  • html_to_unicode() now checks for the byte order mark before inspecting the Content-Type header when determining the content encoding, in line with the URL living standard. (#189, #191)

  • canonicalize_url() now strips spaces from the input URL, to be more in line with the URL living standard. (#132, #136)

  • get_base_url() now ignores HTML comments. (#70, #77)

  • Fixed safe_url_string() re-encoding percent signs on the URL username and password even when they were being used as part of an escape sequence. (#187, #196)

  • Fixed basic_auth_header() using the wrong flavor of base64 encoding, which could prevent authentication in rare cases. (#181, #192)

  • Fixed replace_entities() raising OverflowError in some cases due to a bug in CPython. (#199, #202)

  • Improved typing and fixed typing issues. (#190, #206)

  • Made CI and test improvements. (#197, #198)

  • Adopted a Code of Conduct. (#194)

2.0.1 (2022-08-11)

Minor documentation fix (release date is set in the changelog).

2.0.0 (2022-08-11)

Backwards incompatible changes:

  • Python 2 is no longer supported; Python 3.6+ is required now (#168, #175).

  • w3lib.url.safe_url_string() and w3lib.url.canonicalize_url() no longer convert “%23” to “#” when it appears in the URL path. This is a bug fix. It’s listed as a backward-incomatible change because in some cases the output of w3lib.url.canonicalize_url() is going to change, and so, if this output is used to generate URL fingerprints, new fingerprints might be incompatible with those created with the previous w3lib versions (#141).

Deprecation removals (#169):

  • The w3lib.form module is removed.

  • The w3lib.html.remove_entities function is removed.

  • The w3lib.url.urljoin_rfc function is removed.

The following functions are deprecated, and will be removed in future releases (#170):

  • w3lib.util.str_to_unicode

  • w3lib.util.unicode_to_str

  • w3lib.util.to_native_str

Other improvements and bug fixes:

  • Type annotations are added (#172, #184).

  • Added support for Python 3.9 and 3.10 (#168, #176).

  • Fixed w3lib.html.get_meta_refresh() for <meta> tags where http-equiv is written after content (#179).

  • Fixed w3lib.url.safe_url_string() for IDNA domains with ports (#174).

  • w3lib.url.url_query_cleaner() no longer adds an unneeded # when keep_fragments=True is passed, and the URL doesn’t have a fragment (#159).

  • Removed a workaround for an ancient pathname2url bug (#142)

  • CI is migrated to GitHub Actions (#166, #177); other CI improvements (#160, #182).

  • The code is formatted using black (#173).

1.22.0 (2020-05-13)

1.21.0 (2019-08-09)

1.20.0 (2019-01-11)

  • Fix url_query_cleaner to do not append “?” to urls without a query string (issue #109)

  • Add support for Python 3.7 and drop Python 3.3 (issue #113)

  • Add w3lib.url.add_or_replace_parameters helper (issue #117)

  • Documentation fixes (issue #115)

1.19.0 (2018-01-25)

  • Add a workaround for CPython segfault (https://bugs.python.org/issue32583) which affect w3lib.encoding functions. This is technically backwards incompatible because it changes the way non-decodable bytes are replaced (in some cases instead of two \ufffd chars you can get one). As a side effect, the fix speeds up decoding in Python 3.4+.

  • Add ‘encoding’ parameter for w3lib.http.basic_auth_header.

  • Fix pypy testing setup, add pypy3 to CI.

1.18.0 (2017-08-03)

  • Include additional assets used for distribution packages in the source tarball

  • Consider [ and ] as safe characters in path and query components of URLs, i.e. they are not escaped anymore

  • Disable codecov project coverage check

1.17.0 (2017-02-08)

  • Add Python 3.5 and 3.6 support

  • Add w3lib.url.parse_data_uri helper for parsing “data:” URIs

  • Add w3lib.html.strip_html5_whitespace function to strip leading and trailing whitespace as per W3C recommendations, e.g. for cleaning “href” attribute values

  • Fix w3lib.http.headers_raw_to_dict for multiple headers with same name

  • Do not distribute tests/test_*.pyc artifacts

1.16.0 (2016-11-10)

1.15.0 (2016-07-29)

  • Add canonicalize_url() to w3lib.url

1.14.3 (2016-07-14)

Bugfix release:

  • Handle IDNA encoding failures in safe_url_string() (issue #62)

1.14.2 (2016-04-11)

Bugfix release:

1.14.1 (2016-04-07)

Bugfix release:

  • For bytes URLs, when supplied encoding (or default UTF8) is wrong, safe_url_string falls back to percent-encoding offending bytes.

1.14.0 (2016-04-06)

Changes to safe_url_string:

  • proper handling of non-ASCII characters in Python2 and Python3

  • support IDNs

  • new path_encoding to override default UTF-8 when serializing non-ASCII characters before percent-encoding

html_body_declared_encoding also detects encoding when not sole attribute in <meta>.

Package is now properly marked as zip_safe.

1.13.0 (2015-11-05)

  • remove_tags removes uppercase tags as well;

  • ignore meta-redirects inside script or noscript tags by default, but add an option to not ignore them;

  • replace_entities now handles entities without trailing semicolon;

  • fixed uncaught UnicodeDecodeError when decoding entities.

1.12.0 (2015-06-29)

  • meta_refresh regex now handles leading newlines and whitespaces in the url;

  • include tests folder in source distribution.

1.11.0 (2015-01-13)

  • url_query_cleaner now supports str or list parameters;

  • add support for resolving base URLs in <base> tags with attributes before href.

1.10.0 (2014-08-20)

  • reverted all 1.9.0 changes.

1.9.0 (2014-08-16)

  • all url-related functions accept bytes and unicode and now return bytes.

1.8.1 (2014-08-14)

  • w3lib.http.basic_auth_header now returns bytes

1.8.0 (2014-07-31)

  • add support for big5-hkscs encoding.

1.7.1 (2014-07-26)

  • PY3 fixed headers_raw_to_dict and headers_dict_to_raw;

  • documentation improvements;

  • provide wheels.

1.6 (2014-06-03)

  • w3lib.form.encode_multipart is deprecated;

  • docstrings and docs are improved;

  • w3lib.url.add_or_replace_parameter is re-implemented on top of stdlib functions;

  • remove_entities is renamed to replace_entities.

1.5 (2013-11-09)

  • Python 2.6 support is dropped.

1.4 (2013-10-18)

  • Python 3 support;

  • get_meta_refresh encoding handling is fixed;

  • check for ‘?’ in add_or_replace_parameter;

  • ISO-8859-1 is used for HTTP Basic Auth;

  • fixed unicode handling in replace_escape_chars;

1.3 (2012-05-13)

  • support non-standard gb_2312_80 encoding;

  • drop Python 2.5 support.

1.2 (2012-05-02)

  • Detect encoding for content attr before http-equiv in meta tag.

1.1 (2012-04-18)

  • w3lib.html.remove_comments handles multiline comments;

  • Added w3lib.encoding module, containing functions for working with character encoding, like encoding autodetection from HTML pages.

  • w3lib.url.urljoin_rfc is deprecated.

1.0 (2011-04-17)

First release of w3lib.

History

The code of w3lib was originally part of the Scrapy framework but was later stripped out of Scrapy, with the aim of make it more reusable and to provide a useful library of web functions without depending on Scrapy.

Indices and tables