Skip to content

lxml juno patches#1

Open
igormarkoff wants to merge 9 commits intolxml-6.0.2+junofrom
lxml-6.0.2+juno_patches
Open

lxml juno patches#1
igormarkoff wants to merge 9 commits intolxml-6.0.2+junofrom
lxml-6.0.2+juno_patches

Conversation

@igormarkoff
Copy link
Copy Markdown
Collaborator

Juno patches for the xml library

Bump ``__version__`` to ``6.0.2+juno`` in ``src/lxml/__init__.py``.
PEP 440 local version segment marks the fork-patched build so that
``importlib.metadata.metadata("lxml")["version"]`` and
``lxml.__version__`` agree on the suffix.
Replace the prior fork's custom ``tools.setup_common``-based build
loop in ``setup.py`` with the standard upstream ``setup(...)`` call
plus an ``IOSBuildExt(_build_ext)`` cmdclass gated on
``IOS_BUILD_PLATFORM``. Without the env var set, the cmdclass is a
transparent passthrough to upstream setuptools — host ``pip
install`` and CI smoke builds keep working unchanged. With it, the
extension build picks up iOS-specific compile/link flags
(``-arch arm64 -isysroot <iOS-SDK> -m<plat>-version-min=16.0
-Wl,-undefined,dynamic_lookup``) and an optional
``IOS_PYTHON_INCLUDE`` for prepending the iOS Python framework's
``Headers/`` to the include path.

Companion changes:

- ``pyproject.toml``: declare ``build-backend =
  "setuptools.build_meta"`` so PEP 517 frontends (``pip wheel``,
  ``python -m build``) drive the build correctly.
- ``versioninfo.py``: anchor ``get_base_dir()`` on ``__file__``
  rather than ``sys.argv[0]``. Under PEP 517 backends
  ``sys.argv[0]`` is the pyproject_hooks subprocess script, not the
  project's ``setup.py``, so the legacy heuristic resolves to the
  wrong directory and breaks callers like
  ``version()``-via-open(``src/lxml/__init__.py``).
``src/lxml/etree.pyx`` was calling
``xmlparser.xmlCleanupParser()`` immediately before
``xmlparser.xmlInitParser()`` at module-import time. libxml2's
``xmlCleanupParser`` is destructive of *process-global* state —
catalogs, encoding handlers, registered IO callbacks, schema type
tables, error handlers — and the docs reserve it for controlled
process / interpreter teardown when no libxml2 objects can still be
alive.

Calling it on every lxml import means that, in any embedding that
runs multiple Python interpreters in the same process, a second
interpreter's lxml import wipes the first interpreter's still-live
registrations. The first interpreter's parsers / XInclude / DTD
operations then drift out of a half-initialised state on subsequent
calls.

Drop the call. ``xmlInitParser`` is internally idempotent so a
fresh import is safe without any preceding cleanup.
``src/lxml/xslt.pxi`` was calling ``xslt.xsltUninit()`` immediately
before ``xslt.xsltSetLoaderFunc(NULL)`` at module-import time.
``xsltUninit`` only flips libxslt's initialisation-once flag — it
does not clear extension/element/style registries. Used at module
import without a follow-up ``xsltCleanupGlobals()``, the next
sub-interpreter's lxml import re-runs ``xsltInit()``'s built-in
registrations on top of the already-populated tables.

``xsltSetLoaderFunc(NULL)`` alone is the right scoped operation for
the loader reset; full table cleanup belongs in a controlled
teardown path (``xsltCleanupGlobals``), not at import.

Drops the call and the matching ``cdef void xsltUninit() nogil``
declaration from ``src/lxml/includes/xslt.pxd`` (the only caller
goes away).
When ``lxml.__version__`` carries a PEP 440 local version segment
(e.g. ``6.0.2+juno`` for an embedding's fork-patched build),
upstream's split-by-dot unpacker produces ``(6, 0, '2+juno', 0)``
— a tuple whose third element is a string, breaking
``isinstance(LXML_VERSION[2], int)`` checks (and the
``test_etree.ETreeOnlyTestCase.test_version`` self-test).

Strip the ``+xyz`` segment before splitting so ``LXML_VERSION``
stays ``(int, int, int, int)`` regardless of whether a local-
version suffix is present.
``parser.pxi``'s ``super(_ParseError, self).__init__(...)`` and
``etree.pyx``'s ``super(_Error, self).__init__(...)`` cached the
class object in a process-static
``cdef object _ParseError = ParseError`` /
``cdef object _Error = Error``. Cython emits these cdef-level
objects as file-scope ``static PyObject *`` outside
``__pyx_mstate_global``, so a second concurrent interpreter's
import overwrites the first's. After the overwrite, the first
interpreter raising e.g. ``XMLSyntaxError`` (which inherits from
``ParseError``) trips
``TypeError: super(type, obj): obj (instance of XMLSyntaxError) is
not an instance or subtype of type (ParseError).``

Replace with name-lookup forms that resolve via the importing
module's per-interpreter ``__dict__``:

- ``parser.pxi`` (``ParseError`` is a plain Python class) →
  ``super().__init__(message)``. Cython emits the ``__class__``
  cell that the no-arg form needs.
- ``etree.pyx`` (``LxmlError`` is a ``cdef class`` — no ``__class__``
  cell available) → ``super(Error, self).__init__(message)``.

Crucially, both forms preserve the cooperative super chain that
ultimately reaches ``SyntaxError.__init__`` and populates
``self.msg`` for SyntaxError-derived subclasses; bypassing the
chain (e.g. by calling ``Error.__init__`` directly) leaves
``self.msg`` unset and ``str(exception)`` shows ``"None …"``.

Drop the now-orphan ``cdef object _ParseError = ParseError`` and
``cdef object _Error = Error`` definitions.
Three independent fixes that all surface when the test suite runs
against the installed wheel rather than the in-repo source tree
(``pip install lxml && pytest <site-packages>/lxml/tests`` style),
which is the deploy shape any embedding will see.

1. ``tests/common_imports.py``: ``DOC_DIR`` source change + a
   ``make_doctest`` graceful skip. The legacy ``DOC_DIR`` walked
   four ``dirname`` levels up from ``__file__`` and resolved to a
   non-existent path under wheel installs (lxml's ``doc/`` only
   ships in the source tree). Allow callers to override via
   ``LXML_DOC_DIR`` (or ``SITE_PACKAGES_DIR``) for deployers that
   ship the doc tree, fall back to the legacy walk otherwise, and
   make ``make_doctest`` return an empty TestSuite when the file
   isn't on disk — instead of letting ``DocFileSuite`` raise
   ``FileNotFoundError`` at collection time and torch the
   surrounding ``test_suite`` (~12 such cascades observed before).

2. ``html/tests/test_feedparser_data.py``: add ``__test__ = False``
   to ``FeedTestCase``. The class's ``__init__`` requires a
   ``filename`` arg; pytest's auto-discovery instantiates it as
   ``FeedTestCase('runTest')``, assigning the method name to
   ``self.filename``, and downstream ``open('runTest')`` then
   raises ``FileNotFoundError``. The surrounding ``test_suite()``
   constructs instances with proper file paths.

3. ``tests/test_etree.py``: route the
   ``test_python3_problem_filebased_*`` tests through
   ``tempfile.NamedTemporaryFile`` instead of
   ``open('test.xml', 'w+b')``. ``tests/test.xml`` is the bundled
   fixture used by ``test_parse_file``, ``test_xinclude``,
   ``test_dtd_*`` and ~15 other tests; on platforms where the
   resource bundle is writable (notably the iOS Simulator),
   overwriting it on cycle 1 corrupted every subsequent test that
   read it (``b'<a><b></b></a>' != b'<some_ns_id:some_head_elem
   ...>'``-style mismatches).
Python 3.13 added a guard in ``code.__set__`` that rejects
``func.__code__`` assignment when the new code object's
``co_freevars`` length differs from the function's existing closure
cells. ``_RestoreChecker.install_clone()`` swaps a code object that
may not satisfy this guard, raising
``ValueError: <name>() requires a code object with N free vars,
not M`` and torching the collection of every doctest that opted in
via ``temp_install`` (typically the ``html/tests/test_*.txt``
files).

Wrap the swap in ``try / except ValueError`` and fall back to no
swap when the guard rejects it. The override-via-
``_temp_call_super_check_output`` mechanism stays in place, so the
default-strict comparison runs for doctests that wanted lxml's
HTML-aware comparison — a soft regression versus the swap success
path, but better than crashing every dependent doctest. Some
HTML-aware doctests may pass under strict comparison anyway when
the expected output happens to match exactly; the rest are clear
follow-ups for a proper rewrite of ``temp_install`` (subclass +
bound-method shadow rather than ``__code__`` replacement).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants