diff --git a/Doc/library/difflib.rst b/Doc/library/difflib.rst index e56c4f5e7dfbf7..fd77715900d78f 100644 --- a/Doc/library/difflib.rst +++ b/Doc/library/difflib.rst @@ -14,14 +14,37 @@ -------------- This module provides classes and functions for comparing sequences. It -can be used for example, for comparing files, and can produce information +can be used, for example, for comparing files, and can produce information about file differences in various formats, including HTML and context and unified -diffs. For comparing directories and files, see also, the :mod:`filecmp` module. +diffs. For comparing directories and files, see the :mod:`filecmp` module. + + +.. class:: SequenceMatcherBase + :noindex: + + Base class for implementing sequence matchers. + + At minimum, derived classes must implement ``_get_matching_blocks`` method, + which returns a list of tuples of the form ``(start_in_a, start_in_b, length)``. + See :meth:`~SequenceMatcherBase._get_matching_blocks` and + :meth:`~SequenceMatcherBase.get_matching_blocks` for more information. + + Once implemented, the following methods are available: + - :meth:`~SequenceMatcherBase.get_matching_blocks` + - :meth:`~SequenceMatcherBase.get_opcodes` + - :meth:`~SequenceMatcherBase.get_grouped_opcodes` + - :meth:`~SequenceMatcherBase.ratio` + - :meth:`~SequenceMatcherBase.quick_ratio` + - :meth:`~SequenceMatcherBase.real_quick_ratio` + + See :class:`SequenceMatcher` for example implementation. .. class:: SequenceMatcher :noindex: + Implementation of :class:`SequenceMatcherBase`. + This is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are :term:`hashable`. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980's by Ratcliff and @@ -88,7 +111,8 @@ diffs. For comparing directories and files, see also, the :mod:`filecmp` module. The constructor for this class is: - .. method:: __init__(tabsize=8, wrapcolumn=None, linejunk=None, charjunk=IS_CHARACTER_JUNK) + .. method:: __init__(tabsize=8, wrapcolumn=None, + linejunk=None, charjunk=IS_CHARACTER_JUNK, differ=None) Initializes instance of :class:`HtmlDiff`. @@ -98,9 +122,12 @@ diffs. For comparing directories and files, see also, the :mod:`filecmp` module. *wrapcolumn* is an optional keyword to specify column number where lines are broken and wrapped, defaults to ``None`` where lines are not wrapped. - *linejunk* and *charjunk* are optional keyword arguments passed into :func:`ndiff` - (used by :class:`HtmlDiff` to generate the side by side HTML differences). See - :func:`ndiff` documentation for argument default values and descriptions. + *linejunk*, *charjunk* and *differ* are optional keyword arguments passed into + :func:`ndiff` (used by :class:`HtmlDiff` to generate the side by side HTML differences). + See :func:`ndiff` documentation for argument default values and descriptions. + + .. versionchanged:: 3.15 + Added *differ* argument. The following methods are public: @@ -143,7 +170,8 @@ diffs. For comparing directories and files, see also, the :mod:`filecmp` module. -.. function:: context_diff(a, b, fromfile='', tofile='', fromfiledate='', tofiledate='', n=3, lineterm='\n') +.. function:: context_diff(a, b, fromfile='', tofile='', fromfiledate='', tofiledate='', \ + n=3, lineterm='\n', matcher=None) Compare *a* and *b* (lists of strings); return a delta (a :term:`generator` generating the delta lines) in context diff format. @@ -161,6 +189,10 @@ diffs. For comparing directories and files, see also, the :mod:`filecmp` module. For inputs that do not have trailing newlines, set the *lineterm* argument to ``""`` so that the output will be uniformly newline free. + Optional argument *matcher* is a callable with 3 optional arguments and returns + :class:`SequenceMatcherBase` instance. i.e. ``matcher(isjunk=None, a='', b='')``. + Default (if ``None``) is a :class:`SequenceMatcher` class. + The context diff format normally has a header for filenames and modification times. Any or all of these may be specified using strings for *fromfile*, *tofile*, *fromfiledate*, and *tofiledate*. The modification times are normally @@ -189,8 +221,11 @@ diffs. For comparing directories and files, see also, the :mod:`filecmp` module. See :ref:`difflib-interface` for a more detailed example. + .. versionchanged:: 3.15 + Added *matcher* argument. -.. function:: get_close_matches(word, possibilities, n=3, cutoff=0.6) + +.. function:: get_close_matches(word, possibilities, n=3, cutoff=0.6, matcher=None) Return a list of the best "good enough" matches. *word* is a sequence for which close matches are desired (typically a string), and *possibilities* is a list of @@ -202,6 +237,10 @@ diffs. For comparing directories and files, see also, the :mod:`filecmp` module. Optional argument *cutoff* (default ``0.6``) is a float in the range [0, 1]. Possibilities that don't score at least that similar to *word* are ignored. + Optional argument *matcher* is a callable with 3 optional arguments and returns + :class:`SequenceMatcherBase` instance. i.e. ``matcher(isjunk=None, a='', b='')``. + Default (if ``None``) is a :class:`SequenceMatcher` class. + The best (no more than *n*) matches among the possibilities are returned in a list, sorted by similarity score, most similar first. @@ -215,8 +254,11 @@ diffs. For comparing directories and files, see also, the :mod:`filecmp` module. >>> get_close_matches('accept', keyword.kwlist) ['except'] + .. versionchanged:: 3.15 + Added *matcher* argument. -.. function:: ndiff(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK) + +.. function:: ndiff(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK, differ=None) Compare *a* and *b* (lists of strings); return a :class:`Differ`\ -style delta (a :term:`generator` generating the delta lines). @@ -233,10 +275,14 @@ diffs. For comparing directories and files, see also, the :mod:`filecmp` module. usually works better than using this function. *charjunk*: A function that accepts a character (a string of length 1), and - returns if the character is junk, or false if not. The default is module-level + returns true if the character is junk, or false if not. The default is module-level function :func:`IS_CHARACTER_JUNK`, which filters out whitespace characters (a blank or tab; it's a bad idea to include newline in this!). + *differ*: callable that takes 2 optional arguments and returns + :class:`Differ` instance. i.e. ``differ(linejunk=None, charjunk=None)``. + Default (if ``None``) is a :class:`Differ` class. + >>> diff = ndiff('one\ntwo\nthree\n'.splitlines(keepends=True), ... 'ore\ntree\nemu\n'.splitlines(keepends=True)) >>> print(''.join(diff), end="") @@ -250,6 +296,9 @@ diffs. For comparing directories and files, see also, the :mod:`filecmp` module. + tree + emu + .. versionchanged:: 3.15 + Added *differ* argument. + .. function:: restore(sequence, which) @@ -274,7 +323,8 @@ diffs. For comparing directories and files, see also, the :mod:`filecmp` module. emu -.. function:: unified_diff(a, b, fromfile='', tofile='', fromfiledate='', tofiledate='', n=3, lineterm='\n', *, color=False) +.. function:: unified_diff(a, b, fromfile='', tofile='', fromfiledate='', tofiledate='', \ + n=3, lineterm='\n', *, color=False, matcher=None) Compare *a* and *b* (lists of strings); return a delta (a :term:`generator` generating the delta lines) in unified diff format. @@ -297,6 +347,10 @@ diffs. For comparing directories and files, see also, the :mod:`filecmp` module. :program:`git diff --color`. Even if enabled, it can be :ref:`controlled using environment variables `. + Optional argument *matcher* is a callable with 3 optional arguments and returns + :class:`SequenceMatcherBase` instance. i.e. ``matcher(isjunk=None, a='', b='')``. + Default (if ``None``) is a :class:`SequenceMatcher` class. + The unified diff format normally has a header for filenames and modification times. Any or all of these may be specified using strings for *fromfile*, *tofile*, *fromfiledate*, and *tofiledate*. The modification times are normally @@ -321,6 +375,7 @@ diffs. For comparing directories and files, see also, the :mod:`filecmp` module. .. versionchanged:: 3.15 Added the *color* parameter. + Added *matcher* argument. .. function:: diff_bytes(dfunc, a, b, fromfile=b'', tofile=b'', fromfiledate=b'', tofiledate=b'', n=3, lineterm=b'\n') @@ -360,15 +415,14 @@ diffs. For comparing directories and files, see also, the :mod:`filecmp` module. was published in Dr. Dobb's Journal in July, 1988. -.. _sequence-matcher: +.. _sequencematcher-base: -SequenceMatcher Objects ------------------------ +SequenceMatcherBase +------------------- -The :class:`SequenceMatcher` class has this constructor: +The :class:`SequenceMatcherBase` class has this constructor: - -.. class:: SequenceMatcher(isjunk=None, a='', b='', autojunk=True) +.. class:: SequenceMatcherBase(isjunk=None, a='', b='') Optional argument *isjunk* must be ``None`` (the default) or a one-argument function that takes a sequence element and returns true if and only if the @@ -384,33 +438,16 @@ The :class:`SequenceMatcher` class has this constructor: The optional arguments *a* and *b* are sequences to be compared; both default to empty strings. The elements of both sequences must be :term:`hashable`. - The optional argument *autojunk* can be used to disable the automatic junk - heuristic. - - .. versionchanged:: 3.2 - Added the *autojunk* parameter. - - SequenceMatcher objects get three data attributes: *bjunk* is the - set of elements of *b* for which *isjunk* is ``True``; *bpopular* is the set of - non-junk elements considered popular by the heuristic (if it is not - disabled); *b2j* is a dict mapping the remaining elements of *b* to a list - of positions where they occur. All three are reset whenever *b* is reset - with :meth:`set_seqs` or :meth:`set_seq2`. - - .. versionadded:: 3.2 - The *bjunk* and *bpopular* attributes. - - :class:`SequenceMatcher` objects have the following methods: + :class:`SequenceMatcherBase` objects have the following methods: .. method:: set_seqs(a, b) Set the two sequences to be compared. - :class:`SequenceMatcher` computes and caches detailed information about the - second sequence, so if you want to compare one sequence against many - sequences, use :meth:`set_seq2` to set the commonly used sequence once and - call :meth:`set_seq1` repeatedly, once for each of the other sequences. - + :class:`SequenceMatcherBase` caches detailed information about the + second sequence. :meth:`set_seq2` clears cache of :meth:`quick_ratio` method. + In addition :meth:`_prepare_seq2`, which is called at the end of :meth:`set_seq2`, + can be implemented by derived class for alignment algorithm cache logic. .. method:: set_seq1(a) @@ -423,47 +460,20 @@ The :class:`SequenceMatcher` class has this constructor: Set the second sequence to be compared. The first sequence to be compared is not changed. + .. method:: _get_matching_blocks() + :abstractmethod: - .. method:: find_longest_match(alo=0, ahi=None, blo=0, bhi=None) + Returns list of tuples of the form ``(start_in_a, start_in_b, length)`` + describing matching subsequences. - Find longest matching block in ``a[alo:ahi]`` and ``b[blo:bhi]``. + Validity of whether blocks actually match is not checked + and it is up to the user to make sure of result's correctness. - If *isjunk* was omitted or ``None``, :meth:`find_longest_match` returns - ``(i, j, k)`` such that ``a[i:i+k]`` is equal to ``b[j:j+k]``, where ``alo - <= i <= i+k <= ahi`` and ``blo <= j <= j+k <= bhi``. For all ``(i', j', - k')`` meeting those conditions, the additional conditions ``k >= k'``, ``i - <= i'``, and if ``i == i'``, ``j <= j'`` are also met. In other words, of - all maximal matching blocks, return one that starts earliest in *a*, and - of all those maximal matching blocks that start earliest in *a*, return - the one that starts earliest in *b*. - - >>> s = SequenceMatcher(None, " abcd", "abcd abcd") - >>> s.find_longest_match(0, 5, 0, 9) - Match(a=0, b=4, size=5) - - If *isjunk* was provided, first the longest matching block is determined - as above, but with the additional restriction that no junk element appears - in the block. Then that block is extended as far as possible by matching - (only) junk elements on both sides. So the resulting block never matches - on junk except as identical junk happens to be adjacent to an interesting - match. - - Here's the same example as before, but considering blanks to be junk. That - prevents ``' abcd'`` from matching the ``' abcd'`` at the tail end of the - second sequence directly. Instead only the ``'abcd'`` can match, and - matches the leftmost ``'abcd'`` in the second sequence: - - >>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd") - >>> s.find_longest_match(0, 5, 0, 9) - Match(a=1, b=0, size=4) - - If no blocks match, this returns ``(alo, blo, 0)``. - - This method returns a :term:`named tuple` ``Match(a, b, size)``. - - .. versionchanged:: 3.9 - Added default arguments. + This method implements the core matching logic, while + :meth:`get_matching_blocks` takes care of the maintenance and caching. + For custom maintenance and caching, :meth:`get_matching_blocks` can be + overriden by derived class without making use of this method. .. method:: get_matching_blocks() @@ -487,6 +497,14 @@ The :class:`SequenceMatcher` class has this constructor: [Match(a=0, b=0, size=2), Match(a=3, b=2, size=2), Match(a=5, b=4, size=0)] + .. method:: _prepare_seq2() + + Preparation method that is called at the end of :meth:`set_seq2`. + + By default it does nothing, but can be implemented by derived class + for alignment algorithm cache logic. + + .. method:: get_opcodes() Return list of 5-tuples describing how to turn *a* into *b*. Each tuple is @@ -576,8 +594,8 @@ The :class:`SequenceMatcher` class has this constructor: The three methods that return the ratio of matching to total characters can give different results due to differing levels of approximation, although -:meth:`~SequenceMatcher.quick_ratio` and :meth:`~SequenceMatcher.real_quick_ratio` -are always at least as large as :meth:`~SequenceMatcher.ratio`: +:meth:`~SequenceMatcherBase.quick_ratio` and :meth:`~SequenceMatcherBase.real_quick_ratio` +are always at least as large as :meth:`~SequenceMatcherBase.ratio`: >>> s = SequenceMatcher(None, "abcd", "bcde") >>> s.ratio() @@ -588,6 +606,91 @@ are always at least as large as :meth:`~SequenceMatcher.ratio`: 1.0 +.. _sequence-matcher: + +SequenceMatcher Objects +----------------------- + +The :class:`SequenceMatcher` class has this constructor: + + +.. class:: SequenceMatcher(isjunk=None, a='', b='', autojunk=True) + + *isjunk*, *a* and *b* are passed on to ``SequenceMatcherBase`` constructor. + See :class:`SequenceMatcherBase` documentation. + + The optional argument *autojunk* can be used to disable the automatic junk + heuristic. + + SequenceMatcher objects get three data attributes: *bjunk* is the + set of elements of *b* for which *isjunk* is ``True``; *bpopular* is the set of + non-junk elements considered popular by the heuristic (if it is not + disabled); *b2j* is a dict mapping the remaining elements of *b* to a list + of positions where they occur. All three are reset whenever *b* is reset + with :meth:`~SequenceMatcherBase.set_seqs` or :meth:`~SequenceMatcherBase.set_seq2`. + + .. versionchanged:: 3.2 + Added the *autojunk* parameter. + + :class:`SequenceMatcher` computes and caches detailed information about the + second sequence, so if you want to compare one sequence against many + sequences, use :meth:`~SequenceMatcherBase.set_seq2` to set the commonly used + sequence once and call :meth:`~SequenceMatcherBase.set_seq1` repeatedly, + once for each of the other sequences. + + .. versionadded:: 3.2 + The *bjunk* and *bpopular* attributes. + + In addition to methods implemented by :class:`SequenceMatcherBase`, + :class:`SequenceMatcher` objects have the following methods: + + + .. method:: _prepare_seq2() + + Implemented to prepare *b2j*, *bjunk* and *bpopular* caches. + + + .. method:: find_longest_match(alo=0, ahi=None, blo=0, bhi=None) + + Find longest matching block in ``a[alo:ahi]`` and ``b[blo:bhi]``. + + If *isjunk* was omitted or ``None``, :meth:`find_longest_match` returns + ``(i, j, k)`` such that ``a[i:i+k]`` is equal to ``b[j:j+k]``, where ``alo + <= i <= i+k <= ahi`` and ``blo <= j <= j+k <= bhi``. For all ``(i', j', + k')`` meeting those conditions, the additional conditions ``k >= k'``, ``i + <= i'``, and if ``i == i'``, ``j <= j'`` are also met. In other words, of + all maximal matching blocks, return one that starts earliest in *a*, and + of all those maximal matching blocks that start earliest in *a*, return + the one that starts earliest in *b*. + + >>> s = SequenceMatcher(None, " abcd", "abcd abcd") + >>> s.find_longest_match(0, 5, 0, 9) + Match(a=0, b=4, size=5) + + If *isjunk* was provided, first the longest matching block is determined + as above, but with the additional restriction that no junk element appears + in the block. Then that block is extended as far as possible by matching + (only) junk elements on both sides. So the resulting block never matches + on junk except as identical junk happens to be adjacent to an interesting + match. + + Here's the same example as before, but considering blanks to be junk. That + prevents ``' abcd'`` from matching the ``' abcd'`` at the tail end of the + second sequence directly. Instead only the ``'abcd'`` can match, and + matches the leftmost ``'abcd'`` in the second sequence: + + >>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd") + >>> s.find_longest_match(0, 5, 0, 9) + Match(a=1, b=0, size=4) + + If no blocks match, this returns ``(alo, blo, 0)``. + + This method returns a :term:`named tuple` ``Match(a, b, size)``. + + .. versionchanged:: 3.9 + Added default arguments. + + .. _sequencematcher-examples: SequenceMatcher Examples @@ -599,15 +702,15 @@ This example compares two strings, considering blanks to be "junk": ... "private Thread currentThread;", ... "private volatile Thread currentThread;") -:meth:`~SequenceMatcher.ratio` returns a float in [0, 1], measuring the similarity of the -sequences. As a rule of thumb, a :meth:`~SequenceMatcher.ratio` value over 0.6 means the -sequences are close matches: +:meth:`~SequenceMatcherBase.ratio` returns a float in [0, 1], measuring +the similarity of the sequences. As a rule of thumb, a :meth:`~SequenceMatcherBase.ratio` +value over 0.6 means the sequences are close matches: >>> print(round(s.ratio(), 3)) 0.866 If you're only interested in where the sequences match, -:meth:`~SequenceMatcher.get_matching_blocks` is handy: +:meth:`~SequenceMatcherBase.get_matching_blocks` is handy: >>> for block in s.get_matching_blocks(): ... print("a[%d] and b[%d] match for %d elements" % block) @@ -615,12 +718,12 @@ If you're only interested in where the sequences match, a[8] and b[17] match for 21 elements a[29] and b[38] match for 0 elements -Note that the last tuple returned by :meth:`~SequenceMatcher.get_matching_blocks` +Note that the last tuple returned by :meth:`~SequenceMatcherBase.get_matching_blocks` is always a dummy, ``(len(a), len(b), 0)``, and this is the only case in which the last tuple element (number of elements matched) is ``0``. If you want to know how to change the first sequence into the second, use -:meth:`~SequenceMatcher.get_opcodes`: +:meth:`~SequenceMatcherBase.get_opcodes`: >>> for opcode in s.get_opcodes(): ... print("%6s a[%d:%d] b[%d:%d]" % opcode) @@ -653,7 +756,7 @@ locality, at the occasional cost of producing a longer diff. The :class:`Differ` class has this constructor: -.. class:: Differ(linejunk=None, charjunk=None) +.. class:: Differ(linejunk=None, charjunk=None, linematcher=None, charmatcher=None) :noindex: Optional keyword parameters *linejunk* and *charjunk* are for filter functions @@ -673,6 +776,14 @@ The :class:`Differ` class has this constructor: :meth:`~SequenceMatcher.find_longest_match` method's *isjunk* parameter for an explanation. + *linematcher*: callable with 3 optional arguments which returns + :class:`~SequenceMatcherBase` instance. i.e. ``matcher(isjunk=None, a='', b='')``. + Default (if ``None``) is a :class:`SequenceMatcher` class. + + *charmatcher*: callable with 3 optional arguments which returns + :class:`~SequenceMatcherBase` instance. i.e. ``matcher(isjunk=None, a='', b='')``. + Default (if ``None``) is a :class:`SequenceMatcher` class. + :class:`Differ` objects are used (deltas generated) via a single method: diff --git a/Lib/difflib.py b/Lib/difflib.py index 8f3cdaed9564d8..7877383f001c37 100644 --- a/Lib/difflib.py +++ b/Lib/difflib.py @@ -16,6 +16,9 @@ Function unified_diff(a, b): For two lists of strings, return a delta in unified diff format. +class SequenceMatcherBase: + Base class for implementing sequence matchers. + Class SequenceMatcher: A flexible class for comparing pairs of sequences of any type. @@ -35,6 +38,68 @@ from collections import namedtuple as _namedtuple from types import GenericAlias +######################################################################## +### Utilities +######################################################################## + +def _expand_block(block, a, b, alo, ahi, blo, bhi, *, pred=None): + """Expands block for consecutive matches at both sides if characters match + + pred: callable + additionally, only expand if pred(matching_element) returns True + + Examples: + >>> a, b = '_cabbac_', '.cabbac.' + >>> _expand_block((3, 3, 2), a, b, 0, 8, 0, 8) + (1, 1, 6) + >>> _expand_block((3, 3, 2), a, b, 0, 8, 0, 8, pred='a'.__contains__) + (2, 2, 4) + """ + i, j, k = block + while i > alo and j > blo: + el2 = b[j - 1] + if a[i - 1] != el2 or pred is not None and not pred(el2): + break + i -= 1 + j -= 1 + k += 1 + while i + k < ahi and j + k < bhi: + el2 = b[j + k] + if a[i + k] != el2 or pred is not None and not pred(el2): + break + k += 1 + return (i, j, k) + +def _collapse_adjacent_blocks(blocks): + """Collapses adjacent blocks and removes null blocks + + Examples: + >>> blocks = [(1, 1, 2), (3, 3, 2), (6, 6, 0), (10, 10, 1)] + >>> list(_collapse_adjacent_blocks(blocks)) + [(1, 1, 4), (10, 10, 1)] + """ + i1 = j1 = k1 = 0 + for i2, j2, k2 in blocks: + # Is this block adjacent to i1, j1, k1? + if i1 + k1 == i2 and j1 + k1 == j2: + # Yes, so collapse them -- this just increases the length of + # the first block by the length of the second, and the first + # block so lengthened remains the block to compare against. + k1 += k2 + else: + # Not adjacent. Remember the first block (k1==0 means it's + # the dummy we started with), and make the second block the + # new block to compare against. + if k1: + yield (i1, j1, k1) + i1, j1, k1 = i2, j2, k2 + if k1: + yield (i1, j1, k1) + +######################################################################## +### SequenceMatcherBase +######################################################################## + Match = _namedtuple('Match', 'a b size') def _calculate_ratio(matches, length): @@ -42,9 +107,381 @@ def _calculate_ratio(matches, length): return 2.0 * matches / length return 1.0 -class SequenceMatcher: +def _process_matcher_arg(matcher, argname='matcher'): + if matcher is None: + return SequenceMatcher + elif callable(matcher): + test_inst = matcher() + if not isinstance(test_inst, SequenceMatcherBase): + msg = "%r must return SequenceMatcherBase instance. Returned: %r" + raise TypeError(msg % (argname, test_inst)) + return matcher + else: + raise TypeError("%r must be a callable. Got %r" % (argname, matcher)) + +class SequenceMatcherBase: + """Base class for implementing sequence matchers. + + At minimum, derived classes must implement `_get_matching_blocks` method, + which returns a list of tuples of the form (start_in_a, start_in_b, length). + See `_get_matching_blocks` and `get_matching_blocks` for more information. + Once implemented, the following methods are available: + - get_matching_blocks + - get_opcodes + - get_grouped_opcodes + - ratio + - quick_ratio + - real_quick_ratio + + See `SequenceMatcher` for example implementation. """ + + def __init__(self, isjunk=None, a='', b=''): + """ + Optional arg isjunk is None (the default), or a one-argument + function that takes a sequence element and returns true iff the + element is junk. None is equivalent to passing "lambda x: 0", i.e. + no elements are considered to be junk. For example, pass + lambda x: x in " \\t" + if you're comparing lines as sequences of characters, and don't + want to synch up on blanks or hard tabs. + + Optional arg a is the first of two sequences to be compared. By + default, an empty string. The elements of a must be hashable. See + also .set_seqs() and .set_seq1(). + + Optional arg b is the second of two sequences to be compared. By + default, an empty string. The elements of b must be hashable. See + also .set_seqs() and .set_seq2(). + + Members: + a : Sequence + first sequence + b : Sequence + second sequence; differences are computed as "what do + we need to do to 'a' to change it into 'b'?" + isjunk : Callable | None + a user-supplied function taking a sequence element and + returning true iff the element is "junk" + "junk" elements are unmatchable elements + matching_blocks : list + a list of (i, j, k) triples, where a[i:i+k] == b[j:j+k]; + ascending & non-overlapping in i and in j; terminated by + a dummy (len(a), len(b), 0) sentinel + opcodes : list + a list of (tag, i1, i2, j1, j2) tuples, where tag is + one of + 'replace' a[i1:i2] should be replaced by b[j1:j2] + 'delete' a[i1:i2] should be deleted + 'insert' b[j1:j2] should be inserted + 'equal' a[i1:i2] == b[j1:j2] + """ + self.isjunk = isjunk + self.a = None + self.b = None + self.set_seqs(a, b) + + def set_seqs(self, a, b): + """Set the two sequences to be compared.""" + self.set_seq1(a) + self.set_seq2(b) + + def set_seq1(self, a): + """Set the first sequence to be compared. + + The second sequence to be compared is not changed. + + >>> s = SequenceMatcher(None, "abcd", "bcde") + >>> s.ratio() + 0.75 + >>> s.set_seq1("bcde") + >>> s.ratio() + 1.0 + >>> + + SequenceMatcher computes and caches detailed information about the + second sequence, so if you want to compare one sequence S against + many sequences, use .set_seq2(S) once and call .set_seq1(x) + repeatedly for each of the other sequences. + + See also set_seqs() and set_seq2(). + """ + + if a is self.a: + return + self.a = a + self.matching_blocks = self.opcodes = None + + def set_seq2(self, b): + """Set the second sequence to be compared. + + The first sequence to be compared is not changed. + + >>> s = SequenceMatcher(None, "abcd", "bcde") + >>> s.ratio() + 0.75 + >>> s.set_seq2("abcd") + >>> s.ratio() + 1.0 + >>> + + SequenceMatcherBase caches detailed information about the + second sequence. set_seq2 clears cache of quick_ratio method. + In addition _prepare_seq2, which is called at the end of set_seq2, + can be implemented by derived class for alignment algorithm cache logic. + + See also set_seqs() and set_seq1(). + """ + + if b is self.b: + return + self.b = b + self.matching_blocks = self.opcodes = None + self.fullbcount = None + self._prepare_seq2() + + def _prepare_seq2(self): + """Preparation method that is called at the end of `set_seq2`. + + By default it does nothing, but can be implemented by derived class + for alignment algorithm cache logic. + """ + pass + + # Abstract Methods ---------------- + # --------------------------------- + + def _get_matching_blocks(self): + """Returns list of tuples of the form (start_in_a, start_in_b, length) + describing matching subsequences. + + Validity of whether blocks actually match is not checked + and it is up to the user to make sure of result's correctness. + + This method implements the core matching logic, while + `get_matching_blocks` takes care of the maintenance and caching. + + For custom maintenance and caching, get_matching_blocks can be + overriden by derived class without making use of this method. + """ + raise NotImplementedError + + # Implemented Methods ------------- + # --------------------------------- + + def get_matching_blocks(self): + """Return list of triples describing matching subsequences. + + Each triple is of the form (i, j, n), and means that + a[i:i+n] == b[j:j+n]. The triples are monotonically increasing in + i and in j. New in Python 2.5, it's also guaranteed that if + (i, j, n) and (i', j', n') are adjacent triples in the list, and + the second is not the last triple in the list, then i+n != i' or + j+n != j'. IOW, adjacent triples never describe adjacent equal + blocks. + + The last triple is a dummy, (len(a), len(b), 0), and is the only + triple with n==0. + + When `_get_matching_blocks` is implemented, this method takes care of: + 1. Appending last dummy tripple + 2. Collapsing adjacent blocks (after removing empty blocks) + 3. Caching + """ + blocks = self.matching_blocks + if blocks is None: + blocks = self._get_matching_blocks() + blocks = _collapse_adjacent_blocks(blocks) + blocks = list(map(Match._make, blocks)) + # Append dummy at the end + blocks.append(Match(len(self.a), len(self.b), 0)) + # Cache + self.matching_blocks = blocks + return blocks + + def get_opcodes(self): + """Return list of 5-tuples describing how to turn a into b. + + Each tuple is of the form (tag, i1, i2, j1, j2). The first tuple + has i1 == j1 == 0, and remaining tuples have i1 == the i2 from the + tuple preceding it, and likewise for j1 == the previous j2. + + The tags are strings, with these meanings: + + 'replace': a[i1:i2] should be replaced by b[j1:j2] + 'delete': a[i1:i2] should be deleted. + Note that j1==j2 in this case. + 'insert': b[j1:j2] should be inserted at a[i1:i1]. + Note that i1==i2 in this case. + 'equal': a[i1:i2] == b[j1:j2] + + >>> a = "qabxcd" + >>> b = "abycdf" + >>> s = SequenceMatcher(None, a, b) + >>> for tag, i1, i2, j1, j2 in s.get_opcodes(): + ... print(("%7s a[%d:%d] (%s) b[%d:%d] (%s)" % + ... (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2]))) + delete a[0:1] (q) b[0:0] () + equal a[1:3] (ab) b[0:2] (ab) + replace a[3:4] (x) b[2:3] (y) + equal a[4:6] (cd) b[3:5] (cd) + insert a[6:6] () b[5:6] (f) + """ + + if self.opcodes is not None: + return self.opcodes + i = j = 0 + self.opcodes = answer = [] + for ai, bj, size in self.get_matching_blocks(): + # invariant: we've pumped out correct diffs to change + # a[:i] into b[:j], and the next matching block is + # a[ai:ai+size] == b[bj:bj+size]. So we need to pump + # out a diff to change a[i:ai] into b[j:bj], pump out + # the matching block, and move (i,j) beyond the match + tag = '' + if i < ai and j < bj: + tag = 'replace' + elif i < ai: + tag = 'delete' + elif j < bj: + tag = 'insert' + if tag: + answer.append( (tag, i, ai, j, bj) ) + i, j = ai+size, bj+size + # the list of matching blocks is terminated by a + # sentinel with size 0 + if size: + answer.append( ('equal', ai, i, bj, j) ) + return answer + + def get_grouped_opcodes(self, n=3): + """ Isolate change clusters by eliminating ranges with no changes. + + Return a generator of groups with up to n lines of context. + Each group is in the same format as returned by get_opcodes(). + + >>> from pprint import pprint + >>> a = list(map(str, range(1,40))) + >>> b = a[:] + >>> b[8:8] = ['i'] # Make an insertion + >>> b[20] += 'x' # Make a replacement + >>> b[23:28] = [] # Make a deletion + >>> b[30] += 'y' # Make another replacement + >>> pprint(list(SequenceMatcher(None,a,b).get_grouped_opcodes())) + [[('equal', 5, 8, 5, 8), ('insert', 8, 8, 8, 9), ('equal', 8, 11, 9, 12)], + [('equal', 16, 19, 17, 20), + ('replace', 19, 20, 20, 21), + ('equal', 20, 22, 21, 23), + ('delete', 22, 27, 23, 23), + ('equal', 27, 30, 23, 26)], + [('equal', 31, 34, 27, 30), + ('replace', 34, 35, 30, 31), + ('equal', 35, 38, 31, 34)]] + """ + + codes = self.get_opcodes() + if not codes: + codes = [("equal", 0, 1, 0, 1)] + # Fixup leading and trailing groups if they show no changes. + if codes[0][0] == 'equal': + tag, i1, i2, j1, j2 = codes[0] + codes[0] = tag, max(i1, i2-n), i2, max(j1, j2-n), j2 + if codes[-1][0] == 'equal': + tag, i1, i2, j1, j2 = codes[-1] + codes[-1] = tag, i1, min(i2, i1+n), j1, min(j2, j1+n) + + nn = n + n + group = [] + for tag, i1, i2, j1, j2 in codes: + # End the current group and start a new one whenever + # there is a large range with no changes. + if tag == 'equal' and i2-i1 > nn: + group.append((tag, i1, min(i2, i1+n), j1, min(j2, j1+n))) + yield group + group = [] + i1, j1 = max(i1, i2-n), max(j1, j2-n) + group.append((tag, i1, i2, j1 ,j2)) + if group and not (len(group)==1 and group[0][0] == 'equal'): + yield group + + def ratio(self): + """Return a measure of the sequences' similarity (float in [0,1]). + + Where T is the total number of elements in both sequences, and + M is the number of matches, this is 2.0*M / T. + Note that this is 1 if the sequences are identical, and 0 if + they have nothing in common. + + .ratio() is expensive to compute if you haven't already computed + .get_matching_blocks() or .get_opcodes(), in which case you may + want to try .quick_ratio() or .real_quick_ratio() first to get an + upper bound. + + >>> s = SequenceMatcher(None, "abcd", "bcde") + >>> s.ratio() + 0.75 + >>> s.quick_ratio() + 0.75 + >>> s.real_quick_ratio() + 1.0 + """ + + matches = sum(triple[-1] for triple in self.get_matching_blocks()) + return _calculate_ratio(matches, len(self.a) + len(self.b)) + + def quick_ratio(self): + """Return an upper bound on ratio() relatively quickly. + + This isn't defined beyond that it is an upper bound on .ratio(), and + is faster to compute. + """ + + # viewing a and b as multisets, set matches to the cardinality + # of their intersection; this counts the number of matches + # without regard to order, so is clearly an upper bound + if self.fullbcount is None: + self.fullbcount = fullbcount = {} + for elt in self.b: + fullbcount[elt] = fullbcount.get(elt, 0) + 1 + fullbcount = self.fullbcount + # avail[x] is the number of times x appears in 'b' less the + # number of times we've seen it in 'a' so far ... kinda + avail = {} + matches = 0 + for elt in self.a: + if elt in avail: + numb = avail[elt] + else: + numb = fullbcount.get(elt, 0) + avail[elt] = numb - 1 + if numb > 0: + matches += 1 + return _calculate_ratio(matches, len(self.a) + len(self.b)) + + def real_quick_ratio(self): + """Return an upper bound on ratio() very quickly. + + This isn't defined beyond that it is an upper bound on .ratio(), and + is faster to compute than either .ratio() or .quick_ratio(). + """ + + la, lb = len(self.a), len(self.b) + # can't have more matches than the number of elements in the + # shorter sequence + return _calculate_ratio(min(la, lb), la + lb) + + __class_getitem__ = classmethod(GenericAlias) + +######################################################################## +### SequenceMatcher +######################################################################## + +class SequenceMatcher(SequenceMatcherBase): + + """ + Implementation of SequenceMatcherBase. + SequenceMatcher is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm @@ -103,150 +540,52 @@ class SequenceMatcher: ... print("%6s a[%d:%d] b[%d:%d]" % opcode) equal a[0:8] b[0:8] insert a[8:8] b[8:17] - equal a[8:29] b[17:38] - - See the Differ class for a fancy human-friendly file differencer, which - uses SequenceMatcher both to compare sequences of lines, and to compare - sequences of characters within similar (near-matching) lines. - - See also function get_close_matches() in this module, which shows how - simple code building on SequenceMatcher can be used to do useful work. - - Timing: Basic R-O is cubic time worst case and quadratic time expected - case. SequenceMatcher is quadratic time for the worst case and has - expected-case behavior dependent in a complicated way on how many - elements the sequences have in common; best case time is linear. - """ - - def __init__(self, isjunk=None, a='', b='', autojunk=True): - """Construct a SequenceMatcher. - - Optional arg isjunk is None (the default), or a one-argument - function that takes a sequence element and returns true iff the - element is junk. None is equivalent to passing "lambda x: 0", i.e. - no elements are considered to be junk. For example, pass - lambda x: x in " \\t" - if you're comparing lines as sequences of characters, and don't - want to synch up on blanks or hard tabs. - - Optional arg a is the first of two sequences to be compared. By - default, an empty string. The elements of a must be hashable. See - also .set_seqs() and .set_seq1(). - - Optional arg b is the second of two sequences to be compared. By - default, an empty string. The elements of b must be hashable. See - also .set_seqs() and .set_seq2(). - - Optional arg autojunk should be set to False to disable the - "automatic junk heuristic" that treats popular elements as junk - (see module documentation for more information). - """ - - # Members: - # a - # first sequence - # b - # second sequence; differences are computed as "what do - # we need to do to 'a' to change it into 'b'?" - # b2j - # for x in b, b2j[x] is a list of the indices (into b) - # at which x appears; junk and popular elements do not appear - # fullbcount - # for x in b, fullbcount[x] == the number of times x - # appears in b; only materialized if really needed (used - # only for computing quick_ratio()) - # matching_blocks - # a list of (i, j, k) triples, where a[i:i+k] == b[j:j+k]; - # ascending & non-overlapping in i and in j; terminated by - # a dummy (len(a), len(b), 0) sentinel - # opcodes - # a list of (tag, i1, i2, j1, j2) tuples, where tag is - # one of - # 'replace' a[i1:i2] should be replaced by b[j1:j2] - # 'delete' a[i1:i2] should be deleted - # 'insert' b[j1:j2] should be inserted - # 'equal' a[i1:i2] == b[j1:j2] - # isjunk - # a user-supplied function taking a sequence element and - # returning true iff the element is "junk" -- this has - # subtle but helpful effects on the algorithm, which I'll - # get around to writing up someday <0.9 wink>. - # DON'T USE! Only __chain_b uses this. Use "in self.bjunk". - # bjunk - # the items in b for which isjunk is True. - # bpopular - # nonjunk items in b treated as junk by the heuristic (if used). - - self.isjunk = isjunk - self.a = self.b = None - self.autojunk = autojunk - self.set_seqs(a, b) - - def set_seqs(self, a, b): - """Set the two sequences to be compared. - - >>> s = SequenceMatcher() - >>> s.set_seqs("abcd", "bcde") - >>> s.ratio() - 0.75 - """ - - self.set_seq1(a) - self.set_seq2(b) - - def set_seq1(self, a): - """Set the first sequence to be compared. - - The second sequence to be compared is not changed. - - >>> s = SequenceMatcher(None, "abcd", "bcde") - >>> s.ratio() - 0.75 - >>> s.set_seq1("bcde") - >>> s.ratio() - 1.0 - >>> - - SequenceMatcher computes and caches detailed information about the - second sequence, so if you want to compare one sequence S against - many sequences, use .set_seq2(S) once and call .set_seq1(x) - repeatedly for each of the other sequences. - - See also set_seqs() and set_seq2(). - """ + equal a[8:29] b[17:38] - if a is self.a: - return - self.a = a - self.matching_blocks = self.opcodes = None + See the Differ class for a fancy human-friendly file differencer, which + uses SequenceMatcher both to compare sequences of lines, and to compare + sequences of characters within similar (near-matching) lines. - def set_seq2(self, b): - """Set the second sequence to be compared. + See also function get_close_matches() in this module, which shows how + simple code building on SequenceMatcher can be used to do useful work. - The first sequence to be compared is not changed. + Timing: Basic R-O is cubic time worst case and quadratic time expected + case. SequenceMatcher is quadratic time for the worst case and has + expected-case behavior dependent in a complicated way on how many + elements the sequences have in common; best case time is linear. + """ - >>> s = SequenceMatcher(None, "abcd", "bcde") - >>> s.ratio() - 0.75 - >>> s.set_seq2("abcd") - >>> s.ratio() - 1.0 - >>> + def __init__(self, isjunk=None, a='', b='', autojunk=True): + """Construct a SequenceMatcher. - SequenceMatcher computes and caches detailed information about the - second sequence, so if you want to compare one sequence S against - many sequences, use .set_seq2(S) once and call .set_seq1(x) - repeatedly for each of the other sequences. + isjunk,a,b are passed on to `SequenceMatcherBase` constructor. + See `SequenceMatcherBase` documentation. - See also set_seqs() and set_seq1(). + Optional arg autojunk should be set to False to disable the + "automatic junk heuristic" that treats popular elements as junk + (see module documentation for more information). """ - if b is self.b: - return - self.b = b - self.matching_blocks = self.opcodes = None - self.fullbcount = None - self.__chain_b() + # Members specific to Sequence Matcher: + # b2j + # for x in b, b2j[x] is a list of the indices (into b) + # at which x appears; junk and popular elements do not appear + # fullbcount + # for x in b, fullbcount[x] == the number of times x + # appears in b; only materialized if really needed (used + # only for computing quick_ratio()) + # isjunk + # a user-supplied function taking a sequence element and + # returning true iff the element is "junk" -- this has + # subtle but helpful effects on the algorithm, which I'll + # get around to writing up someday <0.9 wink>. + # DON'T USE! Only __chain_b uses this. Use "in self.bjunk". + # bjunk + # the items in b for which isjunk is True. + # bpopular + # nonjunk items in b treated as junk by the heuristic (if used). + self.autojunk = autojunk + super().__init__(isjunk, a, b) # For each element x in b, set b2j[x] to a list of the indices in # b where x appears; the indices are in increasing order; note that @@ -264,6 +603,9 @@ def set_seq2(self, b): # kinds of matches, it's best to call set_seq2 once, then set_seq1 # repeatedly + def _prepare_seq2(self): + self.__chain_b() + def __chain_b(self): # Because isjunk is a user-defined (not C) function, and we test # for junk a LOT, it's important to minimize the number of calls. @@ -361,7 +703,8 @@ def find_longest_match(self, alo=0, ahi=None, blo=0, bhi=None): # Windiff ends up at the same place as diff, but by pairing up # the unique 'b's and then matching the first two 'a's. - a, b, b2j, isbjunk = self.a, self.b, self.b2j, self.bjunk.__contains__ + a, b, b2j = self.a, self.b, self.b2j + bjunk, bpopular = self.bjunk, self.bpopular if ahi is None: ahi = len(a) if bhi is None: @@ -388,38 +731,29 @@ def find_longest_match(self, alo=0, ahi=None, blo=0, bhi=None): besti, bestj, bestsize = i-k+1, j-k+1, k j2len = newj2len - # Extend the best by non-junk elements on each end. In particular, - # "popular" non-junk elements aren't in b2j, which greatly speeds - # the inner loop above, but also means "the best" match so far - # doesn't contain any junk *or* popular non-junk elements. - while besti > alo and bestj > blo and \ - not isbjunk(b[bestj-1]) and \ - a[besti-1] == b[bestj-1]: - besti, bestj, bestsize = besti-1, bestj-1, bestsize+1 - while besti+bestsize < ahi and bestj+bestsize < bhi and \ - not isbjunk(b[bestj+bestsize]) and \ - a[besti+bestsize] == b[bestj+bestsize]: - bestsize += 1 - - # Now that we have a wholly interesting match (albeit possibly - # empty!), we may as well suck up the matching junk on each - # side of it too. Can't think of a good reason not to, and it - # saves post-processing the (possibly considerable) expense of - # figuring out what to do with it. In the case of an empty - # interesting match, this is clearly the right thing to do, - # because no other kind of match is possible in the regions. - while besti > alo and bestj > blo and \ - isbjunk(b[bestj-1]) and \ - a[besti-1] == b[bestj-1]: - besti, bestj, bestsize = besti-1, bestj-1, bestsize+1 - while besti+bestsize < ahi and bestj+bestsize < bhi and \ - isbjunk(b[bestj+bestsize]) and \ - a[besti+bestsize] == b[bestj+bestsize]: - bestsize = bestsize + 1 - - return Match(besti, bestj, bestsize) - - def get_matching_blocks(self): + block = besti, bestj, bestsize + if bpopular: + # Extend the best by non-junk elements on each end. In particular, + # "popular" non-junk elements aren't in b2j, which greatly speeds + # the inner loop above, but also means "the best" match so far + # doesn't contain any junk *or* popular non-junk elements. + block = _expand_block(block, a, b, alo, ahi, blo, bhi, + pred=bpopular.__contains__) + + if bjunk: + # Now that we have a wholly interesting match (albeit possibly + # empty!), we may as well suck up the matching junk on each + # side of it too. Can't think of a good reason not to, and it + # saves post-processing the (possibly considerable) expense of + # figuring out what to do with it. In the case of an empty + # interesting match, this is clearly the right thing to do, + # because no other kind of match is possible in the regions. + block = _expand_block(block, a, b, alo, ahi, blo, bhi, + pred=bjunk.__contains__) + + return Match._make(block) + + def _get_matching_blocks(self): """Return list of triples describing matching subsequences. Each triple is of the form (i, j, n), and means that @@ -438,8 +772,6 @@ def get_matching_blocks(self): [Match(a=0, b=0, size=2), Match(a=3, b=2, size=2), Match(a=5, b=4, size=0)] """ - if self.matching_blocks is not None: - return self.matching_blocks la, lb = len(self.a), len(self.b) # This is most naturally expressed as a recursive algorithm, but @@ -463,208 +795,13 @@ def get_matching_blocks(self): if i+k < ahi and j+k < bhi: queue.append((i+k, ahi, j+k, bhi)) matching_blocks.sort() + return matching_blocks - # It's possible that we have adjacent equal blocks in the - # matching_blocks list now. Starting with 2.5, this code was added - # to collapse them. - i1 = j1 = k1 = 0 - non_adjacent = [] - for i2, j2, k2 in matching_blocks: - # Is this block adjacent to i1, j1, k1? - if i1 + k1 == i2 and j1 + k1 == j2: - # Yes, so collapse them -- this just increases the length of - # the first block by the length of the second, and the first - # block so lengthened remains the block to compare against. - k1 += k2 - else: - # Not adjacent. Remember the first block (k1==0 means it's - # the dummy we started with), and make the second block the - # new block to compare against. - if k1: - non_adjacent.append((i1, j1, k1)) - i1, j1, k1 = i2, j2, k2 - if k1: - non_adjacent.append((i1, j1, k1)) - - non_adjacent.append( (la, lb, 0) ) - self.matching_blocks = list(map(Match._make, non_adjacent)) - return self.matching_blocks - - def get_opcodes(self): - """Return list of 5-tuples describing how to turn a into b. - - Each tuple is of the form (tag, i1, i2, j1, j2). The first tuple - has i1 == j1 == 0, and remaining tuples have i1 == the i2 from the - tuple preceding it, and likewise for j1 == the previous j2. - - The tags are strings, with these meanings: - - 'replace': a[i1:i2] should be replaced by b[j1:j2] - 'delete': a[i1:i2] should be deleted. - Note that j1==j2 in this case. - 'insert': b[j1:j2] should be inserted at a[i1:i1]. - Note that i1==i2 in this case. - 'equal': a[i1:i2] == b[j1:j2] - - >>> a = "qabxcd" - >>> b = "abycdf" - >>> s = SequenceMatcher(None, a, b) - >>> for tag, i1, i2, j1, j2 in s.get_opcodes(): - ... print(("%7s a[%d:%d] (%s) b[%d:%d] (%s)" % - ... (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2]))) - delete a[0:1] (q) b[0:0] () - equal a[1:3] (ab) b[0:2] (ab) - replace a[3:4] (x) b[2:3] (y) - equal a[4:6] (cd) b[3:5] (cd) - insert a[6:6] () b[5:6] (f) - """ - - if self.opcodes is not None: - return self.opcodes - i = j = 0 - self.opcodes = answer = [] - for ai, bj, size in self.get_matching_blocks(): - # invariant: we've pumped out correct diffs to change - # a[:i] into b[:j], and the next matching block is - # a[ai:ai+size] == b[bj:bj+size]. So we need to pump - # out a diff to change a[i:ai] into b[j:bj], pump out - # the matching block, and move (i,j) beyond the match - tag = '' - if i < ai and j < bj: - tag = 'replace' - elif i < ai: - tag = 'delete' - elif j < bj: - tag = 'insert' - if tag: - answer.append( (tag, i, ai, j, bj) ) - i, j = ai+size, bj+size - # the list of matching blocks is terminated by a - # sentinel with size 0 - if size: - answer.append( ('equal', ai, i, bj, j) ) - return answer - - def get_grouped_opcodes(self, n=3): - """ Isolate change clusters by eliminating ranges with no changes. - - Return a generator of groups with up to n lines of context. - Each group is in the same format as returned by get_opcodes(). - - >>> from pprint import pprint - >>> a = list(map(str, range(1,40))) - >>> b = a[:] - >>> b[8:8] = ['i'] # Make an insertion - >>> b[20] += 'x' # Make a replacement - >>> b[23:28] = [] # Make a deletion - >>> b[30] += 'y' # Make another replacement - >>> pprint(list(SequenceMatcher(None,a,b).get_grouped_opcodes())) - [[('equal', 5, 8, 5, 8), ('insert', 8, 8, 8, 9), ('equal', 8, 11, 9, 12)], - [('equal', 16, 19, 17, 20), - ('replace', 19, 20, 20, 21), - ('equal', 20, 22, 21, 23), - ('delete', 22, 27, 23, 23), - ('equal', 27, 30, 23, 26)], - [('equal', 31, 34, 27, 30), - ('replace', 34, 35, 30, 31), - ('equal', 35, 38, 31, 34)]] - """ - - codes = self.get_opcodes() - if not codes: - codes = [("equal", 0, 1, 0, 1)] - # Fixup leading and trailing groups if they show no changes. - if codes[0][0] == 'equal': - tag, i1, i2, j1, j2 = codes[0] - codes[0] = tag, max(i1, i2-n), i2, max(j1, j2-n), j2 - if codes[-1][0] == 'equal': - tag, i1, i2, j1, j2 = codes[-1] - codes[-1] = tag, i1, min(i2, i1+n), j1, min(j2, j1+n) - - nn = n + n - group = [] - for tag, i1, i2, j1, j2 in codes: - # End the current group and start a new one whenever - # there is a large range with no changes. - if tag == 'equal' and i2-i1 > nn: - group.append((tag, i1, min(i2, i1+n), j1, min(j2, j1+n))) - yield group - group = [] - i1, j1 = max(i1, i2-n), max(j1, j2-n) - group.append((tag, i1, i2, j1 ,j2)) - if group and not (len(group)==1 and group[0][0] == 'equal'): - yield group - - def ratio(self): - """Return a measure of the sequences' similarity (float in [0,1]). - - Where T is the total number of elements in both sequences, and - M is the number of matches, this is 2.0*M / T. - Note that this is 1 if the sequences are identical, and 0 if - they have nothing in common. - - .ratio() is expensive to compute if you haven't already computed - .get_matching_blocks() or .get_opcodes(), in which case you may - want to try .quick_ratio() or .real_quick_ratio() first to get an - upper bound. - - >>> s = SequenceMatcher(None, "abcd", "bcde") - >>> s.ratio() - 0.75 - >>> s.quick_ratio() - 0.75 - >>> s.real_quick_ratio() - 1.0 - """ - - matches = sum(triple[-1] for triple in self.get_matching_blocks()) - return _calculate_ratio(matches, len(self.a) + len(self.b)) - - def quick_ratio(self): - """Return an upper bound on ratio() relatively quickly. - - This isn't defined beyond that it is an upper bound on .ratio(), and - is faster to compute. - """ - - # viewing a and b as multisets, set matches to the cardinality - # of their intersection; this counts the number of matches - # without regard to order, so is clearly an upper bound - if self.fullbcount is None: - self.fullbcount = fullbcount = {} - for elt in self.b: - fullbcount[elt] = fullbcount.get(elt, 0) + 1 - fullbcount = self.fullbcount - # avail[x] is the number of times x appears in 'b' less the - # number of times we've seen it in 'a' so far ... kinda - avail = {} - matches = 0 - for elt in self.a: - if elt in avail: - numb = avail[elt] - else: - numb = fullbcount.get(elt, 0) - avail[elt] = numb - 1 - if numb > 0: - matches += 1 - return _calculate_ratio(matches, len(self.a) + len(self.b)) - - def real_quick_ratio(self): - """Return an upper bound on ratio() very quickly. - - This isn't defined beyond that it is an upper bound on .ratio(), and - is faster to compute than either .ratio() or .quick_ratio(). - """ - - la, lb = len(self.a), len(self.b) - # can't have more matches than the number of elements in the - # shorter sequence - return _calculate_ratio(min(la, lb), la + lb) - - __class_getitem__ = classmethod(GenericAlias) - +######################################################################## +### get_close_matches +######################################################################## -def get_close_matches(word, possibilities, n=3, cutoff=0.6): +def get_close_matches(word, possibilities, n=3, cutoff=0.6, matcher=None): """Use SequenceMatcher to return list of the best "good enough" matches. word is a sequence for which close matches are desired (typically a @@ -679,6 +816,10 @@ def get_close_matches(word, possibilities, n=3, cutoff=0.6): Optional arg cutoff (default 0.6) is a float in [0, 1]. Possibilities that don't score at least that similar to word are ignored. + Optional arg matcher is a callable with 3 optional arguments and returns + SequenceMatcherBase instance. i.e. matcher(isjunk=None, a='', b=''). + Default (if None) is a SequenceMatcher class. + The best (no more than n) matches among the possibilities are returned in a list, sorted by similarity score, most similar first. @@ -693,12 +834,14 @@ def get_close_matches(word, possibilities, n=3, cutoff=0.6): ['except'] """ - if not n > 0: + if not n > 0: raise ValueError("n must be > 0: %r" % (n,)) if not 0.0 <= cutoff <= 1.0: raise ValueError("cutoff must be in [0.0, 1.0]: %r" % (cutoff,)) + matcher = _process_matcher_arg(matcher, 'matcher') + result = [] - s = SequenceMatcher() + s = matcher() s.set_seq2(word) for x in possibilities: s.set_seq1(x) @@ -714,6 +857,22 @@ def get_close_matches(word, possibilities, n=3, cutoff=0.6): # Strip scores for the best n matches return [x for score, x in result] +######################################################################## +### Differ +######################################################################## + +def _get_differ(differ, linejunk=None, charjunk=None, argname='differ'): + if differ is None: + differ = Differ + elif not callable(differ): + msg = "%r must be a callable. Got %r" + raise TypeError(msg % (argname, differ)) + + differ_inst = differ(linejunk, charjunk) + if not isinstance(differ_inst, Differ): + msg = "%r must return Differ instance. Returned: %r" + raise TypeError(msg % (argname, differ_inst)) + return differ_inst def _keep_original_ws(s, tag_s): """Replace whitespace with the original whitespace characters in `s`""" @@ -722,8 +881,6 @@ def _keep_original_ws(s, tag_s): for c, tag_c in zip(s, tag_s) ) - - class Differ: r""" Differ is a class for comparing sequences of lines of text, and @@ -810,7 +967,8 @@ class Differ: + 5. Flat is better than nested. """ - def __init__(self, linejunk=None, charjunk=None): + def __init__(self, linejunk=None, charjunk=None, + linematcher=None, charmatcher=None): """ Construct a text differencer, with optional filters. @@ -828,10 +986,19 @@ def __init__(self, linejunk=None, charjunk=None): module-level function `IS_CHARACTER_JUNK` may be used to filter out whitespace characters (a blank or tab; **note**: bad idea to include newline in this!). Use of IS_CHARACTER_JUNK is recommended. - """ + - `linematcher`: callable with 3 optional arguments which returns + SequenceMatcherBase instance. i.e. matcher(isjunk=None, a='', b=''). + Default (if None) is a SequenceMatcher class. + + - `charmatcher`: callable with 3 optional arguments which returns + SequenceMatcherBase instance. i.e. matcher(isjunk=None, a='', b=''). + Default (if None) is a SequenceMatcher class. + """ self.linejunk = linejunk self.charjunk = charjunk + self.linematcher = _process_matcher_arg(linematcher, 'linematcher') + self.charmatcher = _process_matcher_arg(charmatcher, 'charmatcher') def compare(self, a, b): r""" @@ -859,7 +1026,7 @@ def compare(self, a, b): + emu """ - cruncher = SequenceMatcher(self.linejunk, a, b) + cruncher = self.linematcher(self.linejunk, a, b) for tag, alo, ahi, blo, bhi in cruncher.get_opcodes(): if tag == 'replace': g = self._fancy_replace(a, alo, ahi, b, blo, bhi) @@ -920,7 +1087,7 @@ def _fancy_replace(self, a, alo, ahi, b, blo, bhi): # Later, more pathological cases prompted removing recursion # entirely. cutoff = 0.74999 - cruncher = SequenceMatcher(self.charjunk) + cruncher = self.charmatcher(self.charjunk) crqr = cruncher.real_quick_ratio cqr = cruncher.quick_ratio cr = cruncher.ratio @@ -930,7 +1097,7 @@ def _fancy_replace(self, a, alo, ahi, b, blo, bhi): dump_i, dump_j = alo, blo # smallest indices not yet resolved for j in range(blo, bhi): cruncher.set_seq2(b[j]) - # Search the corresponding i's within WINDOW for rhe highest + # Search the corresponding i's within WINDOW for the highest # ratio greater than `cutoff`. aequiv = alo + (j - blo) arange = range(max(aequiv - WINDOW, dump_i), @@ -1060,8 +1227,8 @@ def IS_LINE_JUNK(line, pat=None): if pat is None: # Default: match '#' or the empty string return line.strip() in '#' - # Previous versions used the undocumented parameter 'pat' as a - # match function. Retain this behaviour for compatibility. + # Previous versions used the undocumented parameter 'pat' as a + # match function. Retain this behaviour for compatibility. return pat(line) is not None def IS_CHARACTER_JUNK(ch, ws=" \t"): @@ -1099,7 +1266,8 @@ def _format_range_unified(start, stop): return '{},{}'.format(beginning, length) def unified_diff(a, b, fromfile='', tofile='', fromfiledate='', - tofiledate='', n=3, lineterm='\n', *, color=False): + tofiledate='', n=3, lineterm='\n', *, color=False, + matcher=None): r""" Compare two sequences of lines; generate the delta as a unified diff. @@ -1120,6 +1288,10 @@ def unified_diff(a, b, fromfile='', tofile='', fromfiledate='', 'git diff --color'. Even if enabled, it can be controlled using environment variables such as 'NO_COLOR'. + Optional arg matcher is a callable with 3 optional arguments and returns + SequenceMatcherBase instance. i.e. matcher(isjunk=None, a='', b=''). + Default (if None) is a SequenceMatcher class. + The unidiff format normally has a header for filenames and modification times. Any or all of these may be specified using strings for 'fromfile', 'tofile', 'fromfiledate', and 'tofiledate'. @@ -1142,6 +1314,7 @@ def unified_diff(a, b, fromfile='', tofile='', fromfiledate='', +tree four """ + matcher = _process_matcher_arg(matcher, 'matcher') if color and can_colorize(): t = get_theme(force_color=True).difflib @@ -1150,7 +1323,7 @@ def unified_diff(a, b, fromfile='', tofile='', fromfiledate='', _check_types(a, b, fromfile, tofile, fromfiledate, tofiledate, lineterm) started = False - for group in SequenceMatcher(None,a,b).get_grouped_opcodes(n): + for group in matcher(None,a,b).get_grouped_opcodes(n): if not started: started = True fromdate = '\t{}'.format(fromfiledate) if fromfiledate else '' @@ -1192,8 +1365,8 @@ def _format_range_context(start, stop): return '{},{}'.format(beginning, beginning + length - 1) # See http://www.unix.org/single_unix_specification/ -def context_diff(a, b, fromfile='', tofile='', - fromfiledate='', tofiledate='', n=3, lineterm='\n'): +def context_diff(a, b, fromfile='', tofile='', fromfiledate='', tofiledate='', + n=3, lineterm='\n', matcher=None): r""" Compare two sequences of lines; generate the delta as a context diff. @@ -1210,6 +1383,10 @@ def context_diff(a, b, fromfile='', tofile='', For inputs that do not have trailing newlines, set the lineterm argument to "" so that the output will be uniformly newline free. + Optional arg matcher is a callable with 3 optional arguments and returns + SequenceMatcherBase instance. i.e. matcher(isjunk=None, a='', b=''). + Default (if None) is a SequenceMatcher class. + The context diff format normally has a header for filenames and modification times. Any or all of these may be specified using strings for 'fromfile', 'tofile', 'fromfiledate', and 'tofiledate'. @@ -1236,10 +1413,11 @@ def context_diff(a, b, fromfile='', tofile='', four """ + matcher = _process_matcher_arg(matcher, 'matcher') _check_types(a, b, fromfile, tofile, fromfiledate, tofiledate, lineterm) prefix = dict(insert='+ ', delete='- ', replace='! ', equal=' ') started = False - for group in SequenceMatcher(None,a,b).get_grouped_opcodes(n): + for group in matcher(None,a,b).get_grouped_opcodes(n): if not started: started = True fromdate = '\t{}'.format(fromfiledate) if fromfiledate else '' @@ -1321,7 +1499,7 @@ def decode(s): for line in lines: yield line.encode('ascii', 'surrogateescape') -def ndiff(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK): +def ndiff(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK, differ=None): r""" Compare `a` and `b` (lists of strings); return a `Differ`-style delta. @@ -1329,7 +1507,7 @@ def ndiff(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK): functions, or can be None: - linejunk: A function that should accept a single string argument and - return true iff the string is junk. The default is None, and is + returns true iff the string is junk. The default is None, and is recommended; the underlying SequenceMatcher class has an adaptive notion of "noise" lines. @@ -1339,6 +1517,10 @@ def ndiff(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK): whitespace characters (a blank or tab; note: it's a bad idea to include newline in this!). + - differ: callable that takes 2 optional arguments and returns + Differ instance. i.e. differ(linejunk=None, charjunk=None). + Default (if None) is a Differ class. + Tools/scripts/ndiff.py is a command-line front-end to this function. Example: @@ -1356,10 +1538,11 @@ def ndiff(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK): + tree + emu """ - return Differ(linejunk, charjunk).compare(a, b) + differ_inst = _get_differ(differ, linejunk, charjunk, 'differ') + return differ_inst.compare(a, b) -def _mdiff(fromlines, tolines, context=None, linejunk=None, - charjunk=IS_CHARACTER_JUNK): +def _mdiff(fromlines, tolines, context=None, + linejunk=None, charjunk=IS_CHARACTER_JUNK, differ=None): r"""Returns generator yielding marked up from/to side by side differences. Arguments: @@ -1369,6 +1552,7 @@ def _mdiff(fromlines, tolines, context=None, linejunk=None, if None, all from/to text lines will be generated. linejunk -- passed on to ndiff (see ndiff documentation) charjunk -- passed on to ndiff (see ndiff documentation) + differ -- passed on to ndiff (see ndiff documentation) This function returns an iterator which returns a tuple: (from line tuple, to line tuple, boolean flag) @@ -1398,7 +1582,7 @@ def _mdiff(fromlines, tolines, context=None, linejunk=None, change_re = re.compile(r'(\++|\-+|\^+)') # create the difference iterator to generate the differences - diff_lines_iterator = ndiff(fromlines,tolines,linejunk,charjunk) + diff_lines_iterator = ndiff(fromlines,tolines,linejunk,charjunk,differ) def _make_line(lines, format_key, side, num_lines=[0,0]): """Returns line of text with user's change markup and line formatting. @@ -1627,6 +1811,9 @@ def _line_pair_iterator(): # Catch exception from next() and return normally return +######################################################################## +### HtmlDiff +######################################################################## _file_template = """ @@ -1737,15 +1924,15 @@ class HtmlDiff(object): _legend = _legend _default_prefix = 0 - def __init__(self,tabsize=8,wrapcolumn=None,linejunk=None, - charjunk=IS_CHARACTER_JUNK): + def __init__(self, tabsize=8, wrapcolumn=None, + linejunk=None, charjunk=IS_CHARACTER_JUNK, differ=None): """HtmlDiff instance initializer Arguments: tabsize -- tab stop spacing, defaults to 8. wrapcolumn -- column number where lines are broken and wrapped, defaults to None where lines are not wrapped. - linejunk,charjunk -- keyword arguments passed into ndiff() (used by + linejunk,charjunk,differ -- keyword arguments passed into ndiff() (used by HtmlDiff() to generate the side by side HTML differences). See ndiff() documentation for argument default values and descriptions. """ @@ -1753,6 +1940,7 @@ def __init__(self,tabsize=8,wrapcolumn=None,linejunk=None, self._wrapcolumn = wrapcolumn self._linejunk = linejunk self._charjunk = charjunk + self._differ = differ def make_file(self, fromlines, tolines, fromdesc='', todesc='', context=False, numlines=5, *, charset='utf-8'): @@ -2024,7 +2212,7 @@ def make_table(self,fromlines,tolines,fromdesc='',todesc='',context=False, else: context_lines = None diffs = _mdiff(fromlines,tolines,context_lines,linejunk=self._linejunk, - charjunk=self._charjunk) + charjunk=self._charjunk,differ=self._differ) # set up iterator to wrap lines that exceed desired width if self._wrapcolumn: diff --git a/Lib/test/test_difflib.py b/Lib/test/test_difflib.py index 771fd46e042a41..92800893d9f279 100644 --- a/Lib/test/test_difflib.py +++ b/Lib/test/test_difflib.py @@ -1,4 +1,5 @@ import difflib +from functools import partial from test.support import findfile, force_colorized import unittest import doctest @@ -211,6 +212,8 @@ def test_html_diff(self): t2 = patch914575_to2.splitlines() f3 = patch914575_from3 t3 = patch914575_to3 + # Set prefix manually so that other tests are indepedent + difflib.HtmlDiff._default_prefix = 0 i = difflib.HtmlDiff() j = difflib.HtmlDiff(tabsize=2) k = difflib.HtmlDiff(wrapcolumn=14) @@ -640,6 +643,78 @@ def test_invalid_input(self): ''.join(difflib.restore([], 3)) +class NullMatcher(difflib.SequenceMatcherBase): + def _get_matching_blocks(self): + return [] + + +class TestMatcherDifferArgs(unittest.TestCase): + def test_process_matcher_arg(self): + matcher = difflib._process_matcher_arg(None, 'matcher') + self.assertEqual(matcher, difflib.SequenceMatcher) + with self.assertRaisesRegex(TypeError, "^'matcher' must be a callable. Got"): + difflib._process_matcher_arg(0, 'matcher') + + func = lambda j=None, s1='', s2='': None + regex = "must return SequenceMatcherBase instance. Returned: None$" + with self.assertRaisesRegex(TypeError, regex): + difflib._process_matcher_arg(func, 'matcher') + + def test_get_close_matches(self): + results = difflib.get_close_matches('a', ['a', 'aa'], matcher=NullMatcher) + self.assertEqual(results, []) + + def test_differ(self): + result = difflib.Differ().compare(['a', 'a'], ['a', 'a']) + self.assertEqual(list(result), [' a', ' a']) + + null_differ = difflib.Differ(linematcher=NullMatcher, charmatcher=NullMatcher) + result = null_differ.compare(['a', 'a'], ['a', 'a']) + self.assertEqual(list(result), ['- a', '- a', '+ a', '+ a']) + + # although linematcher matches nothing, charmatcher compensates + differ = difflib.Differ(linematcher=NullMatcher) + result = differ.compare(['a', 'a'], ['a', 'a']) + self.assertEqual(list(result), [' a', ' a']) + + def test_unified_diff(self): + result = difflib.unified_diff(['a'], ['a'], matcher=NullMatcher) + self.assertEqual(list(result), ['--- \n', '+++ \n', '@@ -1 +1 @@\n', '-a', '+a']) + + def test_context_diff(self): + result = difflib.context_diff(['a'], ['a'], matcher=NullMatcher) + self.assertEqual(list(result), ['*** \n', + '--- \n', + '***************\n', + '*** 1 ****\n', + '! a', + '--- 1 ----\n', + '! a']) + + def test_ndiff(self): + with self.assertRaisesRegex(TypeError, "^'differ' must be a callable. Got"): + difflib.ndiff([], [], differ=0) + + dfunc = lambda j1=None, j2=None: None + regex = "'differ' must return Differ instance. Returned: None" + with self.assertRaisesRegex(TypeError, regex): + difflib.ndiff([], [], differ=dfunc) + + null_differ = partial(difflib.Differ, linematcher=NullMatcher, charmatcher=NullMatcher) + result = difflib.ndiff(['a'], ['a'], differ=null_differ) + self.assertEqual(list(result), ['- a', '+ a']) + + def test_html_diff(self): + difflib.HtmlDiff._default_prefix = 0 + table = difflib.HtmlDiff().make_table(['a'], ['a']) + null_differ = partial(difflib.Differ, linematcher=NullMatcher, charmatcher=NullMatcher) + difflib.HtmlDiff._default_prefix = 0 + hdiff = difflib.HtmlDiff(differ=null_differ) + null_table = hdiff.make_table(['a'], ['a']) + self.assertEqual(len(table), 599) + self.assertEqual(len(null_table), 725) + + def setUpModule(): difflib.HtmlDiff._default_prefix = 0 diff --git a/Misc/NEWS.d/next/Library/2026-02-25-12-30-12.gh-issue-145209.tSDKnR.rst b/Misc/NEWS.d/next/Library/2026-02-25-12-30-12.gh-issue-145209.tSDKnR.rst new file mode 100644 index 00000000000000..7e897e440c34d0 --- /dev/null +++ b/Misc/NEWS.d/next/Library/2026-02-25-12-30-12.gh-issue-145209.tSDKnR.rst @@ -0,0 +1 @@ +Users can now implement and/or provide custom ``Matcher`` and ``Diff`` classes to functions of :mod:`difflib`.