Skip to content

Conversation

@Kaushik-Kumar-CEG
Copy link

Fixes #4690

The Issue

ScanCode was failing to correctly handle composite license expressions in two specific edge cases:

  • The "AND" Failure (Dropping Data):

    • Input: Apache-2.0 AND MIT
    • Old Behavior: Dropped "MIT" as noise. Output: Apache-2.0
    • Expected: Apache-2.0 AND MIT
  • The "OR" Failure (Redundancy):

    • Input: Apache-2.0 OR MIT
    • Old Behavior: Reported both the combined license AND the individual part. Output: Apache-2.0 OR MIT + Apache-2.0
    • Expected: Only Apache-2.0 OR MIT

Summary of Changes

This PR fixes both issues by tightening the data rules and adding a clean-up filter:

  • Added New Rule: Created apache-2.0_and_mit_37.RULE to explicitly catch "Apache-2.0 AND MIT" as a single unit. This prevents the tokenizer from treating the short "MIT" string as discardable noise.
  • Added Redundancy Filter: Updated detection.py to filter out subset matches. If a detected license (e.g., Apache-2.0) is fully contained within a larger detected expression (e.g., Apache-2.0 OR MIT), the redundant subset is now discarded.

Verification

Verified locally with a new test case (tests/licensedcode/test_issue_4690.py) covering both scenarios. The output now correctly reports single, accurate composite license expressions for both cases without dropping data or adding duplicates.

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled 📑 and links the original issue above 🔗
  • Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
  • Commits are in uniquely-named feature branch and has no merge conflicts 📁
  • Updated documentation pages (if applicable)
  • Updated CHANGELOG.rst (if applicable)

…ches in 'OR' expressions

Signed-off-by: Kaushik <kaushikrjpm10@gmail.com>
Copy link
Member

@AyanSinhaMahapatra AyanSinhaMahapatra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kaushik-Kumar-CEG IMHO the main fix for this particular issue would be to add rules.

  1. you have lots of test failures, have you even checked these? you need to regenerate the test expectations and see if your changes make any sense/they break other tests
  2. for a case like apache-2.0 AND (apache-2.0 OR mit) you need to ensure these are perfect matches, on the same line/right next to each other to merge these. And the expressions matter
  3. You have not crafted any test so show that the issue is actually fixed.

is_license_reference: yes
relevance: 100
---
Apache-2.0 AND MIT No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add license rules for the whole texts instead:

  • licensed under Apache-2.0 AND MIT
  • licensed under Apache-2.0 OR MIT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect license expression normalization for AND / OR combinations

2 participants