Skip to content

Refactor cli internals#54

Merged
fantazio merged 12 commits intoLexiFi:masterfrom
fantazio:refactor_cli
Feb 18, 2026
Merged

Refactor cli internals#54
fantazio merged 12 commits intoLexiFi:masterfrom
fantazio:refactor_cli

Conversation

@fantazio
Copy link
Collaborator

@fantazio fantazio commented Feb 5, 2026

This is a significant change and part of an even larger refactor planned (#43).

Key changes

Introducing the Config module

Configuration related code is now contained in its own module (and submodule Sections). With this, it will be easier to debug the configuration of the analysis when necessary (debugging facilities such as pretty printing are planned for later work).

This module provides an immutable API with private types, ensuring there is no risk of an involuntary update of the configuration. There are only 2 ways to obtain a configuration : either by using the Config.default_config or by parsing the command line by using Config.parse_cli ().
The only updatable field is sections.style by using the Config.update_style function. This is necessary for now because the analysis must not track stylistic issues when processing the paths passed to the --references command line argument 1.

The State.t has been updated to hold the configuration. The analysis state can only be initialized by providing a configuration 2.
Rather than accessing the different configuration information through globally accessible refs, the analyzer would read the appropriate state.config fields.

Simplifying the command line arguments order

Previously, the following command would have meant "analyze path1 for the unused exported values only and path2 for both the unused exported values and the unused methods" :
dead_code_analyzer --nothing -E all path1 -M all path2
Now all the paths are gathered together before any analysis is run, so the configuration is the same for all files. Hence, the above command now means "analyze path1 and path2 for unused exported values and methods". Consequently, one could prefer using its "normalized" form instead :
dead_code_analyzer --nothing -E all -M all path1 path2

Disabling complex command line semantics is voluntary. It could be re-enabled in the future if there is a request for it but for now, the simplified meaning should lead to clearer documentation and easier maintenance.

Note

This change will facilitate the 5.3 update, which would require to know which .cm* files are provided to find out some location information. (The 5.3 update currently fails on cases like #53).

In OCaml 5.3, the cmt_infos.cmt_value_dependencies has been replaced by cmt_infos.cmt_declaration_dependencies which uses Shape.Uid.t instead of Types.value_description. This change of type makes location harder to get because they were directly accessible in Types.value_description but are inexistant in Shape.Uid.t.
The analysis relies on locations. Finding an uid's location can be done using cmt_infos.cmt_uid_to_decl but, as described in Shape.mli, one needs to read the .cmt of external units to find the location of corresponding "external" uids (although they appear in the cmt_declaration_dependencies).

Thus, knowing which files are given to the tool will indicate which in files "external" uids can be looked for, and the analyzer will be able to correctly associate locations like it does in OCaml 5.2

Footnotes

  1. I think this rule should be changed to still track stylistic issues on "reference" paths. The idea behind --references was to not track new declarations (be it values, types, methods or optional arguments). This idea does not contradict the identification of stylistic issues.

  2. A default state exists and provides the default_config, but is not explicitly exposed in the API. It is only used as the initial value for the current state before any State.update happened.

This is the first of a series of commits aiming at extracting
configuration related code from the analysis code.

NOTE:

There are multiple end goals at stake :
- for the user interface, its specification should be rigidified to
  avoid confusion on command lines such as
  `dead_code_analyzer --nothing -E all path1 -M all path2`
  which would currently mean "analyze path1 for the exported values only
  and path2 for both the exported values and the methods".
  Ideally, it should be simplified to "analyze path1 and path2 for
  exported values and methods" and encourage the use of
  `dead_code_analyzer --nothing -E all -M all path1 path2`
  instead.
- for the configuration, it should appear as immutable in order to
  facilitate debugging and accessing its information.
- for the analysis, the end goal is to be able to split the analysis in
  2 passes :
  1. on the interfaces to gather all the declarations
  2. on the implementations to gather all the references.
  They are currently interlaced (actually, for each compilation unit, we
  first process the interface and then the implementation). This split
  combined with a more robust configuration should help transition to
  OCaml 5.3
The main sections config was a structure with fields `print`,
`threshold`, and `call_sites`. The opt args sections config was a
structure with fields `print`, `threshold`, and `call_sites`. The only
difference between the 2 is the type of `threshold`.

Hence, they can be grouped under a same record type with `threshold`'s
type as parameter.

Additionally, having the `threshold` and `call_sites` fields mandatory
does not make sense when the main usage of the tool does not require
them, and the `print` field would express if the reports were activated
for the section.
Hence, the sections configuration is simplified to express the 3 states
it can be in more explicitly :
- deactivated (`Off`)
- activated without threshold (`On`)
- activated with threshold (`Threshold _`)

The `threshold` and `call_sites` information are now only available in
the `Threshold` mode, where they make sense.
It was a structure with fields `percentage`, `exceptions` and
`optional`. The `optional` field was actually a mode selection (` `Both`
or ` `Percent`). If the mode was ` `Both`, then the `exceptions` field
would be used. Otherwise, only `percentage` would.

Rather than having this implicit semantic with everything at hand all
the time, the type is now properly split in 2 constructors : `Both` and
`Percent`, which both only hold the relevant information.
It contains all the configuration. For now it simply regroup all the
configuration values that were found independently `Config`.
The extra indirection makes the code heavier to read for now.
The section configuration fields are still `ref`. Hence the
`!(config._)` constructs.
The other ones are now mutable fields.
`Config.config` is now a ref and its fields are neither `mutable` nor
`ref`.
Additionally, now that the cli parsing is in `Config`, `Config`'s API is
reduced as to not expose the "update" functions. Only `update_style`
remains for now.
The new `Config.Sections` module exposes an immutable API for the
sections configuration. The sections configuration is now exposed
through `Config.config.sections`.
This simplifies the code in consumers of the config API
Peths considered for analysis, to gather references from, and those
excluded from the analysis are now available in the config. This step
will enable moving to an immutable config and an "out-of-arg-parsing"
analysis (analysis is currently triggered during the parsing of
arguments).
The `Config` API is now immutable. `parse_cli` will produce a new
`Config.t` each time rather than update the `Config.config`, and the old
`config : t ref` is now simply `default_config : t`.
`State.init` now expects a config as argument to produce the initial
state. During the analysis, the configuration that must be acounted for
is the one in the state.
This protects the configuration from updates outside the expected canals
(the functions in the API)
+ Fix some typos in comments.
+ Move `StringSet` to `Utils` (for now)
+ Easier to read `Arg.parse`.
+ Rename `Config.*_activated` into `Config.must_report_*`
+ Rewrite `Config.normalize_path` for clarity
@nojb
Copy link
Contributor

nojb commented Feb 11, 2026

Thanks for the useful description.

This change will facilitate the 5.3 update, which would require to know which .cm* files are provided to find out some location information. (The 5.3 update currently fails on cases like #53).

About this: when I last looked at it I thought that for an accurate analysis a full dependency tree of each module was needed (but perhaps I was missing something obvious). I considered using the output of dune describe to retrieve it from Dune itself. I thought I would just mention it in case it makes sense (but it may very well not :)).

@fantazio
Copy link
Collaborator Author

fantazio commented Feb 11, 2026

Thanks for the pointer to dune describe. Sorry for the lengthy response.

when I last looked at it I thought that for an accurate analysis a full dependency tree of each module was needed

You are right that the full dependency tree will help provide a more accurate analysis. What I usually do for projects built with dune is run dead_code_analyzer [options] _build and let the analyzer discover all the .cm* files it can analyze in _build.

In practice, we don't build the dependency tree (yet ?) because the analysis is mostly order-insensitive. The only* order-sensitive part of the analysis relies on the fact that a compilation unit's interface is analyzed before its implementation.

* There may be some corner cases where order may have an influence. It was the case for re-exports (07408d8). To ensure the results of the analyzer are insensitive to the order of the paths in the command line, they are analyzed in lexicographical order, and the reports are also sorted in that order. Before the current change, the paths on the command line were analyzed in their order of appearance, and only the paths found when traversing directories would be analyzed in lexicographical order.

I considered using the output of dune describe to retrieve it from Dune itself.

If I understand correctly, you mean parsing that output to pass all the right files to the analyzer. I am not sure this would provide value in most cases vs passing _build in the command line. There are probably situations where a .cm* file is not stored in the same _build directory, but wouldn't that mean it is independent from the rest ?

Note

This is actually the case for the dead_code_analyzer : its build files are in _build, the examples/using_dune build files are in examples/using_dune/_build, the examples/using_make build files are spread in examples/using_make (not using dune), and check/src build files are in check/src (not using dune).
This separation is voluntary and there is no dependency in between each component listed.

If you meant using the output of dune describe to build the dependency tree for each module in the analyzer itself, this is a good idea. The dependency trees could probably be built using information stored in .cmi as well. I think this is not critical for now but that I see 2 immediate ways such trees could help :

  • Scheduling the analysis : analyze the leaves first, then their parents, and group files for parallel analysis. I actually plan on the long term to have an incremental/differential-analysis mode (a la Infer), which would allow to only analyze and report on subsets of projects that are impacted by a change, and this kind of scheduling would greatly help.
  • Dependency-aware caching : for the issue described in the Note section of this PR, and combined with a dependency-aware scheduler, .cm* files can be cached for exactly as long as they are needed, and information retrieval would be straightforward.

@fantazio
Copy link
Collaborator Author

@nojb Let me know if you'd like to review this PR more in depth. Otherwise, I'll merge it Wednesday ;)

@fantazio fantazio merged commit f022714 into LexiFi:master Feb 18, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments