Skip to content

Feature/ted9 47/document probing#590

Closed
kaleanych wants to merge 257 commits intoOP-TED:developfrom
meaningfy-ws:feature/TED9-47/document-probing
Closed

Feature/ted9 47/document probing#590
kaleanych wants to merge 257 commits intoOP-TED:developfrom
meaningfy-ws:feature/TED9-47/document-probing

Conversation

@kaleanych
Copy link
Copy Markdown
Contributor

No description provided.

schivmeister and others added 28 commits March 10, 2026 21:26
Use the one-stop MSSKD service to detect, convert and load packages of
any given version, normalizing to a unified "v3". This yields support
also for v3L, aka lightweight, as v3 is a superset (the lightweight
variant excludes all data except bare transformation necessities).
Standard Forms and eForms are henceforth "v1" and "v2", respectively.

The pipeline native model is now an MSSDK v3-extended one, with the
JSONLD being the canonical metadata model. Not only is there no
equivalent in older models for this and the accompanying
`context.jsonld`, the datetime datatype also needs special
handling/conversion when used in legacy contexts.

A key distinguishing feature of the new unified package is the complete
refactor of the constraints model, removing one level of nesting but
also adding more structure and possibilities with one model (like a
range of document schema versions as seen in v1 or a list of such as
seen in v2). Repurposing these constraints for legacy contexts therefore
needs extra care, if not refactored completely.

Recap of model differences from v1/v2 to v3:

- `identifier` -> `id`
- `issue_date (str)` -> `created_at (datetime)`
- `ontology_version` -> `model_version`
- `metadata_constraints.constraints` -> `applicability_constraints`
- `eforms_subtype` -> `document_type_list`
- `start_date/end_date` -> `document_time_interval.start/end`
- `min/max_xsd_versions` -> `document_version_range.min/max`
- `eforms_sdk_versions` -> `document_schema_version_list`

Note that _applicability constraints_ is a package perspective -- the
same constraints are to be interpreted by the pipeline as _eligibility
constraints_ for a notice.

There is an additional transitional field `project_identifier`, which
stands in for the `mapping_type`, but only barely. This interpretation
may be deprecated at any point, but not before support is added for
alternative detection mechanisms.
- sparql_test_suites <-> test_suites_sparql
- shacl_test_suites <-> test_suites_shacl

This also fixes some tests that rely on these prerequisite validation data.
Fix the test assumptions by reducing the number of test data to match.
In the case of the Standard Forms (v1) package `package_F03_test`, there
are 105 test data files, of which 82 are unique. However, only 81 of
them are to be found within folders under `test_data`, whereas the rest,
including a unique one `example.xml`, is not contained within a folder.
Pass the type along as it does not matter as much anymore since we
normalize to MSSDK v3 and the native pipeline model is an extension of
it.
We now delegate to the MSSDK for validation, which is carried out during
the package parsing/loading.
If an error occurs in the package loading, due to validation or other
failures, simply forward the error and continue loading the other
packages.
- pass MongoDB client to normalise_notice function
- reparse MSSDK CSV list object w/ Pandas to reinterpret numbers
- update tests
Tests were failing with ModelNotFoundError because:
- Notice fixtures didn't set mapping_package_identifier
- Mapping suite/package weren't loaded into test MongoDB instances
- normalise_notice() calls didn't pass mongodb_client parameter

Changes:
- Add load_mapping_suite_and_package fixture to features/conftest.py
- Update notice fixtures to set mapping_package_identifier
- Pass mongodb_client to normalise_notice() in test steps
- Add load_mapping_suite_and_package_fake for e2e tests using mongomock
- Update e2e fixtures to link GitHub-loaded packages to local mapping suite

Fixes 30+ e2e/feature tests that were failing after metadata resource
refactoring with dynamic MS Config loading via MSSDK.
There was a hidden circular dependency in the metadata resource
migration to MS Config via MSSDK.

The previous design required a notice with `mapping_package_identifier`
to load resources, but this created a circular dependency: normalisation
needs resources, yet eligibility checking (which returns a package
identifier but does not set one on the notice) needs normalised
metadata.

Initial assumptions may have been anchored on the resources being
project-specific. However, this is problematic as not all projects may
be updated with the mapping suite configuration. Therefore, resource
files (country.json, languages.json, etc.) can be interpreted to be
global for now during the transition period.

Once all currently known production projects are updated with the
configuration, a more dynamic method to select the mapping suite can be
implemented, for e.g. via the `document_probing` conditions specified in
the config, which defines what XPaths must and must not be available to
be compatible with the project.

Changes:
- MappingFilesRegistry now loads resources from any available MappingSuite
- Removed notice parameter from DefaultNoticeMetadataNormaliser and
  EformsNoticeMetadataNormaliser constructors
- Updated find_metadata_normaliser_based_on_xml_manifestation() and
  extract_and_normalise_notice_metadata() to not require notice
- Added MappingSuiteConfigError for when no MappingSuite is available
- Updated all test fixtures to use the new API
- Remove all traces and dependence on a Notice
  mapping_package_identifier

TODO: The mapping suite must be made mandatory and be fetched from a
default known project with the configuration if not given.
The actual fetch of the github repo would get no MS config, and the fake
would be adding one. There appears to inconsistency in this test passing
locally but failing on the server, so let us remove the MS config part.
feat(infra): add Fuseki seed data init container and healthcheck
This is required for passing the mongodb client to the
MappingFilesRegistry, which picks up mapping metadata resource files
from the MS config. Without this there is a mismatch in the mongodb
client in tests, whose first entrypoint usually gets a mock, but in this
case, the normalisation would've defaulted to a real one retrieved from
the environment.
Simply specifying the `develop` branch is cause for problems. Pip will
not reinstall the dependency as it would interpret no change to the
version of the package installed. In turn, the Docker cache layer for
the file would remain unchanged and the dependency resolution would not
be rerun. This would lead to a stale dependency situation in the images.
Pin development version of MSSDK to avoid stale dependencies
…feature/TED9-47-Enhance-the-eForms-eligibility-checking-component
…-component' into feature/TED9-47/document-probing
@kaleanych kaleanych changed the base branch from main to develop March 11, 2026 18:48
@kaleanych kaleanych closed this Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants