RFC 002 - Grammatical Normalization And Noise Tolerance Layer Before Intent Parsing

Status

Status: requested
Scope: text preprocessing before resolution_to_parsed
Primary surface: ontobdc run intent parsing flow

Purpose

This RFC proposes adding a lightweight text normalization layer before the current spaCy parsing step used by resolution_to_parsed.

The goal is to improve grammatical robustness for low-friction terminal input, especially when the user omits accentuation or introduces small orthographic noise in structurally important words.

Context

The current parsing flow relies directly on a lightweight spaCy model such as pt_core_news_sm.

That model is fast and useful, but it can show grammatical short-sightedness when the user types noisy terminal input.

A concrete failure case already observed is the loss of accentuation in short grammatical terms, such as:

é becoming e
são becoming sao

In those cases, the dependency tree may degrade enough to:

identify the wrong root
classify a structural token incorrectly
generate invalid or misleading triples in context.rdf
mislead downstream capability validation and filtering

Motivation

Terminal-oriented usage has a low barrier of entry and should tolerate imperfect input.

Users should not need perfect orthography to obtain a valid dependency structure for common operational intents.

This is especially important for:

short interactive prompts
fast command-line usage
Portuguese grammatical markers that depend on accentuation
structural terms whose misclassification can corrupt the parse tree

The intent parser should therefore accept a meaningful level of orthographic chaos without collapsing the grammatical backbone of the sentence.

Proposal

Introduce a preprocessing stage before the current spaCy model call in resolution_to_parsed.

This new stage should:

receive the raw user intent text
apply a lightweight local normalization pass
return a normalized string for the parser
optionally preserve both original and normalized forms for debugging or RDF traceability

Proposed Responsibilities

The preprocessing layer should focus on:

grammatical normalization
tolerance to small orthographic errors
restoration of high-impact accentuation in structural tokens

Proposed Normalization Strategy

Use a lightweight local orthographic normalizer, such as an edit-distance-based approach inspired by SymSpell, but constrained to a narrow linguistic scope.

The normalizer should prioritize:

closed-class grammatical words
common auxiliary verbs
interrogative forms
short structural terms that strongly affect dependency parsing

Examples of desirable corrections:

e -> é when the surrounding structure strongly indicates the verb
sao -> são
quais sao -> quais são

Proposed Tolerance Rules

The layer should be conservative.

It should:

prefer low-risk corrections
target structural grammar before general vocabulary
avoid rewriting domain-specific nouns aggressively
avoid turning the pre-parser into a full spell-checker

Possible Runtime Contract

If implemented, the parsing flow may evolve into:

raw user input
grammatical normalization and noise tolerance
spaCy parsing
semantic mapping and RDF materialization

Constraints

The preprocessing layer should:

remain lightweight and local
avoid network calls or external remote dependencies
avoid introducing a heavy language-model dependency before spaCy
remain explainable and testable
minimize false positive rewrites
work well for terminal-sized inputs rather than large text corpora

Expected Impact

If implemented, this change would likely affect:

src/ontobdc/run/plugin/capability/resolution_to_parsed.py
parsing-related runtime tests
RDF expectations for parsedIntent
state-machine reliability when parsing noisy user input

Likely test impact:

add cases for accent omission in Portuguese
verify stable root extraction after normalization
verify that parsedIntent remains semantically valid under noisy terminal input

Likely documentation impact:

documentation/spec/SPEC008_run_intent_resolution_state_machine.md
documentation/test/TEST005_run_context_state_machine_tests.md

Open Questions

Should the normalized text replace the original text entirely, or should both be preserved?
Should normalization be language-specific from day one, or only for Portuguese initially?
Should the correction layer operate only on a curated structural lexicon?
How much correction confidence is required before rewriting a token?
Should the RDF runtime artifact record that normalization occurred?

Follow-Up

If accepted, the next step should be to define:

the exact boundary between normalization and parsing
the first structural lexicon for Portuguese
the acceptance rules for conservative token correction
the test corpus for noisy terminal inputs