Skip to content

Learning

The learn capability is what makes A2E agents self-improving. It defines a standard protocol for feedback (human, environment, or self-critique), experience replay (on-policy and off-policy), and adaptation (UCB1, epsilon-greedy, softmax, or custom strategies). Every agent action becomes a training signal; every correction becomes a policy improvement.

Overview

The learn capability provides a feedback-driven learning system — agents can submit feedback signals, record RL experience, trigger skill adaptation, and query performance statistics. It bridges agent evaluation with policy optimization.

Protocol Messages (8 types)

Type StringModelDirection
learn/feedback/reqLearnFeedbackRequestAgent → Host
learn/feedback/respLearnFeedbackResponseHost → Agent
learn/experience/reqLearnExperienceRequestAgent → Host
learn/experience/respLearnExperienceResponseHost → Agent
learn/adapt/reqLearnAdaptRequestAgent → Host
learn/adapt/respLearnAdaptResponseHost → Agent
learn/stats/reqLearnStatsRequestAgent → Host
learn/stats/respLearnStatsResponseHost → Agent

Feedback Model

FeedbackPolarity: POSITIVE, NEGATIVE, NEUTRAL, CORRECTIVE

FeedbackDimension: CORRECTNESS, HELPFULNESS, SAFETY, TONE, PLAN_QUALITY

FeedbackSource: HUMAN, ENV, SELF

FieldTypeDescription
correlation_idstrLinks to the original request
polarityFeedbackPolarityPositive/negative/neutral/corrective
scorefloat-1.0 to +1.0
dimensionFeedbackDimensionWhat aspect is being evaluated
confidencefloat0-1 confidence in this feedback
commentstrFree-text explanation
correctionstrCorrected output (for CORRECTIVE polarity)
correction_spandictPosition of the correction
sourceFeedbackSourceWho gave the feedback
annotator_idstrAnnotator identifier
rated_turnRatedTurnAssociated prompt/response pair

Validation: CORRECTIVE polarity requires correction text (enforced by Pydantic @model_validator).

Conversion methods:

  • to_preference_pair() → DPO training pair (chosen vs rejected)
  • to_reward_sample() → Reward model training sample

Experience Model (RL Replay)

python
Experience(
    state: dict,        # Current state
    action: dict,       # Action taken
    reward: float,      # Reward received
    next_state: dict,   # Resulting state
    done: bool          # Terminal flag
)

SkillPerformanceRecord

Rolling per-skill/tool performance stats:

FieldTypeDescription
namestrSkill or tool name
calls_totalintTotal invocations
calls_successintSuccessful calls
calls_failedintFailed calls
avg_duration_msfloatAverage execution time
avg_scorefloatAverage feedback score
p95_duration_msfloatP95 latency

Adaptation Strategies

StrategyDescription
ucb1Upper Confidence Bound — explore/exploit based on confidence intervals
epsilon_greedyRandom exploration with epsilon probability
softmaxBoltzmann exploration over value estimates
customUser-defined strategy

LearnPlugin ABC

python
class LearnPlugin(A2EPlugin):
    name = "learn"
    priority = 5

    @abstractmethod
    def _record_feedback(self, feedbacks) -> tuple[int, dict]: ...

    @abstractmethod
    def _store_experiences(self, experiences) -> int: ...

    @abstractmethod
    def _adapt(self, skill_name, strategy) -> list[SkillPerformanceRecord]: ...

    @abstractmethod
    def _get_stats(self, skill_name, tool_name) -> dict: ...

LearnAPI (Client)

python
from a2e.caps.learn.client import LearnAPI

learn = LearnAPI(client)

# Submit feedback
resp = learn.feedback(
    polarity="POSITIVE",
    score=0.9,
    dimension="CORRECTNESS",
    confidence=0.95,
    prompt="What is 2+2?",
    response="4",
    source="HUMAN",
    comment="Correct answer"
)

# Record RL experience
count = learn.experience([
    {"state": {"count": 0}, "action": {"type": "inc"}, "reward": 1.0,
     "next_state": {"count": 1}, "done": False}
])

# Trigger adaptation
records = learn.adapt(skill_name="my_skill", strategy="ucb1")

# Query stats
skills, tools = learn.stats(skill_name="my_skill")

# Convenience: send scalar reward
learn.reward(skill_name="my_skill", value=1.0, correlation_id="req_123")

A2E Protocol v1.0 — Released under the MIT License.