INDEX
    Explanations

    understanding user difficulties

    Detects content about dangerous or illicit requests and the model's safety refusals and crisis/help-seeking language (e.g., offers of resources and warnings).

    New Auto-Interp
    Negative Logits
     optimizing
    0.39
    etchup
    0.37
     Critics
    0.37
     shrimps
    0.37
     martini
    0.37
     topologically
    0.36
     preserving
    0.35
    Critics
    0.35
    ographers
    0.35
     Shaping
    0.35
    POSITIVE LOGITS
    寻求
    0.52
     urges
    0.50
     желание
    0.49
     vragen
    0.49
     motiva
    0.47
     keinginan
    0.47
     möglicherweise
    0.46
     kebutuhan
    0.45
     Bedür
    0.45
     मनात
    0.45
    Act Density 0.353%

    No Known Activations