INDEX
Explanations
understanding user difficulties
Detects content about dangerous or illicit requests and the model's safety refusals and crisis/help-seeking language (e.g., offers of resources and warnings).
New Auto-Interp
Negative Logits
optimizing
0.39
etchup
0.37
Critics
0.37
shrimps
0.37
martini
0.37
topologically
0.36
preserving
0.35
Critics
0.35
ographers
0.35
Shaping
0.35
POSITIVE LOGITS
寻求
0.52
urges
0.50
желание
0.49
vragen
0.49
motiva
0.47
keinginan
0.47
möglicherweise
0.46
kebutuhan
0.45
Bedür
0.45
मनात
0.45
Activations Density 0.353%