INDEX
Explanations
deliberate disruption or refusal
New Auto-Interp
Negative Logits
%
0.40
ول
0.37
奋
0.37
சன்
0.35
阐
0.34
특징
0.34
isom
0.34
}
0.33
玄
0.32
深刻
0.32
POSITIVE LOGITS
refused
0.64
refuses
0.62
disrespect
0.61
refuse
0.61
maliciously
0.60
disrespectful
0.58
sabotage
0.57
disregard
0.57
purposely
0.56
deliberately
0.56
Activations Density 0.110%