INDEX
Explanations
text related to harmful, intentional actions
terms related to malicious intent or harmful actions
New Auto-Interp
Negative Logits
ĸļ
-1.26
arist
-0.86
ills
-0.81
akeru
-0.76
gdala
-0.73
ļéĨĴ
-0.72
illy
-0.71
hene
-0.71
ère
-0.69
blance
-0.69
POSITIVE LOGITS
ly
1.23
intent
1.06
mischief
0.87
behaviour
0.87
activity
0.84
behavi
0.82
payload
0.82
icious
0.82
LY
0.81
malicious
0.80
Activations Density 0.021%