INDEX
Explanations
terms related to harmful intent or behavior
terms related to malicious behavior or intent
New Auto-Interp
Negative Logits
ĸļ
-0.87
arist
-0.85
ILA
-0.76
ills
-0.74
hetti
-0.70
Passage
-0.70
alon
-0.69
Gap
-0.69
quart
-0.69
Prayer
-0.67
POSITIVE LOGITS
malicious
1.18
mischief
0.96
intent
0.94
payload
0.87
icious
0.86
ly
0.82
vertising
0.78
behavi
0.78
behaviour
0.77
fully
0.76
Activations Density 0.005%