INDEX
Explanations
expressions related to hatred or strong negative feelings
feeling hate or related emotions
New Auto-Interp
Negative Logits
Rohy
-0.60
erſt
-0.59
SBATCH
-0.59
stiefe
-0.59
}{@-0.58
encodeWith
-0.57
Климат
-0.56
moveToFirst
-0.55
ьаж
-0.55
verifyException
-0.53
POSITIVE LOGITS
hate
1.11
hatred
1.09
hates
1.03
hated
1.03
HATE
1.00
hating
0.99
Hate
0.97
hate
0.96
Hate
0.96
ненави
0.85
Activations Density 0.035%