INDEX
Explanations
verbs related to causing harm
references to the concept of "ruin" and its derivatives
New Auto-Interp
Negative Logits
appa
-0.74
duino
-0.68
heter
-0.67
enne
-0.67
bors
-0.67
leground
-0.67
arij
-0.66
gencies
-0.66
taboola
-0.63
WER
-0.63
POSITIVE LOGITS
havoc
1.07
ous
0.96
ously
0.94
OUS
0.81
fully
0.81
stal
0.81
spoil
0.79
spo
0.77
strument
0.76
ifully
0.76
Activations Density 0.018%