INDEX
Explanations
negation or denial in statements
New Auto-Interp
Negative Logits
featureID
-0.54
@"/
-0.43
PerformLayout
-0.40
Jurí
-0.40
arşivlendi
-0.39
paravant
-0.39
TargetApi
-0.38
Económica
-0.38
Artículos
-0.38
pleaſure
-0.37
POSITIVE LOGITS
而非
0.77
而不是
0.70
وليس
0.62
rather
0.61
Rather
0.54
bukan
0.53
Rather
0.52
вместо
0.50
statt
0.50
piuttosto
0.50
Activations Density 0.193%