INDEX
Explanations
statements highlighting societal biases and inconsistencies
New Auto-Interp
Negative Logits
lÃŃ
-0.15
iola
-0.15
ãĥ³ãĤº
-0.14
_BUF
-0.14
ãĥ³ãĤ°
-0.14
oretical
-0.14
yonel
-0.14
arak
-0.14
erable
-0.13
rams
-0.13
POSITIVE LOGITS
even
0.29
even
0.25
almost
0.24
даже
0.22
almost
0.21
sogar
0.21
EVEN
0.20
Even
0.19
Even
0.19
it
0.19
Activations Density 0.101%