INDEX
Explanations
phrases related to topics of debate and controversy
New Auto-Interp
Negative Logits
367
-0.14
tul
-0.14
whatever
-0.14
oi
-0.13
á»Ļi
-0.13
ERM
-0.12
ANNEL
-0.12
inders
-0.12
helf
-0.12
ìķĦëĭĪ
-0.12
POSITIVE LOGITS
how
0.40
å¦Ĥä½ķ
0.28
how
0.28
whether
0.27
why
0.27
cómo
0.24
HOW
0.19
æĺ¯åIJ¦
0.19
-how
0.19
whether
0.18
Activations Density 0.147%