INDEX
Explanations
negative or limiting phrases
New Auto-Interp
Negative Logits
çĶļèĩ³
-0.21
zwar
-0.21
sice
-0.20
even
-0.20
Even
-0.18
but
-0.18
tháºŃm
-0.18
even
-0.18
iola
-0.16
èϽçĦ¶
-0.16
POSITIVE LOGITS
necessarily
0.21
enough
0.16
Traversal
0.15
vida
0.15
theless
0.14
Ïħνα
0.14
consequ
0.14
essler
0.14
ilden
0.14
daq
0.14
Activations Density 0.126%