INDEX
Explanations
phrases indicating ongoing or prior events and actions
New Auto-Interp
Negative Logits
ory
-0.14
burger
-0.14
same
-0.14
olini
-0.14
Oliv
-0.14
alles
-0.14
Rear
-0.13
overseas
-0.13
_same
-0.13
deleg
-0.13
POSITIVE LOGITS
еÑĢин
0.15
deen
0.15
adnÃŃ
0.14
owie
0.14
chein
0.14
UNUSED
0.14
ubs
0.13
дÑĢа
0.13
ailable
0.13
_ORD
0.13
Activations Density 0.322%