INDEX
Explanations
phrases indicating tendencies or behaviors
New Auto-Interp
Negative Logits
adia
-0.18
oste
-0.17
aving
-0.17
esan
-0.16
icism
-0.15
opard
-0.15
å¥ı
-0.15
idable
-0.14
ιÏİν
-0.14
ourd
-0.14
POSITIVE LOGITS
erness
0.28
ENCIES
0.20
tend
0.19
entially
0.18
tends
0.18
toward
0.16
encias
0.16
entious
0.16
reds
0.16
ży
0.15
Activations Density 0.009%