INDEX
Explanations
words related to behavioral tendencies or characteristics
phrases indicating tendencies or behavioral patterns
New Auto-Interp
Negative Logits
gur
-0.80
arta
-0.79
lain
-0.70
yz
-0.64
fil
-0.64
zbek
-0.62
aban
-0.61
ania
-0.60
ZA
-0.59
oÄŁ
-0.58
POSITIVE LOGITS
rils
1.36
entious
1.03
ril
0.96
erest
0.85
entimes
0.85
erers
0.81
erer
0.81
uce
0.80
toward
0.77
ensical
0.75
Activations Density 0.015%