INDEX
Explanations
references to introverted personality traits and behaviors
New Auto-Interp
Negative Logits
emd
-0.16
vocab
-0.16
ocr
-0.16
apus
-0.15
esture
-0.15
IMENT
-0.15
onnement
-0.15
adge
-0.15
steller
-0.15
lez
-0.14
POSITIVE LOGITS
ennon
0.17
trait
0.16
Mattis
0.16
raits
0.15
traits
0.15
309
0.15
compensation
0.15
Tow
0.14
toward
0.14
subs
0.14
Activations Density 0.026%