INDEX
Explanations
concepts related to social class and status
New Auto-Interp
Negative Logits
boys
-0.21
adores
-0.19
adoras
-0.17
stalk
-0.16
innen
-0.16
ãĥ¼ãĤ¿
-0.16
ughters
-0.16
embros
-0.15
wives
-0.15
sons
-0.15
POSITIVE LOGITS
person
0.65
person
0.42
Person
0.40
guy
0.38
Person
0.35
woman
0.34
member
0.34
personne
0.33
player
0.33
_person
0.32
Activations Density 2.020%