INDEX
Explanations
references to relationships and personal attributes of individuals
New Auto-Interp
Negative Logits
herself
-0.33
Frau
-0.28
Woman
-0.28
woman
-0.28
woman
-0.27
female
-0.27
actresses
-0.26
actress
-0.26
atrice
-0.26
Actress
-0.26
POSITIVE LOGITS
guy
0.33
çĶ·åŃIJ
0.31
men
0.31
boys
0.30
guys
0.30
male
0.30
gentlemen
0.30
boy
0.28
handsome
0.28
males
0.28
Activations Density 0.786%