INDEX
Explanations
references to violent events and incidents
New Auto-Interp
Negative Logits
Father
-0.30
fathers
-0.30
grandfather
-0.29
himself
-0.29
boy
-0.29
guy
-0.28
masculinity
-0.28
brothers
-0.28
gentleman
-0.28
Fathers
-0.28
POSITIVE LOGITS
woman
0.44
women
0.44
herself
0.42
female
0.41
actresses
0.40
woman
0.39
girl
0.38
Women
0.38
women
0.37
females
0.37
Activations Density 1.475%