INDEX
Explanations
mentions of family relationships, particularly uncles and aunts
New Auto-Interp
Negative Logits
Daughter
-0.19
granddaughter
-0.19
Wife
-0.17
303
-0.17
Sons
-0.16
grandson
-0.16
Fathers
-0.16
413
-0.15
476
-0.15
妻
-0.15
POSITIVE LOGITS
uncle
0.49
Uncle
0.49
Unc
0.49
unc
0.47
Unc
0.45
aunt
0.42
_unc
0.39
Aunt
0.39
Cous
0.35
åıĶ
0.35
Activations Density 0.230%