INDEX
Explanations
references to past events and actions related to discussions and conclusions
New Auto-Interp
Negative Logits
elop
-0.15
wives
-0.13
erna
-0.13
wives
-0.13
cial
-0.13
تاب
-0.13
daughters
-0.13
.Adam
-0.12
luž
-0.12
ucked
-0.12
POSITIVE LOGITS
ol
0.32
ole
0.25
Mr
0.24
poor
0.23
dear
0.23
old
0.22
mr
0.21
OUR
0.21
Herr
0.20
Mr
0.20
Activations Density 0.267%