INDEX
Explanations
references to personal relationships and familial connections
New Auto-Interp
Negative Logits
HAS
-0.27
Were
-0.27
Were
-0.26
Has
-0.26
—are
-0.25
_are
-0.24
.are
-0.23
hanno
-0.23
_has
-0.23
aren
-0.23
POSITIVE LOGITS
was
0.40
wasn
0.30
became
0.27
could
0.25
couldn
0.25
was
0.24
took
0.23
began
0.23
had
0.22
would
0.22
Activations Density 0.512%