INDEX
Explanations
words reflecting personal relationships and emotional connections
New Auto-Interp
Negative Logits
unavailable
-0.65
absent
-0.60
impossible
-0.57
unseen
-0.57
Absent
-0.54
absent
-0.53
impossibility
-0.50
lacked
-0.49
Impossible
-0.48
unthinkable
-0.48
POSITIVE LOGITS
doesn
1.61
doesn
1.40
Doesn
1.39
does
1.39
Does
1.29
don
1.29
Does
1.28
does
1.27
Doesn
1.27
doesnt
1.24
Activations Density 0.228%