INDEX
Explanations
statements related to personal reflection and self-awareness
New Auto-Interp
Negative Logits
brid
-0.83
masked
-0.80
pressing
-0.79
continuous
-0.79
carriage
-0.78
raft
-0.78
applicable
-0.77
camp
-0.76
revers
-0.76
previously
-0.76
POSITIVE LOGITS
And
1.68
Because
1.54
That
1.54
It
1.52
Then
1.50
But
1.50
If
1.49
Maybe
1.49
Advertisements
1.49
They
1.48
Activations Density 0.409%