INDEX
Explanations
references to individuals and interactions with them
New Auto-Interp
Negative Logits
’s
-0.21
has
-0.17
shown
-0.16
demanded
-0.16
deemed
-0.16
awaited
-0.15
shown
-0.15
hasn
-0.15
presumed
-0.15
's
-0.15
POSITIVE LOGITS
want
0.58
think
0.45
believe
0.44
wish
0.40
prefer
0.39
know
0.38
need
0.36
hope
0.35
want
0.35
expect
0.35
Activations Density 0.077%