INDEX
Explanations
specific names or entities in a document
references to authority figures or political elements
New Auto-Interp
Negative Logits
Pg
-0.70
-->
-0.69
Scroll
-0.69
Replay
-0.68
KR
-0.67
GOODMAN
-0.67
Doors
-0.66
>>
-0.66
Secondly
-0.63
ouch
-0.63
POSITIVE LOGITS
consolid
0.80
indul
0.72
hypothes
0.69
pione
0.68
langu
0.66
sleek
0.65
relaxing
0.64
devoted
0.64
ambig
0.64
integrating
0.63
Activations Density 0.586%