INDEX
Explanations
information revealing surprising or unexpected facts
phrases that emphasize revelations or surprising conclusions
New Auto-Interp
Negative Logits
icipated
-0.72
cious
-0.70
oided
-0.67
ilater
-0.67
uli
-0.66
ombs
-0.66
comm
-0.66
shaw
-0.65
notations
-0.65
resents
-0.64
POSITIVE LOGITS
there
0.84
nobody
0.71
âĶĢ
0.71
ymes
0.65
they
0.65
quite
0.63
Professor
0.62
none
0.62
din
0.62
THERE
0.62
Activations Density 0.036%