INDEX
Explanations
statements starting with "In fact" and "actually"
statements that emphasize factual information
New Auto-Interp
Negative Logits
GBT
-0.56
ggles
-0.56
eded
-0.55
Lastly
-0.54
rounder
-0.53
arthed
-0.53
Lastly
-0.53
Flavoring
-0.53
prus
-0.52
peat
-0.52
POSITIVE LOGITS
,
0.98
,.
0.78
terday
0.77
,...
0.73
.,
0.70
!,
0.70
rophe
0.65
oln
0.64
,,
0.64
,-
0.62
Activations Density 0.056%