INDEX
Explanations
phrases indicating unlikelihood or improbability
phrases indicating improbability or doubt
New Auto-Interp
Negative Logits
tein
-0.84
oola
-0.79
cience
-0.75
utterstock
-0.74
ravings
-0.73
ription
-0.72
ngth
-0.72
utations
-0.71
aeper
-0.71
rongh
-0.71
POSITIVE LOGITS
ever
0.85
theless
0.85
anymore
0.76
Squirrel
0.76
EVER
0.76
anything
0.75
icably
0.73
coincidence
0.72
bably
0.68
anyone
0.67
Activations Density 0.041%