INDEX
Explanations
phrases related to surprising or unknowingly discovered information
phrases that express awareness or knowledge
New Auto-Interp
Negative Logits
ĪĴ
-0.85
assi
-0.78
stros
-0.76
utic
-0.75
uckles
-0.75
ŃĶ
-0.74
thren
-0.74
ressor
-0.73
empl
-0.70
elin
-0.70
POSITIVE LOGITS
anymore
0.92
yet
0.80
whatsoever
0.77
DERR
0.75
ledge
0.74
beforehand
0.74
wrongdoing
0.73
nor
0.72
anything
0.72
aloud
0.71
Activations Density 0.092%