INDEX
Explanations
phrases that convey realization and understanding
New Auto-Interp
Negative Logits
ÏĮν
-0.14
rint
-0.14
εÏģγ
-0.14
acades
-0.13
arov
-0.13
ÑĢек
-0.13
|/
-0.13
oment
-0.13
ãĢľ
-0.13
zell
-0.13
POSITIVE LOGITS
just
0.72
just
0.59
how
0.57
exactly
0.47
Just
0.47
JUST
0.46
Just
0.44
how
0.43
why
0.42
juste
0.41
Activations Density 0.279%