INDEX
Explanations
phrases that involve future events or anticipated outcomes
New Auto-Interp
Negative Logits
_INTR
-0.14
harma
-0.14
Davidson
-0.14
UNS
-0.14
WINDOWS
-0.13
ilim
-0.13
metro
-0.13
panse
-0.13
oplan
-0.13
heartbeat
-0.13
POSITIVE LOGITS
anka
0.18
illow
0.17
heim
0.17
iale
0.16
eka
0.16
rig
0.15
furt
0.15
вай
0.15
imir
0.15
joy
0.15
Activations Density 0.177%