INDEX
Explanations
phrases that indicate movement or transformation towards a goal or state
New Auto-Interp
Negative Logits
ernet
-0.16
ose
-0.15
579
-0.15
ffa
-0.15
ipt
-0.14
ffer
-0.14
lep
-0.14
trand
-0.14
ersh
-0.14
elts
-0.14
POSITIVE LOGITS
levels
0.20
stell
0.16
Level
0.15
orelease
0.15
zero
0.15
level
0.15
Poll
0.15
completion
0.14
Levels
0.14
owski
0.14
Activations Density 0.092%