INDEX
Explanations
phrases indicating universality or generalization
New Auto-Interp
Negative Logits
onio
-0.44
ствия
-0.42
Storyboard
-0.42
主意
-0.41
mologie
-0.41
ahal
-0.41
low
-0.41
an
-0.40
|>
-0.39
Ref
-0.39
POSITIVE LOGITS
every
1.99
every
1.99
Every
1.94
Every
1.93
Chaque
1.82
Chaque
1.80
EVERY
1.74
Each
1.73
Each
1.72
each
1.72
Activations Density 0.408%