INDEX
Explanations
phrases indicating exclusions or limitations
New Auto-Interp
Negative Logits
zeug
-0.16
zag
-0.16
gest
-0.15
undi
-0.15
elic
-0.15
ugh
-0.15
hong
-0.15
ellig
-0.14
aris
-0.14
uggage
-0.14
POSITIVE LOGITS
limited
0.52
limited
0.43
Limited
0.40
LIMITED
0.38
Limited
0.38
LIMIT
0.30
restricted
0.29
limit
0.28
limitation
0.28
exclusive
0.28
Activations Density 0.006%