INDEX
Explanations
mentions of lack of specific attributes or actions
the word "any" in various contexts
New Auto-Interp
Negative Logits
rex
-0.73
gypt
-0.71
ip
-0.70
rox
-0.67
plex
-0.67
bach
-0.64
gal
-0.63
vier
-0.62
rored
-0.62
seless
-0.61
POSITIVE LOGITS
THING
1.27
particular
1.01
meaningful
0.96
WHERE
0.96
place
0.94
sort
0.92
body
0.92
ones
0.91
significant
0.88
kind
0.85
Activations Density 0.094%