INDEX
Explanations
words related to physical actions or impacts
New Auto-Interp
Negative Logits
iser
-0.08
istic
-0.08
hin
-0.08
strict
-0.08
.au
-0.08
istics
-0.07
readcr
-0.07
estruct
-0.07
hang
-0.07
dest
-0.07
POSITIVE LOGITS
ively
0.08
ingly
0.07
aller
0.07
nowled
0.07
et
0.07
ur
0.07
al
0.06
nowledge
0.06
able
0.06
¯ÃĤ
0.06
Activations Density 0.012%