INDEX
Explanations
instances of sameness or similarity in concepts or experiences
New Auto-Interp
Negative Logits
rud
-0.17
beyond
-0.16
itself
-0.16
amac
-0.15
alone
-0.15
alian
-0.15
besonders
-0.15
ivec
-0.15
FormControl
-0.14
ntag
-0.14
POSITIVE LOGITS
except
0.27
except
0.23
than
0.21
minus
0.21
identical
0.21
minus
0.21
Except
0.21
Except
0.20
same
0.20
same
0.19
Activations Density 0.104%