INDEX
Explanations
words associated with reflection or representation
New Auto-Interp
Negative Logits
stant
-0.76
iott
-0.75
aii
-0.74
ensable
-0.71
uilt
-0.71
--------------------------------------------------------
-0.70
yright
-0.69
ccoli
-0.68
CVE
-0.68
TAIN
-0.68
POSITIVE LOGITS
ror
1.01
neuron
0.94
image
0.86
angelo
0.83
mirror
0.82
Mirror
0.80
neurons
0.79
shards
0.78
ing
0.77
ocular
0.76
Activations Density 0.050%