INDEX
Explanations
mentions of the color red
New Auto-Interp
Negative Logits
gray
-0.16
blackColor
-0.16
grey
-0.16
Haram
-0.15
ightly
-0.15
led
-0.15
asted
-0.15
Gray
-0.14
turquoise
-0.14
egrated
-0.14
POSITIVE LOGITS
oub
0.28
dest
0.27
acted
0.26
dish
0.26
/red
0.24
emption
0.23
empt
0.23
-hot
0.23
shift
0.22
uces
0.22
Activations Density 0.037%