INDEX
Explanations
references to color combinations and their descriptions
New Auto-Interp
Negative Logits
adir
-0.15
kud
-0.15
quirrel
-0.14
-alist
-0.14
illez
-0.14
owell
-0.13
orney
-0.13
adla
-0.13
há
-0.13
ابر
-0.13
POSITIVE LOGITS
white
0.51
green
0.48
yellow
0.48
blue
0.45
black
0.40
orange
0.39
white
0.38
brown
0.38
red
0.38
green
0.34
Activations Density 0.257%