INDEX
Explanations
phrases indicating transformations or changes in circumstances
New Auto-Interp
Negative Logits
uti
-0.15
ontent
-0.15
byn
-0.15
Truy
-0.15
allo
-0.14
omik
-0.14
prites
-0.14
elps
-0.14
pry
-0.14
à¥įतà¤ķ
-0.14
POSITIVE LOGITS
nowhere
0.48
thin
0.27
nothing
0.26
Thin
0.25
blue
0.25
Thin
0.24
blue
0.23
thin
0.22
-blue
0.21
nothing
0.21
Activations Density 0.019%