INDEX
Explanations
phrases indicating a contrast or comparison between two situations
New Auto-Interp
Negative Logits
jack
-0.16
enton
-0.15
uilt
-0.15
èħ°
-0.15
LETE
-0.14
éŀ
-0.14
proved
-0.14
ãģ£ãģı
-0.14
ailability
-0.14
uit
-0.14
POSITIVE LOGITS
flip
0.24
flip
0.22
other
0.20
Flip
0.19
flips
0.19
upside
0.18
.flip
0.18
Flip
0.17
761
0.17
flipping
0.16
Activations Density 0.029%