INDEX
Explanations
phrases related to comparisons and ratios
New Auto-Interp
Negative Logits
experiment
-0.16
lers
-0.16
experiment
-0.15
Experiment
-0.15
ardown
-0.15
Experiment
-0.15
gui
-0.14
avad
-0.14
dó
-0.14
experimental
-0.14
POSITIVE LOGITS
topic
0.20
theme
0.19
theme
0.19
chosen
0.19
Topic
0.18
themes
0.18
selected
0.18
topic
0.18
-topic
0.17
_theme
0.16
Activations Density 0.007%