INDEX
Explanations
conditional statements and metrics related to the effectiveness of interventions
New Auto-Interp
Negative Logits
arry
-0.18
stat
-0.15
³
-0.15
mos
-0.15
razier
-0.15
uary
-0.15
canonical
-0.14
nat
-0.14
overall
-0.14
Overall
-0.14
POSITIVE LOGITS
rud
0.17
ibir
0.17
ikon
0.16
iber
0.15
еÑĢб
0.15
eson
0.15
caff
0.15
antino
0.14
only
0.14
SSIP
0.13
Activations Density 0.211%