INDEX
Explanations
instances of conditional phrases indicating hypothetical scenarios
New Auto-Interp
Negative Logits
bedo
-0.16
adera
-0.16
sburg
-0.15
sville
-0.15
bole
-0.15
itra
-0.15
inas
-0.15
undra
-0.15
swick
-0.15
nant
-0.14
POSITIVE LOGITS
pier
0.16
piercing
0.14
Harvey
0.14
corner
0.13
ota
0.13
else
0.13
/loader
0.13
else
0.13
wrapper
0.13
Motion
0.13
Activations Density 0.009%