INDEX
Explanations
phrases indicating quantities or comparisons that suggest a lower measure
New Auto-Interp
Negative Logits
agedList
-0.15
oug
-0.15
YD
-0.15
erken
-0.15
tas
-0.14
)test
-0.14
trl
-0.14
munition
-0.14
apesh
-0.14
Lists
-0.14
POSITIVE LOGITS
ened
0.47
ening
0.46
than
0.41
-than
0.40
Than
0.31
ens
0.30
_than
0.29
THAN
0.29
Than
0.28
ons
0.28
Activations Density 0.034%