INDEX
Explanations
phrases or expressions indicating a reduction or lower amount
New Auto-Interp
Negative Logits
oug
-0.16
tas
-0.15
trl
-0.14
ulist
-0.14
pas
-0.14
ady
-0.14
issy
-0.14
erken
-0.14
Resp
-0.14
Å«
-0.13
POSITIVE LOGITS
ened
0.44
ening
0.43
-than
0.39
than
0.36
Than
0.31
_than
0.30
ens
0.27
Than
0.27
THAN
0.27
ons
0.27
Activations Density 0.032%