INDEX
Explanations
comparisons and evaluations related to quantity
New Auto-Interp
Negative Logits
onut
-0.77
bons
-0.72
licts
-0.71
ividual
-0.71
network
-0.71
ggles
-0.69
prus
-0.67
ourn
-0.67
neys
-0.67
oké
-0.65
POSITIVE LOGITS
understatement
1.07
description
0.84
exaggeration
0.83
bullshit
0.83
rhetorical
0.79
untrue
0.79
reassuring
0.79
explanation
0.78
characterization
0.78
conjecture
0.77
Activations Density 0.186%