INDEX
Explanations
expressions of agreement or disagreement
New Auto-Interp
Negative Logits
rego
-0.19
pedo
-0.17
sto
-0.15
onth
-0.15
redd
-0.14
dre
-0.14
룬
-0.14
homme
-0.14
ologi
-0.14
Lei
-0.14
POSITIVE LOGITS
alon
0.16
izu
0.16
flen
0.15
ATUS
0.14
ALAR
0.14
orate
0.14
lim
0.14
valid
0.14
princ
0.14
opacity
0.14
Activations Density 0.024%