INDEX
Explanations
phrases indicating a sense of loss or detachment
New Auto-Interp
Negative Logits
icz
-0.15
785
-0.14
achi
-0.14
arios
-0.14
beit
-0.14
lad
-0.13
itals
-0.13
ÐĴС
-0.13
ustral
-0.13
stances
-0.13
POSITIVE LOGITS
away
1.83
Away
1.61
away
1.45
Away
1.41
-away
1.34
aways
0.77
weg
0.77
AW
0.57
.aw
0.55
awy
0.48
Activations Density 0.551%