INDEX
Explanations
mentions of negative outcomes, specifically losses
occurrences of the word "loss" in various contexts
New Auto-Interp
Negative Logits
Instruct
-0.67
indo
-0.66
Surve
-0.66
omet
-0.65
imaru
-0.65
commun
-0.64
Instruct
-0.63
JB
-0.63
ç«
-0.62
ulhu
-0.62
POSITIVE LOGITS
loss
3.76
Loss
2.94
loss
2.91
losses
2.62
losing
1.74
lose
1.60
lost
1.55
defeat
1.46
loses
1.42
setback
1.40
Activations Density 0.019%