INDEX
Explanations
phrases that express negativity or extreme criticism
New Auto-Interp
Negative Logits
441
-0.17
ized
-0.16
tron
-0.15
üçük
-0.15
hunt
-0.15
oria
-0.15
otate
-0.15
ÏģίοÏħ
-0.14
_maximum
-0.14
ect
-0.14
POSITIVE LOGITS
ger
0.22
-case
0.22
Worse
0.21
worst
0.20
worse
0.20
Worst
0.16
ÑħÑĥд
0.16
ening
0.15
scenario
0.15
luck
0.15
Activations Density 0.020%