INDEX
Explanations
phrases indicating skepticism or denial
New Auto-Interp
Negative Logits
Edge
-0.65
Less
-0.62
ÅĤ
-0.61
skept
-0.60
Vers
-0.58
Unfortunately
-0.58
wan
-0.58
edge
-0.57
advertisement
-0.57
Unfortunately
-0.57
POSITIVE LOGITS
nor
0.88
existent
0.84
anywhere
0.78
plete
0.76
whatsoever
0.76
slightest
0.73
anymore
0.72
etheless
0.72
necessarily
0.70
anything
0.70
Activations Density 0.260%