INDEX
Explanations
contrast with unexpected description
New Auto-Interp
Negative Logits
notoriously
0.49
famously
0.48
redos
0.44
arabangsa
0.43
psychologically
0.41
价值观
0.41
認為
0.40
にとって
0.40
inspired
0.39
inherently
0.39
POSITIVE LOGITS
strange
0.83
strangely
0.82
似乎
0.80
oddly
0.75
seemed
0.72
étr
0.70
seeming
0.70
стран
0.66
unfamiliar
0.64
whitish
0.64
Activations Density 0.065%