INDEX
Explanations
questions and expressions of curiosity
New Auto-Interp
Negative Logits
autorytatywna
-0.92
<unused41>
-0.86
<unused79>
-0.86
<unused16>
-0.86
<unused23>
-0.86
<unused28>
-0.86
<unused43>
-0.86
<unused8>
-0.86
[@BOS@]
-0.86
<unused3>
-0.86
POSITIVE LOGITS
WHAT
0.46
what
0.44
waste
0.43
what
0.41
perd
0.40
WHAT
0.39
What
0.37
WTF
0.36
wrong
0.36
What
0.36
Activations Density 0.028%