INDEX
Explanations
sentences expressing personal opinions or emotional reflections
New Auto-Interp
Negative Logits
Luckily
-0.17
fortunately
-0.15
thankfully
-0.15
ãĥ³ãĤº
-0.15
Luckily
-0.14
Typed
-0.14
Dummy
-0.14
enerator
-0.14
luckily
-0.14
_compat
-0.13
POSITIVE LOGITS
truth
0.79
truth
0.62
honest
0.57
Truth
0.56
Truth
0.54
honestly
0.53
_truth
0.48
honesty
0.47
verdad
0.45
truthful
0.43
Activations Density 0.333%