INDEX
Explanations
statements or judgments of correctness
statements about accuracy or correctness
New Auto-Interp
Negative Logits
aden
-0.74
GGGGGGGG
-0.74
Valhalla
-0.72
EMOTE
-0.72
CHO
-0.66
neys
-0.63
ILY
-0.63
thin
-0.62
doms
-0.62
Connector
-0.61
POSITIVE LOGITS
ives
0.95
eous
0.86
fully
0.85
ibly
0.84
guiActiveUn
0.80
answers
0.78
ible
0.77
aber
0.75
translations
0.73
correct
0.73
Activations Density 0.015%