INDEX
Explanations
words related to revealing information or uncovering secrets
New Auto-Interp
Negative Logits
oslav
-0.71
otor
-0.70
stead
-0.67
hovah
-0.67
creation
-0.66
oday
-0.66
compensate
-0.65
acqu
-0.64
atomic
-0.64
upiter
-0.64
POSITIVE LOGITS
secrets
1.00
loopholes
0.98
truths
0.93
ibility
0.86
orial
0.83
flaws
0.81
clues
0.80
weaknesses
0.80
details
0.79
revelations
0.79
Activations Density 0.067%