INDEX
Explanations
questions, tips, answers, and instructional prompts
structured query and response formats in text
New Auto-Interp
Negative Logits
disg
-0.76
cale
-0.70
ité
-0.68
undai
-0.67
aps
-0.64
creen
-0.63
indiscrim
-0.63
glac
-0.62
paces
-0.61
ides
-0.60
POSITIVE LOGITS
#
0.99
Number
0.98
Yourself
0.96
Summary
0.93
Explan
0.89
Description
0.86
!:
0.86
Regarding
0.85
Abuse
0.84
:
0.84
Activations Density 0.200%