INDEX
Explanations
prompts encouraging critical thinking and reflection
New Auto-Interp
Negative Logits
arness
-0.15
usercontent
-0.14
Hacker
-0.14
ãĥ¡ãĥ©
-0.14
æĤł
-0.14
.azure
-0.14
ecycle
-0.14
ucker
-0.13
ibaba
-0.13
aran
-0.13
POSITIVE LOGITS
yourself
0.17
tout
0.16
åIJ§
0.16
ance
0.15
ables
0.15
865
0.14
Yourself
0.14
ZA
0.14
able
0.14
778
0.14
Activations Density 0.067%