INDEX
Explanations
phrases that indicate reasoning or justification
New Auto-Interp
Negative Logits
allas
-0.16
Josh
-0.15
esto
-0.14
mond
-0.14
lication
-0.14
Ske
-0.14
Mind
-0.14
uche
-0.14
beef
-0.13
Josh
-0.13
POSITIVE LOGITS
озем
0.16
ikon
0.15
ihad
0.15
ovny
0.14
ople
0.14
/Instruction
0.14
apolis
0.14
ãĥ³ãĥĹ
0.14
.partial
0.14
ssel
0.14
Activations Density 0.146%