INDEX
Explanations
as such, I cannot
sentences where the assistant refers to itself and issues safety/refusal disclaimers (e.g., "I am programmed..." / "As such, I cannot...").
New Auto-Interp
Negative Logits
vx
0.40
\|_{0.40
Synchronization
0.39
Foreign
0.39
trained
0.39
orr
0.38
untrained
0.38
Synchron
0.38
Glasgow
0.38
რაც
0.38
POSITIVE LOGITS
obviously
0.45
sadly
0.43
robots
0.42
компью
0.41
lines
0.41
wrists
0.40
actually
0.40
computer
0.40
funny
0.39
rêves
0.39
Activations Density 0.027%