INDEX
Explanations
instructions or prompts to visit specific websites or take specific actions online
commands or directives to perform specific actions
New Auto-Interp
Negative Logits
rejuven
-0.58
ament
-0.58
ariat
-0.57
ingham
-0.55
IDs
-0.53
ÏĦ
-0.53
achel
-0.53
ORED
-0.53
winner
-0.53
Coul
-0.52
POSITIVE LOGITS
ahead
1.00
og
0.95
HERE
0.86
verning
0.83
forth
0.82
ogly
0.81
browse
0.81
ethe
0.80
quartered
0.80
ogl
0.76
Activations Density 0.060%