INDEX
Explanations
instructions or suggestions for taking specific actions
phrases emphasizing the necessity of verifying or consulting information
New Auto-Interp
Negative Logits
gery
-0.73
hum
-0.71
alf
-0.70
rock
-0.69
bled
-0.68
folk
-0.67
bern
-0.67
MpServer
-0.66
pher
-0.66
OT
-0.66
POSITIVE LOGITS
icio
0.78
Thrones
0.70
patience
0.67
compr
0.67
beforehand
0.65
Titus
0.64
ilus
0.62
ppo
0.62
clicking
0.62
Siren
0.61
Activations Density 0.037%