INDEX
Explanations
phrases related to caution or warning
references to authority figures or experts discussing policy or situations
New Auto-Interp
Negative Logits
surprisingly
-0.72
Slate
-0.60
!:
-0.60
Reborn
-0.59
Eater
-0.57
ãĥį
-0.57
Yon
-0.56
Edited
-0.56
Edit
-0.55
echoed
-0.55
POSITIVE LOGITS
)."
1.49
..."
1.39
),"
1.33
',"
1.30
,'"
1.30
)",
1.26
â̦"
1.25
."
1.25
.'"
1.25
.""
1.23
Activations Density 2.127%