INDEX
Explanations
mentions of the assistant identifying itself as an AI (self-referential statements about being an AI).
New Auto-Interp
Negative Logits
PREFIX
-0.07
luv
-0.06
Belgian
-0.06
Administr
-0.06
der
-0.06
Franc
-0.06
ně
-0.06
GRE
-0.06
ISC
-0.06
affects
-0.06
POSITIVE LOGITS
solicit
0.07
AI
0.06
exter
0.06
){↵0.06
diets
0.06
ysical
0.06
/#{0.06
McKin
0.06
の子
0.06
()){↵0.06
Activations Density 0.022%