INDEX
Explanations
references to personal thoughts, feelings, and experiences
New Auto-Interp
Negative Logits
ilog
-0.16
reck
-0.15
wondered
-0.15
suy
-0.14
duk
-0.14
VICE
-0.14
wonder
-0.14
opus
-0.14
_try
-0.14
ضا
-0.14
POSITIVE LOGITS
mentioned
0.24
Mention
0.22
mention
0.21
mentioned
0.21
mentioning
0.21
mention
0.20
disclaimer
0.18
mentions
0.18
touched
0.17
claimer
0.16
Activations Density 0.148%