INDEX
Explanations
mentions of interest or being interested in something
expressions of curiosity or engagement towards various topics or activities
New Auto-Interp
Negative Logits
Fail
-0.65
UES
-0.63
misunder
-0.62
âĶĢ
-0.62
patriarch
-0.61
welf
-0.60
unts
-0.59
stacked
-0.59
llan
-0.57
ãĥ¼ãĥĨãĤ£
-0.57
POSITIVE LOGITS
ãĥĦ
0.77
enough
0.74
therein
0.74
iltr
0.71
inery
0.71
ately
0.70
in
0.67
igent
0.66
illed
0.65
iotics
0.64
Activations Density 0.037%