INDEX
Explanations
personal names
mentions of specific people or public figures, particularly on social media
New Auto-Interp
Negative Logits
''.
-0.87
âĶĢâĶĢ
-0.76
".[
-0.76
.).
-0.75
.",
-0.74
]."
-0.74
).[
-0.73
.ãĢį
-0.71
.�
-0.71
�
-0.69
POSITIVE LOGITS
@
1.11
Jr
1.05
Originally
0.97
congr
0.93
Thanks
0.87
steen
0.87
afort
0.84
why
0.83
Yep
0.83
_
0.82
Activations Density 0.075%