INDEX
Explanations
details that are formatted as bullet points or subheadings within a longer text
bullet points or lists in the text
New Auto-Interp
Negative Logits
udic
-0.81
othal
-0.79
erer
-0.74
aults
-0.67
ierre
-0.67
uve
-0.66
ERC
-0.63
aughter
-0.62
olyn
-0.61
enthal
-0.60
POSITIVE LOGITS
··
1.25
âĢ¢âĢ¢
0.82
¼
0.76
¾
0.72
Joined
0.71
âĢ¢âĢ¢âĢ¢âĢ¢
0.71
ting
0.70
thia
0.70
µ
0.68
lat
0.67
Activations Density 0.014%