INDEX
Explanations
unwanted or contentious mentions, potentially involving political and legal matters
phrases indicating significant events or emergencies
New Auto-Interp
Negative Logits
ij士
-0.76
½
-0.70
ichick
-0.68
²
-0.67
rer
-0.67
ķ
-0.67
ãĥ¼ãĥĨãĤ£
-0.66
Ĥ
-0.65
ãĤ±
-0.64
_.
-0.64
POSITIVE LOGITS
Scientists
0.79
Britain
0.73
Britain
0.69
BBC
0.69
ccording
0.69
Researchers
0.68
Sorry
0.68
Analysis
0.68
labour
0.67
IMAGES
0.67
Activations Density 0.044%