Skip to content

Commit 6350ada

Browse files
committed
MTBench: Added more data to leaderboard and revised a content
1 parent 14f92a9 commit 6350ada

5 files changed

Lines changed: 125 additions & 67 deletions

File tree

app/projects/mtbench/data/data.json

Lines changed: 0 additions & 26 deletions
This file was deleted.

app/projects/mtbench/page.mdx

Lines changed: 5 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -18,37 +18,14 @@ import Table from '@/components/table'
1818

1919
## Introduction
2020

21+
News influences the world around us—from stock markets reacting to financial reports to temperature trends following extreme weather events. However, understanding this impact is not straightforward. While AI models are improving at handling both text and numbers, most datasets fail to test how well they connect these different types of data.
2122

22-
Figuring out how news articles influence changes in time-series data (like stock prices or weather trends) is an important but still under-explored area in applied data science. While AI models that handle multiple types of data (like text and numbers) are becoming more popular, most existing datasets don’t do a great job of testing how well these models can connect information across different formats.
23+
To address this, we introduce **MTBench** (**M**ultimodal **T**ime Series **Bench**mark), a dataset designed to evaluate how well AI models understand the relationship between text and time-series data. MTBench pairs financial news with stock market movements and weather reports with historical temperature changes. Unlike existing benchmarks that focus on text or numbers separately, MTBench challenges models to analyze both together, helping to assess their ability to detect trends, interpret news, and make predictions.
2324

24-
To address this challenge, we introduce **MTBench** (**M**ultimodal **T**ime Series **Bench**mark), a large-scale dataset designed to test how well large language models (LLMs) understand both time-series data and text in financial and weather-related contexts. MTBench pairs numerical data with relevant text—for example, financial news linked to stock price changes and weather reports aligned with historical temperature trends. Unlike existing benchmarks that focus on either text or numbers separately, MTBench challenges models to analyze both together, helping assess their ability to recognize patterns, make predictions, and answer questions based on cross-modal reasoning. This enables a wide range of tasks, such as forecasting trends, interpreting news impact on data, and extracting meaningful insights from both structured (numerical) and unstructured (textual) information.
25+
- **Finance**: 200K+ news articles with stock movements from 2021–2023.
26+
- **Weather**: Historical temperature trends covering nearly two decades with reports of extreme events.
2527

26-
We test the latest large language models (LLMs) on MTBench to see how well they understand the connection between news stories and time-based trends. Our results highlight major challenges—these models struggle to recognize long-term patterns, understand cause-and-effect relationships in financial and weather data, and seamlessly combine insights from both text and numerical information.
27-
28-
## Dataset Collection
29-
30-
### Finance Dataset
31-
![Figure 1. The pipeline of finance dataset collection.|scale=0.5](./assets/fin_data_collection_pipeline.png)
32-
33-
The process of collecting the finance dataset is illustrated in Figure 1. To build a diverse dataset linking stock market news with time-series data, we gathered over 200,000 financial news article URLs from reputable sources such as GlobeNews, MarketWatch, SeekingAlpha, Zacks, Invezz, Quartz (QZ), PennyStocks, and Benzinga, spanning May 2021 to September 2023. We then scraped key details from these articles, including text content, titles, stock names, and publication dates. From this collection, we selected a subset of 20,000 news articles, ensuring a balanced distribution of article lengths.
34-
35-
To enhance the dataset with structured metadata, we used GPT-4o to annotate each article with its content type, the time range of its impact, and sentiment analysis.
36-
37-
**Stock Time-Series Collection.**
38-
39-
For each financial news article, we identified relevant stock time-series data based on the extracted sentiment and stock name. We retrieved historical stock prices with opening values sampled at different time scales. To maintain high data quality, we excluded cases where stock price data was missing for more than 70% of the timeframe, often due to market closures (e.g., weekends, holidays).
40-
41-
To align news articles with stock trends, we assumed that each article corresponds to the 90th percentile of its input time-series window. We created two forecasting scenarios:
42-
**Short-Term Prediction**: Using 7 days of stock price data at a 5-minute resolution to predict price movements for the next day.
43-
**Long-Term Prediction**: Using 30 days of stock price data at a 1-hour resolution to forecast stock movements for the following 7 days.
44-
45-
### Weather Dataset
46-
47-
We selected 50 airports across the United States as data sources, using the Global Historical Climatology Network Hourly (GHCN-H) dataset. The data, which spans from 2003 to 2020, is collected hourly. Each weather station records various attributes, including geographic location, temperature, humidity, wind speed, wind direction, visibility, pressure, and precipitation. Airports were chosen because their weather data is generally more reliable and accurate than data from other stations. In this study, we focus on temperature as the primary parameter, since it is a key factor in weather forecasting. However, our raw data pipeline includes additional channels, allowing for future expansion into multi-channel weather analysis.
48-
49-
Unlike stock price data, systematically collecting weather-related news is challenging, as routine reports often lack the context needed for complex analysis. To overcome this, we use the Storm Events Database, which records storm occurrences in the United States from 1950 to 2020. This dataset includes details such as storm type, location, fatalities, and injuries, covering various severe weather conditions, including hail, tornadoes, thunderstorms, floods, hurricanes, and typhoons.
50-
51-
Each entry in the database contains an <em>event ID</em>, which uniquely identifies an occurrence, and an <em>episode ID</em>, which links related events. For example, a hurricane might trigger multiple tornadoes, hailstorms, and thunderstorms, all grouped under the same episode ID. Each event also includes a textual description, offering valuable contextual information.
28+
We evaluate state-of-the-art large language models (LLMs) on MTBench to measure their ability to link news with data trends (see our **Leaderboard**). The results reveal key challenges—models struggle with long-term pattern recognition, cause-and-effect relationships, and seamlessly combining insights from text and numbers.
5229

5330
## Leaderboard
5431

components/sortable-table.tsx

Lines changed: 91 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
'use client'
22
import { MouseEventHandler, useCallback, useState } from "react";
3-
// import data from "../app/projects/mtbench/data/data.json";
43
import data from "../app/projects/mtbench/data/data_leaderboard.json";
54

65
type Data = typeof data;
@@ -21,6 +20,13 @@ function sortData({
2120
if (!sortKey) return tableData;
2221

2322
const sortedData = data.sort((a, b) => {
23+
// nulls sort after anything else
24+
if (a[sortKey] === null) {
25+
return 1;
26+
}
27+
if (b[sortKey] === null) {
28+
return -1;
29+
}
2430
return a[sortKey] > b[sortKey] ? 1 : -1;
2531
});
2632

@@ -62,7 +68,7 @@ function SortableTable({ data }: { data: Data }) {
6268

6369
const headers: { key: SortKeys; label: string }[] = [
6470
{ key: "model_name", label: "Model" },
65-
{ key: "stock_price_forecast_7_day_mae_ts", label: "Stock price predict. for 7 days under TS (MAE)" },
71+
{ key: "stock_price_forecast_7_day_mae_ts", label: "Stock price predict. \n for 7 days under TS (MAE)" },
6672
{ key: "stock_price_forecast_7_day_mae_ts_w_text", label: "Stock price predict. for 7 days under TS+Text (MAE)" },
6773
{ key: "stock_price_forecast_7_day_mape_ts", label: "Stock price predict. for 7 days under TS (MAPE)" },
6874
{ key: "stock_price_forecast_7_day_mape_ts_w_text", label: "Stock price predict. for 7 days under TS+Text (MAPE)" },
@@ -78,6 +84,47 @@ function SortableTable({ data }: { data: Data }) {
7884
{ key: "temp_forecast_14_day_mse_ts_w_text", label: "Temp. predict. for 14 days under TS+Text (MSE)" },
7985
{ key: "temp_forecast_14_day_mae_ts", label: "Temp. predict. for 14 days under TS (MAE)" },
8086
{ key: "temp_forecast_14_day_mae_ts_w_text", label: "Temp. predict. for 14 days under TS+Text (MAE)" },
87+
{ key: "stock_trend_predict_acc_7_day_3_way_ts", label: "Stock trend predict. for 7 days 3-way under TS (Acc)"},
88+
{ key: "stock_trend_predict_acc_7_day_3_way_ts_w_text", label: "Stock trend predict. for 7 days 3-way under TS+Text (Acc)"},
89+
{ key: "stock_trend_predict_acc_7_day_5_way_ts", label: "Stock trend predict. for 7 days 5-way under TS (Acc)"},
90+
{ key: "stock_trend_predict_acc_7_day_5_way_ts_w_text", label: "Stock trend predict. for 7 days 5-way under TS+Text (Acc)"},
91+
{ key: "stock_trend_predict_acc_30_day_3_way_ts", label: "Stock trend predict. for 30 days 3-way under TS (Acc)"},
92+
{ key: "stock_trend_predict_acc_30_day_3_way_ts_w_text", label: "Stock trend predict. for 30 days 3-way under TS+Text (Acc)"},
93+
{ key: "stock_trend_predict_acc_30_day_5_way_ts", label: "Stock trend predict. for 30 days 5-way under TS (Acc)"},
94+
{ key: "stock_trend_predict_acc_30_day_5_way_ts_w_text", label: "Stock trend predict. for 30 days 5-way under TS+Text (Acc)"},
95+
{ key: "temp_trend_predict_acc_past_ts", label: "Temp. trend predict. past under TS (Acc)"},
96+
{ key: "temp_trend_predict_acc_past_ts_w_text", label: "Temp. trend predict. past under TS+Text (Acc)"},
97+
{ key: "temp_trend_predict_acc_future_ts", label: "Temp. trend predict. future under TS (Acc)"},
98+
{ key: "temp_trend_predict_acc_future_ts_w_text", label: "Temp. trend predict. future under TS+Text (Acc)"},
99+
{ key: "stock_indicator_predict_mse_7_day_macd_ts", label: "MACD predict. for 7 days under TS (MSE)"},
100+
{ key: "stock_indicator_predict_mse_7_day_macd_ts_w_text", label: "MACD predict. for 7 days under TS+Text (MSE)"},
101+
{ key: "stock_indicator_predict_mse_7_day_bb_ts", label: "Bollinger Bands predict. for 7 days under TS (MSE)"},
102+
{ key: "stock_indicator_predict_mse_7_day_bb_ts_w_text", label: "Bollinger Bands predict. for 7 days under TS+Text (MSE)"},
103+
{ key: "stock_indicator_predict_mse_30_day_macd_ts", label: "MACD predict. for 30 days under TS (MSE)"},
104+
{ key: "stock_indicator_predict_mse_30_day_macd_ts_w_text", label: "MACD predict. for 30 days under TS+Text (MSE)"},
105+
{ key: "stock_indicator_predict_mse_30_day_bb_ts", label: "Bollinger Bands predict. for 30 days under TS (MSE)"},
106+
{ key: "stock_indicator_predict_mse_30_day_bb_ts_w_text", label: "Bollinger Bands predict. for 30 days under TS+Text (MSE)"},
107+
{ key: "temp_predict_max_mse_ts", label: "Temp. predict. max under TS (MSE)"},
108+
{ key: "temp_predict_max_mse_ts_w_text", label: "Temp. predict. max under TS+Text (MSE)"},
109+
{ key: "temp_predict_max_mae_ts", label: "Temp. predict. max under TS (MAE)"},
110+
{ key: "temp_predict_max_mae_ts_w_text", label: "Temp. predict. max under TS+Text (MAE)"},
111+
{ key: "temp_predict_min_mse_ts", label: "Temp. predict. min under TS (MSE)"},
112+
{ key: "temp_predict_min_mse_ts_w_text", label: "Temp. predict. min under TS+Text (MSE)"},
113+
{ key: "temp_predict_min_mae_ts", label: "Temp. predict. min under TS (MAE)"},
114+
{ key: "temp_predict_min_mae_ts_w_text", label: "Temp. predict. min under TS+Text (MAE)"},
115+
{ key: "temp_predict_diff_mse_ts", label: "Temp. predict. diff. under TS (MSE)"},
116+
{ key: "temp_predict_diff_mse_ts_w_text", label: "Temp. predict. diff. under TS+Text (MSE)"},
117+
{ key: "temp_predict_diff_mae_ts", label: "Temp. predict. diff. under TS (MAE)"},
118+
{ key: "temp_predict_diff_mae_ts_w_text", label: "Temp. predict. diff. under TS+Text (MAE)"},
119+
{ key: "news_stock_corr_acc_7_day_3_way", label: "News stock corr. for 7 days 3-way (Acc)"},
120+
{ key: "news_stock_corr_acc_7_day_5_way", label: "News stock corr. for 7 days 5-way (Acc)"},
121+
{ key: "news_stock_corr_acc_30_day_3_way", label: "News stock corr. for 30 days 3-way (Acc)"},
122+
{ key: "news_stock_corr_acc_30_day_5_way", label: "News stock corr. for 30 days 5-way (Acc)"},
123+
{ key: "news_driven_mcqa_acc_7_day_fin", label: "News driven MCQA for 7 days for Finance data (Acc)"},
124+
{ key: "news_driven_mcqa_acc_7_day_weather", label: "News driven MCQA for 7 days for Weather data (Acc)"},
125+
{ key: "news_driven_mcqa_acc_30_day_fin", label: "News driven MCQA for 30 days for Finance data (Acc)"},
126+
{ key: "news_driven_mcqa_acc_30_day_weather", label: "News driven MCQA for 30 days for Weather data (Acc)"}
127+
81128
];
82129

83130
const sortedData = useCallback(
@@ -117,7 +164,7 @@ function SortableTable({ data }: { data: Data }) {
117164
{sortedData().map((model) => {
118165
return (
119166
<tr key={model.model_name}>
120-
<td>{model.model_name}</td>
167+
<td className="headcol">{model.model_name}</td>
121168
<td>{model.stock_price_forecast_7_day_mae_ts}</td>
122169
<td>{model.stock_price_forecast_7_day_mae_ts_w_text}</td>
123170
<td>{model.stock_price_forecast_7_day_mape_ts}</td>
@@ -134,6 +181,47 @@ function SortableTable({ data }: { data: Data }) {
134181
<td>{model.temp_forecast_14_day_mse_ts_w_text}</td>
135182
<td>{model.temp_forecast_14_day_mae_ts}</td>
136183
<td>{model.temp_forecast_14_day_mae_ts_w_text}</td>
184+
<td>{model.stock_trend_predict_acc_7_day_3_way_ts}</td>
185+
<td>{model.stock_trend_predict_acc_7_day_3_way_ts_w_text}</td>
186+
<td>{model.stock_trend_predict_acc_7_day_5_way_ts}</td>
187+
<td>{model.stock_trend_predict_acc_7_day_5_way_ts_w_text}</td>
188+
<td>{model.stock_trend_predict_acc_30_day_3_way_ts}</td>
189+
<td>{model.stock_trend_predict_acc_30_day_3_way_ts_w_text}</td>
190+
<td>{model.stock_trend_predict_acc_30_day_5_way_ts}</td>
191+
<td>{model.stock_trend_predict_acc_30_day_5_way_ts_w_text}</td>
192+
<td>{model.temp_trend_predict_acc_past_ts}</td>
193+
<td>{model.temp_trend_predict_acc_past_ts_w_text}</td>
194+
<td>{model.temp_trend_predict_acc_future_ts}</td>
195+
<td>{model.temp_trend_predict_acc_future_ts_w_text}</td>
196+
<td>{model.stock_indicator_predict_mse_7_day_macd_ts}</td>
197+
<td>{model.stock_indicator_predict_mse_7_day_macd_ts_w_text}</td>
198+
<td>{model.stock_indicator_predict_mse_7_day_bb_ts}</td>
199+
<td>{model.stock_indicator_predict_mse_7_day_bb_ts_w_text}</td>
200+
<td>{model.stock_indicator_predict_mse_30_day_macd_ts}</td>
201+
<td>{model.stock_indicator_predict_mse_30_day_macd_ts_w_text}</td>
202+
<td>{model.stock_indicator_predict_mse_30_day_bb_ts}</td>
203+
<td>{model.stock_indicator_predict_mse_30_day_bb_ts_w_text}</td>
204+
<td>{model.temp_predict_max_mse_ts}</td>
205+
<td>{model.temp_predict_max_mse_ts_w_text}</td>
206+
<td>{model.temp_predict_max_mae_ts}</td>
207+
<td>{model.temp_predict_max_mae_ts_w_text}</td>
208+
<td>{model.temp_predict_min_mse_ts}</td>
209+
<td>{model.temp_predict_min_mse_ts_w_text}</td>
210+
<td>{model.temp_predict_min_mae_ts}</td>
211+
<td>{model.temp_predict_min_mae_ts_w_text}</td>
212+
<td>{model.temp_predict_diff_mse_ts}</td>
213+
<td>{model.temp_predict_diff_mse_ts_w_text}</td>
214+
<td>{model.temp_predict_diff_mae_ts}</td>
215+
<td>{model.temp_predict_diff_mae_ts_w_text}</td>
216+
<td>{model.news_stock_corr_acc_7_day_3_way}</td>
217+
<td>{model.news_stock_corr_acc_7_day_5_way}</td>
218+
<td>{model.news_stock_corr_acc_30_day_3_way}</td>
219+
<td>{model.news_stock_corr_acc_30_day_5_way}</td>
220+
<td>{model.news_driven_mcqa_acc_7_day_fin}</td>
221+
<td>{model.news_driven_mcqa_acc_7_day_weather}</td>
222+
<td>{model.news_driven_mcqa_acc_30_day_fin}</td>
223+
<td>{model.news_driven_mcqa_acc_30_day_weather}</td>
224+
137225
</tr>
138226
);
139227
})}

components/table.tsx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ import "../styles/table.css";
55

66
function Table() {
77
return (
8-
<div className="Table">
8+
<div className="table-wrapper">
99
<SortableTable data={data} />
1010
</div>
1111
);

styles/table.css

Lines changed: 28 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,64 @@
11
table {
22
width: 100%;
3-
table-layout: fixed;
43
border: 1px solid;
5-
overflow-x: auto;
6-
max-width: fit-content;
4+
border-spacing: 0px;
5+
border-collapse: separate;
6+
min-width: max-content;
7+
table-layout: fixed;
78
}
89

910
tbody {
1011
width: 100%;
1112
border: 1px solid rgba(255, 255, 255, 0.3);
1213
}
1314

15+
th {
16+
position: sticky;
17+
}
18+
1419
tr {
1520
width: 100%;
1621
}
1722

23+
tr:hover {
24+
background-color: rgba(84, 83, 83, 0.7);
25+
}
26+
1827
th,
1928
td {
2029
border-bottom: 1px solid #ddd;
21-
/* padding: 10px 8px; */
2230
text-align: left;
2331
font-weight: 500;
2432
font-size: 12px;
25-
/* border: 1px solid; */
26-
/* color: black; */
33+
width: 100px;
34+
overflow: hidden;
2735
}
28-
tr:hover {background-color: rgba(84, 83, 83, 0.7);}
36+
2937
.sort-button {
3038
background-color: transparent;
3139
border: none;
3240

3341
padding: 5px 10px;
3442
margin: 0;
3543
line-height: 1;
36-
font-size: 15px;
37-
/* color: black; */
44+
font-size: 12px;
3845
cursor: pointer;
3946

4047
transition: transform 0.05s ease-out;
4148
}
4249

4350
.sort-reverse {
4451
transform: rotate(180deg);
52+
}
53+
54+
.table-wrapper {
55+
overflow-y: scroll;
56+
overflow-x: scroll;
57+
height: fit-content;
58+
margin: 20px;
59+
max-height: 300px;
60+
}
61+
62+
.headcol {
63+
position: absolute;
4564
}

0 commit comments

Comments
 (0)