Skip to content

Commit 14f92a9

Browse files
committed
MTBench: Added leaderboard and revised main page's text
1 parent 9ae6944 commit 14f92a9

8 files changed

Lines changed: 613 additions & 30 deletions

File tree

-28.3 KB
Loading
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
[
2+
{
3+
"id": 1,
4+
"first_name": "Marrilee",
5+
"last_name": "Le Clercq",
6+
"email": "mleclercq0@wunderground.com",
7+
"gender": "Female",
8+
"ip_address": "51.189.247.168"
9+
},
10+
{
11+
"id": 2,
12+
"first_name": "Nancey",
13+
"last_name": "Garioch",
14+
"email": "ngarioch1@adobe.com",
15+
"gender": "Non-binary",
16+
"ip_address": "106.123.76.196"
17+
},
18+
{
19+
"id": 3,
20+
"first_name": "Malanie",
21+
"last_name": "Decroix",
22+
"email": "mdecroix2@goodreads.com",
23+
"gender": "Non-binary",
24+
"ip_address": "95.153.34.2"
25+
}
26+
]

app/projects/mtbench/data/data_leaderboard.json

Lines changed: 356 additions & 0 deletions
Large diffs are not rendered by default.

app/projects/mtbench/page.mdx

Lines changed: 25 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11

2-
import { Authors, Badges } from '@/components/utils'
2+
import { Authors, Badges} from '@/components/utils'
3+
import Table from '@/components/table'
34

45
# MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering
56

@@ -8,51 +9,47 @@ import { Authors, Badges } from '@/components/utils'
89
/>
910

1011
<Badges
11-
venue="KDD 2025"
12+
venue=""
1213
github="https://github.com/Graph-and-Geometric-Learning/MTBench"
1314
arxiv=""
1415
pdf=""
1516
/>
1617

1718

1819
## Introduction
19-
Understanding the relationship between textual news and time-series evolution is a critical yet under-explored challenge in applied data science. While multimodal learning has gained traction, existing multimodal time-series datasets fall short in evaluating cross-modal reasoning and complex question answering, which are essential for capturing complex interactions between narrative information and temporal patterns. To bridge this gap, we introduce **M**ultimodal **T**ime Series **Bench**mark **MTBench**, a large-scale benchmark designed to evaluate large language models (LLMs) on time series and text understanding across financial and weather domains. MTBench comprises of paired time-series and textual data, including financial news with corresponding stock price movements and weather reports aligned with historical temperature records. Unlike existing benchmarks that focus on isolated modalities, MTBench provides a comprehensive testbed for models to jointly reason over structured numerical trends and unstructured textual narratives. The richness of MTBench enables formulation of diverse tasks that require a deep understanding of both text and time-series data, including time-series forecasting, semantic and technical trend analysis, and news-driven question answering (QA). These tasks target the model’s ability to capture temporal dependencies, extract key insights from textual context, and integrate cross-modal information. We evaluate state-of-the-art LLMs on MTBench, analyzing their effectiveness in modeling the complex relationships between news narratives and temporal patterns. Our findings reveal significant challenges in current models, including difficulties in capturing long-term dependencies, interpreting causality in financial and weather trends, and effectively fusing multimodal information.
2020

2121

22-
## Dataset Overview
22+
Figuring out how news articles influence changes in time-series data (like stock prices or weather trends) is an important but still under-explored area in applied data science. While AI models that handle multiple types of data (like text and numbers) are becoming more popular, most existing datasets don’t do a great job of testing how well these models can connect information across different formats.
23+
24+
To address this challenge, we introduce **MTBench** (**M**ultimodal **T**ime Series **Bench**mark), a large-scale dataset designed to test how well large language models (LLMs) understand both time-series data and text in financial and weather-related contexts. MTBench pairs numerical data with relevant text—for example, financial news linked to stock price changes and weather reports aligned with historical temperature trends. Unlike existing benchmarks that focus on either text or numbers separately, MTBench challenges models to analyze both together, helping assess their ability to recognize patterns, make predictions, and answer questions based on cross-modal reasoning. This enables a wide range of tasks, such as forecasting trends, interpreting news impact on data, and extracting meaningful insights from both structured (numerical) and unstructured (textual) information.
25+
26+
We test the latest large language models (LLMs) on MTBench to see how well they understand the connection between news stories and time-based trends. Our results highlight major challenges—these models struggle to recognize long-term patterns, understand cause-and-effect relationships in financial and weather data, and seamlessly combine insights from both text and numerical information.
27+
28+
## Dataset Collection
2329

2430
### Finance Dataset
25-
![The pipeline of finance dataset collection.|scale=0.5](./assets/fin_data_collection_pipeline.png)
31+
![Figure 1. The pipeline of finance dataset collection.|scale=0.5](./assets/fin_data_collection_pipeline.png)
2632

27-
The pipeline of finance dataset collection is shown in Figure 2. To construct a diverse stock news and time-series dataset, we collected over 200,000 financial news article URLs from professional financial websites, including GlobeNews, Market-Watch, SeekingAlpha, Zacks, Invezz, Quartz (QZ), PennyStocks, and Benzinga, covering the period from May 2021 to September 2023. We then scraped and parsed the corresponding textual content, titles, stock names, and publishing dates from these URLs. From this collection, we derived a 20,000-news subset while ensuring a balanced distribution of article lengths. To enrich the dataset with structured metadata, we employed GPT-4o to annotate each article with news content type, temporal effect range, and sentiment.
28-
**Stock Time-Series Collection.** For each financial news article, we identified the corresponding stock time-series data by utilizing the extracted sentiment and stock name. The historical stock price data
29-
was retrieved with open prices sampled at varying granularities. To ensure data quality, we discarded samples where stock price
30-
data was missing for more than 70% of the time period due to market closures (e.g., holidays, weekends). To construct aligned input-output time-series pairs, we assumed that each news article happens at the 0.9 percentile of its input time-series window. We curated two forecasting settings:
31-
**Short-Term Prediction**: Use 7 days of stock prices at a 5-minute granularity to predict the next 1-day price movements.
32-
**Long-Term Prediction**: Use 30 days of stock prices at a 1-hour granularity to predict the next 7 days’ stock movements.
33+
The process of collecting the finance dataset is illustrated in Figure 1. To build a diverse dataset linking stock market news with time-series data, we gathered over 200,000 financial news article URLs from reputable sources such as GlobeNews, MarketWatch, SeekingAlpha, Zacks, Invezz, Quartz (QZ), PennyStocks, and Benzinga, spanning May 2021 to September 2023. We then scraped key details from these articles, including text content, titles, stock names, and publication dates. From this collection, we selected a subset of 20,000 news articles, ensuring a balanced distribution of article lengths.
3334

34-
### Weather Dataset
35-
We selected 50 airports in the United States as data sources, using the Global Historical Climatology Network Hourly (GHCN-H) dataset. The data spans from 2003 to 2020 and is collected hourly. Each weather station records multiple attributes, including geographical location, temperature, humidity, wind speed, wind direction, visibility, pressure, and precipitation. Airports were chosen due to the higher reliability and accuracy of their weather data compared to other stations. In this study, we focus on single-channel data, specifically temperature, as it is the most critical parameter for weather forecasting. Meanwhile, within our raw data creation pipeline, additional channels are available, allowing for future expansion to multi-channel weather analysis.
36-
Unlike stock price datasets, systematically collecting weather-related news is challenging, as routine weather reports may not provide sufficient context for complex reasoning. To address this, we use the Storm Events Database, which documents storm occurrences in the United States from 1950 to 2020. This dataset includes details such as storm type, location, fatalities, and injuries, covering a range of severe weather conditions, such as hail, tornadoes, thunderstorms, floods, hurricanes, and typhoons.
37-
Each entry contains an <em>event ID</em> and an <em>episode ID</em>, where the event ID uniquely identifies an occurrence, and the <em>episode ID</em> links related events. For example, a hurricane may trigger multiple tornadoes, hailstorms, and thunderstorms, all grouped under the same <em>episode ID</em>. Each event also includes a textual description, providing valuable contextual information.
35+
To enhance the dataset with structured metadata, we used GPT-4o to annotate each article with its content type, the time range of its impact, and sentiment analysis.
36+
37+
**Stock Time-Series Collection.**
3838

39-
## Experiments
39+
For each financial news article, we identified relevant stock time-series data based on the extracted sentiment and stock name. We retrieved historical stock prices with opening values sampled at different time scales. To maintain high data quality, we excluded cases where stock price data was missing for more than 70% of the timeframe, often due to market closures (e.g., weekends, holidays).
4040

41-
### Time-series Forecasting
42-
![Time-series forecasting performance for finance dataset.|scale=0.5](./assets/ts_forecasting_finance.png)
43-
![Time-series forecasting performance for weather dataset.|scale=0.5](./assets/ts_forecasting_weather.png)
41+
To align news articles with stock trends, we assumed that each article corresponds to the 90th percentile of its input time-series window. We created two forecasting scenarios:
42+
**Short-Term Prediction**: Using 7 days of stock price data at a 5-minute resolution to predict price movements for the next day.
43+
**Long-Term Prediction**: Using 30 days of stock price data at a 1-hour resolution to forecast stock movements for the following 7 days.
4444

45+
### Weather Dataset
4546

46-
### Semantic Trend Prediction
47-
![Semantic trend prediction for finance dataset.|scale=0.5](./assets/trend_prediction_finance.png)
48-
![Semantic trend prediction for weather dataset.|scale=0.5](./assets/trend_prediction_weather.png)
47+
We selected 50 airports across the United States as data sources, using the Global Historical Climatology Network Hourly (GHCN-H) dataset. The data, which spans from 2003 to 2020, is collected hourly. Each weather station records various attributes, including geographic location, temperature, humidity, wind speed, wind direction, visibility, pressure, and precipitation. Airports were chosen because their weather data is generally more reliable and accurate than data from other stations. In this study, we focus on temperature as the primary parameter, since it is a key factor in weather forecasting. However, our raw data pipeline includes additional channels, allowing for future expansion into multi-channel weather analysis.
4948

49+
Unlike stock price data, systematically collecting weather-related news is challenging, as routine reports often lack the context needed for complex analysis. To overcome this, we use the Storm Events Database, which records storm occurrences in the United States from 1950 to 2020. This dataset includes details such as storm type, location, fatalities, and injuries, covering various severe weather conditions, including hail, tornadoes, thunderstorms, floods, hurricanes, and typhoons.
5050

51-
### Technical Indicator Calculation
52-
![Technical indicator calculation for finance dataset.|scale=0.5](./assets/tech_indicator_finance.png)
53-
![Technical indicator calculation for weather dataset.|scale=0.5](./assets/tech_indicator_weather.png)
54-
![Technical indicator calculation for weather dataset.|scale=0.5](./assets/tech_indicator_fin_weather.png)
51+
Each entry in the database contains an <em>event ID</em>, which uniquely identifies an occurrence, and an <em>episode ID</em>, which links related events. For example, a hurricane might trigger multiple tornadoes, hailstorms, and thunderstorms, all grouped under the same episode ID. Each event also includes a textual description, offering valuable contextual information.
5552

53+
## Leaderboard
5654

57-
### News-driven Question Answering
58-
![News-driven question answering.|scale=0.5](./assets/news_driven_qa_answering.png)
55+
<Table/>

components/sortable-table.tsx

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
'use client'
2+
import { MouseEventHandler, useCallback, useState } from "react";
3+
// import data from "../app/projects/mtbench/data/data.json";
4+
import data from "../app/projects/mtbench/data/data_leaderboard.json";
5+
6+
type Data = typeof data;
7+
8+
type SortKeys = keyof Data[0];
9+
10+
type SortOrder = "ascn" | "desc";
11+
12+
function sortData({
13+
tableData,
14+
sortKey,
15+
reverse,
16+
}: {
17+
tableData: Data;
18+
sortKey: SortKeys;
19+
reverse: boolean;
20+
}) {
21+
if (!sortKey) return tableData;
22+
23+
const sortedData = data.sort((a, b) => {
24+
return a[sortKey] > b[sortKey] ? 1 : -1;
25+
});
26+
27+
if (reverse) {
28+
return sortedData.reverse();
29+
}
30+
31+
return sortedData;
32+
}
33+
34+
function SortButton({
35+
sortOrder,
36+
columnKey,
37+
sortKey,
38+
onClick,
39+
}: {
40+
sortOrder: SortOrder;
41+
columnKey: SortKeys;
42+
sortKey: SortKeys;
43+
onClick: MouseEventHandler<HTMLButtonElement>;
44+
}) {
45+
return (
46+
<button
47+
onClick={onClick}
48+
className={`${
49+
sortKey === columnKey && sortOrder === "desc"
50+
? "sort-button sort-reverse"
51+
: "sort-button"
52+
}`}
53+
>
54+
55+
</button>
56+
);
57+
}
58+
59+
function SortableTable({ data }: { data: Data }) {
60+
const [sortKey, setSortKey] = useState<SortKeys>("model_name");
61+
const [sortOrder, setSortOrder] = useState<SortOrder>("ascn");
62+
63+
const headers: { key: SortKeys; label: string }[] = [
64+
{ key: "model_name", label: "Model" },
65+
{ key: "stock_price_forecast_7_day_mae_ts", label: "Stock price predict. for 7 days under TS (MAE)" },
66+
{ key: "stock_price_forecast_7_day_mae_ts_w_text", label: "Stock price predict. for 7 days under TS+Text (MAE)" },
67+
{ key: "stock_price_forecast_7_day_mape_ts", label: "Stock price predict. for 7 days under TS (MAPE)" },
68+
{ key: "stock_price_forecast_7_day_mape_ts_w_text", label: "Stock price predict. for 7 days under TS+Text (MAPE)" },
69+
{ key: "stock_price_forecast_30_day_mae_ts", label: "Stock price predict. for 30 days under TS (MAE)" },
70+
{ key: "stock_price_forecast_30_day_mae_ts_w_text", label: "Stock price predict. for 30 days under TS+Text (MAE)" },
71+
{ key: "stock_price_forecast_30_day_mape_ts", label: "Stock price predict. for 30 days under TS (MAPE)" },
72+
{ key: "stock_price_forecast_30_day_mape_ts_w_text", label: "Stock price predict. for 30 days under TS+Text (MAPE)" },
73+
{ key: "temp_forecast_7_day_mse_ts", label: "Temp. predict. for 7 days under TS (MSE)" },
74+
{ key: "temp_forecast_7_day_mse_ts_w_text", label: "Temp. predict. for 7 days under TS+Text (MSE)" },
75+
{ key: "temp_forecast_7_day_mae_ts", label: "Temp. predict. for 7 days under TS (MAE)" },
76+
{ key: "temp_forecast_7_day_mae_ts_w_text", label: "Temp. predict. for 7 days under TS+Text (MAE)" },
77+
{ key: "temp_forecast_14_day_mse_ts", label: "Temp. predict. for 14 days under TS (MSE)" },
78+
{ key: "temp_forecast_14_day_mse_ts_w_text", label: "Temp. predict. for 14 days under TS+Text (MSE)" },
79+
{ key: "temp_forecast_14_day_mae_ts", label: "Temp. predict. for 14 days under TS (MAE)" },
80+
{ key: "temp_forecast_14_day_mae_ts_w_text", label: "Temp. predict. for 14 days under TS+Text (MAE)" },
81+
];
82+
83+
const sortedData = useCallback(
84+
() => sortData({ tableData: data, sortKey, reverse: sortOrder === "desc" }),
85+
[data, sortKey, sortOrder]
86+
);
87+
88+
function changeSort(key: SortKeys) {
89+
setSortOrder(sortOrder === "ascn" ? "desc" : "ascn");
90+
91+
setSortKey(key);
92+
}
93+
94+
return (
95+
<table>
96+
<thead>
97+
<tr>
98+
{headers.map((row) => {
99+
return (
100+
<td key={row.key}>
101+
{row.label}{" "}
102+
<SortButton
103+
columnKey={row.key}
104+
onClick={() => changeSort(row.key)}
105+
{...{
106+
sortOrder,
107+
sortKey,
108+
}}
109+
/>
110+
</td>
111+
);
112+
})}
113+
</tr>
114+
</thead>
115+
116+
<tbody>
117+
{sortedData().map((model) => {
118+
return (
119+
<tr key={model.model_name}>
120+
<td>{model.model_name}</td>
121+
<td>{model.stock_price_forecast_7_day_mae_ts}</td>
122+
<td>{model.stock_price_forecast_7_day_mae_ts_w_text}</td>
123+
<td>{model.stock_price_forecast_7_day_mape_ts}</td>
124+
<td>{model.stock_price_forecast_7_day_mape_ts_w_text}</td>
125+
<td>{model.stock_price_forecast_30_day_mae_ts}</td>
126+
<td>{model.stock_price_forecast_30_day_mae_ts_w_text}</td>
127+
<td>{model.stock_price_forecast_30_day_mape_ts}</td>
128+
<td>{model.stock_price_forecast_30_day_mape_ts_w_text}</td>
129+
<td>{model.temp_forecast_7_day_mse_ts}</td>
130+
<td>{model.temp_forecast_7_day_mse_ts_w_text}</td>
131+
<td>{model.temp_forecast_7_day_mae_ts}</td>
132+
<td>{model.temp_forecast_7_day_mae_ts_w_text}</td>
133+
<td>{model.temp_forecast_14_day_mse_ts}</td>
134+
<td>{model.temp_forecast_14_day_mse_ts_w_text}</td>
135+
<td>{model.temp_forecast_14_day_mae_ts}</td>
136+
<td>{model.temp_forecast_14_day_mae_ts_w_text}</td>
137+
</tr>
138+
);
139+
})}
140+
</tbody>
141+
</table>
142+
);
143+
}
144+
145+
export default SortableTable;

components/table.tsx

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
import { useState } from "react";
2+
import SortableTable from "./sortable-table";
3+
import data from "../app/projects/mtbench/data/data_leaderboard.json";
4+
import "../styles/table.css";
5+
6+
function Table() {
7+
return (
8+
<div className="Table">
9+
<SortableTable data={data} />
10+
</div>
11+
);
12+
}
13+
14+
export default Table;

config/publications.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,8 @@ export interface Publication {
2121
export const publications: Publication[] = [
2222
{
2323
title: "MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering",
24-
authors: "Jialin Chen, Aosong Feng, Ziyu Zhao, Juan Garza, Gaukhar Nurbek, Ali Maatouk, Leandros Tassiulas, Yifeng Gao3, Rex Ying",
25-
venue: "KDD, 2025",
24+
authors: "Jialin Chen, Aosong Feng, Ziyu Zhao, Juan Garza, Gaukhar Nurbek, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, Rex Ying",
25+
venue: "",
2626
page: "mtbench",
2727
code: "https://github.com/Graph-and-Geometric-Learning/MTBencht",
2828
paper: "",

styles/table.css

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
table {
2+
width: 100%;
3+
table-layout: fixed;
4+
border: 1px solid;
5+
overflow-x: auto;
6+
max-width: fit-content;
7+
}
8+
9+
tbody {
10+
width: 100%;
11+
border: 1px solid rgba(255, 255, 255, 0.3);
12+
}
13+
14+
tr {
15+
width: 100%;
16+
}
17+
18+
th,
19+
td {
20+
border-bottom: 1px solid #ddd;
21+
/* padding: 10px 8px; */
22+
text-align: left;
23+
font-weight: 500;
24+
font-size: 12px;
25+
/* border: 1px solid; */
26+
/* color: black; */
27+
}
28+
tr:hover {background-color: rgba(84, 83, 83, 0.7);}
29+
.sort-button {
30+
background-color: transparent;
31+
border: none;
32+
33+
padding: 5px 10px;
34+
margin: 0;
35+
line-height: 1;
36+
font-size: 15px;
37+
/* color: black; */
38+
cursor: pointer;
39+
40+
transition: transform 0.05s ease-out;
41+
}
42+
43+
.sort-reverse {
44+
transform: rotate(180deg);
45+
}

0 commit comments

Comments
 (0)