QTSumm: Query-Focused Summarization over Tabular Data

1 of

Published on Nov 12, 2023

Page 1 (0s)

[Audio] Thank you for watching our video! In this presentation, we introduce our EMNLP paper titled 'QT Sum: Query-Focused Summarization Over Tabular Data..

Page 2 (15s)

[Audio] In this paper, we define a new task: query-focused table summarization. This task challenges text generation models to mimic human reasoning and analysis when they interact with tabular data to produce a summary that's tailored to a specific query. We've developed a new benchmark for this task called QTSUMM. It includes 7,111 human-annotated query-summary pairs based on 2,934 tables covering a wide array of topics. The QTSumm dataset introduces challenges on three fronts. First, it demands that models understand the user's information needs to create a summary that's specific to their query. Second, these models must perform human-like analysis of the table data. They need to extract facts and reason through them to make sure the summary stays true to the table's information. Lastly, this task also brings new challenges in developing automated evaluation systems that can accurately assess model performance in this unique domain..

Page 3 (1m 25s)

[Audio] QTSUMM requires models to perform human-like reasoning in generating summaries that provide comprehensive and precise analysis of the source table to fulfill the user's information need. However, existing end-to-end text generation models rely on error-prone implicit reasoning processes for generating text, leading to diminished explainability and challenges in reasoning. To address this issue, we present REFACTOR, to retrieve and reason over query-relevant information from tabular data to generate several data insights in natural language forms. These generated facts will serve as explicit reasoning results, enhancing the comprehensiveness and faithfulness of text generation systems..

Page 4 (2m 10s)

[Audio] We investigate a set of strong baselines on QTSUMM, including text generation, table-to-text generation, and large language models. Our experiments led to five key findings. Firstly, our analysis reveals that an enhanced understanding of table structures significantly augments the performance of fine-tuned models. Table-to-text generation models such as ReasTAP and TAPEX, for instance, demonstrate superior performance compared to their underlying base models, underscoring the value of specialized table structure comprehension. Secondly, the capacity for human-like reasoning and analysis emerges as a critical determinant in the effectiveness of models on the QTSumm task. Notably, Flan-T5, which represents an augmentation of T5 through scaled instruction fine-tuning, exhibits superior performance over the original T5 model. Furthermore, large language models with advanced reasoning abilities, such as GPT-4, also display enhanced performance. In addition, we we found Refactor framework could significantly bolster model performance, particularly in terms of enhancing the faithfulness of the generated summaries. Our case studies have identified four prevalent types of errors commonly exhibited by existing models: hallucination, factual incorrectness, misunderstanding of intent, and repetitive outputs. We also observe a mismatch between automated evaluation and human judgement. For example, while receiving low scores in BLEU and ROUGE, GPT-4 exhibit better performance than state-of-the-art fine-tuned models in human evaluation. This finding underscores the need for future research to investigate the development of automated evaluation metrics for the QTSUMM task that better align with human judgments.

Page 5 (4m 9s)

[Audio] We have made the dataset available on Hugging Face and our codebase can be found on GitHub. If you have any questions regarding the paper, please feel free to contact the first author..