Summarizing Large Data Sets

Summarizing Large Data Sets
Jeffrey A. Bilmes

Citation
Jeffrey A. Bilmes. "Summarizing Large Data Sets". Tutorial, 20, August, 2015.

Abstract
The recent growth of available data is both a blessing and a curse for the field of data science. While large data sets can lead to improved predictive accuracy and can motivate research in parallel computing, they can also be plagued with redundancy, leading to wasted computation. In this talk we will discuss a class of approaches to data summarization and subset selection based on submodular functions. We will see how a form of "combinatorial dependence" over data sets can be naturally induced via submodular functions, and how resulting submodular programs (that often have approximation guarantees) can yield practical and high-quality data summarization strategies. The effectiveness of this approach will be demonstrated based on results from a wide range of applications, including document summarization, machine learning training data subset selection, image summarization, and some recent work on online streaming video summarization.

Electronic downloads

Internal. This publication has been marked by the author for TerraSwarm-only distribution, so electronic downloads are not available without logging in.

Citation formats

HTML

Jeffrey A. Bilmes. <a
href="http://www.terraswarm.org/pubs/595.html"
><i>Summarizing Large Data
Sets</i></a>, Tutorial,  20, August, 2015.

Plain text

Jeffrey A. Bilmes. "Summarizing Large Data Sets".
Tutorial,  20, August, 2015.

BibTeX

@tutorial{Bilmes15_SummarizingLargeDataSets,
    author = {Jeffrey A. Bilmes},
    title = {Summarizing Large Data Sets},
    day = {20},
    month = {August},
    year = {2015},
    abstract = {The recent growth of available data is both a
              blessing and a curse for the field of data
              science. While large data sets can lead to
              improved predictive accuracy and can motivate
              research in parallel computing, they can also be
              plagued with redundancy, leading to wasted
              computation. In this talk we will discuss a class
              of approaches to data summarization and subset
              selection based on submodular functions. We will
              see how a form of "combinatorial dependence" over
              data sets can be naturally induced via submodular
              functions, and how resulting submodular programs
              (that often have approximation guarantees) can
              yield practical and high-quality data
              summarization strategies. The effectiveness of
              this approach will be demonstrated based on
              results from a wide range of applications,
              including document summarization, machine learning
              training data subset selection, image
              summarization, and some recent work on online
              streaming video summarization. },
    URL = {http://terraswarm.org/pubs/595.html}
}

Posted by Barb Hoversten on 30 Jul 2015.
Groups: services

Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright.