Summarizing Large Data Sets
Jeffrey A. Bilmes

Citation
Jeffrey A. Bilmes. "Summarizing Large Data Sets". Tutorial, 20, August, 2015.

Abstract
The recent growth of available data is both a blessing and a curse for the field of data science. While large data sets can lead to improved predictive accuracy and can motivate research in parallel computing, they can also be plagued with redundancy, leading to wasted computation. In this talk we will discuss a class of approaches to data summarization and subset selection based on submodular functions. We will see how a form of "combinatorial dependence" over data sets can be naturally induced via submodular functions, and how resulting submodular programs (that often have approximation guarantees) can yield practical and high-quality data summarization strategies. The effectiveness of this approach will be demonstrated based on results from a wide range of applications, including document summarization, machine learning training data subset selection, image summarization, and some recent work on online streaming video summarization.

Electronic downloads


Internal. This publication has been marked by the author for TerraSwarm-only distribution, so electronic downloads are not available without logging in.
Citation formats  
  • HTML
    Jeffrey A. Bilmes. <a
    href="http://www.terraswarm.org/pubs/595.html"
    ><i>Summarizing Large Data
    Sets</i></a>, Tutorial,  20, August, 2015.
  • Plain text
    Jeffrey A. Bilmes. "Summarizing Large Data Sets".
    Tutorial,  20, August, 2015.
  • BibTeX
    @tutorial{Bilmes15_SummarizingLargeDataSets,
        author = {Jeffrey A. Bilmes},
        title = {Summarizing Large Data Sets},
        day = {20},
        month = {August},
        year = {2015},
        abstract = {The recent growth of available data is both a
                  blessing and a curse for the field of data
                  science. While large data sets can lead to
                  improved predictive accuracy and can motivate
                  research in parallel computing, they can also be
                  plagued with redundancy, leading to wasted
                  computation. In this talk we will discuss a class
                  of approaches to data summarization and subset
                  selection based on submodular functions. We will
                  see how a form of "combinatorial dependence" over
                  data sets can be naturally induced via submodular
                  functions, and how resulting submodular programs
                  (that often have approximation guarantees) can
                  yield practical and high-quality data
                  summarization strategies. The effectiveness of
                  this approach will be demonstrated based on
                  results from a wide range of applications,
                  including document summarization, machine learning
                  training data subset selection, image
                  summarization, and some recent work on online
                  streaming video summarization. },
        URL = {http://terraswarm.org/pubs/595.html}
    }
    

Posted by Barb Hoversten on 30 Jul 2015.
Groups: services

Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright.