PDF Clown
0.1.2

org.pdfclown.tools
Class TextExtractor

java.lang.Object
  extended by org.pdfclown.tools.TextExtractor

public final class TextExtractor
extends Object

Tool for extracting text from content contexts.

Since:
0.0.8
Version:
0.1.1, 11/01/11
Author:
Stefano Chizzolini (http://www.stefanochizzolini.it)

Nested Class Summary
static class TextExtractor.AreaModeEnum
          Text-to-area matching mode.
static interface TextExtractor.IIntervalFilter
          Text filter by interval.
 
Constructor Summary
TextExtractor()
           
TextExtractor(boolean sorted, boolean dehyphenated)
           
TextExtractor(List<Rectangle2D> areas, boolean sorted, boolean dehyphenated)
           
 
Method Summary
 Map<Rectangle2D,List<ITextString>> extract(Contents contents)
          Extracts text strings from the specified contents.
 Map<Rectangle2D,List<ITextString>> extract(IContentContext contentContext)
          Extracts text strings from the specified content context.
 Map<Rectangle2D,List<ITextString>> filter(List<? extends ITextString> textStrings, Rectangle2D... areas)
          Gets the text strings matching the specified areas.
 List<ITextString> filter(List<? extends ITextString> textStrings, Rectangle2D area)
          Gets the text strings matching the specified area.
 List<ITextString> filter(Map<Rectangle2D,List<ITextString>> textStrings, List<Interval<Integer>> intervals)
          Gets the text strings matching the specified intervals.
 Map<Rectangle2D,List<ITextString>> filter(Map<Rectangle2D,List<ITextString>> textStrings, Rectangle2D... areas)
          Gets the text strings matching the specified areas.
 List<ITextString> filter(Map<Rectangle2D,List<ITextString>> textStrings, Rectangle2D area)
          Gets the text strings matching the specified area.
 void filter(Map<Rectangle2D,List<ITextString>> textStrings, TextExtractor.IIntervalFilter filter)
          Processes the text strings matching the specified filter.
 TextExtractor.AreaModeEnum getAreaMode()
          Gets the text-to-area matching mode.
 List<Rectangle2D> getAreas()
          Gets the graphic areas whose text has to be extracted.
 double getAreaTolerance()
          Gets the admitted outer area (in points) for containment matching purposes.
 boolean isDehyphenated()
          Gets whether the text strings have to be dehyphenated.
 boolean isSorted()
          Gets whether the text strings have to be sorted.
 void setAreaMode(TextExtractor.AreaModeEnum value)
           
 void setAreas(List<Rectangle2D> value)
           
 void setAreaTolerance(double value)
           
 void setDehyphenated(boolean value)
           
 void setSorted(boolean value)
           
static String toString(Map<Rectangle2D,List<ITextString>> textStrings)
          Converts text information into plain text.
static String toString(Map<Rectangle2D,List<ITextString>> textStrings, String lineSeparator, String areaSeparator)
          Converts text information into plain text.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TextExtractor

public TextExtractor()

TextExtractor

public TextExtractor(boolean sorted,
                     boolean dehyphenated)

TextExtractor

public TextExtractor(List<Rectangle2D> areas,
                     boolean sorted,
                     boolean dehyphenated)
Method Detail

toString

public static String toString(Map<Rectangle2D,List<ITextString>> textStrings)
Converts text information into plain text.

Parameters:
textStrings - Text information to convert.
Returns:
Plain text.

toString

public static String toString(Map<Rectangle2D,List<ITextString>> textStrings,
                              String lineSeparator,
                              String areaSeparator)
Converts text information into plain text.

Parameters:
textStrings - Text information to convert.
lineSeparator - Separator to apply on line break.
areaSeparator - Separator to apply on area break.
Returns:
Plain text.

extract

public Map<Rectangle2D,List<ITextString>> extract(IContentContext contentContext)
Extracts text strings from the specified content context.

Parameters:
contentContext - Source content context.

extract

public Map<Rectangle2D,List<ITextString>> extract(Contents contents)
Extracts text strings from the specified contents.

Parameters:
contents - Source contents.

filter

public List<ITextString> filter(Map<Rectangle2D,List<ITextString>> textStrings,
                                List<Interval<Integer>> intervals)
Gets the text strings matching the specified intervals.

Parameters:
textStrings - Text strings to filter.
intervals - Text intervals to match. They MUST be ordered and not overlapping.
Returns:
A list of text strings corresponding to the specified intervals.

filter

public void filter(Map<Rectangle2D,List<ITextString>> textStrings,
                   TextExtractor.IIntervalFilter filter)
Processes the text strings matching the specified filter.

Parameters:
textStrings - Text strings to filter.
filter - Matching processor.

filter

public List<ITextString> filter(Map<Rectangle2D,List<ITextString>> textStrings,
                                Rectangle2D area)
Gets the text strings matching the specified area.

Parameters:
textStrings - Text strings to filter, grouped by source area.
area - Graphic area which text strings have to be matched to.

filter

public Map<Rectangle2D,List<ITextString>> filter(Map<Rectangle2D,List<ITextString>> textStrings,
                                                 Rectangle2D... areas)
Gets the text strings matching the specified areas.

Parameters:
textStrings - Text strings to filter, grouped by source area.
areas - Graphic areas which text strings have to be matched to.

filter

public List<ITextString> filter(List<? extends ITextString> textStrings,
                                Rectangle2D area)
Gets the text strings matching the specified area.

Parameters:
textStrings - Text strings to filter.
area - Graphic area which text strings have to be matched to.

filter

public Map<Rectangle2D,List<ITextString>> filter(List<? extends ITextString> textStrings,
                                                 Rectangle2D... areas)
Gets the text strings matching the specified areas.

Parameters:
textStrings - Text strings to filter.
areas - Graphic areas which text strings have to be matched to.

getAreaMode

public TextExtractor.AreaModeEnum getAreaMode()
Gets the text-to-area matching mode.


getAreas

public List<Rectangle2D> getAreas()
Gets the graphic areas whose text has to be extracted.


getAreaTolerance

public double getAreaTolerance()
Gets the admitted outer area (in points) for containment matching purposes.

This measure is useful to ensure that text whose boxes overlap with the area bounds is not excluded from the match.


isDehyphenated

public boolean isDehyphenated()
Gets whether the text strings have to be dehyphenated.


isSorted

public boolean isSorted()
Gets whether the text strings have to be sorted.


setAreaMode

public void setAreaMode(TextExtractor.AreaModeEnum value)
See Also:
getAreaMode()

setAreas

public void setAreas(List<Rectangle2D> value)
See Also:
getAreas()

setAreaTolerance

public void setAreaTolerance(double value)
See Also:
getAreaTolerance()

setDehyphenated

public void setDehyphenated(boolean value)
See Also:
isDehyphenated()

setSorted

public void setSorted(boolean value)
See Also:
isSorted()

PDF Clown
0.1.2

Project home page

Copyright © 2006-2013 Stefano Chizzolini. Some Rights Reserved.
This documentation is available under the terms of the GNU Free Documentation License.