|
PDF Clown 0.1.2 |
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.pdfclown.tools.TextExtractor
public final class TextExtractor
Tool for extracting text from content contexts
.
Nested Class Summary | |
---|---|
static class |
TextExtractor.AreaModeEnum
Text-to-area matching mode. |
static interface |
TextExtractor.IIntervalFilter
Text filter by interval. |
Constructor Summary | |
---|---|
TextExtractor()
|
|
TextExtractor(boolean sorted,
boolean dehyphenated)
|
|
TextExtractor(List<Rectangle2D> areas,
boolean sorted,
boolean dehyphenated)
|
Method Summary | |
---|---|
Map<Rectangle2D,List<ITextString>> |
extract(Contents contents)
Extracts text strings from the specified contents. |
Map<Rectangle2D,List<ITextString>> |
extract(IContentContext contentContext)
Extracts text strings from the specified content context. |
Map<Rectangle2D,List<ITextString>> |
filter(List<? extends ITextString> textStrings,
Rectangle2D... areas)
Gets the text strings matching the specified areas. |
List<ITextString> |
filter(List<? extends ITextString> textStrings,
Rectangle2D area)
Gets the text strings matching the specified area. |
List<ITextString> |
filter(Map<Rectangle2D,List<ITextString>> textStrings,
List<Interval<Integer>> intervals)
Gets the text strings matching the specified intervals. |
Map<Rectangle2D,List<ITextString>> |
filter(Map<Rectangle2D,List<ITextString>> textStrings,
Rectangle2D... areas)
Gets the text strings matching the specified areas. |
List<ITextString> |
filter(Map<Rectangle2D,List<ITextString>> textStrings,
Rectangle2D area)
Gets the text strings matching the specified area. |
void |
filter(Map<Rectangle2D,List<ITextString>> textStrings,
TextExtractor.IIntervalFilter filter)
Processes the text strings matching the specified filter. |
TextExtractor.AreaModeEnum |
getAreaMode()
Gets the text-to-area matching mode. |
List<Rectangle2D> |
getAreas()
Gets the graphic areas whose text has to be extracted. |
double |
getAreaTolerance()
Gets the admitted outer area (in points) for containment matching purposes. |
boolean |
isDehyphenated()
Gets whether the text strings have to be dehyphenated. |
boolean |
isSorted()
Gets whether the text strings have to be sorted. |
void |
setAreaMode(TextExtractor.AreaModeEnum value)
|
void |
setAreas(List<Rectangle2D> value)
|
void |
setAreaTolerance(double value)
|
void |
setDehyphenated(boolean value)
|
void |
setSorted(boolean value)
|
static String |
toString(Map<Rectangle2D,List<ITextString>> textStrings)
Converts text information into plain text. |
static String |
toString(Map<Rectangle2D,List<ITextString>> textStrings,
String lineSeparator,
String areaSeparator)
Converts text information into plain text. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public TextExtractor()
public TextExtractor(boolean sorted, boolean dehyphenated)
public TextExtractor(List<Rectangle2D> areas, boolean sorted, boolean dehyphenated)
Method Detail |
---|
public static String toString(Map<Rectangle2D,List<ITextString>> textStrings)
textStrings
- Text information to convert.
public static String toString(Map<Rectangle2D,List<ITextString>> textStrings, String lineSeparator, String areaSeparator)
textStrings
- Text information to convert.lineSeparator
- Separator to apply on line break.areaSeparator
- Separator to apply on area break.
public Map<Rectangle2D,List<ITextString>> extract(IContentContext contentContext)
contentContext
- Source content context.public Map<Rectangle2D,List<ITextString>> extract(Contents contents)
contents
- Source contents.public List<ITextString> filter(Map<Rectangle2D,List<ITextString>> textStrings, List<Interval<Integer>> intervals)
textStrings
- Text strings to filter.intervals
- Text intervals to match. They MUST be ordered and not overlapping.
public void filter(Map<Rectangle2D,List<ITextString>> textStrings, TextExtractor.IIntervalFilter filter)
textStrings
- Text strings to filter.filter
- Matching processor.public List<ITextString> filter(Map<Rectangle2D,List<ITextString>> textStrings, Rectangle2D area)
textStrings
- Text strings to filter, grouped by source area.area
- Graphic area which text strings have to be matched to.public Map<Rectangle2D,List<ITextString>> filter(Map<Rectangle2D,List<ITextString>> textStrings, Rectangle2D... areas)
textStrings
- Text strings to filter, grouped by source area.areas
- Graphic areas which text strings have to be matched to.public List<ITextString> filter(List<? extends ITextString> textStrings, Rectangle2D area)
textStrings
- Text strings to filter.area
- Graphic area which text strings have to be matched to.public Map<Rectangle2D,List<ITextString>> filter(List<? extends ITextString> textStrings, Rectangle2D... areas)
textStrings
- Text strings to filter.areas
- Graphic areas which text strings have to be matched to.public TextExtractor.AreaModeEnum getAreaMode()
public List<Rectangle2D> getAreas()
public double getAreaTolerance()
This measure is useful to ensure that text whose boxes overlap with the area bounds is not excluded from the match.
public boolean isDehyphenated()
public boolean isSorted()
public void setAreaMode(TextExtractor.AreaModeEnum value)
getAreaMode()
public void setAreas(List<Rectangle2D> value)
getAreas()
public void setAreaTolerance(double value)
getAreaTolerance()
public void setDehyphenated(boolean value)
isDehyphenated()
public void setSorted(boolean value)
isSorted()
|
PDF Clown 0.1.2 |
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |