skillNer.cleaner.Cleaner

class skillNer.cleaner.Cleaner(to_lowercase: bool = True, include_cleaning_functions: List[str] = ['remove_punctuation', 'remove_redundant', 'stem_text', 'lem_text', 'remove_extra_space'], exclude_cleaning_function: List[str] = [])

A class to build pipelines to clean text.

the constructor of the class.

Parameters
  • to_lowercase (bool, optional) – whether to lowercase the text before cleaning it, by default True

  • include_cleaning_functions (List, optional) – List of cleaning operations to include in the pipeline, by default all_cleaning

  • exclude_cleaning_function (List, optional) – List of cleaning operations to exclude for the pipeline, by default []

__init__(to_lowercase: bool = True, include_cleaning_functions: List[str] = ['remove_punctuation', 'remove_redundant', 'stem_text', 'lem_text', 'remove_extra_space'], exclude_cleaning_function: List[str] = [])

the constructor of the class.

Parameters
  • to_lowercase (bool, optional) – whether to lowercase the text before cleaning it, by default True

  • include_cleaning_functions (List, optional) – List of cleaning operations to include in the pipeline, by default all_cleaning

  • exclude_cleaning_function (List, optional) – List of cleaning operations to exclude for the pipeline, by default []

Methods

__init__([to_lowercase, ...])

the constructor of the class.

__call__(text: str) str

To apply the initiallized cleaning pipeline on a given text.

Parameters

text (str) – text to clean

Returns

returns the text after applying all cleaning operations on it.

Return type

str

Examples

>>> from skillNer.cleaner import Cleaner
>>> cleaner = Cleaner(
                to_lowercase=True,
                include_cleaning_functions=["remove_punctuation", "remove_extra_space"]
            )
>>> text = " I am   sentence with   a lot of  annoying extra    spaces    , and !! some ,., meaningless punctuation ?! .! AH AH AH"
>>> cleaner(text)
'i am sentence with a lot of annoying extra spaces and some meaningless punctuation ah ah ah'