Skip to content

Utilities

This section documents utility functions and helper classes used throughout Pydoll.

pydoll.utils

logger module-attribute

logger = getLogger(__name__)

TextExtractor

TextExtractor()

Bases: HTMLParser

HTML parser for text extraction.

Extracts visible text content from an HTML string, excluding the contents of tags specified in _skip_tags.

handle_starttag

handle_starttag(tag, attrs)

Marks the parser to skip content inside tags specified in _skip_tags.

PARAMETER DESCRIPTION
tag

The tag name.

TYPE: str

attrs

A list of (attribute, value) pairs.

TYPE: list

handle_endtag

handle_endtag(tag)

Marks the parser the end of skip tags.

PARAMETER DESCRIPTION
tag

The tag name.

TYPE: str

handle_data

handle_data(data)

Handles text nodes. Adds them to the result unless they are within a skip tag.

PARAMETER DESCRIPTION
data

The text data.

TYPE: str

get_strings

get_strings(strip)

Yields all collected visible text fragments.

PARAMETER DESCRIPTION
strip

Whether to strip leading/trailing whitespace from each fragment.

TYPE: bool

YIELDS DESCRIPTION
str

Visible text fragments.

get_text

get_text(separator, strip)

Returns all visible text.

PARAMETER DESCRIPTION
separator

String inserted between extracted text fragments.

TYPE: str

strip

Whether to strip whitespace from each fragment.

TYPE: bool

RETURNS DESCRIPTION
str

The visible text.

TYPE: str

extract_text_from_html

extract_text_from_html(html, separator='', strip=False)

Extracts visible text content from an HTML string.

PARAMETER DESCRIPTION
html

The HTML string to extract text from.

TYPE: str

separator

String inserted between extracted text fragments. Defaults to ''.

TYPE: str DEFAULT: ''

strip

Whether to strip whitespace from text fragments. Defaults to False.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
str

The extracted visible text.

TYPE: str

decode_base64_to_bytes

decode_base64_to_bytes(image)

Decodes a base64 image string to bytes.

PARAMETER DESCRIPTION
image

The base64 image string to decode.

TYPE: str

RETURNS DESCRIPTION
bytes

The decoded image as bytes.

TYPE: bytes

get_browser_ws_address async

get_browser_ws_address(port)

Fetches the WebSocket address for the browser instance.

RETURNS DESCRIPTION
str

The WebSocket address for the browser.

TYPE: str

RAISES DESCRIPTION
NetworkError

If the address cannot be fetched due to network errors or missing data.

InvalidResponse

If the response is not valid JSON.

validate_browser_paths

validate_browser_paths(paths)

Validates potential browser executable paths and returns the first valid one.

Checks a list of possible browser binary locations to find an existing, executable browser. This is used by browser-specific subclasses to locate the browser executable when no explicit binary path is provided.

PARAMETER DESCRIPTION
paths

List of potential file paths to check for the browser executable. These should be absolute paths appropriate for the current OS.

TYPE: list[str]

RETURNS DESCRIPTION
str

The first valid browser executable path found.

TYPE: str

RAISES DESCRIPTION
InvalidBrowserPath

If the browser executable is not found at the path.

clean_script_for_analysis

clean_script_for_analysis(script)

Clean JavaScript code by removing comments and string literals.

This helps avoid false positives when analyzing script structure.

PARAMETER DESCRIPTION
script

JavaScript code to clean.

TYPE: str

RETURNS DESCRIPTION
str

Cleaned script with comments and strings removed.

TYPE: str

is_script_already_function

is_script_already_function(script)

Check if a JavaScript script is already wrapped in a function.

PARAMETER DESCRIPTION
script

JavaScript code to analyze.

TYPE: str

RETURNS DESCRIPTION
bool

True if script is already a function, False otherwise.

TYPE: bool

has_return_outside_function

has_return_outside_function(script)

Check if a JavaScript script has return statements outside of functions.

PARAMETER DESCRIPTION
script

JavaScript code to analyze.

TYPE: str

RETURNS DESCRIPTION
bool

True if script has return outside function, False otherwise.

TYPE: bool