Utilities
This section documents utility functions and helper classes used throughout Pydoll.
pydoll.utils
TextExtractor
Bases: HTMLParser
HTML parser for text extraction.
Extracts visible text content from an HTML string, excluding the contents of tags specified in _skip_tags.
handle_starttag
Marks the parser to skip content inside tags specified in _skip_tags.
PARAMETER | DESCRIPTION |
---|---|
tag
|
The tag name.
TYPE:
|
attrs
|
A list of (attribute, value) pairs.
TYPE:
|
handle_endtag
Marks the parser the end of skip tags.
PARAMETER | DESCRIPTION |
---|---|
tag
|
The tag name.
TYPE:
|
handle_data
Handles text nodes. Adds them to the result unless they are within a skip tag.
PARAMETER | DESCRIPTION |
---|---|
data
|
The text data.
TYPE:
|
get_strings
Yields all collected visible text fragments.
PARAMETER | DESCRIPTION |
---|---|
strip
|
Whether to strip leading/trailing whitespace from each fragment.
TYPE:
|
YIELDS | DESCRIPTION |
---|---|
str
|
Visible text fragments. |
extract_text_from_html
Extracts visible text content from an HTML string.
PARAMETER | DESCRIPTION |
---|---|
html
|
The HTML string to extract text from.
TYPE:
|
separator
|
String inserted between extracted text fragments. Defaults to ''.
TYPE:
|
strip
|
Whether to strip whitespace from text fragments. Defaults to False.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str
|
The extracted visible text.
TYPE:
|
decode_base64_to_bytes
Decodes a base64 image string to bytes.
PARAMETER | DESCRIPTION |
---|---|
image
|
The base64 image string to decode.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
bytes
|
The decoded image as bytes.
TYPE:
|
get_browser_ws_address
async
Fetches the WebSocket address for the browser instance.
RETURNS | DESCRIPTION |
---|---|
str
|
The WebSocket address for the browser.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
NetworkError
|
If the address cannot be fetched due to network errors or missing data. |
InvalidResponse
|
If the response is not valid JSON. |
validate_browser_paths
Validates potential browser executable paths and returns the first valid one.
Checks a list of possible browser binary locations to find an existing, executable browser. This is used by browser-specific subclasses to locate the browser executable when no explicit binary path is provided.
PARAMETER | DESCRIPTION |
---|---|
paths
|
List of potential file paths to check for the browser executable. These should be absolute paths appropriate for the current OS.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str
|
The first valid browser executable path found.
TYPE:
|
RAISES | DESCRIPTION |
---|---|
InvalidBrowserPath
|
If the browser executable is not found at the path. |
clean_script_for_analysis
Clean JavaScript code by removing comments and string literals.
This helps avoid false positives when analyzing script structure.
PARAMETER | DESCRIPTION |
---|---|
script
|
JavaScript code to clean.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
str
|
Cleaned script with comments and strings removed.
TYPE:
|
is_script_already_function
Check if a JavaScript script is already wrapped in a function.
PARAMETER | DESCRIPTION |
---|---|
script
|
JavaScript code to analyze.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
bool
|
True if script is already a function, False otherwise.
TYPE:
|
has_return_outside_function
Check if a JavaScript script has return statements outside of functions.
PARAMETER | DESCRIPTION |
---|---|
script
|
JavaScript code to analyze.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
bool
|
True if script has return outside function, False otherwise.
TYPE:
|