redshred package
Subpackages
- redshred.api package
- redshred.cli package
- redshred.enrichments package
- Submodules
- redshred.enrichments.base module
- redshred.enrichments.defined_acronyms module
- redshred.enrichments.external_api module
- redshred.enrichments.grouper module
GrouperPerspective
GrouperPerspectiveConfig
GrouperPerspectiveConfig.Config
GrouperPerspectiveConfig.hull_method
GrouperPerspectiveConfig.hull_method_options
GrouperPerspectiveConfig.operation_labels
GrouperPerspectiveConfig.operations
GrouperPerspectiveConfig.root_label
GrouperPerspectiveConfig.whitespace_calculation_method
GrouperPerspectiveConfig.whitespace_method_options
GrouperPerspectiveConfig.x_gap
GrouperPerspectiveConfig.y_gap
GrouperPerspectiveHullMethod
object_setattr()
- redshred.enrichments.huggingface module
HuggingfacePerspective
HuggingfacePerspectiveConfig
HuggingfacePerspectiveConfig.Config
HuggingfacePerspectiveConfig.model
HuggingfacePerspectiveConfig.model_class
HuggingfacePerspectiveConfig.model_source
HuggingfacePerspectiveConfig.pipeline_task
HuggingfacePerspectiveConfig.task_config
HuggingfacePerspectiveConfig.task_config_class
HuggingfacePerspectiveConfig.task_specific_template
HuggingfacePerspectiveConfig.tokenizer
HuggingfacePerspectiveConfig.tokenizer_class
object_setattr()
- redshred.enrichments.iris module
- redshred.enrichments.page_images module
BackendOptions
PageImagesPerspective
PageImagesPerspectiveBackend
PageImagesPerspectiveConfig
object_setattr()
- redshred.enrichments.pdftotext module
- redshred.enrichments.preprocess module
- redshred.enrichments.regex module
- redshred.enrichments.sentences module
- redshred.enrichments.spacy module
- redshred.enrichments.tfidf module
TFIDFPerspective
TFIDFPerspectiveConfig
TFIDFPerspectiveNorm
object_setattr()
- redshred.enrichments.typography module
- Module contents
- redshred.microservices package
- redshred.models package
- Submodules
- redshred.models.api module
APIObjectIterator
ApiObject
Collection
Collection.client
Collection.config
Collection.create()
Collection.created_at
Collection.created_by
Collection.delete()
Collection.description
Collection.document()
Collection.documents()
Collection.documents_link
Collection.id
Collection.load()
Collection.marked_for_delete
Collection.metadata
Collection.name
Collection.owner
Collection.perspective()
Collection.perspectives()
Collection.perspectives_link
Collection.segment()
Collection.segments()
Collection.segments_link
Collection.self_link
Collection.slug
Collection.updated_at
Collection.updated_by
Collection.upload_csv()
Collection.upload_file()
Collection.upload_text()
Collection.upload_url()
Collection.user_data
CollectionIterator
Document
Document.collection()
Document.collection_link
Document.collection_slug
Document.config
Document.content_hash
Document.create()
Document.created_at
Document.created_by
Document.csv_metadata
Document.description
Document.document_segment_link
Document.download()
Document.download_bytes()
Document.errors
Document.file_link
Document.file_size
Document.id
Document.index
Document.metadata
Document.n_pages
Document.name
Document.original_name
Document.page()
Document.pages()
Document.pages_link
Document.pdf_link
Document.perspective()
Document.perspectives()
Document.perspectives_link
Document.read_state
Document.read_state_updated_at
Document.region
Document.reread_document()
Document.segment()
Document.segments()
Document.segments_link
Document.self_link
Document.slug
Document.source
Document.summary
Document.text
Document.uniqueness_id
Document.updated_at
Document.updated_by
Document.user_data
Document.wait_until_read()
Document.warnings
DocumentIterator
Page
Page.collection()
Page.collection_link
Page.collection_slug
Page.content_hash
Page.created_at
Page.created_by
Page.document()
Page.document_index
Page.document_name
Page.dpi
Page.height
Page.id
Page.index
Page.metadata
Page.name
Page.next()
Page.page_segment_link
Page.perspective()
Page.perspectives()
Page.perspectives_link
Page.previous()
Page.region
Page.segment()
Page.segments()
Page.segments_link
Page.self_link
Page.summary
Page.text
Page.tokens()
Page.tokens_file_link
Page.units
Page.updated_at
Page.updated_by
Page.user_data
Page.width
PageIterator
Perspective
Perspective.bulk_create_segments()
Perspective.cache_id
Perspective.collection()
Perspective.collection_link
Perspective.collection_slug
Perspective.create()
Perspective.created_at
Perspective.created_by
Perspective.description
Perspective.document()
Perspective.document_link
Perspective.document_name
Perspective.enrichment_config
Perspective.enrichment_name
Perspective.errors
Perspective.id
Perspective.metadata
Perspective.name
Perspective.segment()
Perspective.segment_types
Perspective.segments()
Perspective.segments_link
Perspective.self_link
Perspective.slug
Perspective.updated_at
Perspective.updated_by
Perspective.user_data
Perspective.warnings
PerspectiveIterator
RedShredUser
Segment
Segment.between()
Segment.bounding_box
Segment.cache_id
Segment.collection()
Segment.collection_link
Segment.collection_slug
Segment.create()
Segment.created_at
Segment.created_by
Segment.document()
Segment.document_link
Segment.document_name
Segment.enrichment_data
Segment.enrichment_name
Segment.errors
Segment.get_segment_image()
Segment.get_segments_from_perspective()
Segment.get_text()
Segment.id
Segment.labels
Segment.max_x
Segment.max_y
Segment.metadata
Segment.min_x
Segment.min_y
Segment.perspective()
Segment.perspective_link
Segment.q()
Segment.regions
Segment.segment_type
Segment.self_link
Segment.summary
Segment.text
Segment.updated_at
Segment.updated_by
Segment.user_data
Segment.warnings
SegmentIterator
SerializableModel
Token
get_type()
- redshred.models.configuration module
AdvancedOCRTokenizerConfig
AdvancedOCRTokenizerConfigOptions
CollectionConfiguration
CollectionConfiguration.Config
CollectionConfiguration.allow_anonymous_downloads
CollectionConfiguration.dict()
CollectionConfiguration.document_uniqueness
CollectionConfiguration.enrichments
CollectionConfiguration.from_dict()
CollectionConfiguration.json()
CollectionConfiguration.notifications
CollectionConfiguration.tokenizer
CollectionConfiguration.validate_remote_schema()
CollectionConfiguration.yaml()
ConfiguredTokenizer
DocumentUniqueness
NotificationConfiguration
PerspectiveConfiguration
TesseractTokenizerConfig
TesseractTokenizerConfigOptions
Tokenizers
- Module contents
- redshred.visualize package
Submodules
redshred.configuration module
- class redshred.configuration.Configuration(token=None, host=None, host_verify=None, config_path=None, context_override=None)[source]
Bases:
object
Handles configuration for the RedShred API client.
The order in which the configuration sources are looked up is as follows:
Explicit values given to the constructor.
Environment variables: REDSHRED_TOKEN and REDSHRED_HOST.
Environment variables: REDSHRED_CONTEXT and REDSHRED_CONFIG.
REDSHRED_CONTEXT alone (uses the default config path).
REDSHRED_CONFIG alone (defaults to currentContext).
None given: uses the default config path with currentContext.
- Attributes:
context (str): The current context name. user (Optional[str]): The username, if available. host (str): The RedShred API server URL. token (str): The authentication token for RedShred API. verify (bool): Flag indicating whether to verify the server’s SSL certificate. options (Dict[str, Any]): Additional configuration options. source (str): The source of the configuration, such as environment variables or config file path.
- Raises:
ConfigurationError: If there is a general configuration error. ConfigurationFileError: If there is a problem specifically with the configuration file format.
- context: str
- host: str
- info()[source]
Provides a multiline string representation of the configuration details.
The returned string includes the configuration context, host, obfuscated token, username, and the source from which the configuration was obtained.
- Returns:
str: A human-readable, formatted summary of the configuration settings.
- options: Dict[str, Any]
- source: str
- token: str
- user: str | None
- verify: bool
- exception redshred.configuration.ConfigurationError[source]
Bases:
Exception
Exception raised for general configuration errors.
This error is used for broader configuration issues not necessarily related to the file format itself, such as missing environment variables or failure to locate a required configuration file.
- exception redshred.configuration.ConfigurationFileError[source]
Bases:
Exception
Exception raised for problems with the configuration file format.
This error is thrown when there is a problem with parsing the RedShred configuration file, which might be due to issues like improper formatting or missing required fields.
redshred.exceptions module
- exception redshred.exceptions.RedShredAPIError(*args, reason=None, **kwargs)[source]
Bases:
Exception
- reason: Any
- exception redshred.exceptions.RedShredFileExistsError(*args, **kwargs)[source]
Bases:
RedShredHTTPError
redshred.spatial module
- class redshred.spatial.BoundingBox(initlist: Iterable | None = None)[source]
Bases:
list
,_BaseGeometricMixin
- as_boundingbox() BoundingBox [source]
Get a BoundingBox object from a GeoJSON object
- classmethod from_shape(geometry) BoundingBox [source]
- get_bounds() BoundingBox [source]
for api consistency with Geojson
- get_offsets(numpy=False) tuple[int, int] [source]
x and y offsets for each coordinate of the geojson object
- property height
- property min_x
- property min_y
- normalize_to_page() BoundingBox [source]
- rotate(rotation, origin=(0.5, 0.5)) BoundingBox [source]
- scale(xfact=1.0, yfact=1.0, origin=(0, 0)) BoundingBox [source]
shapely scaling transformation
- translate(xoff=0.0, yoff=0.0) BoundingBox [source]
shapely offset transformation
- property width
- class redshred.spatial.GeoJSON[source]
Bases:
dict
,_BaseGeometricMixin
- as_boundingbox() BoundingBox [source]
Get a BoundingBox object from a GeoJSON object
- as_shape() BaseGeometry [source]
Get a shapely object from the GeoJSON dictionary Returns:
BaseGeometry
- classmethod from_bounds(x_min, y_min, x_max, y_max, base_class='polygon')[source]
Create a new object from x_min, y_min, x_max, y_max values. By default, a polygon is returned, but you can override this functionality by passing “multipolygon” as your base_class
- classmethod from_coords(coords)[source]
Create a new GeoJSON object from either a list of coordinates, or a list of list of coordinates for a Polygon and Multipolygon respectively
- get_bounds() BoundingBox [source]
Get a BoundingBox object from a GeoJSON object
- get_coordinates(canonical=True, numpy=False) List[List[float]] [source]
Get a coordinates list object from a GeoJSON object
- get_offsets(numpy=False) Tuple[int, int] [source]
x and y offsets for each coordinate of the geojson object
- property height
- property min_x
- property min_y
- normalize_to_page() GeoJSON [source]
Normalize the given GeoJSON from the quiltspace or document space to a [0,1], [0,1] scale. This returns a new GeoJSON instance.
- rotate(angle, origin='center') GeoJSON [source]
Rotate a GeoJSON object using affine transformations. This returns a new GeoJSON instance.
- scale(xfact=1.0, yfact=1.0, origin=(0, 0)) GeoJSON [source]
Scale a GeoJSON object using affine transformations. This returns a new GeoJSON instance.
- translate(xoff=0.0, yoff=0.0) GeoJSON [source]
Offset a GeoJSON object by the specified amount. This returns a new GeoJSON instance.
- property width
redshred.util module
util functions
Module contents
This is the RedShred client library in Python.
RedShred, LLC 2018-2023
- class redshred.Collection(*, self_link: str = None, id: str = None, config: CollectionConfiguration | None = None, created_at: datetime = None, created_by: str = None, description: str | None = None, documents_link: str = None, marked_for_delete: bool | None = False, metadata: dict | None = None, name: str = None, owner: str = None, perspectives_link: str = None, segments_link: str = None, slug: str = None, updated_at: datetime = None, updated_by: str = None, user_data: dict | None = None, client: Any = None, **data)[source]
Bases:
ApiObject
A collection class.
This class is used to create collections. It provides methods to create, read, update, and delete the collection, get documents, perspectives, and segments from the collection, and upload CSVs, files, URLs, and text to the collection.
- Attributes:
id: The ID of the collection. config: The configuration of the collection. created_at: The date the collection was created. created_by: The user who created the collection. description: The description of the collection. documents_link: The link to the documents in the collection. marked_for_delete: Whether the collection is marked for deletion. metadata: The metadata of the collection. name: The name of the collection. owner: The owner of the collection. perspectives_link: The link to the perspectives in the collection. segments_link: The link to the segments in the collection. self_link: The self link of the collection. slug: The slug of the collection. updated_at: The date the collection was last updated. updated_by: The user who last updated the collection. user_data: The user data of the collection.
- client: Any
- config: CollectionConfiguration | None
- create(client: 'redshred.api.http.RedShredAPI' | 'redshred.api.client.RedShredClient') TApiObject [source]
Creates the local object on the remote server.
This method is used to create the local object on the remote server. It uses the provided client to make the API request. The method identifies the creatable fields, makes a POST request to the server, and updates the local object with the response.
- Args:
client (Union[RedShredAPI, RedShredClient]): The client to use for the API request.
- Returns:
TApiObject: The local object updated with the response from the server.
- Raises:
RedShredAPIError: If the server response status code is not 201 (Created).
- created_at: datetime.datetime
- created_by: str
- description: str | None
- document(document_id) Document [source]
Retrieves a specific document from the collection.
This method uses the provided document ID to load a Document object from the collection. The document ID can be provided in various formats and is converted to a standard format using the id_from_any function.
- Args:
document_id (str): The ID of the document to retrieve.
- Returns:
Document: The loaded Document object.
- documents(q: str | None = None, fields: List[str] | None = None, **url_params) DocumentIterator[Document] [source]
Returns an iterator over the documents in the collection.
This method creates a DocumentIterator object that can be used to iterate over the documents in the collection. The documents can be filtered using a query string and specific fields can be included in the output. Additional URL parameters can be provided as keyword arguments.
- Args:
q (str, optional): The query string to filter the documents. Defaults to None. fields (List[str], optional): The fields to include in the output. Defaults to None. **url_params: Additional URL parameters.
- Returns:
DocumentIterator[Document]: An iterator over the documents in the collection.
- documents_link: str
- id: str
- classmethod load(client: 'redshred.api.http.RedShredAPI' | 'redshred.api.client.RedShredClient', slug: str = None, url: str = None) Collection [source]
See APIObject.load
- marked_for_delete: bool | None
- metadata: dict | None
- name: str
- owner: str
- perspective(perspective_id) Perspective [source]
Retrieves a specific perspective from the collection.
This method uses the provided perspective ID to load a Perspective object from the collection. The perspective ID can be provided in various formats and is converted to a standard format using the id_from_any function.
- Args:
perspective_id (str): The ID of the perspective to retrieve.
- Returns:
Perspective: The loaded Perspective object.
- perspectives(q: str | None = None, fields: List[str] | None = None, **url_params) PerspectiveIterator[Perspective] [source]
Returns an iterator over the perspectives in the collection.
This method creates a PerspectiveIterator object that can be used to iterate over the perspectives in the collection. The perspectives can be filtered using a query string and specific fields can be included in the output. Additional URL parameters can be provided as keyword arguments.
- Args:
q (str, optional): The query string to filter the perspectives. Defaults to None. fields (List[str], optional): The fields to include in the output. Defaults to None. **url_params: Additional URL parameters.
- Returns:
PerspectiveIterator[Perspective]: An iterator over the perspectives in the collection.
- perspectives_link: str
- segment(segment_id) Segment [source]
Retrieves a specific segment from the collection.
- Args:
segment_id (str): The ID of the segment to retrieve.
- Returns:
Segment: The loaded Segment object.
- segments(q: str | None = None, fields: List[str] | None = None, **url_params) SegmentIterator[Segment] [source]
Returns an iterator over the segments in the collection.
This method creates a SegmentIterator object that can be used to iterate over the segments in the collection. The segments can be filtered using a query string and specific fields can be included in the output. Additional URL parameters can be provided as keyword arguments.
- Args:
q (str, optional): The query string to filter the segments. Defaults to None. fields (List[str], optional): The fields to include in the output. Defaults to None. **url_params: Additional URL parameters.
- Returns:
SegmentIterator[Segment]: An iterator over the segments in the collection.
- segments_link: str
- self_link: str
- slug: str
- updated_at: datetime.datetime
- updated_by: str
- upload_csv(file, content_columns: List[str], delimiter=',', rename: str | None = None, **user_data)[source]
- Args:
file: source file to upload content_columns: Used for multi-document upload via CSV. A list which specifies the column(s)
that will be used for the document body.
delimiter: delimiter for csv, defaults to “,” rename: a new name if desired **user_data: any additional user data
- Returns:
None
- upload_file(file: Path | BufferedReader, rename: str | None = None, save_origin: bool | None = False, **user_data) Document [source]
Convenience method to upload a filelike into RedShred.
- Args:
collection_link (str): Target collection to upload file to file (str, filelike): Either a filename, url of file, or open() filelike object rename (str, optional): File name override. Defaults to existing filename save_origin (bool, optional): Save the path to the file on disk. Defaults to False user_data (dict): arbitrary dictionary to store with document on server
- Raises:
ValueError: Name argument missing for URL upload
- Returns:
dict: Returned payload from API server
- upload_text(text: str, name: str, **user_data) Document [source]
Convenience method to upload raw text into RedShred.
Given a collection name and a url, upload that text into RedShred.
- Args:
text (str): Text to upload. name (str, optional): File name to save text as. user_data (dict): arbitrary dictionary to store with document on server
- Returns:
dict: Returned payload from API server
- upload_url(url: str, rename: str | None = None, save_origin: bool | None = True, **user_data) Document [source]
Convenience method to upload a URL into RedShred.
Given a collection name and a url, upload that file into RedShred.
- Args:
collection_link (str): Target collection to upload file to url (str, filelike): Url of file to upload. rename (str, optional): File name override. Defaults to existing filename save_origin (bool, optional): Save the url to the file. Defaults to True user_data (dict): arbitrary dictionary to store with document on server
- Raises:
ValueError: Name argument missing for URL upload
- Returns:
dict: Returned payload from API server
- user_data: dict | None
- class redshred.CollectionConfiguration(*, tokenizer: List[Tokenizers | str | ConfiguredTokenizer | TesseractTokenizerConfig | AdvancedOCRTokenizerConfig] | Tokenizers | str | ConfiguredTokenizer | TesseractTokenizerConfig | AdvancedOCRTokenizerConfig = None, enrichments: List[DefinedAcronymsPerspective | ExternalAPIPerspective | GrouperPerspective | HuggingfacePerspective | IrisPerspective | PageImagesPerspective | PdftotextPerspective | PreprocessPerspective | RegexPerspective | SentencesPerspective | SpacyPerspective | TFIDFPerspective | TypographyPerspective | PerspectiveConfiguration] = None, notifications: List[NotificationConfiguration] = None, document_uniqueness: DocumentUniqueness = 'contents', allow_anonymous_downloads: bool = False)[source]
Bases:
BaseModel
- allow_anonymous_downloads: bool
- dict(*, include: AbstractSetIntStr | MappingIntStrAny | None = None, exclude: AbstractSetIntStr | MappingIntStrAny | None = None, by_alias: bool = False, skip_defaults: bool | None = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False) DictStrAny
Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.
- document_uniqueness: DocumentUniqueness
- enrichments: List[DefinedAcronymsPerspective | ExternalAPIPerspective | GrouperPerspective | HuggingfacePerspective | IrisPerspective | PageImagesPerspective | PdftotextPerspective | PreprocessPerspective | RegexPerspective | SentencesPerspective | SpacyPerspective | TFIDFPerspective | TypographyPerspective | PerspectiveConfiguration]
- json(*, include: AbstractSetIntStr | MappingIntStrAny | None = None, exclude: AbstractSetIntStr | MappingIntStrAny | None = None, by_alias: bool = False, skip_defaults: bool | None = None, exclude_unset: bool = False, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Callable[[Any], Any] | None = None, models_as_dict: bool = True, **dumps_kwargs: Any) unicode
Generate a JSON representation of the model, include and exclude arguments as per dict().
encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().
- notifications: List[NotificationConfiguration]
- tokenizer: List[Tokenizers | str | ConfiguredTokenizer | TesseractTokenizerConfig | AdvancedOCRTokenizerConfig] | Tokenizers | str | ConfiguredTokenizer | TesseractTokenizerConfig | AdvancedOCRTokenizerConfig
- validate_remote_schema(client: redshred.api.client.RedShredClient)[source]
- yaml(*, include: Set[str] | None = None, exclude: Set[str] | None = None, by_alias: bool = False, skip_defaults: bool | None = None, exclude_unset: bool = True, exclude_defaults: bool = False, exclude_none: bool = False, encoder: Callable[[Any], Any] | None = None, models_as_dict: bool = True, **dumps_kwargs: Any)[source]
Generate a YAML representation of the model from the JSON representation, include and exclude arguments as per dict().
encoder is an optional function to supply as default to json.dumps(), other arguments as per json.dumps().
- class redshred.Document(*, self_link: str = None, id: str = None, collection_link: str = None, collection_slug: str = None, config: CollectionConfiguration | None = None, content_hash: str = None, created_at: datetime = None, created_by: str = None, csv_metadata: dict | None = None, description: str | None = None, document_segment_link: str = None, errors: dict | str | None = None, file_link: str = None, file_size: int = None, index: int = None, metadata: dict | None = None, n_pages: int = None, name: str = None, original_name: str = None, pages_link: str = None, pdf_link: str = None, perspectives_link: str = None, read_state: str = None, read_state_updated_at: datetime = None, region: GeoJSON = None, segments_link: str = None, slug: str = None, source: str = None, summary: str = None, text: str = None, updated_at: datetime = None, updated_by: str = None, user_data: dict | None = None, warnings: dict | None = None, uniqueness_id: str | None = None, **data)[source]
Bases:
ApiObject
A document class.
This class is used to create documents. It provides methods to create, read, update, and delete the document, get pages, perspectives, and segments from the document, and download the document.
- Attributes:
id: The ID of the document. collection_link: The link to the collection the document belongs to. collection_slug: The slug of the collection the document belongs to. config: The configuration of the document. content_hash: The content hash of the document. created_at: The date the document was created. created_by: The user who created the document. csv_metadata: The CSV metadata of the document. description: The description of the document. document_segment_link: The link to the document segment. errors: The errors of the document. file_link: The link to the file of the document. file_size: The size of the file of the document. index: The index of the document. metadata: The metadata of the document. n_pages: The number of pages in the document. name: The name of the document. original_name: The original name of the document. pages_link: The link to the pages in the document. pdf_link: The link to the PDF of the document. perspectives_link: The link to the perspectives in the document. read_state: The read state of the document. read_state_updated_at: The date the read state of the document was last updated. region: The region of the document. segments_link: The link to the segments in the document. self_link: The self link of the document. slug: The slug of the document. source: The source of the document. summary: The summary of the document. text: The text of the document. updated_at: The date the document was last updated. updated_by: The user who last updated the document. user_data: The user data of the document. warnings: The warnings of the document. uniqueness_id: The uniqueness ID of the document.
- collection() Collection [source]
- collection_link: str
- collection_slug: str
- config: CollectionConfiguration | None
- content_hash: str
- created_at: datetime.datetime
- created_by: str
- csv_metadata: dict | None
- description: str | None
- document_segment_link: str
- download(path: str | 'pathlib.Path') int [source]
Download the original_file uploaded to RedShred to the specified path, returning the total bytes written
- Args:
path: a path to somewhere on the local filesystem
Returns: number of bytes written
- download_bytes() bytes [source]
Download the original_file uploaded to RedShred to the specified path, returning the total bytes written
Returns: document as bytes
- errors: dict | str | None
- file_link: str
- file_size: int
- id: str
- index: int
- metadata: dict | None
- n_pages: int
- name: str
- original_name: str
- page(index) Page [source]
Retrieves a specific page from the document.
- Args:
index (int): The index of the page to retrieve.
- Returns:
Page: The loaded Page object.
- pages(q: str | None = None, fields: List[str] | None = None, **url_params) PageIterator[Page] [source]
Returns an iterator over the pages in the document.
- Args:
q (str, optional): The query string to filter the pages. Defaults to None. fields (List[str], optional): The fields to include in the output. Defaults to None. **url_params: Additional URL parameters.
- Returns:
PageIterator[Page]: An iterator over the pages in the document.
- pages_link: str
- pdf_link: str
- perspective(perspective_id) Perspective [source]
- perspectives(q: str | None = None, fields: List[str] | None = None, **url_params) PerspectiveIterator[Perspective] [source]
- perspectives_link: str
- read_state: str
- read_state_updated_at: datetime.datetime
- reread_document(force=False)[source]
Reread the document and generate any new or changed perspectives and retry any failed perspectives.
- Args:
force: force a reread even if the document is not in a state that allows it to be read
- segment(segment_id) Perspective [source]
- segments(q: str | None = None, fields: List[str] | None = None, **url_params) SegmentIterator[Segment] [source]
- segments_link: str
- self_link: str
- slug: str
- source: str
- summary: str
- text: str
- uniqueness_id: str | None
- updated_at: datetime.datetime
- updated_by: str
- user_data: dict | None
- wait_until_read(wait_time_seconds: int = 5)[source]
Synchronously wait until the document has been read
- Args:
wait_time_seconds: time to wait between checks
- warnings: dict | None
- class redshred.Page(*, self_link: str = None, collection_link: str = None, collection_slug: str = None, content_hash: str = None, created_at: datetime = None, created_by: str = None, document_index: int = None, document_name: str = None, dpi: int = None, height: float = None, id: str = None, index: int = None, metadata: dict | None = None, name: str = None, page_segment_link: str = None, perspectives_link: str = None, region: GeoJSON = None, segments_link: str = None, summary: str = None, text: str = None, tokens_file_link: str = None, units: str = None, updated_at: datetime = None, updated_by: str = None, user_data: dict | None = None, width: float = None, **data)[source]
Bases:
ApiObject
A page class.
This class is used to create pages. It provides methods to set attributes.
- Attributes:
collection_link: The link to the collection the page belongs to. collection_slug: The slug of the collection the page belongs to. content_hash: The content hash of the page. created_at: The date the page was created. created_by: The user who created the page. document_index: The index of the document the page belongs to. document_name: The name of the document the page belongs to. dpi: The DPI of the page. height: The height of the page. id: The ID of the page. index: The index of the page. metadata: The metadata of the page. name: The name of the page. page_segment_link: The link to the page segment. perspectives_link: The link to the perspectives in the page. region: The region of the page. segments_link: The link to the segments in the page. self_link: The self link of the page. summary: The summary of the page. text: The text of the page. tokens_file_link: The link to the tokens file of the page. units: The units of the page. updated_at: The date the page was last updated. updated_by: The user who last updated the page. user_data: The user data of the page. width: The width of the page.
- collection_link: str
- collection_slug: str
- content_hash: str
- created_at: datetime.datetime
- created_by: str
- document_index: int
- document_name: str
- dpi: int
- height: float
- id: str
- index: int
- metadata: dict | None
- name: str
- next()[source]
Returns the next page in the document.
- Returns:
Page: The next page in the document.
- Raises:
ValueError: If there is no next page.
- page_segment_link: str
- perspective(perspective_id) Perspective [source]
- perspectives(q: str | None = None, fields: List[str] | None = None, **url_params) PerspectiveIterator[Perspective] [source]
- perspectives_link: str
- previous()[source]
Returns the previous page in the document.
- Returns:
Page: The previous page in the document.
- Raises:
ValueError: If there is no previous page.
- segments(q: str | None = None, fields: List[str] | None = None, **url_params) SegmentIterator[Segment] [source]
- segments_link: str
- self_link: str
- summary: str
- text: str
- tokens()[source]
Returns a list of tokens in the page.
- Returns:
List[Token]: A list of tokens in the page.
- tokens_file_link: str
- units: str
- updated_at: datetime.datetime
- updated_by: str
- user_data: dict | None
- width: float
- class redshred.Perspective(*, self_link: str = None, name: str = None, enrichment_name: str = None, collection_link: str = None, collection_slug: str = None, created_at: datetime = None, created_by: str = None, document_link: str = None, description: str | None = None, document_name: str = None, enrichment_config: dict = None, errors: dict | None = None, id: str = None, metadata: dict | None = None, segment_types: list = None, segments_link: str = None, slug: str = None, updated_at: datetime = None, updated_by: str = None, user_data: dict | None = None, warnings: dict | None = None, cache_id: str | None = None, **data)[source]
Bases:
ApiObject
A Perspective class.
This class is used to create Perspective objects. It provides methods to create, read, update, and delete the Perspective object, get segments from the Perspective object, and set the API.
- Attributes:
name: The name of the Perspective. enrichment_name: The name of the enrichment. collection_link: The link to the collection the Perspective belongs to. collection_slug: The slug of the collection the Perspective belongs to. created_at: The date the Perspective was created. created_by: The user who created the Perspective. document_link: The link to the document the Perspective belongs to. description: The description of the Perspective. document_name: The name of the document the Perspective belongs to. enrichment_config: The configuration of the enrichment. errors: The errors of the Perspective. id: The ID of the Perspective. metadata: The metadata of the Perspective. segment_types: The types of segments in the Perspective. segments_link: The link to the segments in the Perspective. self_link: The self link of the Perspective. slug: The slug of the Perspective. updated_at: The date the Perspective was last updated. updated_by: The user who last updated the Perspective. user_data: The user data of the Perspective. warnings: The warnings of the Perspective. cache_id: The cache ID of the Perspective.
- bulk_create_segments(segments: List[dict | Segment], batch_size=128)[source]
Creates multiple segments in the perspective all at once, faster than individually creating them.
This method takes a list of segments and creates them in the perspective. The segments can be provided as dictionaries or Segment objects. The method uses the provided batch size to determine how many segments to create at a time. The default batch size is 128. If a segment is provided as a Segment object and it already has an ID, a ValueError is raised. The method returns a list of the created segments.
- Args:
segments (List[Union[dict, Segment]]): The segments to create. Can be provided as dictionaries or Segment objects. batch_size (int, optional): The number of segments to create at a time. Defaults to 128.
- Returns:
List[Segment]: The created segments.
- Raises:
ValueError: If a segment is provided as a Segment object and it already has an ID.
- cache_id: str | None
- collection_link: str
- collection_slug: str
- create(collection: str | Collection | None = None, document: Document | None = None, client: 'redshred.api.http.RedShredAPI' | 'redshred.api.client.RedShredClient' | None = None)[source]
Create the local object on the remote server
- created_at: datetime.datetime
- created_by: str
- description: str | None
- document_link: str
- document_name: str
- enrichment_config: dict
- enrichment_name: str
- errors: dict | None
- id: str
- metadata: dict | None
- name: str
- segment_types: list
- segments(q: str | None = None, fields: List[str] | None = None, **url_params) SegmentIterator[Segment] [source]
- segments_link: str
- self_link: str
- slug: str
- updated_at: datetime.datetime
- updated_by: str
- user_data: dict | None
- warnings: dict | None
- class redshred.PerspectiveConfiguration(*, name: str, perspective: str, segments: SegmentQuery | Dict = None, description: str = '', config: Dict[str, Any] = None, debug: bool = False)[source]
Bases:
BaseModel
- config: Dict[str, Any]
- debug: bool
- description: str
- name: str
- perspective: str
- segments: SegmentQuery | Dict
- exception redshred.RedShredAPIError(*args, reason=None, **kwargs)[source]
Bases:
Exception
- reason: Any
- class redshred.RedShredClient(token: str | None = None, host: str | None = None, host_verify: bool | None = None, config_path: str | None = None, context_override: str | None = None, config: Configuration | None = None)[source]
Bases:
object
Client for interacting with the RedShred Platform.
This client provides easy access to the RedShred API, enabling users to interact with various RedShred Platform services such as retrieving user data, fetching contents of files, and accessing collection statistics.
The client can be configured via environmental variables, a .rsconfig file, or manual initialization arguments.
- Attributes:
config (Configuration): The configuration for the RedShred API. api (RedShredAPI): Interface to the RedShred API services. _user (RedShredUser, optional): The authenticated user’s details. Default is None.
- Args:
token (str, optional): The authentication token. host (str, optional): The RedShred API host address. host_verify (bool, optional): Flag to enable or disable SSL verification. config_path (str, optional): Path to the .rsconfig file. context_override (str, optional): Overrides the context specified in the configuration. config (Configuration, optional): A pre-initialized Configuration object.
- collection(slug: str) Collection [source]
Retrieve a specific collection by its slug.
- Args:
slug (str): The slug, short reference, or link field of the collection.
- Returns:
Collection: The retrieved collection object.
- Raises:
HTTPError: If the requested collection cannot be found or another error occurs.
- collections(**client_params) CollectionIterator [source]
Fetch collections accessible to the user.
- Args:
**client_params: Additional parameters for the client request.
- Returns:
CollectionIterator: An iterator to access user’s collections.
- file(storage_path: str, inline: bool = False, unconfined: bool = True, width: int = 800, **kwargs) bytes [source]
Fetch a file stored in RedShred by its relative path.
Many enrichments in RedShred can generate files (e.g. extracted images) and will serve these back with a relative path that can be passed to this method to retrieve.
If inline is True, this will display images directly in notebook context. In these cases, unconfined and width will be passed directly to Image().
- Args:
storage_path (str): path to file as given in API response data inline (bool, optional): Whether to attempt to display the file inline in a notebook, images only. Defaults
to False.
unconfined (bool, optional): passed to IPython.core.display.Image, requires inline=True. Defaults to True. width (int, optional): passed to IPython.core.display.Image, requires inline=True. Defaults to 800. **kwargs: passed to IPython.core.display.Image, requires inline=True.
- Returns:
bytes: file contents as bytes
- get_text(api_object: redshred.models.api.ApiObject)[source]
Extract the text from a given api object.
- Args:
api_object: Any API object that has a get_text method.
- Returns:
str: The text extracted from the API object.
- stats(collection_name: str | Collection) dict [source]
Review the current states of documents in a collection.
Read state is one of [‘unread’, ‘queued’, ‘reading’, ‘read’, ‘crashed’]:
unread - newly uploaded documents that are not yet fully enriched and indexed
queued - documents that are awaiting reading
reading - documents that are currently being enriched by the RedShred reader
read - documents that have been read and are “at rest” in RedShred
crashed - documents that could not be successfully processed by RedShred.
Documents in crashed states can be reported to RedShred through the chat window in the documentation. It is our goal that all documents should be read successfully although the amount of enrichment may vary (e.g. encrypted PDFs shouldn’t crash, but will likely be sparsely enriched.)
- Args:
collection_name (str): Name of target collection
- Returns:
dict: Read statistics for collection
- property user
Retrieve the current authenticated user’s details.
- Returns:
RedShredUser: An object representing the authenticated user.
- Raises:
HTTPError: If authentication fails or an error occurs with the HTTP call. TypeError: If the data returned is not in the expected format. ConnectionError: If a connection error occurs.
- exception redshred.RedShredFileExistsError(*args, **kwargs)[source]
Bases:
RedShredHTTPError
- class redshred.Segment(*, self_link: str = None, segment_type: str = None, regions: GeoJSON = None, bounding_box: BoundingBox | None = None, collection_link: str = None, collection_slug: str = None, created_at: datetime = None, created_by: str = None, document_link: str = None, document_name: str = None, enrichment_data: dict | None = None, enrichment_name: str = None, errors: dict | None = None, id: str = None, labels: list = None, metadata: dict | None = None, max_x: float | None = None, max_y: float | None = None, min_x: float | None = None, min_y: float | None = None, perspective_link: str = None, summary: str = None, text: str = None, updated_at: datetime = None, updated_by: str = None, user_data: dict | None = None, warnings: dict | None = None, cache_id: str | None = None, **data)[source]
Bases:
ApiObject
A Segment class.
This class is used to create Segment objects. It provides methods to create, read, update, and delete the Segment object, get segments from the Segment object, and set the API.
- Attributes:
segment_type: The type of the Segment. regions: The regions of the Segment. bounding_box: The bounding box of the Segment. collection_link: The link to the collection the Segment belongs to. collection_slug: The slug of the collection the Segment belongs to. created_at: The date the Segment was created. created_by: The user who created the Segment. document_link: The link to the document the Segment belongs to. enrichment_data: The enrichment data of the Segment. enrichment_name: The name of the enrichment. errors: The errors of the Segment. id: The ID of the Segment. labels: The labels of the Segment. metadata: The metadata of the Segment. max_x: The maximum x-coordinate of the Segment. max_y: The maximum y-coordinate of the Segment. min_x: The minimum x-coordinate of the Segment. min_y: The minimum y-coordinate of the Segment. perspective_link: The link to the perspective of the Segment. self_link: The self link of the Segment. summary: The summary of the Segment. text: The text of the Segment. updated_at: The date the Segment was last updated. updated_by: The user who last updated the Segment. user_data: The user data of the Segment. warnings: The warnings of the Segment. cache_id: The cache ID of the Segment.
- between(segment: Segment, strict: bool = False) BoundingBox [source]
# TEMPORARILY NOT IMPLEMENTED Provides a helper function to generate the bounding box between two segments.
- Args:
segment (Segment): A segment to define the space strict (bool, optional): If True, returned bounding box will be area exatly between the two segments. If False, the bounding box returned will be the entire page width between two segments. Defaults to False.
- Returns:
list: Bounding box of the area between two segments.
- bounding_box: BoundingBox | None
- cache_id: str | None
- collection_link: str
- collection_slug: str
- create(perspective: Perspective | None = None)[source]
Create the local object on the remote server
- created_at: datetime.datetime
- created_by: str
- document_link: str
- document_name: str
- enrichment_data: dict | None
- enrichment_name: str
- errors: dict | None
- get_segment_image(path_to_save_folder: str | None = None, return_bytes=False, inline=False, **url_params) bytes | str [source]
Retrieves the image of the segment.
This method uses the SegmentCropper to get the cropped image of the segment. The image can be returned as bytes, opened inline using PIL, or saved to a specified folder.
- Args:
- path_to_save_folder (str, optional): The path to the folder where the image will be saved. Defaults to the
current working directory.
return_bytes (bool, optional): If True, the image will be returned as bytes. Defaults to False. inline (bool, optional): If True, the image will be opened inline using PIL. Defaults to False. **url_params: Additional URL parameters.
- Returns:
Union[bytes, str]: The image of the segment, either as bytes or a path to the saved image.
- get_segments_from_perspective(perspective_name: str, **params)[source]
Get all segments that are in the same perspective as this segment
- get_text(**url_params)[source]
Retrieves the text of the segment.
This method uses the TokenLookup to get the text of the segment.
- Args:
**url_params: Additional URL parameters.
- Returns:
str: The text of the segment.
- id: str
- labels: list
- max_x: float | None
- max_y: float | None
- metadata: dict | None
- min_x: float | None
- min_y: float | None
- perspective_link: str
- q(query: str, search_type: Literal['documents', 'pages', 'perspectives', 'segments'] = 'segments', **url_params)[source]
Executes a query on the API object.
This method uses the provided query and search type to execute a search on the API object. The search type can be one of “documents”, “pages”, “perspectives”, or “segments”. Additional URL parameters can be provided as keyword arguments.
- Args:
query (str): The query to execute. search_type (str): The type of search to perform. Defaults to “segments”. **url_params: Additional URL parameters.
- Returns:
An iterator over the results of the query.
- segment_type: str
- self_link: str
- summary: str
- text: str
- updated_at: datetime.datetime
- updated_by: str
- user_data: dict | None
- warnings: dict | None