Extracting Tabular Data From Text
In this example, we will create a basic agent that can extract tabular data from unstructured text using OpenAI function calls.
Step 1 Let’s set up an Azure OpenAI Client and then use ActionWeaver patch to enhance its API
[1]:
import os
from openai import AzureOpenAI
import actionweaver.llms as llms
# Azure OpenAI
model="gpt-35-turbo-0613-16k"
client = llms.wrap(AzureOpenAI(
azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"),
api_key=os.getenv("AZURE_OPENAI_KEY"),
api_version="2023-10-01-preview"
))
# If you want to use OpenAI endpoint
# from openai import OpenAI
# client = llms.patch(OpenAI())
Step 2 We will use Pydantic Model to define structures we want to extract from the raw text.
[2]:
# Example inspired by https://jxnl.github.io/instructor/examples/autodataframe/
from typing import List, Any
from pydantic import BaseModel, Field
class Dataframe(BaseModel):
"""
Class representing a dataframe.
"""
name: str = Field(..., description="The name of the dataframe")
data: List[List[Any]] = Field(
...,
description="""Correct rows of data aligned to column names, Nones are allowed.""",
)
columns: List[str] = Field(
...,
description="Column names relevant from source data, should be in snake_case",
)
def to_pandas(self):
import pandas as pd
return pd.DataFrame(data=self.data, columns=self.columns)
def chat(self, query):
messages = [
{
"role": "system",
"content": f"You're a helpful agent trying to answer questions based on {str(self)}",
},
{
"role": "user",
"content": f"{query}",
},
]
return client.create(
model=model,
messages=messages,
temperature=1,
stream=False,
)
class Database(BaseModel):
"""
A set of dataframes
"""
tables: List[Dataframe] = Field(..., description="List of dataframes")
Step 3: Utilize the action_from_model function to convert the PyDantic model into an ActionWeaver action. Then, employ the invoke method to evaluate an example. By using force=True, you compel the LLM to execute the action.
[3]:
from actionweaver.actions.factories.pydantic_model_to_action import action_from_model
data ="""Mary, 35, from Chicago, is a soccer aficionado. She heads the 'Falcons,' a soccer team with 14 members.
David, a 28-year-old swimming enthusiast from Miami, leads the 'Sharks,' a team of 12 swimmers.
Emily, 22, in Boston, is devoted to volleyball. She captains the 'Eagles,' which includes 16 volleyball players.
Chris, 32, from Seattle, is an avid cyclist and heads the 'Wolves,' a 12-member cycling team.
Anna, 27, in Denver, excels in golf and captains the 'Bears,' a team of 9 golfers.
In Dallas, there's Leo, 24, who loves hockey. He leads the 'Panthers,' a hockey team of 13.
Grace, 29, in Atlanta, is passionate about rowing. She captains the 'Dolphins,' a rowing team of 11 athletes.
In Phoenix, Alex, 26, is a skilled archer and heads the 'Hawks,' an archery team with 8 members.
Finally, in Portland, we have Zoe, 31, who is a master at badminton. She captains the 'Rabbits,' a badminton team of 10 players.
"""
db = action_from_model(Database).invoke(client, model=model, messages=[{"role": "user", "content": data}], force=True)
[4]:
db.tables[0].to_pandas()
[4]:
| Team Name | Sport | Captain Name | Captain Age | Location | Number of Members | |
|---|---|---|---|---|---|---|
| 0 | Falcons | Soccer | Mary | 35 | Chicago | 14 |
| 1 | Sharks | Swimming | David | 28 | Miami | 12 |
| 2 | Eagles | Volleyball | Emily | 22 | Boston | 16 |
| 3 | Wolves | Cycling | Chris | 32 | Seattle | 12 |
| 4 | Bears | Golf | Anna | 27 | Denver | 9 |
| 5 | Panthers | Hockey | Leo | 24 | Dallas | 13 |
| 6 | Dolphins | Rowing | Grace | 29 | Atlanta | 11 |
| 7 | Hawks | Archery | Alex | 26 | Phoenix | 8 |
| 8 | Rabbits | Badminton | Zoe | 31 | Portland | 10 |
Let’s wrap it inside of an agent class. This time, we include an additional actions parameter orch when calling the chat completion API. This feature will allow us for more precise control over the specific set of tools available to the LLM during each interaction.
Example:
client.create(
messages = ...
actions=[a1, a2, a3], # First, LLM respond with either a1, a2 or a3, or text without action
# Define the orchestration logic for actions:
orch={
a1.name: [a2, a3], # If a1 is invoked, the next response will be either a2, a3 or a text response.
a2.name: a3, # If a2 is invoked, the next action will be a3
a3.name: [a4] # If a3 is invoked, the next response will be a4 or a text response.
a4.name: None # If a4 is invoked, the next response will guarantee to be a text message
}
)
For details please take a look at here
This agent has the capability to perform two actions:
ReadDataAndCreateTable: It can read raw text, extract tabular data, and store it in a database.
AnswerQuestion: It can provide answers to questions by using queries and information from specific tables.
[7]:
from actionweaver import action
class DBAgent:
def __init__(self):
self.db = Database(tables=[])
self.messages = [{"role": "system", "content": f"You're a helpful agent. You answer questions in plain English. Tables available: {[table.name for table in self.db.tables]}"}]
@action("ReadDataAndCreateTable")
def read(self, data: str):
"""
This method is used to read unstructured data and create tables.
Args:
data (str): The unstructured data to be processed
"""
print (f"[Reading data]: \n {data} \n")
messages=[
{
"role": "system",
"content": """Map this data into a dataframe and correctly define the correct columns and rows""",
},
{
"role": "user",
"content": f"{data}",
},
]
# Utilize the **action_from_model** function to convert the PyDantic model into an ActionWeaver action. Then, employ the `invoke` method to evaluate an example.
# Stop=True, so LLM will return the Database immediately
db = action_from_model(Database, stop=True).invoke(client, messages=messages, model=model, temperature=0.1, stream=False, force=True)
# save the table
self.db.tables.extend(db.tables)
return f"{[table.name for table in db.tables]} have been created"
@action("AnswerQuestion")
def answer(self, query: str, tables:List[str]):
"""
Responds to plain English queries by extracting relevant information from predefined data tables.
This method is specifically designed to process and answer questions that can be addressed using the data available in the provided tables. The tables are not SQL tables.
Parameters:
query (str): A plain English question or query that the user wants to be answered. The method processes this query to understand and extract the necessary information from the available tables.
tables (List[str]): A list of table names or identifiers. These tables contain the data that will be used to answer the query.
Usage:
This method should be invoked when there is a need to answer questions based on the information contained in the specified tables. It is essential that the tables provided have the relevant data to address the query.
"""
candidates = [table for table in self.db.tables if table.name in tables]
context = [table.to_pandas().to_json() for table in candidates]
print (f"[Answering]: {query} using context {context} from tables {tables}")
messages=[
{
"role": "system",
"content": f"You're a helpful agent trying to answer questions based on context: \n {context}"
},
{
"role": "user",
"content": f"{query}",
},
]
response = client.create(
model=model,
temperature=.5,
messages=messages,
stream=False,
)
return response
def chat(self, query):
self.messages.append({"role": "user", "content": f"{query}"})
response = client.create(
model=model,
messages=self.messages,
temperature=.1,
actions = [self.read, self.answer],
orch = {self.read.name: None, self.answer.name: None}, # function is called at most once
stream=False
)
self.messages.append({"role": "assistant", "content": response.choices[0].message.content})
return response
agent = DBAgent()
[8]:
data ="""Mary, 35, from Chicago, is a soccer aficionado. She heads the 'Falcons,' a soccer team with 14 members.
David, a 28-year-old swimming enthusiast from Miami, leads the 'Sharks,' a team of 12 swimmers.
Emily, 22, in Boston, is devoted to volleyball. She captains the 'Eagles,' which includes 16 volleyball players.
Chris, 32, from Seattle, is an avid cyclist and heads the 'Wolves,' a 12-member cycling team.
Anna, 27, in Denver, excels in golf and captains the 'Bears,' a team of 9 golfers.
In Dallas, there's Leo, 24, who loves hockey. He leads the 'Panthers,' a hockey team of 13.
Grace, 29, in Atlanta, is passionate about rowing. She captains the 'Dolphins,' a rowing team of 11 athletes.
In Phoenix, Alex, 26, is a skilled archer and heads the 'Hawks,' an archery team with 8 members.
Finally, in Portland, we have Zoe, 31, who is a master at badminton. She captains the 'Rabbits,' a badminton team of 10 players.
"""
agent.chat(data)
[Reading data]:
Name, Age, Location, Sport, Team
Mary, 35, Chicago, Soccer, Falcons
David, 28, Miami, Swimming, Sharks
Emily, 22, Boston, Volleyball, Eagles
Chris, 32, Seattle, Cycling, Wolves
Anna, 27, Denver, Golf, Bears
Leo, 24, Dallas, Hockey, Panthers
Grace, 29, Atlanta, Rowing, Dolphins
Alex, 26, Phoenix, Archery, Hawks
Zoe, 31, Portland, Badminton, Rabbits
[8]:
ChatCompletion(id='chatcmpl-8ciifleLhHclJXBDOOQKOnyXAdnFc', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="I have created a table called 'Players' with the following information:\n\n| Name | Age | Location | Sport | Team |\n|-------|-----|----------|------------|----------|\n| Mary | 35 | Chicago | Soccer | Falcons |\n| David | 28 | Miami | Swimming | Sharks |\n| Emily | 22 | Boston | Volleyball | Eagles |\n| Chris | 32 | Seattle | Cycling | Wolves |\n| Anna | 27 | Denver | Golf | Bears |\n| Leo | 24 | Dallas | Hockey | Panthers |\n| Grace | 29 | Atlanta | Rowing | Dolphins |\n| Alex | 26 | Phoenix | Archery | Hawks |\n| Zoe | 31 | Portland | Badminton | Rabbits |\n\nLet me know if there's anything specific you would like to know about these teams and players.", role='assistant', function_call=None, tool_calls=None), content_filter_results={'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}})], created=1704239057, model='gpt-35-turbo-16k', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=209, prompt_tokens=440, total_tokens=649), prompt_filter_results=[{'prompt_index': 0, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}])
[9]:
agent.db.tables
[9]:
[Dataframe(name='Players', data=[['Mary', '35', 'Chicago', 'Soccer', 'Falcons'], ['David', '28', 'Miami', 'Swimming', 'Sharks'], ['Emily', '22', 'Boston', 'Volleyball', 'Eagles'], ['Chris', '32', 'Seattle', 'Cycling', 'Wolves'], ['Anna', '27', 'Denver', 'Golf', 'Bears'], ['Leo', '24', 'Dallas', 'Hockey', 'Panthers'], ['Grace', '29', 'Atlanta', 'Rowing', 'Dolphins'], ['Alex', '26', 'Phoenix', 'Archery', 'Hawks'], ['Zoe', '31', 'Portland', 'Badminton', 'Rabbits']], columns=['Name', 'Age', 'Location', 'Sport', 'Team'])]
[10]:
agent.chat("find out people below age 30")
[Answering]: Who are the people below the age of 30? using context ['{"Name":{"0":"Mary","1":"David","2":"Emily","3":"Chris","4":"Anna","5":"Leo","6":"Grace","7":"Alex","8":"Zoe"},"Age":{"0":"35","1":"28","2":"22","3":"32","4":"27","5":"24","6":"29","7":"26","8":"31"},"Location":{"0":"Chicago","1":"Miami","2":"Boston","3":"Seattle","4":"Denver","5":"Dallas","6":"Atlanta","7":"Phoenix","8":"Portland"},"Sport":{"0":"Soccer","1":"Swimming","2":"Volleyball","3":"Cycling","4":"Golf","5":"Hockey","6":"Rowing","7":"Archery","8":"Badminton"},"Team":{"0":"Falcons","1":"Sharks","2":"Eagles","3":"Wolves","4":"Bears","5":"Panthers","6":"Dolphins","7":"Hawks","8":"Rabbits"}}'] from tables ['Players']
[10]:
ChatCompletion(id='chatcmpl-8ciikYcz8k8HX10oC5UU5toQSBF73', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The people below the age of 30 are David, Emily, Anna, Leo, and Alex.', role='assistant', function_call=None, tool_calls=None), content_filter_results={'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}})], created=1704239062, model='gpt-35-turbo-16k', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=20, prompt_tokens=981, total_tokens=1001), prompt_filter_results=[{'prompt_index': 0, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}])
[11]:
agent.chat("""create new table from it""")
[Reading data]:
Name, Age, Location, Sport, Team
David, 28, Miami, Swimming, Sharks
Emily, 22, Boston, Volleyball, Eagles
Anna, 27, Denver, Golf, Bears
Leo, 24, Dallas, Hockey, Panthers
Alex, 26, Phoenix, Archery, Hawks
[11]:
ChatCompletion(id='chatcmpl-8ciipPDPPigy6mFpA9MFy9PXhC4aU', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="I have created a new table called 'Players' with the information of the people below the age of 30:\n\n| Name | Age | Location | Sport | Team |\n|-------|-----|----------|------------|----------|\n| David | 28 | Miami | Swimming | Sharks |\n| Emily | 22 | Boston | Volleyball | Eagles |\n| Anna | 27 | Denver | Golf | Bears |\n| Leo | 24 | Dallas | Hockey | Panthers |\n| Alex | 26 | Phoenix | Archery | Hawks |\n\nLet me know if there's anything else I can assist you with.", role='assistant', function_call=None, tool_calls=None), content_filter_results={'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}})], created=1704239067, model='gpt-35-turbo-16k', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=145, prompt_tokens=1114, total_tokens=1259), prompt_filter_results=[{'prompt_index': 0, 'content_filter_results': {'hate': {'filtered': False, 'severity': 'safe'}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}])
[12]:
len(agent.db.tables)
[12]:
2