Turning Conversation Data Into Insight: Practical Guide For Beginners
Transform Conversations into Valuable Insights
On February 22, 2024, Google and Reddit announced a new partnership. This agreement allows Google to use Reddit’s data to train its AI systems. Google noticed that many people turn to Reddit for product recommendations and advice. By teaming up, Reddit aims to improve its search and other features using Google’s Vertex AI technology. As a result, Reddit has become a key data source for Google, and its relevance is expected to grow significantly.
In 2023, Reddit's relevance increased by 400%, and its 2024 partnership with Google makes it even more valuable for gathering advice and ideas. This makes Reddit a powerful tool for gaining insights, especially in areas like the hospitality industry. One example is using a Reddit thread—what we’ll call a "conversation" in this article—to explore customer feedback and industry trends.
In this guide, we’ll show you how to turn conversations into actionable insights. We’ll cover practical steps and tools, including automation software, data storage platforms, insight tools, and, most importantly, a Large Language Model.
An Overview of Conversation Data
Before diving into the practical guide, let’s take a closer look at the dataset we’ll be working with.
As mentioned earlier, our dataset consists of Reddit threads, specifically focusing on topics within the hospitality industry. These include discussions on Travel Hacks, Hotels, and Travel Advice from subreddits like r/TravelHacks. Reddit threads are structured using a unique system called a hierarchical comment structure. Let’s break this down with an example.
In Reddit's hierarchy, the top-level comment is the first level. Replies to that comment form the second level, and replies to those comments form the third level, and so on. This creates a continuous chain of interactions, which we refer to as a "conversation."
For instance, in the screenshot example below, based on usernames, the hierarchy flows like this:
- Top-level comment (first hierarchy).
- Two second-level replies.
- Additional replies forming further levels in the hierarchy.
This structure allows us to analyze conversations and extract insights, which we’ll demonstrate step by step in this guide.
- jsacik99 → Bitter-Square-3963
- jsacik99 → Alameda540 → Sbmizzou → picklestirfry
- ….
Since some conversations can have extensive hierarchies, we’ll limit our analysis to the first three levels of hierarchy. For instance, in the given example, the hierarchy would be limited to:
jsacik99 → Alameda540 → Sbmizzou.
This approach helps keep the data manageable while still capturing meaningful insights.
To understand how we gathered this data from Reddit, you can refer to this article. Below are the defined fields for our conversation data:
- Id: a unique 12-character ID to identify each conversation
- Title: Title of the post.
- Post: The text below the title contains the questions from the Original Post/Author.
- Response 1: Response on first hierarchy.
- Response 2: Response on the second hierarchy to the first hierarchy.
- Response 2: Response on the third hierarchy to the second hierarchy.
Here is an example of the conversation in JSON.
[{
"ID": "TNb1IwJKTHTu",
"Title": "Spirit Airlines and the Drop by Gravity Rule",
"Post": "My wife and I fly Spirit Airlines often, and it's always only with a personal item. We got pulled once to check the bag size. The bag in question fit with a minimal downward push, but because it didn't freely fall into the measuring container, they said it counted as a carry-on. \n Has anyone had this experience? Does this "rule", as they called it, sound made up, or does anyone know if it's an actual thing?",
"Response 1": "I remember reading about a budget airline that paid their agents a commission on gate-checked baggage fees.",
"Response 2": "Frontier?!",
"Response 3": "Yup. Watched it happen."},
...
...
}]
Based on our insight article, we have around 5000 conversation data about hotels and travels from Reddit. Now that we are more familiar with our data set, let’s take a look at our essential tools to transform and unlock the insights of our data.
Essential Tools for Unlocking Data Insights
In the world of data, the ETL process—Extract, Transform, Load—is essential for creating consistent datasets ready for analysis. In the previous section, we covered how to scrape Reddit threads, which corresponds to the "Extract" step. Now, let’s explore the tools used in the remaining steps.
PostgreSQL
PostgreSQL is a widely-used open-source Relational Database Management System (RDBMS) designed to store structured data. In the ETL process:
- Extract: Data is gathered from sources like Reddit threads.
- Transform: The data is processed and enhanced for analysis.
- Load: The data is stored in a database, like PostgreSQL, for further use.
In our workflow, PostgreSQL handles the "Load" process. We store conversation data in PostgreSQL, transform it using tools like n8n, and reload it back into the database. To simplify managing PostgreSQL, we use pgAdmin v4, a graphical interface that makes it easier to interact with and organize our data.
n8n
n8n is an open-source workflow automation tool that integrates over 350 services, including AI tools, SQL databases, and Google Workspace. It plays a vital role in the Transform step of ETL by processing and enriching our data.
Key features include:
- Data manipulation in JSON format.
- Integration with AI tools like OpenAI API to create new fields such as classification, reasoning, or sentiment analysis.
- The ability to update and reload the data directly into PostgreSQL using queries.
The detailed setup process for n8n will be discussed in the next chapter.
OpenAI API
The OpenAI API integrates AI models into applications. In our case, we use it within n8n for data transformation tasks, including:
- Creating new fields such as sentiment analysis, classification, or reasoning.
The specifics of using OpenAI API in this workflow will also be explained in detail later.
Python
Python is our primary tool for data analysis. Using Google Colab as an interface, we employ libraries like pandas to clean, analyze, and visualize the conversation data..
Automating the ETL Process with n8n
As discussed earlier, n8n is an open-source tool that allows you to automate workflows and run them on a local computer. If you’d like to install n8n, visit their GitHub page for detailed instructions.
Setting Up n8n
- Create an Account: After installation, n8n will prompt you to set up an email and login information.
- Explore Workflows: Once logged in, you’ll be directed to the workflows dashboard.
- Add New Workflows: Click “Add workflows” to open a blank canvas where you can design your workflows.
Using Nodes in n8n
Workflows in n8n are built using nodes—the core building blocks that handle specific tasks. Here are the types of nodes we used:
Trigger Nodes:
These nodes activate the workflow based on specific events, such as manual clicks, new database rows, or messages on platforms like Telegram.
- For this project, we use the Manual Click Trigger to start our workflow.
Pre-Built Nodes:
These nodes integrate with external services like OpenAI, PostgreSQL, Google Docs, and more.
- In our case, we use the PostgreSQL Node to interact with our database and the OpenAI Chat Model Node to generate insights.
Utility Nodes:
Utility nodes manipulate data, control the workflow, or handle conditional logic.
- For this workflow, we use:
- Set Node: To define or modify data fields.
- Merge Node: To combine data from multiple sources.
- Function Node: To perform custom logic or calculations.
Workflow Overview
The goal of our workflow is straightforward:
- Start with a trigger (manual click).
- Use nodes to retrieve, process, and enrich data.
- Add new fields based on the specific needs of the analysis.
In the next chapter, we’ll delve into the step-by-step implementation of this workflow, showing how to connect these nodes effectively.
Importing and Enhancing Data in PostgreSQL
To prepare our data for analysis, we use PostgreSQL to import, organize, and enhance our dataset. Here's a step-by-step guide:
Step 1: Import Existing Data
We start by importing our dataset into PostgreSQL. First, create a table with fields that match the structure of your data. Use a SQL query similar to the following:
CREATE TABLE subreddit_conversation
(id VARCHAR(12) PRIMARY KEY,
title TEXT,
post TEXT,
response_1 TEXT,
response_2 TEXT,
response_3 TEXT)
Step 2: Create a "Conversation" Field
To analyze the entire conversation, we create a new field called conversation. This field concatenates the title, post, and responses into a single string.
Here’s an example query to add the field:
ALTER TABLE subreddit_conversation
ADD COLUMN conversation TEXT;
UPDATE subreddit_conversation
SET conversation =
'Title: ' || COALESCE(title, '') || E'\n' ||
'Post: ' || COALESCE(post, '') || E'\n' ||
'Response 1: ' || COALESCE(response_1, '') || E'\n' ||
'Response 2: ' || COALESCE(response_2, '') || E'\n' ||
'Response 3: ' || COALESCE(response_3, '');
Step 3: Send Data to OpenAI for Sentiment Analysis
Next, the conversation field is sent to OpenAI along with a specific query. For instance, we can ask OpenAI to categorize the sentiment of the conversation as positive, negative, or neutral.
- Data Cleaning: Before sending data to OpenAI, ensure that the text is clean and formatted properly. This might involve removing unwanted characters or handling null values.
Step 4: Update PostgreSQL with OpenAI Output
After receiving the output from OpenAI, we update the sentiment field in our PostgreSQL table. Here’s how we can do it:
- Add a new sentiment column to store the results: ALTER TABLE conversation_data ADD COLUMN sentiment TEXT;
- Update the table with OpenAI's response: UPDATE conversation_data
SET sentiment = 'positive' -- Replace with actual OpenAI output
WHERE id = 1; -- Use the unique ID to target specific rows
Automating Steps 3 and 4 in n8n
The manual steps (sending data to OpenAI and updating PostgreSQL) are automated in n8n workflows.
- Workflow Steps:
- Extract data from PostgreSQL using the PostgreSQL Node.
- Process the data with OpenAI using the OpenAI Chat Model Node.
- Update PostgreSQL with the results using a PostgreSQL Update Node.
In the next section, we’ll provide a detailed view of the n8n workflow to efficiently perform these tasks.
You can see the full workflow on n8n in the above image. This workflow can be used to add any field to our data in Postgres. Let me show you how each node works in this workflow.
Manual Click Trigger
This node triggers the workflow to run.
Loading Data to n8n
This node is a PostgreSQL node. This node type executes Query and is used to load the data from our database. For example, we use the query
SELECT id, conversation
FROM subreddit_conversation
WHERE (depends on what condition we want)
The WHERE condition is typically applied if there are specific fields we want to filter. For instance, if there is a field named sentiment and we want to add explanations to rows with negative sentiment, we would use the condition WHERE sentiment = 'negative'.
The node will load the data in the form of JSON.
Looping to Avoid Token Overload
This node is called the 'Split In Batches' node. Used to do looping over and over, based on how many Batch Sizes we put in there.
'Split In Batches' is divided into two outputs, either loop or done. If we choose a loop, we need to create a workflow after the loop that connects back to this 'Split In Batches' node. Notice that on our workflow, we connect the 'Run The Query' node back to our Looping to Avoid Token Overload, creating a loop. We don’t put anything on done output since the process is finished once the loop ends.
We can determine the batch size of our process. If we have 5000 conversation data and put a batch size of 100, then we have 50 loops on this workflow. Why do we need to split in batches?
OpenAI has a limit of 8192 tokens in standard. If we take an average of 4 tokens as 3 words, then we have a limit input of around 6000 words. Since a conversation on Reddit can be long, we can use batches. In this workflow, we used a batch size of 100 inputs. Assuming each conversation is 500 words, we can avoid token overload on OpenAI input.
Notice after Looping to Avoid Token Overload, the process split into two different branches. Let me explain few of the steps.
Keeping the ID
This node is called the Set Node. This node is used to get the value from the previous node, and then keep it to be used later. We can rename the field name on this node. But In this case, we only need to keep the ID, to be used later.
OpenAI Chat Model
This node is used to ask OpenAI to create a new field for us. To use this OpenAI Chat Model, we need to have our API key and also credit to use the OpenAI chat model. In this node, we set the operation to message the model, and we choose the model GPT-4O. After that, we set the message we want to send. For example, determine the sentiment of this conversation.
We can call the input data from n8n by using {{ json.(field name)}}. So for example, we can use the conversation data from previous output on the message like this.
Determine the sentiment of this conversation.
Conversation: {{ json.conversation }}
Determine the sentiment if it’s negative, neutral, or positive.
Reply in this form.
- - -
conversation_sentiment: [The Sentiment]
- - -
Cleaning the output
This node is called a 'Function Node', where we can write a custom code to transform our data. In this node, we want to clean the output from the OpenAI Chat Model. The OpenAI Chat Model usually contains a formatting as follows.
{
"id": "chatcmpl-12345",
"object": "chat.completion",
"created": 1700000000,
"model": "gpt-4",
"usage": {
"prompt_tokens": 56,
"completion_tokens": 129,
"total_tokens": 185
},
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "sentiment: negative \n sentiment_reasoning: "The guest is not happy with the amenities provided by the hotel.""
},
"finish_reason": "stop"
}
]
}
We want to extract the content only. Using a Function node, we can clean and take only the “content” of this output.’’
First, we set our Function to “Run for Each Input”. That means, if there are 100 inputs, then the code will run 100 times. We extract the content only by using this code.
// Input JSON (mocked example from previous node output)
const inputData = $json;
// Extract the `content` field from the first item
const content = inputData.message.content;
// Return the extracted content as a single output item
return {
json: {
content: content
}
};
With this code, we get an output like this in JSON format.
[
{
"content": "Here is the assistant's response."
}
]
This is the output that we want to see. We can also change the field name from content to anything we prefer.
Other times, we need to extract multiple fields from OpenAI output, like this one.
{
"id": "chatcmpl-12345",
"object": "chat.completion",
"created": 1700000000,
"model": "gpt-4",
"usage": {
"prompt_tokens": 56,
"completion_tokens": 129,
"total_tokens": 185
},
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "sentiment: negative \n sentiment_reasoning: "The guest is not happy with the amenities provided by the hotel.""
},
"finish_reason": "stop"
}
]
}
This is a special case. We want the field sentiment and sentiment_reasoning within the content. We can achieve this with the following code.
// Get the input JSON from the previous node
const content = $json["message"]["content"];
// Initialize an object to hold the parsed fields
const parsed = {};
// Split the content by newlines and process each line
content.split("\n").forEach(line => {
const [key, value] = line.split(":").map(part => part.trim()); // Split by ":"
if (key && value) {
parsed[key.replace(" ", "_").toLowerCase()] = value; // Normalize keys
}
});
// Add the parsed fields to the output
return {
json: parsed
};
The output after this code would look like the one below.
[
{
"sentiment": negative
"sentiment_reasoning": " "The guest is not happy with the amenities provided by the hotel." "
}
]
Now, we have sentiment and sentiment_reasoning, which will be loaded into the PostgreSQL database. But before that, we need more nodes to transform and then load them.
Cleaning Quotes
PostgreSQL can’t process a text query with normal double quotes inside it because double quotes are used to identify column names, table names, or aliases. That means Postgres is sensitive about double quotes.
Let’s take a look at previous sentiment and sentiment reasoning output.
[
{
"sentiment": negative
"sentiment_reasoning": " "The guest is not happy with the amenities provided by the hotel." "
}
]
The double quotes inside sentiment_reasoning are not allowed. So, we need to do a transformation so we can add it to our database. One of the following options can be used:
- Removing entire double quotes inside the text.
- Replacing double quotes with single quotes, which is meant to be used by text.
- Giving backslash (from “test“ to \” test \”). This backslash is used to escape double quotes inside a string.
In this case, we replace double quotes with single quotes, using the below code.
const items = $input.all();
const fieldsToClean = ['field name'];
items.forEach(item => {
// Loop through each field in the fieldsToClean array
fieldsToClean.forEach(field => {
if (item.json[field]) { // Check if the field exists in the item
item.json[field] = item.json[field].replace(/["'`]/g, ''); // Remove quotes
}
});
});
return items;
Make sure to set the Function Node to Run All Items at once, since we’re checking only one line of text. If we run previous sentiment and sentiment_reasoning, it would look like this.
[
{
"sentiment": negative
"sentiment_reasoning": " 'The guest is not happy with the amenities provided by the hotel.' "
}
]
Now we can move on to the next node.
Merging ID and OpenAI Output
This node is called Merge Node. It merges two or more branches from the previous output into one JSON form. In this case, we are merging the clean form of OpenAI output with the previous ID from Set Node. To do this, we just need to connect both previous ends to the Merge output and set the node to combine by position.
Creating SQL Query
This node is a Function Node. This code turns our previous field and its value into the PostgreSQL database. Make sure you have the same field on your PostgreSQL database along with the same data type.
Let’s take a look at our previous output.
[
{
"sentiment": negative
"sentiment_reasoning": " 'The guest is not happy with the amenities provided by the hotel.' "
}
]
We want to add this to our database. Assuming we have sentiment and sentiment_reasoning on our database, we can add it using the following code.
const items = $input.all();
// Define possible fields (classes) and their corresponding reasoning fields
const fieldMapping = {
'sentiment': 'sentiment',
'sentiment_reasoning': 'sentiment_reasoning',
};
// Initialize a string to store SQL queries
let queries = '';
items.forEach(item => {
// Initialize a list to hold the values for the SQL query
let columns = [];
let values = [];
let setStatements = []; // For the UPDATE query
// Get the 'id' and 'conversation' from the current item
const itemId = item.json.id;
const conversation = item.json.conversation;
// Loop through each field in the fieldMapping
Object.keys(fieldMapping).forEach(field => {
if (item.json[field]) { // Check if the field exists in the item
// Insert into corresponding *_reasoning column
columns.push(fieldMapping[field]);
values.push(`'${item.json[field].replace(/["']/g, '')}'`); // Clean the field by removing quotes
setStatements.push(`${fieldMapping[field]} = '${item.json[field].replace(/["']/g, '')}'`); // For UPDATE query
}
});
// If there are columns and values to insert or update
if (columns.length > 0) {
// Build the SQL INSERT query
let sqlQuery;
if (itemId) {
// If 'id' exists, create an UPDATE query
sqlQuery = `UPDATE subreddit_conversation SET ${setStatements.join(', ')} WHERE id = '${itemId}'`;
} else {
// If 'id' does not exist, create an INSERT query
sqlQuery = `INSERT INTO subreddit_conversation (${columns.join(', ')}) VALUES (${values.join(', ')})`;
}
// Append the query to the string and add a semicolon at the end
queries += sqlQuery + ';\n';
}
});
// Return the final queries as a plain string
return { json: { queries } };
This code turns our previous field into the following queries.
UPDATE subreddit_conversation
SET sentiment = 'negative', sentiment_reasoning = 'The guest are not happy with the amenities provided by the hotel.' WHERE id = (Saved ID on the Set Node);
Now that the code is available, all we need to do is to execute the queries.
Run The Query
This node is a PostgreSQL node. This node executes previous queries that we created previously. After executing this node, the flow will loop back to Looping to Avoid Token Overload, keeping our process steady.
Usage of Workflow
The general process has been outlined, and now we’ll move on to more specific steps. Since this workflow is used frequently, the explanation will be straightforward. It will cover what we work on within the OpenAI Chat Model and the fields we create.
We have a few ways to utilize this workflow. Here are the couple of those workflows.
- Generating Classes
We create a field called classification. This field is a text datatype and contains the classification of a conversation. It would be divided into four classes: Solution Requests, Advice Requests, Pain & Anger, and Money Talk. You can check the details here.
After we create this field, we call it through Loading Data to n8n. Use a query to select the conversation with null classification. After that, we set the query for OpenAI, like this:
Classify this Reddit conversation into four classes: Solution Requests, Advice Requests, Pain & Anger, and Money Talk. A conversation can be classified into multiple classes. Reply in this format:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
classification: (classification here), (2nd here), …
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
After setting the classification, we can move on to setting the cleaning process. You just need to replace field_name on Cleaning the output node. Then replace fieldMapping value and the query itself on Creating SQL Query. Then everything is set. - Generate Sentiment
We create a field called sentiment. This field is a text datatype and contains the sentiment of a conversation. It would be divided into three classes: Positive, Neutral, and Negative. You can check the details here.
After we create this field, we call it through Loading Data to n8n. Use a query to select the conversation with null sentiment. After that, we set the query for OpenAI, like this:
Classify this Reddit conversation into three sentiments: Positive, Neutral, and Negative. Reply in this format.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Sentiment: (sentiment here)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
After setting the classification, we can move on to setting the cleaning process. You just need to replace field_name on Cleaning The Output node. Then replace fieldMapping value and the query itself on Creating SQL Query. Then everything is set.
Sometimes OpenAI doesn’t give the output exactly; there will be a few errors exist on the output. For example, sometimes the output forgets about the comma, like “classification”: Solution Requests Pain & Anger. This can be cleaned before we interpret it in Python, manually, or by using an SQL query.
After using n8n to manipulate and structure the conversation data, we move on to the next phase of our workflow. In this step, we focus on extracting valuable insights from the processed data using Python. This allows us to perform deeper analysis and create meaningful insights.
Extracting Insights with Python
After creating an important field, we will interpret the dataset by extracting the dataset from PostgreSQL to Google Colab through Google Drive. Python itself is an incredibly versatile tool for data analysis and visualization. In this step, we focus on extracting valuable insights from our transformed conversation data. Using libraries like pandas, numpy, and matplotlib, we can perform data analysis and generate actionable insights.
The Rise of Multi-Class Text Classification
With the rise of digital communication, vast amounts of text data are generated across platforms like social media and forums, especially Reddit. Multi-class text classification offers an efficient way to organize this unstructured data, turning conversations and feedback into actionable insights.
To gain a better understanding of the different types of conversations occurring in the Hotel Subreddit, we used Python to create a pie chart visualizing the distribution of discussion categories.
# Specify the columns to sum
columns_to_sum = ['Solution Requests', 'Pain & Anger', 'Advice Requests', 'Money Talk']
# Calculate the summation for the specified columns
sums = df[columns_to_sum].sum()
# Create the pie chart
plt.figure(figsize=(8, 8))
plt.pie(sums, labels=sums.index, autopct='%1.1f%%', startangle=140, colors=['#ff9999','#66b3ff','#99ff99','#ffcc99'])
plt.title('Classification of Hotel Subreddit Conversation')
plt.show()
By aggregating the data for each category and presenting it as a pie chart, we can easily interpret the relative proportions of different themes, such as “Solution Requests”, “Pain & Anger”, “Advice Requests”, and “Money Talk”. This visualization helps identify dominant discussion topics, providing valuable insights into user sentiment and priorities, which is essential for understanding patterns in multi-class classification tasks.
This analysis demonstrates how multi-class classification not only organizes complex textual data but also enables businesses to prioritize actions based on the type and volume of user concerns.
Sentiment Analysis: Visualizing Emotional Engagement
Based on our previous findings, we can also understand the complexity and emotional nuance of conversation based on examining our selected categories. For example, users might express “Pain & Anger” alongside “Solution Requests”, indicating an emotionally charged call for resolution.
# Step 1: Define the classes to include
classes_to_include = ['Solution Requests', 'Pain & Anger', 'Advice Requests', 'Money Talk']
# Step 2: Filter rows in 'classification' containing relevant classes
filtered_classifications = df['classification'].apply(
lambda x: ', '.join(sorted(set(x.split(', ')) & set(classes_to_include)))
)
# Step 3: Remove empty rows (rows that didn't match any classes to include)
filtered_classifications = filtered_classifications[filtered_classifications != '']
# Step 4: Count occurrences of each unique combination
combination_counts = filtered_classifications.value_counts()
# Step 5: Create a bar chart
plt.figure(figsize=(8, 6))
combination_counts.plot(kind='bar', color='#66b3ff')
plt.title('Distribution of Overlapping Classes', fontsize=16)
plt.xlabel('Class Combination', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
Here, we focus on analyzing and interpreting overlapping sentiment categories, providing insights into how users express multiple emotions or concerns in their discussions. It identifies the combination of relevant classes, counts their occurrences, and represents them in a bar chart. The resulting visualization reveals which class combinations are most prevalent, helping to prioritize areas requiring attention or intervention in multi-class sentiment analysis.
Advice Requests vs Solutions Requests: Navigating the Path from Inquiry to Answer
In this part, we aim to explore the overlap between the “Advice Request” and “Solution Request” categories in the dataset using a Venn diagram. This technique allows us to visually assess how often these two classifications intersect, helping us understand the relationship between user inquiries and their calls for solutions.
from matplotlib_venn import venn2
import matplotlib.pyplot as plt
# Assuming `df` is your DataFrame
# Step 1: Define the sets
advice_request = set(
df[df['classification'].str.contains('Advice Request', na=False)].index
)
solution_request = set(
df[df['classification'].str.contains('Solution Request', na=False)].index
)
# Step 2: Create the Venn diagram
plt.figure(figsize=(8, 8))
venn = venn2(
[advice_request, solution_request],
('Advice Request', 'Solution Request'),
set_colors=('skyblue', 'lightgreen')
)
# Step 3: Add title and show plot
plt.title('Overlap Between Advice Request and Solution Request')
plt.show()
The Venn diagram is created by defining two sets: one for rows classified under “Advice Request” and the other for “Solution Request”. By visualizing these sets, we can observe how many conversations fall into both categories, revealing important insights into the progression of discussion from asking for advice to seeking solutions. For instance, a substantial overlap may indicate that users often seek solutions after receiving advice, which can be crucial for tailoring content or services. The diagram’s clear, intuitive representation aids in navigating the journey from inquiry to resolution in the conversation.
Next, we can also see the transition between ‘Advice Request’ and ‘Solution Request’. The process starts by defining a dataset that captures transitions between two categories, “From” and “To”, with a count representing the number of occurrences of each transition. The data is structured into a DataFrame, and a pivot operation is performed to reshape it, making it suitable for plotting the transitions in a stacked format.
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
# Example data (replace with actual SQL query output)
data = {
'From': ['Advice Requests', 'Advice Requests', 'Solution Requests', 'Solution Requests'],
'To': ['Advice Requests', 'Solution Requests', 'Advice Requests', 'Solution Requests'],
'Count': [80, 1422, 166, 186]
}
# Convert to DataFrame
df1 = pd.DataFrame(data)
# Pivot for stacking
data_pivot = df1.pivot(index="From", columns="To", values="Count").fillna(0)
# Plot stacked bar chart
data_pivot.plot(kind="bar", stacked=True, figsize=(8, 6), colormap="viridis")
plt.title("Transitions Between Advice and Solution Requests")
plt.xlabel("From")
plt.ylabel("Count")
plt.legend(title="To")
plt.tight_layout()
plt.show()
The stacked bar chart is created using pandas plotting functions, with the colormap=”viridis” parameter providing a visually appealing color scheme. The chart shows how discussions shift between “Advice Requests” and Solution Requests”, with the stacked bar indicating the counts for each transition.
By visualizing the transitions between these two categories, this chart helps uncover the flow of discussions-whether users asking for advice later seek solutions or vice versa. This can provide valuable insights into user behavior and the effectiveness of conversations in addressing user needs.
Overlapping Categories: Advice, Solution, and Pain & Anger
This analysis explores the overlap between the “Solution Requests”, “Advice Requests”, and “Pain & Anger” categories using a Venn diagram. It visually reveals how often these classifications intersect, providing insights into the relationship between user inquiries, emotional expressions, and requests for solutions.
Before we start creating the Venn diagram, we will add a new column defining a single class of the classification. This column contains either 0 or 1 for false and true. First, on our PostgreSQL, we add a new column such as follows.
ALTER TABLE subreddit_data
ADD COLUMN "Solution Requests" INT DEFAULT 0,
ADD COLUMN "Advice Requests" INT DEFAULT 0,
ADD COLUMN "Pain & Anger" INT DEFAULT 0,
ADD COLUMN "Money Talk" INT DEFAULT 0;
After that, we can use a LIKE operator with a wild card to determine 0 or 1 in each column with this SQL query.
-- Update "Solution Requests" column
UPDATE subreddit_conversation
SET "Solution Requests" = 1
WHERE classification LIKE '%Solution Requests%';
-- Update "Advice Requests" column
UPDATE subreddit_conversation
SET "Advice Requests" = 1
WHERE classification LIKE '%Advice Requests%';
-- Update "Pain & Anger" column
UPDATE subreddit_conversation
SET "Pain & Anger" = 1
WHERE classification LIKE '%Pain%' OR category LIKE '%Anger%';
-- Update "Money Talk" column
UPDATE subreddit_conversation
SET "Money Talk" = 1
WHERE classification LIKE '%Money Talk%';
Now we have a reference column when we need to interpret the data, for example, in a Venn Diagram like this one.
from matplotlib_venn import venn3
# Create a new column 'classification' combining the three columns
df['class'] = df.apply(lambda row: ', '.join([col for col in ['Solution Requests', 'Advice Requests', 'Pain & Anger'] if row[col] == 1]), axis=1)
# Count the occurrences of each combination in the 'classification' column
combination_counts = df['classification'].value_counts()
# Prepare data for the Venn Diagram
venn_data = {
'100': combination_counts.get('Solution Requests', 0),
'010': combination_counts.get('Advice Requests', 0),
'001': combination_counts.get('Pain & Anger', 0),
'110': combination_counts.get('Solution Requests, Advice Requests', 0),
'101': combination_counts.get('Solution Requests, Pain & Anger', 0),
'011': combination_counts.get('Advice Requests, Pain & Anger', 0),
'111': combination_counts.get('Solution Requests, Advice Requests, Pain & Anger', 0)
}
# Plot the Venn Diagram
plt.figure(figsize=(8, 8))
venn3(subsets=venn_data, set_labels=('Solution Requests', 'Advice Requests', 'Pain & Anger'))
plt.title('Venn Diagram of Solutions, Advice, and Pain')
plt.show()
The Venn diagram defines three sets: “Solution Requests”, “Advice Requests”, and “Pain & Anger” categories using a Venn diagram. It shows how many conversations fall into multiple categories, helping us understand user dynamics. For example, overlap between “Solution Requests” and “Pain & Anger” may suggest users seek solutions to emotionally charged problems, while intersections between “Advice Requests” and “Pain & Anger” indicate users seek guidance in distressing situations.
Financial Conversation: The Role of Money Talk
Moving on, we focus on the intersection between the “Money Talk” and “Solution Requests” categories within the dataset, using a Venn diagram to visualize the relationship. By examining these two classifications, we gain insights into how often financial discussions align with requests for solutions, providing a deeper understanding of users' needs and behaviors.
# Create boolean columns for Money Talk and Solution Requests
df['Money Talk'] = df['classification'].apply(lambda x: 'Money Talk' in x)
df['Solution Request'] = df['classification'].apply(lambda x: 'Solution Requests' in x)
# Count the number of occurrences for each combination
money_talk = df['Money Talk'].sum()
solution_requests = df['Solution Requests'].sum()
overlap = ((df['Money Talk'] & df['Solution Requests']).sum())
# Create a Venn Diagram with two sets
venn2(subsets=(money_talk - overlap, solution_requests - overlap, overlap),
set_labels=('Money Talk', 'Solution Requests'))
# Title and display the plot
plt.title('Venn Diagram Showing Overlap between Money Talk and Solution Requests')
plt.show()
We create two boolean columns to generate the Venn diagram: one indicating whether a conversation involves “Money Talk” and another for “Solution Requests”. We then calculate each category's total number of occurrences and the overlap, representing conversations containing both elements. The Venn diagram helps us visualize these overlaps and their distribution.
Visualizing Insights Leads to Better Decisions
To conclude our analysis, we’ll be visualizing insights related to sentiment distribution across various classifications. By processing the dataset and preparing it for analysis, we can better understand the overall sentiment of different categories, which is crucial for making data-driven decisions.
# Strip leading/trailing spaces and normalize whitespace
df2['classification'] = df2['classification'].str.strip().str.replace(r'\s*,\s*', ', ', regex=True)
# Convert to lowercase (optional, only if the case shouldn't matter)
df2['classification'] = df2['classification'].str.lower()
# Verify unique classifications
extra_final = df2['classification'].value_counts()[df2['classification'].value_counts() > 50]
First, we clean and preprocess the dataset, normalizing the classification values by stripping unnecessary spaces and converting them to lowercase for consistency. This ensures that all the classification labels are uniform and easy to process. After cleaning the data, we verify the frequency of unique classifications to ensure that only the relevant categories are considered.
# Filter classifications with total counts greater than 50
total_counts = sentiment_counts.sum(axis=1)
filtered_sentiment_counts = sentiment_counts[total_counts > 50]
# Plotting the stacked bar chart
filtered_sentiment_counts.plot(kind='bar', stacked=True, figsize=(10, 7), color=['green', 'gray', 'red'])
plt.title('Sentiment Distribution for Classifications with Counts > 50')
plt.ylabel('Count')
plt.xlabel('Classification')
plt.legend(title='Sentiment')
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()
Next, we aggregate sentiment counts for each classification to determine the sentiment distribution across different categories. We then filter out classifications with fewer than 50 occurrences, focusing on the more significant groups to ensure reliable analysis. The resulting sentiment counts are plotted using a stacked bar chart, clearly representing how different sentiments (positive, neutral, negative) are distributed within each classification.
The stacked bar chart allows us to quickly identify patterns in the data, such as which classifications are predominantly positive or negative. This visualization is valuable for decision-making processes, as it highlights areas where businesses, services, or content can be improved based on the sentiments expressed by users.
Elevate Your Strategy with Insights
You have seen how we extract a Hotel Conversation on Reddit using OpenAI, n8n, and Python. We streamlined the process of ETL process, sentiment analysis using AI, classification using AI, and analyzing the data.
We are using an existing dataset to transform the data inside PostgreSQL using n8n. By using multiple nodes, including OpenAI and Function Nodes to create a new field. After that, we also utilize Python to interpret the existing dataset that we have.
The benefits of integrating these tools are clear:
- Efficiency: Automate time-consuming tasks with tools like n8n.
- Scalability: Analyze thousands of conversations seamlessly.
- Actionable Results: Gain insights that help improve services, resolve issues, and identify growth opportunities.
This workflow is just one example, there are so many ways we can utilize this workflow to automate our process. You can also experiment with your way of using n8n to create an easy automation workflow and generate your insight.