Working with an Existing Messy Database

📖 Working with an Existing Messy Database

Converting CSV Data to a Relational Schema

At DataForge Systems, we frequently work with clients who hand over messy exports, partial databases, or spreadsheets filled with business-critical data. Our job is to convert that material into clean, reliable, and well-structured relational databases.

This article outlines a practical, real-world workflow for transforming unstructured or denormalized data — especially from flat files like CSVs — into a properly normalized MySQL database. The process you'll follow depends on what you're given, but the end goal is always the same: a well-formed relational model that serves your client's needs.

Step 1: Assess What You've Inherited

Before you do anything, review the resources provided by the client. These might include:

A flat CSV data file
A partial database model (e.g. a .mwb or .sql file)
Business requirements or a list of fields
Some combination of the above

The starting point determines your next move:

If no schema is provided, you'll begin by exploring and staging the data (see Step 3).
If a schema is provided, normalize and revise it before attempting to import any data.

Step 2 (Optional): Normalize the Provided Schema

When a client supplies a partial database model, your first task is to review its structure and apply normalization principles:

Look for repeating fields or redundant data
Identify separate entities that need their own tables
Define primary and foreign keys to enforce relationships

You may not need to rebuild the database from scratch — just clean up and extend what already exists. This helps ensure your design is solid before you start loading in new data.

💡 Tip: If you're also provided with a CSV file, take a peek at it before finalizing your schema. The field names and values might reveal additional structure or missing relationships.

Step 3: Import the Raw Data into a Staging Table

Regardless of whether you designed the schema or received one, the next step is to load the CSV data into a temporary staging table. This table should match the CSV file exactly — same fields, no constraints, no normalization.

💡 Tip: This staging table acts as a working copy — you can safely explore, transform, and test queries without affecting your normalized schema.

Option A: Use the Table Data Import Wizard

Create or select a schema (e.g. microblog) and name the new table raw_posts
In the SCHEMAS panel, right-click on Tables heading for the schema
Select Table Data Import Wizard
Choose your CSV file and follow the prompts
Workbench will auto-create a new table with inferred column types and import the data

This is the simplest method for getting the data into MySQL quickly. However, you may want to revise the column names or types afterward for accuracy.

Option B: Create the Staging Table Manually

If you prefer more control or are working without the wizard, you can write a CREATE TABLE statement and import the data using Workbench's import options or SQL.

CREATE TABLE raw_posts (
  post_title VARCHAR(255),
  author_name VARCHAR(100),
  category_name VARCHAR(50),
  publish_status VARCHAR(20),
  publish_date DATE,
  content TEXT
);

Then use the Table > Import > Import Records from File option, or a script using LOAD DATA LOCAL INFILE (if enabled).

Step 4: Explore and Analyze the Data

With the data inside MySQL, you can now explore it using queries:

SELECT DISTINCT to find categories or authors
Aggregate counts to detect duplicates or gaps
Spot inconsistent or missing values

SELECT DISTINCT category_name FROM raw_posts;
SELECT author_name, COUNT(*) FROM raw_posts GROUP BY author_name;

This is your chance to make sense of the raw dataset and identify how it will map into your normalized schema.

Step 5: Populate the Normalized Tables

Use INSERT INTO ... SELECT statements to migrate data from the staging table into your final schema. For example:

INSERT INTO categories (name)
SELECT DISTINCT category_name FROM raw_posts;

INSERT INTO authors (name)
SELECT DISTINCT author_name FROM raw_posts;

This allows the database to assign new primary key values to the authors and categories.

Update raw_posts so that each row directly stores the correct foreign keys:

ALTER TABLE raw_posts ADD COLUMN category_id INT;
ALTER TABLE raw_posts ADD COLUMN author_id INT;

UPDATE raw_posts rp
JOIN categories c ON rp.category_name = c.name
SET rp.category_id = c.id;

UPDATE raw_posts rp
JOIN authors a ON rp.author_name = a.name
SET rp.author_id = a.id;

Now your staging table has author_id and category_id ready to use. Visually confirm that the new foreign key values match the intended records before migrating data to the normalized tables. Next, link them together when inserting into your primary data table:

INSERT INTO posts (title, content, publish_status, publish_date, author_id, category_id)
SELECT
  rp.post_title,
  rp.content,
  rp.publish_status,
  rp.publish_date,
  a.id,
  c.id
FROM raw_posts rp;

Step 6: Validate and Clean Up

Run validation queries to confirm that all records were imported correctly, foreign keys are working, and the final schema matches expectations:

SELECT COUNT(*) FROM posts;
SELECT p.title, a.name, c.name
FROM posts p
JOIN authors a ON p.author_id = a.id
JOIN categories c ON p.category_id = c.id
LIMIT 5;

Once you're confident in the results, you can:

Archive or drop the staging table
Enforce referential integrity with foreign key constraints
Export a clean SQL script for deployment

💡 Tip: It's good practice to keep your raw import table until final validation is complete, then archive or drop it once you've confirmed data integrity.

Recap: Workflow Summary

This is the standard approach we use at DataForge when normalizing client data:

Assess what you've been given (schema, CSV, or both)
Normalize the structure (if provided or required)
Import the raw data into a staging table
Explore the data using SQL
Populate the normalized schema
Validate your results and clean up

This workflow reflects professional practice and mirrors the real-world tasks you'll face in client-based data projects. Keep your staging tables clean, your constraints intentional, and your schema normalized.

Last updated: October 10, 2025 at 9:14 PM