Streamlined Data Ingestion with Pandas

Introduction to Flat files

Flat files

Data stored as plain text (no format)
One row per line
values seperated by a delimiter

Modifying flat file imports

col_names = ['STATEFIPS', 'STATE', 'zipcode', 'agi_stub', 'N1']
col_nums = [0, 1, 2, 3, 4]
# Choose columns to load by name
tax_data_v1 = pd.read_csv('us_tax_data_2016.csv', usecols=col_names)
# Choose columns to load by number
tax_data_v2 = pd.read_csv('us_tax_data_2016.csv',
usecols=col_nums)
print(tax_data_v1.equals(tax_data_v2))
True

tax_data_next500 = pd.read_csv('us_tax_data_2016.csv', nrows=500, skiprows=1000, header=None)

nrows: choose N number of rows
skiprows: skip first N rows

# tax_data_first1000 is dataframe
# extracting column name with list
col_names = list(tax_data_first1000)
tax_data_next500 = pd.read_csv('us_tax_data_2016.csv', nrows=500, skiprows=1000,header=None,names=col_names) print(tax_data_next500.head(1))

Handling errors and missing data

Specifying Data Types

tax_data = pd.read_csv("us_tax_data_2016.csv", dtype={"zipcode": str})

Customizing Missing Data Values

tax_data = pd.read_csv("us_tax_data_2016.csv", na_values={"zipcode":0})
print(tax_data[tax_data.zipcode.isna()])

Lines with Errors

>>> tax_data = pd.read_csv("us_tax_data_2016_corrupt.csv", error_bad_lines=False, warn_bad_lines=True)
b'Skipping line 3: expected 147 fields, saw 148\n'

Introduction to spreadsheets

Loading Select Columns and Rows

# Read columns W-AB and AR of file, skipping metadata header
survey_data = pd.read_excel("fcc_survey_with_headers.xlsx", skiprows=2, usecols="W:AB, AR")
# View data
print(survey_data.head())

Selecting Sheets to Load

Loading Select Sheets

# Get the second sheet by position index
survey_data_sheet2 = pd.read_excel('fcc_survey.xlsx', sheet_name=1)
# Get the second sheet by name
survey_data_2017 = pd.read_excel('fcc_survey.xlsx', sheet_name='2017')
>>> print(survey_data_sheet2.equals(survey_data_2017))
True

Passing sheet_name=None reads all sheet in a workbook

>>> survey_responses = pd.read_excel("fcc_survey.xlsx", sheet_name=None)
>>> print(type(survey_responses))
<class 'collections.OrderedDict'>

# key would be 2016, 2017 here
>> for key, value in survey_responses.items():
>>     print(key, type(value))
2016 <class 'pandas.core.frame.DataFrame'>
2017 <class 'pandas.core.frame.DataFrame'>		

Modifying imports true/false data

# Load data, casting True/False columns as Boolean
bool_data = pd.read_excel("fcc_survey_booleans.xlsx",
                          dtype={
                            "AttendedBootcamp": bool,
                            "AttendedBootCampYesNo": bool,
                            "AttendedBootcampTF":bool,
                            "BootcampLoan": bool,
                            "LoanYesNo": bool,
                            "LoanTF": bool},
                          true_values=["Yes"], false_values=["No"])

Modifying imports: parsing dates

Parsing Dates

# List columns of dates to parse
date_cols = ["Part1StartTime", "Part1EndTime"]

# Load file, parsing standard datetime columns
survey_df = pd.read_excel("fcc_survey.xlsx", parse_dates=date_cols)

# List columns of dates to parse
date_cols = ["Part1StartTime","Part1EndTime", [["Part2StartDate","Part2StartTime"]]]
# Load file, parsing standard and split datetime columns

survey_df = pd.read_excel("fcc_survey.xlsx", parse_dates=date_cols)
print(survey_df.head(3))

	Part2StartDate_Part2StartTime	Age	SchoolMajor
0	2016-03-29 21:24:57	28.0	Nan
1	2016-03-29 21:27:14	22.0	Nan
2	2016-03-29 21:27:13	19.0	Nan

[3 rows x 98 columns]

# List columns of dates to parse
date_cols = {"Part1Start":"Part1StartTime",
             "Part1End":"Part1EndTime","Part2Start": ["Part2StartDate","Part2StartTime"]}
# Load file, parsing standard and split datetime columns
survey_df = pd.read_excel("fcc_survey.xlsx", parse_dates=date_cols)
print(survey_df.Part2Start.head(3))

2016-03-29 21:24:57
2016-03-29 21:27:14
2016-03-29 21:27:13 
Name: Part2Start, dtype: datetime64[ns]

Datetime Formatting

Code	Meaning	Example
%Y	Year (4-digit)	1999
%m	Month (zero-padded)	03
%d	Day (zero-padded)	01
%H	Hour (24-hour clock)	21
%M	Minute (zero-padded)	09
%S	Second (zero-padded)	05

format_string = "%m%d%Y %H:%M:%S"
survey_df["Part2EndTime"] = pd.to_datetime(survey_df["Part2EndTime"], format=format_string)
print(survey_df.Part2EndTime.head())

2016-03-29 21:27:25
2016-03-29 21:29:10
2016-03-29 21:28:21
2016-03-29 21:30:51
2016-03-29 21:31:54
Name: Part2EndTime, dtype: datetime64[ns]

Introduction to databases

Pass

Importing JSON Data and Working with APIs

Introduction to JSON

Record Orientation

[
{
"age_adjusted_death_rate": "7.6",
"death_rate": "6.2",
"deaths": "32",
"leading_cause": "Accidents Except Drug Posioning (V01-X39, X43, X45-X59, Y85-Y86)",
"race_ethnicity": "Asian and Pacific Islander",
"sex": "F",
"year": "2007"
},
{
"age_adjusted_death_rate": "8.1",
"death_rate": "8.3",
"deaths": "87",
...

Column Orientation

{
"age_adjusted_death_rate": {
"0": "7.6",
"1": "8.1",
"2": "7.1",
"3": ".",
"4": ".",
"5": "7.3",
"6": "13",
"7": "20.6",
"8": "17.4",
"9": ".",
"10": ".",
"11": "19.8",
...

Split oriented

{
"columns": ["age_adjusted_death_rate",
            "death_rate",
            "deaths",
            "leading_cause",
            "race_ethnicity",
            "sex",
            "year"],
"index": [...],
"data": [
		[
          "7.6",

Specifying Orientation

import pandas as pd
death_causes = pd.read_json("nyc_death_causes.json", orient="split")
print(death_causes.head())

Introduction to APIs

Making Requests

import requests
import pandas as pd
api_url = "https://api.yelp.com/v3/businesses/search"

# Set up parameter dictionary according to documentation
params = {"term": "bookstore", "location": "San Francisco"}
# Set up header dictionary w/ API key according to documentation
headers = {"Authorization": "Bearer {}".format(api_key)}
# Call the API
response = requests.get(api_url, params=params, headers=headers)

# Extract JSON data from the response
data = response.json()

# Load data to a data frame
cafes = pd.DataFrame(data['businesses'])

Working with nested JSONs

Loading Nested JSON Data

import pandas as pd
import requests
from pandas.io.json import json_normalize
# Set up headers, parameters, and API endpoint
api_url = "https://api.yelp.com/v3/businesses/search"
headers = {"Authorization": "Bearer {}".format(api_key)}
params = {"term": "bookstore", "location": "San Francisco"}

# Make the API call and extract the JSON data
response = requests.get(api_url, headers=headers, params=params)
data = response.json()

# Flatten data and load to data frame, with _ separators
bookstores = json_normalize(data["businesses"], sep="_")
>>> print(list(bookstores))

['alias', 'categories', 'coordinates_latitude', 'coordinates_longitude', ... 'location_address1', 'location_address2', 'location_address3', 'location_city', 'location_country', 'location_display_address', 'location_state', 'location_zip_code', ... 'url']

Deeply Nested Data

>>> print(bookstores.categories.head())
[{'alias': 'bookstores', 'title': 'Bookstores'}]
[{'alias': 'bookstores', 'title': 'Bookstores'...
[{'alias': 'bookstores', 'title': 'Bookstores'}]
[{'alias': 'bookstores', 'title': 'Bookstores'}]
[{'alias': 'bookstores', 'title': 'Bookstores'...
Name: categories, dtype: object

# Flatten categories data, bring in business details
df = json_normalize(data["businesses"],
                    sep="_", 
                    record_path="categories", # for nested data
                    meta=["name", "alias", "rating", ["coordinates","latitude"], 
                          													 ["coordinates","longitude"]
                         ], # list of attributes/columns
                    meta_prefix="biz_" # prefix to meta column names
                   )

Combining multiple datasets

# Put bookstore datasets together, renumber rows
bookstores = first_20_bookstores.append(next_20_bookstores, ignore_index=True)

Streamlined Data Ingestion with pandas

Streamlined Data Ingestion with Pandas

Introduction to Flat files

Flat files

Modifying flat file imports

Handling errors and missing data

Specifying Data Types

Customizing Missing Data Values

Lines with Errors

Introduction to spreadsheets

Loading Select Columns and Rows

Selecting Sheets to Load

Loading Select Sheets

Modifying imports true/false data

Modifying imports: parsing dates

Parsing Dates

Datetime Formatting

Introduction to databases

Importing JSON Data and Working with APIs

Introduction to JSON

Specifying Orientation

Introduction to APIs

Making Requests

Working with nested JSONs

Loading Nested JSON Data

Deeply Nested Data

Combining multiple datasets

Road to ML

Streamlined Data Ingestion with Pandas

Introduction to Flat files

Flat files

Modifying flat file imports

Handling errors and missing data

Specifying Data Types

Customizing Missing Data Values

Lines with Errors

Introduction to spreadsheets

Loading Select Columns and Rows

Selecting Sheets to Load

Loading Select Sheets

Modifying imports true/false data

Modifying imports: parsing dates

Parsing Dates

Datetime Formatting

Introduction to databases

Importing JSON Data and Working with APIs

Introduction to JSON

Specifying Orientation

Introduction to APIs

Making Requests

Working with nested JSONs

Loading Nested JSON Data

Deeply Nested Data

Combining multiple datasets

Share this post