Take Home Project 1: Geodata Processing Many of our projects focus on processing and analyzing documents on a regular basis, some public and some proprietary. Because NLP is hard, let’s simply worry about processing location data in a scalable way. Data Description The data we’re working with is open data from Geonames.org, specifically a compressed dataset of global cities with >1K population. Although it’s small enough to be inspected, the .tsv data fields are described in plain-text form at the bottom of the index page. There’s about 150K cities in total. Processing Most data we process is normalized to a particular form so we can make some generalizable assumptions about the information we have access to. For locations, we will make up a Primer Normalized Location format that includes the following information: ● Name (Latin-1 compatible) ● Shape data (Binary) ● Latitude ● Longitude ● Country Code (ISO-3166) ● Administrative Level 1 ● Administrative Level 2 Not only should this data be persisted, but we will also search against this database as well, either in terms of a nearest neighbors search or full-text by name. Assignment Using the system as described above, design a pipeline to process, transform, and persist the cities file sourced in two particular ways: ● As is (Single file) ● As multiple files (e.g. Each line as an individual file, same format) Assume that the only transformation that you need to handle is the form given above. You should also write a PoC version showing the single-file scenario. The design should optimize for scalability, such that increasing the number of locations (e.g. using all the GeoNames country data) does not yield a linear increase in wall-clock processing time. The design must address the following: ● Resource requirements as the system scales ● Developer ease of use ● Any technology dependencies ● Any shortcomings or limitations of the system Optionally, address the following longer-term concerns: ● Reprocessing data, such as propagating a shape file format change ● Non-trivial increases in processing time for an individual city (i.e. new algorithms get introduced, additional fields are added) ● Monitoring and alerting ● Location data sourced from a database instead of individual files Include any assumptions about the application as needed for simplicity, although we are available to answer any question you may have about the system, assignment, or life in general. General Assessment Criteria Although we are primarily a Python company, any tools used do not have to be in Python, they only need to be delineated. Be sure to include the purpose for each tool as it's added. Make sure you hit all the requirements. Although some are coding, some can be strictly presentational. Some interesting and free tools for simple diagram drawing, if you don’t have one of your own, are Google Draw and yUML.me.