Proceedings of the Second Workshop on Computational Approaches to Code Switching , pages 21–29, Austin, TX, November 1, 2016 c 2016 Association for Computational Linguistics 21WordLevel Language Identication and Predicting Codeswitching Points in SwahiliEnglish Language Data Mario Piergallini, Rouzbeh Shirvani, Gauri S Gautam, andMohamed Chouikha Howard University Department of Electrical Engineering and Computer Science 2366 Sixth St NW, Washington, DC 20059 [email protected],[email protected] [email protected], [email protected] Abstract Codeswitching is a very common behavior among Swahili speakers, but of the little com putational work done on Swahili, none has focused on codeswitching This paper ad dresses two tasks relating to SwahiliEnglish codeswitching: wordlevel language identi cation and prediction of codeswitch points Our twostep model achieves high accuracy at labeling the language of words using a simple feature set combined with label probabilities on the adjacent words This system is used to label a large SwahiliEnglish internet corpus, which is in turn used to train a model for pre dicting codeswitch points 1 Introduction Language technology has progressed rapidly in many applications (speech recognition and synthe sis, parsing, translation, sentiment analysis, etc), but efforts have been focused mainly on large, high resource languages and on monolingual data Many tools have not been developed for lowresource lan guages nor can they be applied to mixedlanguage data containing codeswitching In many cases, deal ing with lowresource languages requires the ability to deal with codeswitching For example, it is quite common to codeswitch between the lingua franca and English in many former English colonies in Africa, such as Kenya, Zimbabwe and South Africa (MyersScotton, 1993b) Thus, expanding the reach of language technologies to users of these languages may require the ability to handle mixedlanguage data, depending on which domains it is intended for Codeswitching produces additional challenges for NLP due to the simple fact that monolingual tools cannot be applied to mixedlanguage data Beyond that, codeswitching also has its own peculiarities and can convey meaning in and of itself, and these aspects are worthy of study as well Codeswitch ing can be used to increase or decrease social dis tance, indicate something about a speaker's social identity or their stance towards the subject of dis cussion, or to draw attention to particular phrases (MyersScotton, 1993b) Sometimes, of course, it may simply indicate that the speaker does not know the word in the other language, or is not able to re call it quickly in this instance Computational ap proaches to discourse analysis will require tools spe cic to codeswitching in order to be able to make use of these social meanings Multiple theories propose grammatical con straints on codeswitching (MyersScotton, 1993a), and computational approaches may contribute to providing stronger evidence for or against these the ories (Solorio and Liu, 2008) These grammatical constraints also can inform the social interpretation of codeswitching If a codeswitch occurs in a po sition that is less expected, it may be more likely to have been used for effect Similarly, when a codeswitch occurs in a less likely context based on features of the discourse, this also affects the inter pretation The longer a discussion is carried out in a single language, the more likely it would seem that a switch indicates a change in the discourse For example, Carol MyersScotton (1993b) analyzes a conversation where a switch to Swahili and then to English after small talk in the local language adds 22force to the speaker's rejection of a request This type of switch could also be precipitated by a change in conversation topic, task (eg preclass small talk transitioning into the beginning of lessons), location, etc By contrast, in conversations where participants switch frequently between languages, each individ ual switch carries less social meaning In those situ ations, it is the overall pattern of codeswitching that conveys meaning (MyersScotton, 1993b) A model should be able to see this pattern and adjust the like lihood of switches accordingly Being able to pre dict how likely a switch is to occur in a particular position may thus provide information to aid in the social analysis of codeswitching behavior In this paper, we will be introducing two cor pora of SwahiliEnglish data One

